# Recommendation System Analysis

1. Feature Engineering and Weakly-Supervised Weight Learning
We designed an algorithm called Weakly-Supervised MMR Re-ranking. This algorithm adopts a data-driven approach to automatically learn feature weights, combining confidence scores with diversity optimization to generate the final recommendation list.
The system extracts feature vectors including Bayesian-smoothed annual average citation rate, semantic similarity quantiles, and content type flags (Review, Application, Theory, Trending) based on regular expression matching. To address the lack of labeled data in academic scenarios, we employ a weak supervision strategy. Predefined heuristic label functions are applied to different user intents, and the **Coordinate Ascent algorithm** is used to minimize the pairwise logistic loss, thereby learning the optimal feature weight combination for specific views.
2. MMR Diversity Re-ranking
By analyzing explicit signals in query terms (e.g., “survey,” “benchmark”), we dynamically adjust the blended weight ratios of each view to achieve adaptive responses to user search intent. The final Top-K list generation employs post-processing via the Maximum Margin Relevance algorithm. This calculates Jaccard similarity between titles to penalize redundant results, ensuring the output JSON recommendation list maintains high relevance while achieving sufficient content diversity.


This notebook imports functionality from `src/recommend/recommend_v3.py` to run and analyze the recommendation pipeline interactively.

In [10]:
import sys
from pathlib import Path
import pandas as pd
import json
from collections import defaultdict

# Add src to python path to allow imports
project_root = Path("../").resolve()
if str(project_root / "src") not in sys.path:
    sys.path.append(str(project_root / "src"))

# Import the recommendation module
from recommend import recommend_v3 as r3

## 1. Load Data
Load the records from the input JSON file determined by the script.

In [12]:
records = r3.load_records(r3.INPUT_JSON)
print(f"Loaded {len(records)} records.")

# Preview one record
if records:
    print("\nSample Record Title:", records[0].get('title'))

Loaded 100 records.

Sample Record Title: Software Engineering at Google


## 2. Feature Engineering
Convert raw records into feature vectors used for scoring.

In [13]:
feats = r3.build_features(records, fallback_query=r3.FALLBACK_QUERY)
print(f"Built features for {len(feats)} items.")

# Display features in a DataFrame
df_feats = pd.DataFrame(feats)
display(df_feats[["qid", "f_sim", "f_recency", "f_cit", "f_title"]].head())

Built features for 100 items.


Unnamed: 0,qid,f_sim,f_recency,f_cit,f_title
0,i wish to study papers in the field of softwar...,1.0,0.414141,0.787879,1.0
1,i wish to study papers in the field of softwar...,0.989899,0.313131,0.0,1.0
2,i wish to study papers in the field of softwar...,0.979798,0.69697,0.0,1.0
3,i wish to study papers in the field of softwar...,0.969697,0.232323,0.343434,1.0
4,i wish to study papers in the field of softwar...,0.959596,0.59596,0.0,1.0


## 3. Learn Weights
Learn expert weights for different views (review, application, theory, trending) using coordinate ascent.

In [14]:
views = ["review", "application", "theory", "trending"]
w_dict = r3.learn_or_default_weights(feats, views=views, seed=r3.SEED)

print("Learned Weights:")
weight_df = pd.DataFrame(w_dict, index=r3.FEATURE_KEYS)
display(weight_df)

Learned Weights:


Unnamed: 0,review,application,theory,trending
f_sim,0.390728,0.421927,0.469512,0.402957
f_title,0.152318,0.152824,0.167683,0.136784
f_abs,0.077815,0.167774,0.071646,0.053604
f_cat,0.0,0.0,0.0,0.0
f_cit_per_year,0.092715,0.0,0.0,0.170055
f_recency,0.0,0.0,0.0,0.236599
f_is_review,0.286424,0.0,0.0,0.0
f_is_application,0.0,0.257475,0.0,0.0
f_is_theory,0.0,0.0,0.291159,0.0


## 4. Run Pipeline & Generate Recommendations
Process each query, apply weights, score, and re-rank using MMR.

In [15]:
# Group features by query ID
by_q = defaultdict(list)
for f in feats:
    by_q[f["qid"]].append(f)

# Define target views
if r3.VIEWS == "all":
    target_views = ["default", "review", "application", "theory", "trending"]
else:
    target_views = [r3.VIEWS]

print(f"Target Views: {target_views}")

results = {}

for view in target_views:
    print(f"\nProcessing View: {view}")
    out_all = []
    
    for qid, group in by_q.items():
        # Determine weights for this view/query
        if view == "default":
            prior = r3.intent_prior_from_query(qid)
            w = r3.mix_weights(prior, w_dict)
        else:
            w = w_dict.get(view)
            if w is None:
                w = [1.0 / len(r3.FEATURE_KEYS)] * len(r3.FEATURE_KEYS)

        scores = [r3.score_with_weights(f, w) for f in group]

        # Re-rank (MMR or sort)
        if r3.TOPK > 0:
            reranked = r3.mmr_rerank(group, scores, k=r3.TOPK, lambda_=r3.MMR_LAMBDA)
        else:
            reranked = [x for _, x in sorted(zip(scores, group), key=lambda t: t[0], reverse=True)]

        # Format Output
        rank = 1
        for f in reranked:
            rec = dict(f["_raw"])
            
            # Calculate contribution for display
            x = r3.vectorize(f)
            contrib = {r3.FEATURE_KEYS[i]: round(w[i] * x[i], 6) for i in range(len(w))}
            
            rec["_view"] = view
            rec["_qid"] = qid
            rec["_score"] = round(r3.score_with_weights(f, w), 6)
            rec["_rank"] = rank
            rec["_contrib"] = contrib
            out_all.append(rec)
            rank += 1
            
    # Save results to dict for visualization
    results[view] = pd.DataFrame(out_all)
    
    # Display top 5 for inspection
    if not results[view].empty:
        display_cols = ["title", "_score", "_rank", "categories", "citation_count"]
        # Ensure columns exist
        display_cols = [c for c in display_cols if c in results[view].columns]
        print(f"Top 5 recommendations for {view}:")
        display(results[view][display_cols].head(5))

Target Views: ['default', 'review', 'application', 'theory', 'trending']

Processing View: default
Top 5 recommendations for default:


Unnamed: 0,title,_score,_rank,categories,citation_count
0,Benchmarking as Empirical Standard in Software...,0.738229,1,cs.SE,20
1,The Risk-Taking Software Engineer: A Framed Po...,0.697685,2,cs.SE,3
2,Lessons from a Pioneering Software Engineering...,0.717459,3,cs.SE,0
3,The Framework For The Discipline Of Software E...,0.716361,4,cs.SE,0
4,An Exploration of the Mentorship Needs of Rese...,0.695903,5,cs.SE,0



Processing View: review
Top 5 recommendations for review:


Unnamed: 0,title,_score,_rank,categories,citation_count
0,Agile Software Engineering and Systems Enginee...,0.756447,1,astro-ph.IM cs.SE,2
1,The Risk-Taking Software Engineer: A Framed Po...,0.734506,2,cs.SE,3
2,What Practitioners Really Think About Continuo...,0.717172,3,cs.SE,0
3,Software Engineering in Australasia,0.754039,4,cs.SE,0
4,Benchmarking as Empirical Standard in Software...,0.684477,5,cs.SE,20



Processing View: application
Top 5 recommendations for application:


Unnamed: 0,title,_score,_rank,categories,citation_count
0,Improving Software Engineering Research throug...,0.833786,1,cs.SE,2
1,SELM: Software Engineering of Machine Learning...,0.812477,2,cs.SE cs.AI,14
2,How Research Software Engineers Can Support Sc...,0.773239,3,cs.SE,0
3,What Practitioners Really Think About Continuo...,0.761334,4,cs.SE,0
4,The Framework For The Discipline Of Software E...,0.734001,5,cs.SE,0



Processing View: theory
Top 5 recommendations for theory:


Unnamed: 0,title,_score,_rank,categories,citation_count
0,Deconcentration of Attention: Addressing the C...,0.773797,1,cs.SE,2
1,A Formal Method for Mapping Software Engineeri...,0.704099,2,cs.SE,0
2,Lessons from a Pioneering Software Engineering...,0.680386,3,cs.SE,0
3,An Exploration of the Mentorship Needs of Rese...,0.689871,4,cs.SE,0
4,Software Engineering Timeline: major areas of ...,0.675644,5,cs.SE,1



Processing View: trending
Top 5 recommendations for trending:


Unnamed: 0,title,_score,_rank,categories,citation_count
0,Spreadsheet Engineering: A Research Framework,0.884314,1,cs.SE,38
1,The Risk-Taking Software Engineer: A Framed Po...,0.838981,2,cs.SE,3
2,Lessons from a Pioneering Software Engineering...,0.839728,3,cs.SE,0
3,Taxing Collaborative Software Engineering,0.857055,4,cs.SE,3
4,Benchmarking as Empirical Standard in Software...,0.862096,5,cs.SE,20


## 5. Output Generation
Save the results to JSON files, similar to the script.

In [16]:
r3.ensure_dir(r3.OUTPUT_DIR)

for view, df in results.items():
    out_path = r3.OUTPUT_DIR / f"recommend_{view}_jupyter.json"
    
    # Convert DataFrame back to list of dicts for saving
    out_records = df.to_dict(orient="records")
    
    with open(out_path, "w", encoding="utf-8") as fo:
        for r in out_records:
            # Clean up pandas timestamps if any
            if isinstance(r.get("update_date"), pd.Timestamp):
                r["update_date"] = r["update_date"].isoformat()
                
            fo.write(json.dumps(r, ensure_ascii=False) + "\n")
            
    print(f"Saved {len(out_records)} records to {out_path}")

Saved 10 records to /work3/s242644/ds/PaperTrail/data/recommend/recommend_default_jupyter.json
Saved 10 records to /work3/s242644/ds/PaperTrail/data/recommend/recommend_review_jupyter.json
Saved 10 records to /work3/s242644/ds/PaperTrail/data/recommend/recommend_application_jupyter.json
Saved 10 records to /work3/s242644/ds/PaperTrail/data/recommend/recommend_theory_jupyter.json
Saved 10 records to /work3/s242644/ds/PaperTrail/data/recommend/recommend_trending_jupyter.json
