# Basic Content-Based Recommender System

Importing required libraries for data processing, feature extraction, and similarity computation.

In [49]:
import pandas as pd
import numpy as np
import ast
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity

## Feature Extraction

### Loading Data

Downloading and extracting the KuaiRec dataset if it is not already present.

In [33]:
%%bash
# Check if KuaiRec.zip already exists
if [ ! -f KuaiRec.zip ]; then
    wget --no-check-certificate 'https://drive.usercontent.google.com/download?id=1qe5hOSBxzIuxBb1G_Ih5X-O65QElollE&export=download&confirm=t&uuid=b2002093-cc6e-4bd5-be47-9603f0b33470
' -O KuaiRec.zip
    unzip KuaiRec.zip -d data_final_project
fi

Loading user interactions, user features, item features, item categories, and the full interaction matrix from CSV files.

In [34]:
interactions = pd.read_csv("data_final_project/KuaiRec 2.0/data/small_matrix.csv")
user_features = pd.read_csv("data_final_project/KuaiRec 2.0/data/user_features.csv")
item_daily_features = pd.read_csv("data_final_project/KuaiRec 2.0/data/item_daily_features.csv")
item_categories = pd.read_csv("data_final_project/KuaiRec 2.0/data/item_categories.csv")
big_matrix = pd.read_csv("data_final_project/KuaiRec 2.0/data/big_matrix.csv")

### One-Hot Encode Tags

Creating a copy of the item categories dataframe to prepare for feature encoding.

In [35]:
item_features = item_categories.copy()

Converting the string representation of feature lists into actual Python lists and one-hot encoding the item features using MultiLabelBinarizer.

In [36]:
item_features['feat_as_list'] = item_features['feat'].apply(ast.literal_eval)
mlb = MultiLabelBinarizer()
interests_encoded = mlb.fit_transform(item_features['feat_as_list'])
interests_df = pd.DataFrame(interests_encoded, columns=[f'feat_{cls}' for cls in mlb.classes_])

### Item x Features Matrix

Combining the original item features with the one-hot encoded features and setting 'video_id' as the index for the item-feature matrix.

In [37]:
item_features = pd.concat([item_features.drop(columns=['feat_as_list']), interests_df], axis=1)
item_features_content_based = item_features.drop(columns=["feat"])
item_features_content_based.set_index('video_id', inplace=True)

In [38]:
item_features_content_based.head(5)

Unnamed: 0_level_0,feat_0,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,...,feat_21,feat_22,feat_23,feat_24,feat_25,feat_26,feat_27,feat_28,feat_29,feat_30
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### User x Features Matrix

Creating a binary 'liked' column in the interactions dataframe based on the watch ratio.

In [39]:
interactions_binarized = interactions.copy()
interactions_binarized["liked"] = interactions_binarized["watch_ratio"].apply(lambda x: 1 if x >= 2 else 0)

Building a user-feature matrix by averaging the features of items each user liked.

In [40]:
user_features_content_based = interactions_binarized[interactions_binarized["liked"] == 1].copy()
user_features_content_based = user_features_content_based.drop(columns=["play_duration", "video_duration", "time", "date", "timestamp", "watch_ratio", "liked"])
user_features_content_based = user_features_content_based.join(item_features_content_based, on="video_id", how="left")
user_features_content_based = user_features_content_based.groupby("user_id").mean().reset_index()
user_features_content_based = user_features_content_based.drop(columns=["video_id"])
user_features_content_based = user_features_content_based.set_index("user_id")

In [41]:
user_features_content_based.head(5)

Unnamed: 0_level_0,feat_0,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,...,feat_21,feat_22,feat_23,feat_24,feat_25,feat_26,feat_27,feat_28,feat_29,feat_30
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14,0.005587,0.061453,0.011173,0.0,0.0,0.03352,0.117318,0.083799,0.156425,0.150838,...,0.0,0.0,0.0,0.0,0.050279,0.089385,0.0,0.329609,0.0,0.0
19,0.0,0.05,0.0,0.0,0.025,0.1,0.05,0.1,0.125,0.1,...,0.0,0.0,0.025,0.0,0.05,0.05,0.0,0.275,0.0,0.0
21,0.0,0.026786,0.017857,0.008929,0.008929,0.044643,0.098214,0.053571,0.107143,0.053571,...,0.0,0.0,0.0,0.0,0.089286,0.080357,0.0,0.294643,0.008929,0.0
23,0.002387,0.031026,0.009547,0.0,0.002387,0.026253,0.090692,0.076372,0.128878,0.126492,...,0.004773,0.0,0.0,0.0,0.057279,0.112172,0.0,0.367542,0.002387,0.002387
24,0.006079,0.042553,0.00304,0.0,0.015198,0.039514,0.115502,0.06079,0.167173,0.072948,...,0.0,0.0,0.0,0.0,0.027356,0.121581,0.0,0.334347,0.00304,0.0


## Similarity Scores

### Similarity Matrix

Calculating the cosine similarity between user and item feature vectors to create a similarity matrix.

In [42]:
similarity_matrix = cosine_similarity(user_features_content_based.values, item_features_content_based.values)

similarity_df = pd.DataFrame(
    similarity_matrix,
    index=user_features_content_based.index,
    columns=item_features_content_based.index
)

In [43]:
similarity_df.head(5)

video_id,0,1,2,3,4,5,6,7,8,9,...,10718,10719,10720,10721,10722,10723,10724,10725,10726,10727
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14,0.335023,0.228437,0.323058,0.191442,0.071791,0.251267,0.04786,0.335023,0.02393,0.227337,...,0.083756,0.203055,0.227337,0.107686,0.071791,0.227337,0.02393,0.083756,0.04786,0.071791
19,0.30657,0.173422,0.245256,0.122628,0.245256,0.122628,0.0,0.30657,0.122628,0.122628,...,0.061314,0.173422,0.122628,0.122628,0.245256,0.122628,0.0,0.061314,0.0,0.245256
21,0.266011,0.094049,0.133005,0.199508,0.110838,0.243843,0.022168,0.266011,0.022168,0.332513,...,0.155173,0.297822,0.332513,0.221676,0.110838,0.332513,0.044335,0.155173,0.022168,0.110838
23,0.273086,0.189525,0.268029,0.237686,0.055629,0.192172,0.040457,0.273086,0.050572,0.262972,...,0.060686,0.203829,0.262972,0.121372,0.055629,0.262972,0.020229,0.060686,0.040457,0.055629
24,0.373726,0.115315,0.163081,0.271801,0.088335,0.258211,0.06795,0.373726,0.04077,0.237826,...,0.088335,0.187387,0.237826,0.061155,0.088335,0.237826,0.006795,0.088335,0.06795,0.088335


## Make Recommendations

Defining a function to retrieve the top N recommended items for a given user based on similarity scores.

In [44]:
def get_top_n_recommendations(similarity_df, user_id, n=10):
    user_similarities = similarity_df.loc[user_id]
    top_n_items = user_similarities.nlargest(n).index.tolist()
    return top_n_items

## Evaluation

Filtering the big interaction matrix to include only relevant users and interactions, and creating a binary 'liked' column for evaluation.

In [45]:
big_matrix_filtered = big_matrix[big_matrix["user_id"].isin(user_features_content_based.index)]
big_matrix_filtered = big_matrix_filtered[big_matrix_filtered["play_duration"] != 0]
big_matrix_filtered["liked"] = big_matrix_filtered["watch_ratio"].apply(lambda x: 1 if x >= 2 else 0)
big_matrix_filtered = big_matrix_filtered.drop(columns=["play_duration", "video_duration", "time", "date", "timestamp", "watch_ratio"])

In [46]:
big_matrix_filtered.head(5)

Unnamed: 0,user_id,video_id,liked
27176,14,8221,0
27177,14,3547,0
27178,14,3594,0
27179,14,3558,0
27180,14,1947,1


Evaluating the recommender system for user 14 by comparing recommended items to items the user actually liked, and calculating precision at 10.

In [53]:
def evaluate_topk_metrics(y_true, top_k_preds, k=5):
    top_k = top_k_preds[:k]
    relevant = set(y_true)
    hits = [1 if item in relevant else 0 for item in top_k]

    precision = sum(hits) / k
    recall = sum(hits) / len(relevant) if relevant else 0.0
    dcg = sum(hit / np.log2(i + 2) for i, hit in enumerate(hits))
    ideal_hits = [1] * min(len(relevant), k)
    idcg = sum(1 / np.log2(i + 2) for i in range(len(ideal_hits)))
    ndcg = dcg / idcg if idcg != 0 else 0.0

    # MAP@k: mean average precision
    ap_sum = 0.0
    hit_count = 0
    for i, hit in enumerate(hits):
        if hit:
            hit_count += 1
            ap_sum += hit_count / (i + 1)
    map_k = ap_sum / min(len(relevant), k) if relevant else 0.0

    return precision, recall, ndcg, map_k

In [55]:
k_values = [1, 5, 10, 20]
results = []

for k in k_values:
    all_precisions, all_recalls, all_ndcgs, all_maps = [], [], [], []
    user_ids = user_features_content_based.index

    for user_id in user_ids:
        y_true = big_matrix_filtered[(big_matrix_filtered["user_id"] == user_id) & (big_matrix_filtered["liked"] == 1)]["video_id"].tolist()
        if not y_true:
            continue  # skip users with no relevant items
        top_k_preds = get_top_n_recommendations(similarity_df, user_id, k)
        precision, recall, ndcg, map_k = evaluate_topk_metrics(y_true, top_k_preds, k)
        all_precisions.append(precision)
        all_recalls.append(recall)
        all_ndcgs.append(ndcg)
        all_maps.append(map_k)

    results.append({
        "k": k,
        "precision": np.mean(all_precisions),
        "recall": np.mean(all_recalls),
        "ndcg": np.mean(all_ndcgs),
        "map": np.mean(all_maps)
    })

# Print results for each k
for res in results:
    print(f"Results for k={res['k']}:")
    print(f"  Mean Precision@{res['k']}: {res['precision']:.4f}")
    print(f"  Mean Recall@{res['k']}: {res['recall']:.4f}")
    print(f"  Mean NDCG@{res['k']}: {res['ndcg']:.4f}")
    print(f"  Mean MAP@{res['k']}: {res['map']:.4f}\n")

Results for k=1:
  Mean Precision@1: 0.0000
  Mean Recall@1: 0.0000
  Mean NDCG@1: 0.0000
  Mean MAP@1: 0.0000

Results for k=5:
  Mean Precision@5: 0.0003
  Mean Recall@5: 0.0001
  Mean NDCG@5: 0.0002
  Mean MAP@5: 0.0001

Results for k=10:
  Mean Precision@10: 0.0004
  Mean Recall@10: 0.0002
  Mean NDCG@10: 0.0003
  Mean MAP@10: 0.0001

Results for k=20:
  Mean Precision@20: 0.0002
  Mean Recall@20: 0.0003
  Mean NDCG@20: 0.0003
  Mean MAP@20: 0.0000



## Conclusion

The evaluation results indicate that the basic content-based recommender system performs very poorly on this dataset, with all metrics (Precision, Recall, NDCG, MAP) close to zero for all tested values of *k*. This suggests that the recommendations generated by the model never match the items that users actually liked.

**Possible reasons for the poor performance:**

- **Limited Feature Representation:** The model relies solely on one-hot encoded item features (tags), which may not capture the true preferences of users or the nuanced similarities between items.
- **Sparse User-Item Interactions:** If users have interacted with only a small subset of items, the user profiles (averaged feature vectors) may not be representative enough to generalize to new recommendations.
- **Feature Overlap:** If many items share similar features or if features are not discriminative, the cosine similarity may not effectively distinguish between relevant and irrelevant items.
- **Cold Start Problem:** New or inactive users/items with little interaction history will have poorly defined feature vectors, leading to random or irrelevant recommendations.
- **No Temporal or Contextual Information:** The model does not consider time, sequence, or context of interactions, which are often important in real-world recommendation scenarios.