# Building Content Features for the Hybrid Recommender

This notebook builds the features that will be used in the hybrid model.
It takes the cleaned data from the pipeline and creates:

- genre features for each movie

- a simple taste profile for each user

- negative samples for training

- user–movie genre similarity

- ALS scores for each user–movie pair

The final output is a training file that will be used in the ranking model.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from numpy.linalg import norm
import pickle
import scipy.sparse as sp

In [2]:
PROJECT_ROOT = Path("..").resolve()
PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"

In [3]:
print("Project root:", PROJECT_ROOT)
print("Processed dir:", PROCESSED_DIR)

Project root: /Users/sanjaydilip/Desktop/Code/Projects/Movie Recommender
Processed dir: /Users/sanjaydilip/Desktop/Code/Projects/Movie Recommender/data/processed


# Load processed data

The notebook loads cleaned data produced by the pipeline:

- movies and their metadata

- movie–genre rows

- user–movie interactions from train

These files are the starting point for creating content features.

In [4]:
movie_map = pd.read_csv(PROCESSED_DIR / "movie_map.csv")
movie_genres = pd.read_csv(PROCESSED_DIR / "movie_genres.csv")
train = pd.read_csv(PROCESSED_DIR / "train.csv")

In [5]:
movie_map.head()

Unnamed: 0,movie_id,m_index,title,genres
0,1193,0,One Flew Over the Cuckoo's Nest (1975),Drama
1,661,1,James and the Giant Peach (1996),Animation|Children's|Musical
2,914,2,My Fair Lady (1964),Musical|Romance
3,3408,3,Erin Brockovich (2000),Drama
4,2355,4,"Bug's Life, A (1998)",Animation|Children's|Comedy


In [6]:
movie_genres.head()

Unnamed: 0,m_index,genre
0,0,Drama
1,1,Animation
2,1,Children's
3,1,Musical
4,2,Musical


In [7]:
train.head()

Unnamed: 0,user_id,movie_id,u_index,m_index,rating,timestamp,title,genres
0,1,3186,0,31,4.0,2000-12-31 22:00:19,"Girl, Interrupted (1999)",Drama
1,1,1270,0,22,5.0,2000-12-31 22:00:55,Back to the Future (1985),Comedy|Sci-Fi
2,1,1721,0,27,4.0,2000-12-31 22:00:55,Titanic (1997),Drama|Romance
3,1,1022,0,37,5.0,2000-12-31 22:00:55,Cinderella (1950),Animation|Children's|Musical
4,1,2340,0,24,3.0,2000-12-31 22:01:43,Meet Joe Black (1998),Romance


# Build movie genre features

Each movie may have several genres.

To use genres as features, the notebook creates one-hot columns like:

- genre_Action = 1

- genre_Drama = 0

- ...

Each movie ends up with one row of genre indicators.

In [8]:
movie_genre_dummy = pd.get_dummies(movie_genres, columns=["genre"])

In [9]:
movie_genre_agg = (
    movie_genre_dummy
    .groupby("m_index", as_index=False)
    .sum()
)

In [10]:
movie_genre_agg.head()

Unnamed: 0,m_index,genre_Action,genre_Adventure,genre_Animation,genre_Children's,genre_Comedy,genre_Crime,genre_Documentary,genre_Drama,genre_Fantasy,genre_Film-Noir,genre_Horror,genre_Musical,genre_Mystery,genre_Romance,genre_Sci-Fi,genre_Thriller,genre_War,genre_Western
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
3,3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,4,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0


# Build user genre profiles

Users often show patterns in what they watch.

To capture this, the notebook averages the genre vectors of the movies each user interacted with.

This gives a simple “taste profile” for every user.

In [11]:
train_with_genre = train.merge(movie_genre_agg, on="m_index", how="left")

In [12]:
genre_cols = [c for c in movie_genre_agg.columns if c.startswith("genre_")]

In [13]:
user_genre_profile = (
    train_with_genre
    .groupby("u_index")[genre_cols]
    .mean()
    .reset_index()
)

In [14]:
user_genre_profile.head()

Unnamed: 0,u_index,genre_Action,genre_Adventure,genre_Animation,genre_Children's,genre_Comedy,genre_Crime,genre_Documentary,genre_Drama,genre_Fantasy,genre_Film-Noir,genre_Horror,genre_Musical,genre_Mystery,genre_Romance,genre_Sci-Fi,genre_Thriller,genre_War,genre_Western
0,0,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,0.488372,0.069767,0.0,0.0,0.232558,0.0,0.116279,0.069767,0.046512,0.046512,0.0
1,1,0.298077,0.115385,0.0,0.0,0.221154,0.086538,0.0,0.711538,0.0,0.009615,0.009615,0.0,0.009615,0.192308,0.096154,0.173077,0.134615,0.019231
2,2,0.560976,0.585366,0.02439,0.02439,0.487805,0.0,0.0,0.170732,0.04878,0.0,0.073171,0.0,0.02439,0.073171,0.121951,0.121951,0.04878,0.146341
3,3,0.882353,0.352941,0.0,0.058824,0.0,0.058824,0.0,0.294118,0.117647,0.0,0.176471,0.0,0.0,0.117647,0.470588,0.117647,0.176471,0.058824
4,4,0.132075,0.050314,0.025157,0.031447,0.295597,0.113208,0.031447,0.522013,0.0,0.018868,0.056604,0.012579,0.031447,0.132075,0.062893,0.163522,0.031447,0.006289


# Create labeled user–movie pairs

The hybrid model needs labeled examples:

- **label = 1** → user interacted with the movie

- **label = 0** → user did not interact with the movie

Positive pairs come from the train split.
Negative pairs are sampled randomly.

In [15]:
positive_pairs = train[["u_index", "m_index"]].copy()
positive_pairs["label"] = 1

In [16]:
positive_pairs.head()

Unnamed: 0,u_index,m_index,label
0,0,31,1
1,0,22,1
2,0,27,1
3,0,37,1
4,0,24,1


# Negative sampling

Negative examples are created by selecting movies that a user did not interact with.

A small number of negatives per user is enough to train the model.

In [17]:
all_movies = movie_map["m_index"].unique()
user_pos = train.groupby("u_index")["m_index"].apply(set).to_dict()

In [18]:
neg_rows = []
NEG_PER_USER = 20

In [19]:
for user, pos_set in user_pos.items():
    max_negs = len(all_movies) - len(pos_set)
    if max_negs <= 0:
        continue
    num_negs = min(NEG_PER_USER, max_negs)
    candidates = np.setdiff1d(all_movies, np.array(list(pos_set), dtype=np.int64))
    if len(candidates) == 0:
        continue
    sampled = np.random.choice(candidates, size=num_negs, replace=False)
    for m in sampled:
        neg_rows.append((user, int(m), 0))

In [20]:
negative_pairs = pd.DataFrame(neg_rows, columns=["u_index", "m_index", "label"])
negative_pairs.head()

Unnamed: 0,u_index,m_index,label
0,0,289,0
1,0,1483,0
2,0,2618,0
3,0,2320,0
4,0,265,0


# Combine all pairs

Now, positive and negative pairs are merged into one dataset.

In [21]:
pairs = pd.concat([positive_pairs, negative_pairs], ignore_index=True)
pairs.head()

Unnamed: 0,u_index,m_index,label
0,0,31,1
1,0,22,1
2,0,27,1
3,0,37,1
4,0,24,1


# Add genre features to each pair

Each pair now receives:

- the user’s taste profile

- the movie’s genre vector

This helps the model learn which genres match a user.

In [22]:
pairs = pairs.merge(user_genre_profile, on="u_index", how="left")

In [23]:
user_genre_cols = [c for c in pairs.columns if c.startswith("genre_")]

In [24]:
pairs.rename(columns={c: f"{c}_user" for c in user_genre_cols}, inplace=True)

In [25]:
pairs.head()

Unnamed: 0,u_index,m_index,label,genre_Action_user,genre_Adventure_user,genre_Animation_user,genre_Children's_user,genre_Comedy_user,genre_Crime_user,genre_Documentary_user,...,genre_Fantasy_user,genre_Film-Noir_user,genre_Horror_user,genre_Musical_user,genre_Mystery_user,genre_Romance_user,genre_Sci-Fi_user,genre_Thriller_user,genre_War_user,genre_Western_user
0,0,31,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0.069767,0.0,0.0,0.232558,0.0,0.116279,0.069767,0.046512,0.046512,0.0
1,0,22,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0.069767,0.0,0.0,0.232558,0.0,0.116279,0.069767,0.046512,0.046512,0.0
2,0,27,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0.069767,0.0,0.0,0.232558,0.0,0.116279,0.069767,0.046512,0.046512,0.0
3,0,37,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0.069767,0.0,0.0,0.232558,0.0,0.116279,0.069767,0.046512,0.046512,0.0
4,0,24,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0.069767,0.0,0.0,0.232558,0.0,0.116279,0.069767,0.046512,0.046512,0.0


# Add movie genre columns

Movie genres are merged with a **_movie** suffix to keep them separate from user features.

In [26]:
movie_features = movie_genre_agg.copy()
movie_genre_cols = [c for c in movie_features.columns if c.startswith("genre_")]

In [27]:
movie_features.rename(
    columns={c: f"{c}_movie" for c in movie_genre_cols},
    inplace=True
)

In [28]:
pairs = pairs.merge(movie_features, on="m_index", how="left")
pairs.head()

Unnamed: 0,u_index,m_index,label,genre_Action_user,genre_Adventure_user,genre_Animation_user,genre_Children's_user,genre_Comedy_user,genre_Crime_user,genre_Documentary_user,...,genre_Fantasy_movie,genre_Film-Noir_movie,genre_Horror_movie,genre_Musical_movie,genre_Mystery_movie,genre_Romance_movie,genre_Sci-Fi_movie,genre_Thriller_movie,genre_War_movie,genre_Western_movie
0,0,31,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,0,0,0,0,0,0,0
1,0,22,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,0,0,0,1,0,0,0
2,0,27,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,0,0,1,0,0,0,0
3,0,37,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,1,0,0,0,0,0,0
4,0,24,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,0,0,1,0,0,0,0


# Add user–movie genre similarity

Now, a cosine similarity score is computed between:

- the user’s genre profile

- the movie’s genre vector

This gives a simple measure of how well the movie fits the user’s taste.

In [29]:
genre_user_cols = [c for c in pairs.columns if c.endswith("_user")]
genre_movie_cols = [c for c in pairs.columns if c.endswith("_movie")]

In [30]:
def compute_genre_similarity(row):
    u = row[genre_user_cols].values.astype(float)
    m = row[genre_movie_cols].values.astype(float)
    if norm(u) == 0 or norm(m) == 0:
        return 0.0
    return float(np.dot(u, m) / (norm(u) * norm(m)))

In [31]:
pairs["genre_sim"] = pairs.apply(compute_genre_similarity, axis=1)

In [32]:
pairs.head()

Unnamed: 0,u_index,m_index,label,genre_Action_user,genre_Adventure_user,genre_Animation_user,genre_Children's_user,genre_Comedy_user,genre_Crime_user,genre_Documentary_user,...,genre_Film-Noir_movie,genre_Horror_movie,genre_Musical_movie,genre_Mystery_movie,genre_Romance_movie,genre_Sci-Fi_movie,genre_Thriller_movie,genre_War_movie,genre_Western_movie,genre_sim
0,0,31,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,0,0,0,0,0,0,0.698836
1,0,22,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,0,0,1,0,0,0,0.282372
2,0,27,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,0,1,0,0,0,0,0.611807
3,0,37,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,1,0,0,0,0,0,0,0.557177
4,0,24,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,0,1,0,0,0,0,0.16639


# Adding ALS score

The ALS model captures user–movie patterns that genre features cannot.

The saved ALS model is loaded and computes an ALS score for every pair.

In [33]:
with open(PROCESSED_DIR / "als_model.pkl", "rb") as f:
    als_model = pickle.load(f)

In [34]:
item_user = sp.load_npz(PROCESSED_DIR / "item_user_train.npz")

In [35]:
print("User factors:", als_model.user_factors.shape)
print("Item factors:", als_model.item_factors.shape)

User factors: (3706, 64)
Item factors: (6040, 64)


In [36]:
def als_score(user, movie):
    if user >= als_model.user_factors.shape[0]:
        return 0.0
    if movie >= als_model.item_factors.shape[0]:
        return 0.0
    return float(np.dot(als_model.user_factors[user], als_model.item_factors[movie]))

In [37]:
pairs["als_score"] = pairs[["u_index", "m_index"]].apply(
    lambda row: als_score(int(row["u_index"]), int(row["m_index"])),
    axis=1
)

In [38]:
pairs.head()

Unnamed: 0,u_index,m_index,label,genre_Action_user,genre_Adventure_user,genre_Animation_user,genre_Children's_user,genre_Comedy_user,genre_Crime_user,genre_Documentary_user,...,genre_Horror_movie,genre_Musical_movie,genre_Mystery_movie,genre_Romance_movie,genre_Sci-Fi_movie,genre_Thriller_movie,genre_War_movie,genre_Western_movie,genre_sim,als_score
0,0,31,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,0,0,0,0,0,0.698836,0.216864
1,0,22,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,0,1,0,0,0,0.282372,0.70085
2,0,27,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,1,0,0,0,0,0.611807,0.764807
3,0,37,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,1,0,0,0,0,0,0,0.557177,-0.398312
4,0,24,1,0.116279,0.093023,0.186047,0.255814,0.209302,0.046512,0.0,...,0,0,0,1,0,0,0,0,0.16639,-0.122625


# Saving the final training data

The combined dataset now contains:

- labels

- movie genre features

- user genre profiles

- similarity scores

- ALS scores

It is saved as a Parquet file.

In [39]:
output_path = PROCESSED_DIR / "hybrid_train_pairs.parquet"
pairs.to_parquet(output_path, index=False)

In [40]:
print("Saved to:", output_path)

Saved to: /Users/sanjaydilip/Desktop/Code/Projects/Movie Recommender/data/processed/hybrid_train_pairs.parquet
