In [1]:
# If you re-run the notebook, this won't re-download unless needed.
import io, zipfile, os, requests
from pathlib import Path

DATA_DIR = Path("ml-latest-small")
URL = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"

if not DATA_DIR.exists():
    print("Downloading MovieLens ml-latest-small...")
    r = requests.get(URL, timeout=60)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(".")
    print("Downloaded and extracted to ./ml-latest-small")
else:
    print("Dataset already present at ./ml-latest-small")

Downloading MovieLens ml-latest-small...
Downloaded and extracted to ./ml-latest-small


In [2]:
import pandas as pd

movies = pd.read_csv("ml-latest-small/movies.csv")  # columns: movieId, title, genres
print(movies.shape)
movies.head()

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
# Clean/normalize the genres and build a simple "content soup"
# For this part, we only use genres. We turn "Adventure|Animation|Children" into "adventure animation children".
movies["genres"] = movies["genres"].fillna("").replace("(no genres listed)", "", regex=False)
movies["content_soup"] = (
    movies["genres"]
    .str.replace("|", " ", regex=False)
    .str.replace("-", "", regex=False)  # treat "sci-fi" as "scifi"
    .str.lower()
    .str.strip()
)

# Build a case-insensitive title lookup
movies["title_lower"] = movies["title"].str.lower().str.strip()
title_to_index = pd.Series(movies.index, index=movies["title_lower"])
movies[["movieId", "title", "genres", "content_soup"]].head(10)

Unnamed: 0,movieId,title,genres,content_soup
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,adventure animation children comedy fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy,adventure children fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance,comedy romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,comedy drama romance
4,5,Father of the Bride Part II (1995),Comedy,comedy
5,6,Heat (1995),Action|Crime|Thriller,action crime thriller
6,7,Sabrina (1995),Comedy|Romance,comedy romance
7,8,Tom and Huck (1995),Adventure|Children,adventure children
8,9,Sudden Death (1995),Action,action
9,10,GoldenEye (1995),Action|Adventure|Thriller,action adventure thriller


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Use a token pattern that keeps whole space-separated tokens (including short tokens)
# because our tokens are clean genre tags after preprocessing.
vectorizer = TfidfVectorizer(token_pattern=r"[^ ]+")
tfidf = vectorizer.fit_transform(movies["content_soup"])  # shape: (num_movies, num_features)
tfidf.shape

(9742, 19)

In [5]:
from sklearn.metrics.pairwise import linear_kernel
import numpy as np

# Because TF-IDF is L2-normalized by default, the linear kernel equals cosine similarity.
cosine_sim = linear_kernel(tfidf, tfidf)  # dense NxN matrix
# Optional: store as float32 to save memory
cosine_sim = cosine_sim.astype(np.float32)

cosine_sim.shape

(9742, 9742)

In [6]:
from difflib import get_close_matches

def recommend_similar_by_genre(title: str, top_k: int = 5):
    """
    Recommend top_k movies similar in genres to the provided title.
    Returns a DataFrame with Title and Similarity.
    """
    if not isinstance(title, str) or not title.strip():
        raise ValueError("Please provide a non-empty movie title (string).")

    query = title.lower().strip()
    if query not in title_to_index:
        # Fuzzy suggestion if exact match not found
        candidates = title_to_index.index.tolist()
        suggestions = get_close_matches(query, candidates, n=5, cutoff=0.6)
        raise ValueError(
            f"Title '{title}' not found. Did you mean: {suggestions} ?"
            if suggestions else f"Title '{title}' not found and no close matches."
        )

    idx = title_to_index[query]
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort by similarity score descending, skip the movie itself (idx)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = [(i, score) for i, score in sim_scores if i != idx][:top_k]

    rec_indices = [i for i, _ in sim_scores]
    rec_scores = [float(s) for _, s in sim_scores]

    out = pd.DataFrame({
        "Title": movies.loc[rec_indices, "title"].values,
        "Similarity": rec_scores
    })
    return out


recommend_similar_by_genre

In [7]:

example_title = "Toy Story (1995)"
print(f"Recommendations for: {example_title}\n")
display(recommend_similar_by_genre(example_title, top_k=5))

Recommendations for: Toy Story (1995)



Unnamed: 0,Title,Similarity
0,Antz (1998),1.0
1,Toy Story 2 (1999),1.0
2,"Adventures of Rocky and Bullwinkle, The (2000)",1.0
3,"Emperor's New Groove, The (2000)",1.0
4,"Monsters, Inc. (2001)",1.0


### Analytical Question: Major limitation of a purely content-based recommender

A major limitation is overspecialization and lack of serendipity: the system only recommends items that are similar in content (e.g., same genres) to what a user already likes, which can trap users in a narrow bubble. It cannot leverage the wisdom of the crowd to surface unexpectedly relevant items outside the user’s established content profile. As a result, it often misses diverse or surprising recommendations that collaborative signals can uncover.

In [None]:
# Clean out conflicting installs
!pip -q uninstall -y scikit-surprise surprise numpy || true

# Install a NumPy version compatible with Surprise and the build tooling
!pip -q install --no-cache-dir "numpy<2.0" "cython<3.1"

# Build Surprise from source against the just-installed NumPy
!pip -q install --no-cache-dir --no-binary scikit-surprise scikit-surprise==1.1.3

# IMPORTANT: restart the Python process so the new NumPy is loaded cleanly
import os, sys
print("Restarting runtime to finalize the install...")
os.kill(os.getpid(), 9)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m282.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
opencv-contrib-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31

In [None]:
import os
print("Restarting runtime...")
os.kill(os.getpid(), 9)

In [1]:
import numpy as np
print("numpy:", np.__version__)

from surprise import Dataset, Reader, SVD
import surprise
print("surprise:", surprise.__version__)

numpy: 1.26.4
surprise: 1.1.3


In [2]:
import io, zipfile, requests
from pathlib import Path
import pandas as pd

DATA_DIR = Path("ml-latest-small")
URL = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"

if not DATA_DIR.exists():
    r = requests.get(URL, timeout=60)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(".")
ratings = pd.read_csv("ml-latest-small/ratings.csv")
movies  = pd.read_csv("ml-latest-small/movies.csv")

In [3]:
from surprise import Dataset, Reader, SVD

reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[["userId", "movieId", "rating"]], reader)
trainset = data.build_full_trainset()

svd = SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02, random_state=42)
svd.fit(trainset)

svd

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7e8308f593a0>

In [4]:
sample = ratings.sample(1, random_state=123)
uid = int(sample["userId"].iloc[0])
iid = int(sample["movieId"].iloc[0])
true_rating = float(sample["rating"].iloc[0])

title = movies.loc[movies["movieId"] == iid, "title"]
title = title.iloc[0] if len(title) else f"movieId={iid}"

pred = svd.predict(uid, iid)

print("Sample prediction with SVD")
print(f"User: {uid}")
print(f"Movie: {iid} - {title}")
print(f"Actual rating in dataset: {true_rating}")
print(f"Predicted rating: {pred.est:.3f}")
print(f"Details: {pred.details}")

Sample prediction with SVD
User: 105
Movie: 77455 - Exit Through the Gift Shop (2010)
Actual rating in dataset: 5.0
Predicted rating: 4.656
Details: {'was_impossible': False}


#### Analytical Question: Core assumption of collaborative filtering (SVD)

Collaborative filtering with SVD assumes that user–item ratings are driven by a small number of shared latent factors, making the rating matrix approximately low-rank. Users with similar latent preference vectors will rate items similarly, and items with similar latent attribute vectors will be liked by the same users

## Conceptual Analysis: The Cold Start Problem in Recommender Systems

The cold start problem arises when the system has too little interaction data to make reliable, personalized recommendations. It manifests in two common ways:

### 1) New User (a user with no or very few ratings)
- Challenge: The system lacks evidence of the user’s tastes.
- Impact on Collaborative Filtering (SVD):
  - SVD needs user–item interactions to learn a user’s latent vector. With zero (or very few) ratings, the model cannot personalize; predictions collapse toward global or item biases (near-popularity), not the user’s true preferences.
- Impact on Content-Based:
  - Content-based uses item attributes (e.g., genres). If the user provides even minimal input (e.g., likes a couple of movies or selects favorite genres during onboarding), the system can recommend similar-content items immediately.
  - With absolutely no input, content-based also can’t personalize; it must fall back to non-personalized lists (e.g., popular, recent, curated).

► Better model for new-user cold start: Content-Based (assuming minimal onboarding input like a favorite movie or genres). It can leverage item attributes right away, whereas pure CF lacks enough data to learn a meaningful user embedding.

### 2) New Movie (an item with no or very few ratings)
- Challenge: The system has no interaction history for the item.
- Impact on Collaborative Filtering (SVD):
  - SVD needs ratings to learn the item’s latent vector. A brand-new movie has no ratings, so CF cannot position it in latent space and will default to weak priors (global mean or item bias estimates), making it hard to surface to the right users.
- Impact on Content-Based:
  - As long as the new movie has metadata (genres, tags, synopsis), content-based can immediately place it in the feature space and recommend it to users who like similar content—even before it receives any ratings.

► Better model for new-movie cold start: Content-Based. It exploits item features to recommend the new title without waiting for interactions, while CF cannot learn the item’s latent factors yet.



In [5]:
import io, zipfile, requests
from pathlib import Path
import pandas as pd

DATA_DIR = Path("ml-latest-small")
URL = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"

if not DATA_DIR.exists():
    r = requests.get(URL, timeout=60)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(".")
    print("Downloaded dataset.")
else:
    print("Dataset already present.")

ratings = pd.read_csv("ml-latest-small/ratings.csv")
movies  = pd.read_csv("ml-latest-small/movies.csv")

ratings.head()

Dataset already present.


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
from surprise import Dataset, Reader

reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[["userId", "movieId", "rating"]], reader)

In [7]:
from surprise import SVD
from surprise.model_selection import GridSearchCV

param_grid = {
    "n_factors": [20, 50, 100],
    "n_epochs": [10, 20, 30],
    "lr_all":   [0.002, 0.005, 0.007],
    "reg_all":  [0.02, 0.05, 0.1],
}

gs = GridSearchCV(
    algo_class=SVD,
    param_grid=param_grid,
    measures=["rmse", "mae"],
    cv=3,
    n_jobs=-1,           # use all cores
    joblib_verbose=1     # show basic progress
)

gs.fit(data)

print("Best RMSE:", gs.best_score["rmse"])
print("Best params for RMSE:", gs.best_params["rmse"])
print("Best MAE:", gs.best_score["mae"])
print("Best params for MAE:", gs.best_params["mae"])

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   22.5s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  2.1min


Best RMSE: 0.8682336826755721
Best params for RMSE: {'n_factors': 50, 'n_epochs': 30, 'lr_all': 0.007, 'reg_all': 0.05}
Best MAE: 0.6660573261853587
Best params for MAE: {'n_factors': 50, 'n_epochs': 30, 'lr_all': 0.007, 'reg_all': 0.05}


[Parallel(n_jobs=-1)]: Done 243 out of 243 | elapsed:  2.9min finished


In [8]:
import pandas as pd

results = pd.DataFrame(gs.cv_results)
results = results.sort_values(by="mean_test_rmse")
results.head(10)

Unnamed: 0,split0_test_rmse,split1_test_rmse,split2_test_rmse,mean_test_rmse,std_test_rmse,rank_test_rmse,split0_test_mae,split1_test_mae,split2_test_mae,mean_test_mae,...,rank_test_mae,mean_fit_time,std_fit_time,mean_test_time,std_test_time,params,param_n_factors,param_n_epochs,param_lr_all,param_reg_all
52,0.866827,0.865832,0.872042,0.868234,0.002723,1,0.664857,0.664113,0.669202,0.666057,...,1,1.190091,0.128878,0.318549,0.004542,"{'n_factors': 50, 'n_epochs': 30, 'lr_all': 0....",50,30,0.007,0.05
80,0.86579,0.866939,0.872519,0.868416,0.002939,2,0.664508,0.665196,0.671239,0.666981,...,4,2.174884,0.326833,0.410472,0.166513,"{'n_factors': 100, 'n_epochs': 30, 'lr_all': 0...",100,30,0.007,0.1
53,0.866864,0.867004,0.873237,0.869035,0.002972,3,0.665672,0.665247,0.671789,0.667569,...,5,1.063317,0.018969,0.304611,0.015139,"{'n_factors': 50, 'n_epochs': 30, 'lr_all': 0....",50,30,0.007,0.1
79,0.866194,0.867916,0.87465,0.869587,0.003649,4,0.66396,0.664918,0.672056,0.666978,...,3,1.762531,0.037482,0.312614,0.010618,"{'n_factors': 100, 'n_epochs': 30, 'lr_all': 0...",100,30,0.007,0.05
25,0.86766,0.867793,0.874147,0.869866,0.003027,5,0.664786,0.664887,0.670747,0.666807,...,2,0.632142,0.053735,0.326235,0.044183,"{'n_factors': 20, 'n_epochs': 30, 'lr_all': 0....",20,30,0.007,0.05
26,0.868385,0.868536,0.874495,0.870472,0.002845,6,0.666553,0.666533,0.673155,0.668747,...,7,0.717489,0.101074,0.390993,0.111516,"{'n_factors': 20, 'n_epochs': 30, 'lr_all': 0....",20,30,0.007,0.1
76,0.867159,0.869936,0.875047,0.870714,0.003267,7,0.665358,0.667035,0.673069,0.668487,...,6,2.000111,0.332376,0.446371,0.059195,"{'n_factors': 100, 'n_epochs': 30, 'lr_all': 0...",100,30,0.005,0.05
49,0.868423,0.868622,0.875172,0.870739,0.003135,8,0.667051,0.666289,0.673308,0.668883,...,8,1.246772,0.109967,0.506102,0.156046,"{'n_factors': 50, 'n_epochs': 30, 'lr_all': 0....",50,30,0.005,0.05
22,0.867049,0.870034,0.875945,0.871009,0.003697,9,0.665617,0.667766,0.673739,0.669041,...,9,0.681158,0.019528,0.274589,0.006085,"{'n_factors': 20, 'n_epochs': 30, 'lr_all': 0....",20,30,0.005,0.05
43,0.867592,0.869832,0.875727,0.87105,0.003431,10,0.666191,0.667868,0.673837,0.669299,...,10,0.747721,0.059485,0.308801,0.023306,"{'n_factors': 50, 'n_epochs': 20, 'lr_all': 0....",50,20,0.007,0.05


In [9]:
# Build full trainset
trainset = data.build_full_trainset()

best_params = gs.best_params["rmse"]
final_svd = SVD(
    n_factors=best_params["n_factors"],
    n_epochs=best_params["n_epochs"],
    lr_all=best_params["lr_all"],
    reg_all=best_params["reg_all"],
    random_state=42
)

final_svd.fit(trainset)

final_svd  # keep this visible for your screenshot

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7e831ceea150>

In [10]:
sample = ratings.sample(1, random_state=123)
uid = int(sample["userId"].iloc[0])
iid = int(sample["movieId"].iloc[0])

pred = final_svd.predict(uid, iid)
print(f"Predicted rating for user {uid} on movie {iid}: {pred.est:.3f}")

Predicted rating for user 105 on movie 77455: 4.786


## Why Hyperparameter Tuning Matters (and the risks of defaults)

Hyperparameter tuning is critical because it aligns the model’s bias–variance trade‑off with the specific data and business objective. The same algorithm can perform very differently depending on learning rate, regularization strength, number of training epochs, and model capacity (e.g., latent factors). Systematically searching the hyperparameter space helps the model generalize better to unseen users/items, improves ranking/accuracy metrics, and can also reduce training instability.

Risks of using default parameters:
- Underfitting or overfitting: Defaults are not tailored to your data; they can be too weak (high bias) or too flexible (high variance).
- Poor generalization: Good training performance may not translate to real users, harming CTR/engagement.
- Instability and convergence issues: Defaults may train too fast/slow or with insufficient regularization, leading to noisy or brittle models.
- Suboptimal resource use: Unnecessarily large models or long training schedules waste time and compute without improving outcomes.
- Hidden bias and fairness concerns: Inadequately regularized or poorly tuned models may amplify popularity bias or fail to serve tail users/items well.

Conclusion: Tune before deploy to achieve reliable, reproducible, and performant recommendations suited to your data and goals.

In [11]:
# Run this in Colab after svd.fit(trainset)
import os, json
import numpy as np
import pandas as pd
from pathlib import Path

ART = Path("artifacts")
ART.mkdir(exist_ok=True)

# 1) Save Surprise model too (optional, for Option 1)
from surprise import dump as svd_dump
svd_dump.dump(file_name=str(ART / "svd_model"), algo=svd, verbose=1)

# 2) Export factors for Option 2 (no Surprise at runtime)
ts = svd.trainset
mu = float(ts.global_mean)
k = int(svd.n_factors)

# User factors and biases
rows = []
for u_inner in range(ts.n_users):
    u_raw = int(ts.to_raw_uid(u_inner))
    bu = float(svd.bu[u_inner])
    pu = svd.pu[u_inner].astype(float)
    row = {"userId": u_raw, "bu": bu}
    row.update({f"f{i}": float(pu[i]) for i in range(k)})
    rows.append(row)
user_factors = pd.DataFrame(rows)
user_factors.to_csv(ART / "svd_user_factors.csv", index=False)

# Item factors and biases
rows = []
for i_inner in range(ts.n_items):
    i_raw = int(ts.to_raw_iid(i_inner))
    bi = float(svd.bi[i_inner])
    qi = svd.qi[i_inner].astype(float)
    row = {"movieId": i_raw, "bi": bi}
    row.update({f"f{i}": float(qi[i]) for i in range(k)})
    rows.append(row)
item_factors = pd.DataFrame(rows)
item_factors.to_csv(ART / "svd_item_factors.csv", index=False)

# Meta
with open(ART / "svd_meta.json", "w") as f:
    json.dump({"global_mean": mu, "n_factors": k}, f)

print("Exported:", list(ART.iterdir()))

The dump has been saved as file artifacts/svd_model
Exported: [PosixPath('artifacts/svd_meta.json'), PosixPath('artifacts/svd_user_factors.csv'), PosixPath('artifacts/svd_item_factors.csv'), PosixPath('artifacts/svd_model')]


## Hybrid Ranking Strategy

I implemented a hybrid recommender that merges:
- The top 10 items from Collaborative Filtering (CF, SVD-based) personalized to the user, and
- The top 10 content-similar items to a favorite movie (genres-based).

To combine and re-rank candidates, I used Reciprocal Rank Fusion (RRF) with a slight preference for CF:

- For a movie with rank r_CF in the CF list and rank r_CB in the content list, the hybrid score is:
  
  Score = 0.6 / (60 + r_CF) + 0.4 / (60 + r_CB)

  If a movie appears in only one list, its score from the other list is 0.

- I then sort by this hybrid score (higher is better), breaking ties by:
  1) higher CF predicted rating, then
  2) higher content similarity.

- Duplicates are removed by movieId. The final top-10 is returned.

Why prefer CF slightly? CF captures personal taste signals learned from the user’s historical ratings, which are strong predictors of future interest. Content-based complements CF by ensuring topical relevance to the user’s chosen favorite and by surfacing items CF may overlook (including less‑popular or newly-described titles).

Why a hybrid approach is often stronger than a single model?
- Personalization + relevance: CF leverages user behavior, while content-based ensures semantic similarity.
- Coverage and cold-start robustness: Content-based can recommend new items with metadata; CF exploits crowd wisdom once interactions exist.
- Diversity and serendipity: Blending two signals reduces overspecialization and helps surface relevant-but-unexpected items.
- Stability: Rank fusion is robust to score scale differences and model idiosyncrasies.

This hybrid approach balances personalization (CF) and semantic similarity (content), typically improving both precision and user satisfaction.