# Collaborative Filtering – Movie Recommendation 

In this notebook I implement a simple **user-based collaborative filtering** system on a small, synthetic movie-rating dataset.

I follow the assignment instructions:

- Choose **20 movies from different genres** (as in IMDb-style items).
- Create a **user–item rating matrix** with 15 users, where:
  - Users 1–5 rate 8–10 movies each.
  - Users 6–10 rate 4–6 movies each.
  - Users 11–15 are “new users” and rate only 2–3 movies each.
- Ratings are integers between **1–5**.
- Perform **normalization including mean centering** of user ratings.
- Compute **user–user similarity** and find the **three most similar pairs** among the first 10 users.
- Generate **movie recommendations for the 5 new users** (users 11–15), based on their 3 most similar users, with a weighted score according to similarity.


In [11]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# 1. Define movies
movie_titles = [
    "The Shawshank Redemption",  # 0
    "The Dark Knight",           # 1
    "Inception",                 # 2
    "The Godfather",             # 3
    "Forrest Gump",              # 4
    "The Matrix",                # 5
    "Interstellar",              # 6
    "Parasite",                  # 7
    "Spirited Away",             # 8
    "The Lion King",             # 9
    "Pulp Fiction",              # 10
    "Fight Club",                # 11
    "The Lord of the Rings: The Fellowship of the Ring",  # 12
    "The Social Network",        # 13
    "La La Land",                # 14
    "Mad Max: Fury Road",        # 15
    "Avengers: Endgame",         # 16
    "Get Out",                   # 17
    "Titanic",                   # 18
    "Toy Story"                  # 19
]

n_users = 15
n_movies = len(movie_titles)

np.random.seed(42)  # for reproducibility

## 1. Movies and Rating Matrix

I define a set of **20 movies** from different genres (action, drama, sci-fi, fantasy, animation, etc.). The specific titles are not important for the algorithm itself, but they make the results easier to interpret.

I then build a **15 × 20 rating matrix**, where:

- Rows = users (`User 1` … `User 15`)
- Columns = movies
- Cells = integer rating from **1–5**, or **NaN** if the user did not rate that movie.

To match the assignment constraints:

- Users **1–5** each rate between **8–10** movies.
- Users **6–10** each rate between **4–6** movies.
- Users **11–15** (the “new users”) each rate between **2–3** movies.

The ratings are generated randomly with a fixed random seed so that the results are **reproducible**.


In [12]:
# 2. Create rating matrix with NaNs
ratings = np.full((n_users, n_movies), np.nan)

# Users 1-5 (indices 0-4): 8-10 ratings each
for u in range(0, 5):
    k = np.random.randint(8, 11)  # 8–10
    items = np.random.choice(n_movies, size=k, replace=False)
    ratings[u, items] = np.random.randint(1, 6, size=k)

# Users 6-10 (indices 5-9): 4-6 ratings each
for u in range(5, 10):
    k = np.random.randint(4, 7)  # 4–6
    items = np.random.choice(n_movies, size=k, replace=False)
    ratings[u, items] = np.random.randint(1, 6, size=k)

# Users 11-15 (indices 10-14): 2-3 ratings each (new users)
for u in range(10, 15):
    k = np.random.randint(2, 4)  # 2–3
    items = np.random.choice(n_movies, size=k, replace=False)
    ratings[u, items] = np.random.randint(1, 6, size=k)

# Put into a DataFrame for readability
df = pd.DataFrame(ratings, columns=movie_titles)
df.index = [f"User {i}" for i in range(1, n_users + 1)]

print("Ratings matrix (NaN = no rating):")
display(df)

print("\nNumber of rated movies per user:")
display(df.notna().sum(axis=1))

Ratings matrix (NaN = no rating):


Unnamed: 0,The Shawshank Redemption,The Dark Knight,Inception,The Godfather,Forrest Gump,The Matrix,Interstellar,Parasite,Spirited Away,The Lion King,Pulp Fiction,Fight Club,The Lord of the Rings: The Fellowship of the Ring,The Social Network,La La Land,Mad Max: Fury Road,Avengers: Endgame,Get Out,Titanic,Toy Story
User 1,2.0,1.0,,4.0,,2.0,,,4.0,,,5.0,,,,5.0,1.0,4.0,1.0,
User 2,1.0,2.0,,,4.0,1.0,,2.0,,,2.0,,5.0,,,,4.0,,4.0,2.0
User 3,,,,,4.0,1.0,,,2.0,2.0,4.0,,5.0,,,2.0,2.0,,5.0,4.0
User 4,,1.0,1.0,3.0,5.0,2.0,,,,2.0,4.0,1.0,,,,4.0,,,,
User 5,,2.0,,1.0,,,1.0,,2.0,5.0,,,4.0,3.0,3.0,,,4.0,,
User 6,4.0,,,,,,,4.0,4.0,4.0,,,,,,,,,,
User 7,,3.0,,,1.0,,,4.0,,,,4.0,,,1.0,,,,,
User 8,,,,,,,,,,5.0,,,,2.0,,2.0,,4.0,3.0,
User 9,,,,,,,3.0,,,3.0,,,2.0,,4.0,3.0,,,,
User 10,,,4.0,4.0,1.0,2.0,,,,,,2.0,1.0,,,,,,,



Number of rated movies per user:


User 1     10
User 2     10
User 3     10
User 4      9
User 5      9
User 6      4
User 7      5
User 8      5
User 9      5
User 10     6
User 11     2
User 12     3
User 13     2
User 14     2
User 15     3
dtype: int64

## 2. Normalization and Mean Centering

Collaborative filtering usually compares **rating patterns**, not absolute values. For example, a user who always rates 5, 4, 5 is a “high rater”, and a user who always rates 2, 3, 2 is a “low rater”, but their preferences may still be similar.

To handle this, I perform **per-user mean centering**:

1. For each user, I compute the **mean of all their existing ratings**.
2. I subtract this mean from each rating of that user:
   \[
   r'_{u,i} = r_{u,i} - \bar{r}_u
   \]
3. Unrated movies remain as **NaN**.

The result is a **centered rating matrix**, where each row has mean ≈ 0 (over the rated items). Later, for cosine similarity, I temporarily replace NaN with 0.


In [13]:
# 3. Normalization with centering
user_means = df.mean(axis=1, skipna=True)
norm = df.sub(user_means, axis=0)  # centered ratings

print("User means:")
display(user_means)

print("\nCentered rating matrix (norm):")
display(norm)

User means:


User 1     2.900000
User 2     2.700000
User 3     3.100000
User 4     2.555556
User 5     2.777778
User 6     4.000000
User 7     2.600000
User 8     3.200000
User 9     3.000000
User 10    2.333333
User 11    4.000000
User 12    2.000000
User 13    3.500000
User 14    3.000000
User 15    3.000000
dtype: float64


Centered rating matrix (norm):


Unnamed: 0,The Shawshank Redemption,The Dark Knight,Inception,The Godfather,Forrest Gump,The Matrix,Interstellar,Parasite,Spirited Away,The Lion King,Pulp Fiction,Fight Club,The Lord of the Rings: The Fellowship of the Ring,The Social Network,La La Land,Mad Max: Fury Road,Avengers: Endgame,Get Out,Titanic,Toy Story
User 1,-0.9,-1.9,,1.1,,-0.9,,,1.1,,,2.1,,,,2.1,-1.9,1.1,-1.9,
User 2,-1.7,-0.7,,,1.3,-1.7,,-0.7,,,-0.7,,2.3,,,,1.3,,1.3,-0.7
User 3,,,,,0.9,-2.1,,,-1.1,-1.1,0.9,,1.9,,,-1.1,-1.1,,1.9,0.9
User 4,,-1.555556,-1.555556,0.444444,2.444444,-0.555556,,,,-0.555556,1.444444,-1.555556,,,,1.444444,,,,
User 5,,-0.777778,,-1.777778,,,-1.777778,,-0.777778,2.222222,,,1.222222,0.222222,0.222222,,,1.222222,,
User 6,0.0,,,,,,,0.0,0.0,0.0,,,,,,,,,,
User 7,,0.4,,,-1.6,,,1.4,,,,1.4,,,-1.6,,,,,
User 8,,,,,,,,,,1.8,,,,-1.2,,-1.2,,0.8,-0.2,
User 9,,,,,,,0.0,,,0.0,,,-1.0,,1.0,0.0,,,,
User 10,,,1.666667,1.666667,-1.333333,-0.333333,,,,,,-0.333333,-1.333333,,,,,,,


## 3. User–User Similarity (Cosine) and Top 3 Pairs

Next, I compute similarity between users based on the **centered ratings**.

- I use **cosine similarity** on the rows of the centered rating matrix.
- Before computing cosine similarity, I replace NaN values with 0 (which effectively means “no contribution” for movies the user did not rate).

Formally, for two users \( u \) and \( v \), with centered rating vectors \( \mathbf{r}'_u \) and \( \mathbf{r}'_v \):

\[
\text{sim}(u, v) = \frac{\mathbf{r}'_u \cdot \mathbf{r}'_v}{\|\mathbf{r}'_u\| \cdot \|\mathbf{r}'_v\|}
\]

From the resulting similarity matrix, I focus on the **first 10 users** and extract the **three most similar pairs** of users according to the cosine similarity value.


In [14]:
# 4. User-user similarity (cosine) and top-3 pairs among users 1-10
norm_matrix = norm.fillna(0).values   # fill NaN with 0 for cosine
sim = cosine_similarity(norm_matrix, norm_matrix)

# restrict to first 10 users (indices 0-9)
sim_0_9 = sim[:10, :10]

pairs = []
for i in range(10):
    for j in range(i + 1, 10):
        # store user numbers as 1-based for reporting
        pairs.append(((i + 1, j + 1), sim_0_9[i, j]))

sorted_pairs = sorted(pairs, key=lambda x: x[1], reverse=True)

print("Top 3 most similar user pairs among users 1-10:")
for (u1, u2), s in sorted_pairs[:3]:
    print(f"Users ({u1}, {u2}) with cosine similarity = {s:.3f}")

Top 3 most similar user pairs among users 1-10:
Users (2, 3) with cosine similarity = 0.481
Users (5, 8) with cosine similarity = 0.458
Users (9, 10) with cosine similarity = 0.309


## 4. Recommendations for New Users (Users 11–15)

Finally, I generate recommendations for the **5 new users** (users 11–15). These users have only **2–3 ratings each**, so they are a typical “cold-start” case.

The process for each new user is:

1. Compute their **similarity** to the first 10 users using the same cosine similarity matrix.
2. Select the **3 most similar users** (neighbors).  
3. For each movie that the new user **has not rated**, predict a rating using a **weighted average** of the neighbors’ normalized ratings:
   \[
   \hat{r}'_i = \frac{\sum_v s_{u,v} \cdot r'_{v,i}}{\sum_v |s_{u,v}|}
   \]
   where:
   - \( s_{u,v} \) is the similarity between the new user \( u \) and neighbor \( v \),
   - \( r'_{v,i} \) is the centered rating of neighbor \( v \) for movie \( i \).
4. Convert back to the original scale by adding the **mean rating of the new user**:
   \[
   \hat{r}_i = \bar{r}_u + \hat{r}'_i
   \]
5. If no neighbor rated a specific movie, or all similarities are 0, I fall back to the **global average rating of that movie**.

For each new user I then sort the unrated movies by the predicted rating and show the **top 2–3 recommended movies** with their predicted scores.


In [15]:
# 5. Recommendations for new users (11-15)
global_movie_means = df.mean(axis=0, skipna=True)

def recommend_for_new_user(u_new_index, top_n_neighbors=3, num_recs=3):
    """
    u_new_index: 0-based index of user (10-14)
    Returns: (neighbors_indices, neighbors_sims, top_recs)
    """
    # Similarity to first 10 users (indices 0-9)
    sims = sim[u_new_index, :10]

    # Take top-n neighbors by similarity
    neighbor_idx = np.argsort(sims)[::-1][:top_n_neighbors]
    neighbor_sims = sims[neighbor_idx]

    user_ratings = df.iloc[u_new_index]
    user_mean = user_means.iloc[u_new_index]

    # Unrated movies for this user
    unrated_movies = user_ratings[user_ratings.isna()].index

    pred_scores = {}

    # Use only neighbors with positive similarity
    pos_mask = neighbor_sims > 0
    effective_neighbors = neighbor_idx[pos_mask]
    effective_sims = neighbor_sims[pos_mask]

    if len(effective_neighbors) == 0:
        # Completely cold-start (no overlap / all sims <= 0):
        # use global means as fallback
        for m in unrated_movies:
            pred_scores[m] = global_movie_means[m]
    else:
        for m in unrated_movies:
            num = 0.0
            denom = 0.0
            for idx, s in zip(effective_neighbors, effective_sims):
                r_norm = norm.iloc[idx][m]
                if not np.isnan(r_norm):
                    num += s * r_norm
                    denom += abs(s)
            if denom == 0:
                # No neighbor with rating – fallback on global mean
                pred = global_movie_means[m]
            else:
                pred_norm = num / denom
                pred = user_mean + pred_norm  # shift back to original scale
            pred_scores[m] = pred

    # sort movies by predicted score
    sorted_movies = sorted(pred_scores.items(), key=lambda x: x[1], reverse=True)
    return neighbor_idx, neighbor_sims, sorted_movies[:num_recs]


# Show recommendations for users 11-15
for u in range(10, 15):
    neighbors, sims_u, recs = recommend_for_new_user(u)
    neighbors_1based = [i + 1 for i in neighbors]

    print(f"=== Recommendations for User {u+1} ===")
    print("Most similar existing users (among users 1-10):")
    for nb, s in zip(neighbors_1based, sims_u):
        print(f"  User {nb} with similarity {s:.3f}")

    print("Top recommended movies:")
    for title, score in recs:
        print(f"  {title}  (predicted rating ≈ {score:.3f})")
    print()

=== Recommendations for User 11 ===
Most similar existing users (among users 1-10):
  User 10 with similarity 0.000
  User 9 with similarity 0.000
  User 8 with similarity 0.000
Top recommended movies:
  Get Out  (predicted rating ≈ 4.000)
  The Lion King  (predicted rating ≈ 3.571)
  The Lord of the Rings: The Fellowship of the Ring  (predicted rating ≈ 3.500)

=== Recommendations for User 12 ===
Most similar existing users (among users 1-10):
  User 2 with similarity 0.509
  User 3 with similarity 0.357
  User 5 with similarity 0.230
Top recommended movies:
  Titanic  (predicted rating ≈ 3.547)
  Get Out  (predicted rating ≈ 3.222)
  Forrest Gump  (predicted rating ≈ 3.135)

=== Recommendations for User 13 ===
Most similar existing users (among users 1-10):
  User 8 with similarity 0.488
  User 7 with similarity 0.373
  User 5 with similarity 0.359
Top recommended movies:
  Parasite  (predicted rating ≈ 4.900)
  Fight Club  (predicted rating ≈ 4.900)
  The Lord of the Rings: The Fell

## 5. Summary and Conclusions

In this project I implemented a small **user-based collaborative filtering** system:

- I created a **synthetic rating matrix** with 20 movies and 15 users, while respecting the constraints on how many movies each group of users rates.
- I performed **mean-centering normalization** per user, so that similarities rely on rating patterns rather than absolute values.
- Using **cosine similarity**, I identified the **three most similar pairs of users** among the first 10 users.
- For the **new users** (11–15), I generated **personalized movie recommendations** based on their 3 most similar neighbors, with a **weighted score** according to similarity and a fallback to global movie means when needed.

This demonstrates the core ideas of collaborative filtering:
- Similarity between users is based on their **rating behavior**.
- We can estimate how a user would rate an unseen movie by looking at **similar users**.
- Even with a small synthetic dataset, we can see how the method produces reasonable, interpretable recommendations.
