# Movie Recommendation

#### Business Case
You just joined a small but fast-growing streaming platform called CineStream (think early-stage competitor to Netflix).
They have:

- ~100k users
- ~9,000 movies
- A ratings dataset (users rate movies from 1–5 stars)

**The CEO wants to improve user engagement and retention.**
**Goal: Build a simple but effective movie recommender system that can:**

- Recommend movies to existing users (personalized)
- Show “popular for you” items to new/cold-start users

In [4]:
import pandas as pd

# Ratings
url_ratings = "https://files.grouplens.org/datasets/movielens/ml-100k/u.data"
columns = ['userId', 'movieId', 'rating', 'timestamp']
ratings = pd.read_csv(url_ratings, sep='\t', names=columns)

# Movies (title + genres)
url_movies = "https://files.grouplens.org/datasets/movielens/ml-100k/u.item"
movies = pd.read_csv(url_movies, sep='|', encoding='latin-1',
                     names=['movieId', 'title', 'release_date', 'video_release_date',
                            'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation',
                            'Children', 'Comedy', 'Crime', 'Documentary', 'Drama',
                            'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery',
                            'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'])  # genres start at col 5

print("Ratings shape:", ratings.shape)
print("Movies shape:", movies.shape)
print("\nSample ratings:\n", ratings.head())
print("\nSample movies:\n", movies.head())

Ratings shape: (100000, 4)
Movies shape: (1682, 24)

Sample ratings:
    userId  movieId  rating  timestamp
0     196      242       3  881250949
1     186      302       3  891717742
2      22      377       1  878887116
3     244       51       2  880606923
4     166      346       1  886397596

Sample movies:
    movieId              title release_date  video_release_date  \
0        1   Toy Story (1995)  01-Jan-1995                 NaN   
1        2   GoldenEye (1995)  01-Jan-1995                 NaN   
2        3  Four Rooms (1995)  01-Jan-1995                 NaN   
3        4  Get Shorty (1995)  01-Jan-1995                 NaN   
4        5     Copycat (1995)  01-Jan-1995                 NaN   

                                            IMDb_URL  unknown  Action  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...        0       0   
1  http://us.imdb.com/M/title-exact?GoldenEye%20(...        0       1   
2  http://us.imdb.com/M/title-exact?Four%20Rooms%...        0       0

# Observation:
Dataset has 10000 users ratings, and 1682 movies

In [5]:
# Basic stats
print("Number of unique users:", ratings['userId'].nunique())
print("Number of unique movies:", ratings['movieId'].nunique())
print("Total ratings:", len(ratings))
print("Average rating:", ratings['rating'].mean())
print("\nRating distribution:")
print(ratings['rating'].value_counts(normalize=True).sort_index())

# Most active users
print("\nTop 5 most active users:")
print(ratings['userId'].value_counts().head())

# Most rated movies
print("\nTop 10 most rated movies (with titles):")
top_movies = ratings['movieId'].value_counts().head(10)
print(pd.merge(top_movies, movies[['movieId', 'title']], on='movieId'))

Number of unique users: 943
Number of unique movies: 1682
Total ratings: 100000
Average rating: 3.52986

Rating distribution:
rating
1    0.06110
2    0.11370
3    0.27145
4    0.34174
5    0.21201
Name: proportion, dtype: float64

Top 5 most active users:
userId
405    737
655    685
13     636
450    540
276    518
Name: count, dtype: int64

Top 10 most rated movies (with titles):
   movieId  count                          title
0       50    583               Star Wars (1977)
1      258    509                 Contact (1997)
2      100    508                   Fargo (1996)
3      181    507      Return of the Jedi (1983)
4      294    485               Liar Liar (1997)
5      286    481    English Patient, The (1996)
6      288    478                  Scream (1996)
7        1    452               Toy Story (1995)
8      300    431           Air Force One (1997)
9      121    429  Independence Day (ID4) (1996)


# Observation

- Number of unique movies: 1682
- Total ratings: 100000
- Average rating: 3.52986
- Rating is skewed to positive and 4 being most
- found 5 most active users: 737, 685, 636, 540, 518
- found top ten movies:

##### Top 10 most rated movies (with titles):

|  movieId | count | title                          |
| ------------- | ------------- |--------------------------------|
|   50 | 583 | Star Wars (1977)               |
|  258 | 509 | Contact (1997)                 |
|  100 | 508 | Fargo (1996)                   |
|  181 | 507 | Return of the Jedi (1983)      |
|  294 | 485 | Liar Liar (1997)               |
|  286 | 481 | English Patient, The (1996)    |
|  288 | 478 | Scream (1996)                  |
|    1 | 452 | Toy Story (1995)               |
|  300 | 431 | Air Force One (1997)           |
|  121 | 429 | Independence Day (ID4) (1996)  |







In [6]:
# 1. Most popular (most rated) movies
most_rated = ratings.groupby('movieId').size().sort_values(ascending=False)
top_20_most_rated = most_rated.head(20).index

print("Top 20 most rated movies:")
print(movies[movies['movieId'].isin(top_20_most_rated)][['movieId', 'title']])

# 2. Highest average rated (with minimum 50 ratings to reduce noise)
movie_stats = ratings.groupby('movieId').agg(
    avg_rating=('rating', 'mean'),
    num_ratings=('rating', 'size')
).reset_index()

# Filter for movies with at least 50 ratings
reliable_movies = movie_stats[movie_stats['num_ratings'] >= 50]
top_rated = reliable_movies.sort_values('avg_rating', ascending=False).head(15)

print("\nTop 15 highest rated movies (min 50 ratings):")
print(pd.merge(top_rated, movies[['movieId', 'title']], on='movieId'))

Top 20 most rated movies:
     movieId                             title
0          1                  Toy Story (1995)
6          7             Twelve Monkeys (1995)
49        50                  Star Wars (1977)
55        56               Pulp Fiction (1994)
97        98  Silence of the Lambs, The (1991)
99       100                      Fargo (1996)
116      117                  Rock, The (1996)
120      121     Independence Day (ID4) (1996)
126      127             Godfather, The (1972)
171      172   Empire Strikes Back, The (1980)
173      174    Raiders of the Lost Ark (1981)
180      181         Return of the Jedi (1983)
221      222   Star Trek: First Contact (1996)
236      237              Jerry Maguire (1996)
257      258                    Contact (1997)
285      286       English Patient, The (1996)
287      288                     Scream (1996)
293      294                  Liar Liar (1997)
299      300              Air Force One (1997)
312      313                    Ti

# Observation

Top 20 most rated movies:

|     | movieId | title                            |
|-----|---------|----------------------------------|
| 0   | 1       | Toy Story (1995)                 |
| 6   | 7       | Twelve Monkeys (1995)            |
| 49  | 50      | Star Wars (1977)                 |
| 55  | 56      | Pulp Fiction (1994)              |
| 97  | 98      | Silence of the Lambs, The (1991) |
| 99  | 100     | Fargo (1996)                     |
| 116 | 117     | Rock, The (1996)                 |
| 120 | 121     | Independence Day (ID4) (1996)    |
| 126 | 127     | Godfather, The (1972)            |
| 171 | 172     | Empire Strikes Back, The (1980)  |
| 173 | 174     | Raiders of the Lost Ark (1981)   |
| 180 | 181     | Return of the Jedi (1983)        |
| 221 | 222     | Star Trek: First Contact (1996)  |
| 236 | 237     | Jerry Maguire (1996)             |
| 257 | 258     | Contact (1997)                   |
| 285 | 286     | English Patient, The (1996)      |
| 287 | 288     | Scream (1996)                    |
| 293 | 294     | Liar Liar (1997)                 |
| 299 | 300     | Air Force One (1997)             |
| 312 | 313     | Titanic (1997)                   |

Top 15 highest rated movies (min 50 ratings):

|     | movieId | avg_rating   | num_ratings |
|-----|---------|--------------|-------------|
|  0  |   408   |  4.491071    |      112    |
|  1  |   318   |  4.466443    |      298    |
|  2  |   169   |  4.466102    |      118    |
|  3  |   483   |  4.456790    |      243    |
|  4  |   114   |  4.447761    |       67    |
|  5  |    64   |  4.445230    |      283    |
|  6  |   603   |  4.387560    |      209    |
|  7  |    12   |  4.385768    |      267    |
|  8  |    50   |  4.358491    |      583    |
|  9  |   178   |  4.344000    |      125    |
|  10 |   513   |  4.333333    |       72    |
|  11 |   134   |  4.292929    |      198    |
|  12 |   427   |  4.292237    |      219    |
|  13 |   357   |  4.291667    |      264    |
|  14 |    98   |  4.289744    |      390    |




Now recommendation based on popularity and min rating

In [7]:
def recommend_popular(n=10, min_ratings=50):
    # Highest rated with enough data
    reliable = movie_stats[movie_stats['num_ratings'] >= min_ratings]
    top = reliable.sort_values('avg_rating', ascending=False).head(n)
    return pd.merge(top, movies[['movieId', 'title']], on='movieId')[['title', 'avg_rating', 'num_ratings']]

print("\nPopularity-based recommendations (top 10):")
print(recommend_popular(10))


Popularity-based recommendations (top 10):
                                               title  avg_rating  num_ratings
0                              Close Shave, A (1995)    4.491071          112
1                            Schindler's List (1993)    4.466443          298
2                         Wrong Trousers, The (1993)    4.466102          118
3                                  Casablanca (1942)    4.456790          243
4  Wallace & Gromit: The Best of Aardman Animatio...    4.447761           67
5                   Shawshank Redemption, The (1994)    4.445230          283
6                                 Rear Window (1954)    4.387560          209
7                         Usual Suspects, The (1995)    4.385768          267
8                                   Star Wars (1977)    4.358491          583
9                                12 Angry Men (1957)    4.344000          125


In [11]:
import implicit
import scipy.sparse as sparse
import numpy as np
from scipy.sparse import coo_matrix

# Create user-item matrix (rows = users, cols = movies)
# We'll use rating as confidence (higher rating = stronger preference)
rows = ratings['userId'] - 1  # make 0-based
cols = ratings['movieId'] - 1
data = ratings['rating'].values.astype(float)

# Build sparse matrix
user_item_matrix = coo_matrix((data, (rows, cols)), shape=(ratings['userId'].nunique(), ratings['movieId'].nunique()))

# Convert to CSR format (required by implicit)
user_item_matrix = user_item_matrix.tocsr()

# Create the model - Alternating Least Squares
model = implicit.als.AlternatingLeastSquares(
    factors=50,          # number of latent factors (like embedding size)
    iterations=15,       # how many times to iterate
    regularization=0.01, # prevent overfitting
    use_gpu=False        # set True if you have GPU!
)

# Train the model (implicit expects confidence matrix)
model.fit(user_item_matrix)

  check_blas_config()
100%|██████████| 15/15 [00:00<00:00, 144.63it/s]


In [12]:
user_id = 196 - 1  # 0-based

# Get top 10 recommendations (movie ids)
recommended_movie_ids, scores = model.recommend(
    userid=user_id,
    user_items=user_item_matrix[user_id],
    N=10,
    filter_already_liked_items=True
)

# Convert to titles
recommended_movies = movies[movies['movieId'].isin(recommended_movie_ids + 1)]['title'].values  # +1 because movieId is 1-based

print("Top 10 recommendations for user 196 (using implicit ALS):")
for i, title in enumerate(recommended_movies, 1):
    print(f"{i}. {title}")

Top 10 recommendations for user 196 (using implicit ALS):
1. Sleepless in Seattle (1993)
2. M*A*S*H (1970)
3. When Harry Met Sally... (1989)
4. Sense and Sensibility (1995)
5. In & Out (1997)
6. Everyone Says I Love You (1996)
7. Mother (1996)
8. Clueless (1995)
9. Dave (1993)
10. How to Make an American Quilt (1995)


In [13]:
def get_user_recommendations(user_id, n=5):
    user_idx = user_id - 1
    recommended_movie_ids, _ = model.recommend(
        userid=user_idx,
        user_items=user_item_matrix[user_idx],
        N=n,
        filter_already_liked_items=True
    )
    rec_titles = movies[movies['movieId'].isin(recommended_movie_ids + 1)]['title'].values
    return list(rec_titles)

# Compare 3 different users
print("Recommendations for User 196 (our original active user):")
print(get_user_recommendations(196))

print("\nRecommendations for User 1 (very different taste):")
print(get_user_recommendations(1))

print("\nRecommendations for User 405 (another active one):")
print(get_user_recommendations(405))

Recommendations for User 196 (our original active user):
['When Harry Met Sally... (1989)', 'In & Out (1997)', 'Everyone Says I Love You (1996)', 'Mother (1996)', 'Clueless (1995)']

Recommendations for User 1 (very different taste):
["Schindler's List (1993)", 'Batman (1989)', 'E.T. the Extra-Terrestrial (1982)', 'Piano, The (1993)', 'Dave (1993)']

Recommendations for User 405 (another active one):
['Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)', 'Vertigo (1958)', 'Rebel Without a Cause (1955)', 'Great Race, The (1965)', 'Air Up There, The (1994)']


This project helped me to understand how real movie recommendation works in Netflix, Amazon prime etc.
ALS (Alternating Least Squares)
- Why it works so well:
    It automatically discovers patterns like:
    * "People who like Star Wars also tend to like Empire Strikes Back"
    * "People who like When Harry Met Sally also tend to like Sleepless in Seattle"

It doesn’t need explicit rules — it learns from the ratings alone (this is collaborative filtering)


- Popularity baseline → same recommendations for everyone
- Collaborative filtering (ALS) → personalized recommendations based on hidden patterns in how users rate things

In [19]:
user_id = 196 - 1  # 0-based

movie_ids_to_check = [50, 181, 258, 100, 172]  # Star Wars, Return of Jedi, Contact, Fargo, When Harry Met Sally

print(f"Predicted raw scores for User {user_id+1} (higher = better):")
for movie_id in movie_ids_to_check:
    score = np.dot(model.user_factors[user_id], model.item_factors[movie_id - 1])
    title = movies[movies['movieId'] == movie_id]['title'].values[0]
    print(f"  - {title:<40} : {score:.3f}")

Predicted raw scores for User 196 (higher = better):
  - Star Wars (1977)                         : 0.179
  - Return of the Jedi (1983)                : 0.102
  - Contact (1997)                           : 0.529
  - Fargo (1996)                             : 0.175
  - Empire Strikes Back, The (1980)          : 0.111


In [20]:
user_id = 196 - 1  # 0-based

movie_ids_to_check = [50, 181, 258, 100, 172, 318]  # Star Wars, Return of Jedi, Contact, Fargo, When Harry Met Sally

print(f"Predicted raw scores for User {user_id+1} (higher = better):")
for movie_id in movie_ids_to_check:
    score = np.dot(model.user_factors[user_id], model.item_factors[movie_id - 1])
    title = movies[movies['movieId'] == movie_id]['title'].values[0]
    print(f"  - {title:<40} : {score:.3f}")

Predicted raw scores for User 196 (higher = better):
  - Star Wars (1977)                         : 0.179
  - Return of the Jedi (1983)                : 0.102
  - Contact (1997)                           : 0.529
  - Fargo (1996)                             : 0.175
  - Empire Strikes Back, The (1980)          : 0.111
  - Schindler's List (1993)                  : -0.106
