# Problem Statement:

1. Customer Behaviour and it’s prediction lies at the core of every Business Model. From Stock Exchange, e-Commerce and Automobile to even Presidential Elections, predictions serve a great purpose. Most of these predictions are based on the data available about a person’s activity either online or in-person.

2. Recommendation Engines are the much needed manifestations of the desired Predictability of User Activity. Recommendation Engines move one step further and not only give information but put forth strategies to further increase users interaction with the platform.

3. In today’s world OTT platform and Streaming Services have taken up a big chunk in the Retail and Entertainment industry. Organizations like Netflix, Amazon etc. analyse User Activity Pattern’s and suggest products that better suit the user needs and choices.

4. For the purpose of this Project we will be creating one such Recommendation Engine from the ground-up, where every single user, based on there area of interest and ratings, would be recommended a list of movies that are best suited for them.

# Objective :

1. Find out the list of most popular and liked genres

2. Create Model that finds the best suited Movie for one user in every genre.

3. Find what Genre Movies have received the best and worst ratings based on User Rating.

# Dataset Information:

1. ID – Contains the separate keys for customer and movies.
2. Rating – A section contains the user ratings for all the movies.
3. Genre – Highlights the category of the movie.
4. Movie Name – Name of the movie with respect to the movie id.

#1. Load Important Libraries

In [31]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# 2. Set Working Directory

In [32]:
from google.colab import files
uploaded = files.upload()

Saving movies.csv to movies (2).csv
Saving ratings.csv to ratings (2).csv


# 3. Load Data

In [33]:
# Load movies & ratings table
movies = pd.read_csv("movies.csv")
ratings_df = pd.read_csv("ratings.csv")

In [34]:
print("movies shape: ", movies.shape)
print("ratings shape: ", ratings_df.shape)

movies shape:  (27278, 3)
ratings shape:  (1048575, 4)


In [35]:
print("Movies table: \n")
movies.head()

Movies table: 



Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [36]:
print("ratings table: \n")
ratings_df.head()

ratings table: 



Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


# 4. Merged Data

In [37]:
df = ratings_df.merge(movies, on="movieId", how="left")
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,2,3.5,1112486027,Jumanji (1995),Adventure|Children|Fantasy
1,1,29,3.5,1112484676,"City of Lost Children, The (Cité des enfants p...",Adventure|Drama|Fantasy|Mystery|Sci-Fi
2,1,32,3.5,1112484819,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
3,1,47,3.5,1112484727,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,3.5,1112484580,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


# 5. Data Cleaning

In [38]:
# Convert timestamp column
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')


# Extract year (last 4 digits inside parentheses)
df["year"] = df["title"].str.extract(r'\((\d{4})')


# Remove the year part from the Title column
df["title"] = df["title"].str.replace(r'\s*\([^)]*\)', '', regex=True).str.strip()


df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,year
0,1,2,3.5,2005-04-02 23:53:47,Jumanji,Adventure|Children|Fantasy,1995
1,1,29,3.5,2005-04-02 23:31:16,"City of Lost Children, The",Adventure|Drama|Fantasy|Mystery|Sci-Fi,1995
2,1,32,3.5,2005-04-02 23:33:39,Twelve Monkeys,Mystery|Sci-Fi|Thriller,1995
3,1,47,3.5,2005-04-02 23:32:07,Seven,Mystery|Thriller,1995
4,1,50,3.5,2005-04-02 23:29:40,"Usual Suspects, The",Crime|Mystery|Thriller,1995


# 6. Data Exploration

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 7 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   userId     1048575 non-null  int64         
 1   movieId    1048575 non-null  int64         
 2   rating     1048575 non-null  float64       
 3   timestamp  1048575 non-null  datetime64[ns]
 4   title      1048575 non-null  object        
 5   genres     1048575 non-null  object        
 6   year       1048573 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 56.0+ MB


# 7. Missing Value Analysis

In [40]:
missing_val = df.isna().sum()
missing_val_Percent = (missing_val/len(df))*100
missing_val_analysis = {'missing_val': missing_val , 'missing_val_percentage': missing_val_Percent}
missing_val_analysis = pd.DataFrame(missing_val_analysis)
missing_val_analysis

Unnamed: 0,missing_val,missing_val_percentage
userId,0,0.0
movieId,0,0.0
rating,0,0.0
timestamp,0,0.0
title,0,0.0
genres,0,0.0
year,2,0.000191


# Problem Statements

### 1. Most Popular and Liked Genre

In [41]:
df_genres = df.assign(genres=df['genres'].str.split('|')).explode('genres')

#### Popular genre → genres with the most ratings.

In [42]:
# Popular genres (by count of ratings)
popular_genres = df_genres.groupby('genres')['rating'].count().sort_values(ascending=False).reset_index()
popular_genres

Unnamed: 0,genres,rating
0,Drama,461704
1,Comedy,395994
2,Action,293934
3,Thriller,278937
4,Adventure,230358
5,Romance,201209
6,Crime,171866
7,Sci-Fi,166024
8,Fantasy,110815
9,Children,87101


Drama , Comedy & Action are top 3 most popular genere

#### Liked genre → genres with highest average rating.

In [43]:
# Liked genres (by average rating)
liked_genres = df_genres.groupby('genres')['rating'].mean().sort_values(ascending=False).reset_index()
liked_genres

Unnamed: 0,genres,rating
0,Film-Noir,3.956143
1,War,3.821342
2,Documentary,3.758738
3,Crime,3.683701
4,Drama,3.678378
5,Mystery,3.665257
6,IMAX,3.656697
7,Animation,3.608344
8,Western,3.573439
9,Musical,3.550363


Film-Noir, War & Drama are the top 3 Most liked genere

# 2. Create Model that finds the best suited Movie for one user in every genre

In [44]:
# Group by user, genre, and title — then compute mean and count of ratings
best_movies_per_user_genre = df_genres.groupby(['userId', 'genres', 'title'])['rating']\
.mean().reset_index().sort_values(['userId', 'rating'], ascending=[True, False])

In [45]:
def get_top_movies_all_genres(df):

    result = {}

    # Group by userId and genre
    grouped = df.groupby(['userId', 'genres'])

    for (user, genre), group in grouped:
        top_movie = group.sort_values('rating', ascending=False).iloc[0]

        if user not in result:
            result[user] = {}

        result[user][genre] = {
            'title': top_movie['title'],
            'rating': top_movie['rating']
        }

    return result

In [46]:
user_top_movies = get_top_movies_all_genres(best_movies_per_user_genre)

# To get top movies for userId = 1
user_top_movies.get(1, {})

{'Action': {'title': 'Lord of the Rings: The Return of the King, The',
  'rating': 5.0},
 'Adventure': {'title': 'Lord of the Rings: The Fellowship of the Ring, The',
  'rating': 5.0},
 'Animation': {'title': 'Incredibles, The', 'rating': 4.0},
 'Children': {'title': 'E.T. the Extra-Terrestrial', 'rating': 4.0},
 'Comedy': {'title': 'Adventures of Baron Munchausen, The', 'rating': 4.0},
 'Crime': {'title': 'Freaks', 'rating': 5.0},
 'Drama': {'title': 'Freaks', 'rating': 5.0},
 'Fantasy': {'title': 'Lord of the Rings: The Fellowship of the Ring, The',
  'rating': 5.0},
 'Horror': {'title': 'Freaks', 'rating': 5.0},
 'IMAX': {'title': 'Spider-Man 2', 'rating': 4.5},
 'Musical': {'title': 'Labyrinth', 'rating': 4.0},
 'Mystery': {'title': 'Brotherhood of the Wolf', 'rating': 4.0},
 'Romance': {'title': 'Clash of the Titans', 'rating': 4.0},
 'Sci-Fi': {'title': 'Spider-Man 2', 'rating': 4.5},
 'Thriller': {'title': 'American Werewolf in London, An', 'rating': 4.0},
 'War': {'title': 'Dir

 # 3. Find what Genre Movies have received the best and worst ratings based on User Rating.

In [47]:
# Step 1: Group by genre and calculate average rating
genre_rating_summary = (
    df_genres
    .groupby('genres')['rating']
    .agg(['mean', 'count'])  # mean rating and number of ratings
    .reset_index()
    .rename(columns={'mean': 'avg_rating', 'count': 'rating_count'})
    .sort_values('avg_rating', ascending=False)
)

# Step 2: Extract best and worst rated genres
best_genre = genre_rating_summary.iloc[0]
worst_genre = genre_rating_summary.iloc[-1]

# Step 3: Display results
print("🎬 Best Rated Genre:")
print(f"{best_genre['genres']} → Avg Rating: {best_genre['avg_rating']:.2f} ({int(best_genre['rating_count'])} ratings)")

print("\n💔 Worst Rated Genre:")
print(f"{worst_genre['genres']} → Avg Rating: {worst_genre['avg_rating']:.2f} ({int(worst_genre['rating_count'])} ratings)")

🎬 Best Rated Genre:
Film-Noir → Avg Rating: 3.96 (11241 ratings)

💔 Worst Rated Genre:
(no genres listed) → Avg Rating: 3.07 (7 ratings)


# 4. SVD-Based Recommendation Engine

In [48]:
from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import train_test_split

In [49]:
# Prepare data for surprise
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)

trainset, testset = train_test_split(data, test_size=0.2)

# Train SVD model
model = SVD()
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x780b4528a9c0>

In [50]:
# Predict on test set
predictions = model.test(testset)

# Evaluate
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)

RMSE: 0.8347
MAE:  0.6388


In [51]:
def recommend_svd(user_id, df, model, top_n=5):
    # Movies the user hasn't rated
    rated_movies = df[df['userId'] == user_id]['movieId'].unique()
    all_movies = df['movieId'].unique()
    unrated = [m for m in all_movies if m not in rated_movies]

    # Predict ratings for unrated movies
    predictions = [model.predict(user_id, movie_id) for movie_id in unrated]
    top_preds = sorted(predictions, key=lambda x: x.est, reverse=True)[:top_n]

    # Map movieId to title and genres
    movie_map = df.drop_duplicates('movieId')[['movieId', 'title', 'genres']]
    recs = pd.DataFrame([{'movieId': p.iid, 'predicted_rating': p.est} for p in top_preds])
    recs = recs.merge(movie_map, on='movieId', how='left')

    return recs[['title', 'genres', 'predicted_rating']]


In [52]:
# Sample Testing

recommend_svd(user_id=1, df=df, model=model, top_n=5)

Unnamed: 0,title,genres,predicted_rating
0,"Andalusian Dog, An",Fantasy,4.528056
1,Cosmos,Documentary,4.493999
2,Band of Brothers,Action|Drama|War,4.459006
3,"Passion of Joan of Arc, The",Drama,4.451181
4,Black Mirror,Drama|Sci-Fi,4.451174


#                        Thank You