# MOVIE RECOMMENDATION SYSTEM

## 1. BUSINESS UNDERSTANDING 



### Objective

Leveraging the MovieLens dataset is to develop a robust movie recommendation system that enhances user engagement and satisfaction within our online movie streaming platform. By effectively recommending movies that align with users' preferences, we aim to increase user retention, drive user-generated content, and boost overall revenue.

### Data Description 

The MovieLens dataset, curated by the GroupLens research lab at the University of Minnesota, is a well-established and widely used resource in the field of recommendation systems. It contains a wealth of information, including user ratings, movie metadata, and user profiles, collected over a significant period of time.

### Problem Definition 

Our primary business problem is to overcome the challenge of content discovery for users. With an ever-expanding catalog of movies, users often face decision making issues when choosing what to watch. We need to address this by providing tailored movie recommendations based on user preferences, thereby simplifying the selection process and improving user satisfaction.

### Key Stakeholders

#### Users: 
Our end-users are at the core of our business. We aim to provide them with an enjoyable and personalized movie-watching experience.
#### Platform Owners: 
The success of our recommendation system directly impacts platform owners by increasing user engagement and revenue.
#### Content Providers:
Enhanced user engagement can attract content providers to collaborate with the platform, enriching their movie catalog.
#### Data Scientists and Engineers: 
The data science and engineering teams play a crucial role in developing, deploying, and maintaining the recommendation system.

### Solution Approach

Our approach is centered around collaborative filtering, a proven recommendation technique. We will analyze user behavior and preferences within the dataset to build models that identify similarities between users and movies. This will enable us to provide personalized movie recommendations.

### Evaluation Metrics

To assess the effectiveness of our recommendation system, we will employ metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Precision, Recall, F1-score, Coverage, and Diversity. These metrics will help us quantify the system's performance in terms of accuracy and relevance.

### Research Questions?

1. What movies might I enjoy watching?
    Users can receive personalized movie recommendations based on their past viewing and rating history.

2. What are the most popular or highly-rated movies?
    The system can provide lists of top-rated or trending movies, helping users discover popular titles.

3. Are there movies similar to the ones I've enjoyed in the past?
    Users can receive recommendations for movies similar to those they've rated highly, expanding their viewing options.

4. How can I discover new movies from genres I like?
    The system can suggest movies from specific genres that align with a user's preferences.

5. What movies have received critical acclaim or awards?
    Users can access recommendations for award-winning or critically acclaimed films.

6. What are the top recommendations for a specific user, given their unique tastes?
    The system tailors recommendations for individual users based on their historical ratings and preferences.

7. How can we improve user engagement and retention on our platform?
    For businesses, the recommendation system can increase user engagement by providing relevant content, reducing churn, and increasing user satisfaction.

8. What is the diversity and coverage of our recommendations?
    Businesses can assess the diversity of recommendations to ensure users are exposed to a wide range of movie genres and styles. Additionally, they can measure how many unique movies in their catalog are being recommended.

9. How accurate are our recommendations?
    Businesses can evaluate the effectiveness of the recommendation system using metrics such as RMSE, MAE, or precision-recall, determining how closely the system's predictions align with user preferences.

10. How can we increase revenue through movie recommendations?
    Businesses can leverage the recommendation system to drive movie rentals, subscriptions, or sales, thereby increasing revenue and ROI.

11. How can we personalize the user experience and increase user-generated content?
    By offering tailored recommendations, businesses can encourage users to rate and review movies, contributing to a richer database of user-generated content.

### Success Criteria

The success of our recommendation system will be measured by improvements in key performance indicators (KPIs) including:

1. User Engagement: Increased user engagement through higher interaction with recommended movies.
2. User Retention: A decrease in user churn rates, indicating improved user satisfaction.
3. Revenue: A significant boost in revenue through increased user subscriptions and movie rentals.
4. Content Utilization: A broader range of movies being watched, leading to better utilization of the movie catalog.

## 2. DATA UNDERSTANDING

- The dataset is named "ml-latest-small" and is from MovieLens, a movie recommendation service.
- It includes 100,836 ratings and 3,683 tag applications across 9,742 movies.
- The data was generated by 610 users between March 29, 1996, and September 24, 2018.
- The dataset was last generated on September 26, 2018.
- Users were selected at random, and their demographic information is not included.

Movie file :
1. movieId - movie reference indicator
2. title - this is the movie titles
3. genres - movie types

Rating file :
1. userId - users reference indicator
2. movieId 
3. rating - movie rating 
4. timestamp - movie online information 

## 3. IMPORTING LIBRARIES

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from surprise import Dataset, Reader
from surprise import SVD
from surprise.model_selection import train_test_split
from surprise import accuracy
from surprise import Dataset, Reader


## 4. READING DATA 

Start by opening the data preferably by using the ratings and movies data

In [2]:
# Load the ratings data
ratings = pd.read_csv("C:\\Users\\Administrator\\Desktop\\Moringa\\Phase 4\\Phase 4 Project Recommendation System\\ml-latest-small\\ml-latest-small\\ratings.csv")
# Load the movies data 
movies = pd.read_csv("C:\\Users\\Administrator\\Desktop\\Moringa\\Phase 4\\Phase 4 Project Recommendation System\\ml-latest-small\\ml-latest-small\\movies.csv")

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Check for missing data on the dataset

In [5]:
print("Ratings Data - Missing Values:")
print(ratings.isnull().sum())

print("\nMovies Data - Missing Values:")
print(movies.isnull().sum())

Ratings Data - Missing Values:
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

Movies Data - Missing Values:
movieId    0
title      0
genres     0
dtype: int64


From the illustration above there are no missing values in the Ratings and Movies data

Check for duplicates 

In [6]:
ratings.drop_duplicates(inplace=True)
# outlier handling: Identifying and capping outlier ratings
outlier_threshold = 5.0
ratings['rating'] = ratings['rating'].clip(0.5, outlier_threshold)

In [7]:
# Check data consistency: Ensure that movie IDs are consistent across datasets
if movies['movieId'].isin(ratings['movieId']).all():
    print("Inconsistent movie IDs between movies and ratings datasets.")
else:
    print("Consistent movie IDs between movies and ratings datasets. ")

Consistent movie IDs between movies and ratings datasets. 


### Feature Engineering

In [8]:
# Calculate the average rating for each movie
average_ratings = ratings.groupby('movieId')['rating'].mean().reset_index()
average_ratings.rename(columns={'rating': 'avg_rating'}, inplace=True)

In [9]:
# Assuming genres are in the "genres" column and are pipe-separated
unique_genres = set('|'.join(movies['genres']).split('|'))
for genre in unique_genres:
    movies[genre] = movies['genres'].apply(lambda x: 1 if genre in x else 0)

In [10]:
# Display the first few rows of the resulting DataFrame
print("Average Ratings for Movies:")
print(average_ratings.head())

print("\nMovies DataFrame with Genre-Based Features:")
print(movies.head())

Average Ratings for Movies:
   movieId  avg_rating
0        1    3.920930
1        2    3.431818
2        3    3.259615
3        4    2.357143
4        5    3.071429

Movies DataFrame with Genre-Based Features:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  Crime  War  Documentary  \
0  Adventure|Animation|Children|Comedy|Fantasy      0    0            0   
1                   Adventure|Children|Fantasy      0    0            0   
2                               Comedy|Romance      0    0            0   
3                         Comedy|Drama|Romance      0    0            0   
4                                       Comedy      0    0            0   

   Children  IMAX  Horror  Film-Noir 

Create a feature that counts the number of ratings each movie has received. Movies with a higher number of ratings may be more popular or well-known.

In [11]:
# Calculate the number of ratings for each movie
movie_rating_counts = ratings['movieId'].value_counts().reset_index()
movie_rating_counts.columns = ['movieId', 'num_ratings']

# Display the first few rows of the resulting DataFrame
print("Number of Ratings for Each Movie:")
print(movie_rating_counts.head())

Number of Ratings for Each Movie:
   movieId  num_ratings
0      356          329
1      318          317
2      296          307
3      593          279
4     2571          278


Genre Based:

Create binary columns for each genre (e.g., Action, Comedy, Romance) and indicate whether a movie belongs to a particular genre. These binary indicators can be used in content-based filtering.
Calculate the proportion of each genre in a movie's genre list (e.g., the percentage of Action movies).

In [12]:
# Extract unique genres
unique_genres = set('|'.join(movies['genres']).split('|'))

# Create binary genre-based columns
for genre in unique_genres:
    movies[genre] = movies['genres'].apply(lambda x: 1 if genre in x else 0)

# Calculate the proportion of each genre in a movie's genre list
genre_columns = list(unique_genres)
movies[genre_columns] = movies[genre_columns].div(movies[genre_columns].sum(axis=1), axis=0)

# Display the first few rows of the resulting DataFrame
print("Movies DataFrame with Binary Genre-Based Columns:")
print(movies.head())

Movies DataFrame with Binary Genre-Based Columns:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  Crime  War  Documentary  \
0  Adventure|Animation|Children|Comedy|Fantasy    0.0  0.0          0.0   
1                   Adventure|Children|Fantasy    0.0  0.0          0.0   
2                               Comedy|Romance    0.0  0.0          0.0   
3                         Comedy|Drama|Romance    0.0  0.0          0.0   
4                                       Comedy    0.0  0.0          0.0   

   Children  IMAX  Horror  Film-Noir  ...  Action  Sci-Fi   Romance     Drama  \
0  0.200000   0.0     0.0        0.0  ...     0.0     0.0  0.000000  0.000000   
1  0.333333   0.0     0.0        0.0

Release Year:

Extract the release year from movie titles and create a feature for the movie's release year. This can be used to recommend recent movies or movies from a specific era.

In [13]:
# Assuming that the release year is enclosed in parentheses at the end of the title
movies['release_year'] = movies['title'].str.extract(r'\((\d{4})\)')

# Display the first few rows of the resulting DataFrame
print("Movies DataFrame with Release Year:")
print(movies.head())

Movies DataFrame with Release Year:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  Crime  War  Documentary  \
0  Adventure|Animation|Children|Comedy|Fantasy    0.0  0.0          0.0   
1                   Adventure|Children|Fantasy    0.0  0.0          0.0   
2                               Comedy|Romance    0.0  0.0          0.0   
3                         Comedy|Drama|Romance    0.0  0.0          0.0   
4                                       Comedy    0.0  0.0          0.0   

   Children  IMAX  Horror  Film-Noir  ...  Sci-Fi   Romance     Drama  \
0  0.200000   0.0     0.0        0.0  ...     0.0  0.000000  0.000000   
1  0.333333   0.0     0.0        0.0  ...     0.0  0.000000  0.000

User-Based Features:

For collaborative filtering, you can create user-based features such as the average rating given by each user or the number of movies each user has rated.

In [14]:
# Calculate the average rating given by each user
user_avg_rating = ratings.groupby('userId')['rating'].mean().reset_index()
user_avg_rating.rename(columns={'rating': 'avg_rating_by_user'}, inplace=True)

# Calculate the number of movies each user has rated
user_rating_counts = ratings['userId'].value_counts().reset_index()
user_rating_counts.columns = ['userId', 'num_movies_rated']

# Display the first few rows of the resulting DataFrames
print("Average Rating Given by Each User:")
print(user_avg_rating.head())

print("\nNumber of Movies Rated by Each User:")
print(user_rating_counts.head())

Average Rating Given by Each User:
   userId  avg_rating_by_user
0       1            4.366379
1       2            3.948276
2       3            2.435897
3       4            3.555556
4       5            3.636364

Number of Movies Rated by Each User:
   userId  num_movies_rated
0     414              2698
1     599              2478
2     474              2108
3     448              1864
4     274              1346


Matrix Factorization Features:

Generate latent factors or embeddings for movies and users using matrix factorization techniques like Singular Value Decomposition (SVD) or matrix factorization models. These embeddings can be used to capture complex relationships.

In [22]:
# Define the Reader object
reader = Reader(rating_scale=(0.5, 5.0))  # Define your rating scale as appropriate

# Load the ratings data using the Reader
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Split the dataset into a trainset and testset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Create and train an SVD model
svd_model = SVD(n_factors=50, random_state=42)
svd_model.fit(trainset)

# Make predictions on the test set
predictions = svd_model.test(testset)

# Evaluate the model using RMSE
rmse = accuracy.rmse(predictions)
print(f"Root Mean Squared Error (RMSE): {rmse}")

# Get latent factors for movies and users
movie_factors = svd_model.qi
user_factors = svd_model.pu

RMSE: 0.8775
Root Mean Squared Error (RMSE): 0.8774680781839199


 RMSE value of 0.8775 indicates the average error (or the average difference) between the predicted ratings by the recommendation system and the actual ratings given by users. Lower RMSE values indicate better accuracy, while higher values suggest that the predictions are less accurate.