
# Movie Recommendation System
## Group 6 Phase 4 Project
### Contributors
 1. Dennis Mwanzia
 2. Amos Kirui
 3. Robert Mbau
 4. Fiona Kungu
 5. Maureen Kitanga
 6. Edwin Muhia
 

## Overview
- MovieXplosion, a new streaming platform wants to improve their user satisfaction. The performance of the platform is dependent on how they can keep user engaged, one way to do this is by providing tailor-made recommendations to the users to drive them to spend more time on the platform.
- The project aims to develop a system that suggests movies to users. We will implement this using collaborative filtering, content-based filtering and hybrid approaches.
## Problem Statement
- The current system that the platform employs does not provide suitable recommendations to users which has led to low user engagement, satisfaction and retention. The system also has no way of providing new users with good recommendations and the existing users do not receive tailor-made recommendations.
- The new system aims to bypass these issues and provide relevant recommendations to all users.
## Objectives
1. Build a model that provides top 5 recommendations to a user.
2. Develop a system that will address the `cold start` problem for new users.
3. Enhance the recommendation system to provide accurate and relevant movie suggestions based on the user.
4. Evaluate the system performance using appropriate metrics such as `RMSE`.
## Data Understanding
The data used was sourced from [MovieLens](https://grouplens.org/datasets/movielens/latest/), we used the small dataset due to limited computational power. The data contains information about movies, ratings by users and other relevant information.
### Data Description
There are several files available with different columns:
1. Movies File
- It contains information about the movies.<br>
`movieId` - Unique identifier for each movie.<br>
`title` - The movie titles.<br>
`genre` - The various genres a movie falls into.<br>
2. Ratings file
- It contains the ratings for the movies by different users.<br>
`userId` - Unique identifier for each user<br>
`movieId` - Unique identifier for each movie.<br>
`rating` - A value between 0 to 5 that a user rates a movie on. A higher rating indicates a higher preference.<br>
`timestamp` - This are the seconds that have passed since Midnight January 1, 1970(UTC)
3. Tags file
- It has user-generated words or short phrases about a movie with the meaning or value being determined ny the specific user.<br>
`userId` - Unique identifier for each user<br>
`movieId` - Unique identifier for each movie.<br>
`tag` - A word or phrase determined by the user.<br>
`timestamp` - This are the seconds that have passed since Midnight January 1, 1970(UTC)
4. Links file
- This are identifiers that can be used to link to other sources of movie data as provide by MovieLens.<br>
`movieId` -  It's an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.<br>
`imdbId` -  It's an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.<br>
`tmdbId` -  is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862. <br>

In [1]:
#importing relevant libraries
#standard libraries
import pandas as pd
import numpy as np
#from surprise import Reader

#visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#import model selection Modules
#from surprise.model_selection import train_test_split, GridSearchCV, cross_validate, KFold

#importing metrics
#from surprise import accuracy

#import models
#from surprise import BaselineOnly
#from surprise import KNNBasic
#from surprise import KNNWithMeans
#from surprise import KNNBaseline
#from surprise import SVD
#from surprise import SVDpp
#from surprise import NMF



In [2]:
# load data
mov_df = pd.read_csv('movies.csv')
mov_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'movies.csv'

In [None]:
# ratings dataframe
ratings_df = pd.read_csv('ratings.csv')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [None]:
# ratings dataframe
links_df = pd.read_csv('links.csv')
links_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [None]:
# ratings dataframe
tags_df = pd.read_csv('tags.csv')
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [None]:
# We have userID and timestamp columns in the tags data. Therefore, we dropped the two columns before we merged the entire dataset

# Drop the 'userId' and 'timestamp' columns from the 'merged_dataset'
tags_df = tags_df.drop(['userId', 'timestamp'], axis=1)

# View the updated DataFrame
tags_df.head()


Unnamed: 0,movieId,tag
0,60756,funny
1,60756,Highly quotable
2,60756,will ferrell
3,89774,Boxing story
4,89774,MMA


In [None]:
#Data Exploration

In [None]:
# Checking Missing Data & Null Values
# check for missing values in mov_df
mov_df.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [None]:
# check for missing values in ratings_df
ratings_df.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [None]:
# check for missing values in links_df
links_df.isna().sum()

movieId    0
imdbId     0
tmdbId     8
dtype: int64

In [None]:
# check for missing values in tags_df
tags_df.isna().sum()

movieId    0
tag        0
dtype: int64

In [None]:
# We have 8 missing values for tmdbID in Links CSV therefore, we went ahead to investigate them

# Count the number of missing values in the 'tmdbId' column
missing_values = links_df['tmdbId'].isnull().sum()

# Get the rows with missing values in the 'tmdbId' column
missing_rows = links_df[links_df['tmdbId'].isnull()]

# Print the number of missing values and the rows containing them
print("Number of missing values in 'tmdbId' column:", missing_values)
print("Rows with missing values in 'tmdbId' column:")
print(missing_rows)


Number of missing values in 'tmdbId' column: 8
Rows with missing values in 'tmdbId' column:
      movieId  imdbId  tmdbId
624       791  113610     NaN
843      1107  102336     NaN
2141     2851   81454     NaN
3027     4051   56600     NaN
5532    26587   92337     NaN
5854    32600  377059     NaN
6059    40697  105946     NaN
7382    79299  874957     NaN


In [None]:
# Combining the four datasets

# Merge the 'movies' and 'ratings' datasets based on the 'movieId' column
movies_and_ratings = pd.merge(left=mov_df, right=ratings_df, on='movieId')

# Merge the 'movies_and_ratings' and 'links' datasets based on the 'movieId' column
movies_ratings_links = pd.merge(left=movies_and_ratings, right=links_df, on='movieId')

# Merge the 'movies_ratings_links' and 'tags' datasets based on the 'movieId' column
merged_dataset = pd.merge(left=movies_ratings_links, right=tags_df, on='movieId')

# Print the merged dataset
merged_dataset.head()


Unnamed: 0,movieId,title,genres,userId,rating,timestamp,imdbId,tmdbId,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,pixar
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,fun
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,pixar
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,pixar


In [None]:
merged_dataset.columns

Index(['movieId', 'title', 'genres', 'userId', 'rating', 'timestamp', 'imdbId',
       'tmdbId', 'tag'],
      dtype='object')

In [None]:
merged_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233213 entries, 0 to 233212
Data columns (total 9 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   movieId    233213 non-null  int64  
 1   title      233213 non-null  object 
 2   genres     233213 non-null  object 
 3   userId     233213 non-null  int64  
 4   rating     233213 non-null  float64
 5   timestamp  233213 non-null  int64  
 6   imdbId     233213 non-null  int64  
 7   tmdbId     233213 non-null  float64
 8   tag        233213 non-null  object 
dtypes: float64(2), int64(4), object(3)
memory usage: 17.8+ MB


In [None]:
# Feature Engineering the Title column to extract the Year in which a movie was released

# Extract the year from the 'Title' column and create a new column 'Year'
merged_dataset['Year'] = merged_dataset['title'].str.extract(r'\((\d{4})\)')

# View the updated DataFrame with the 'Year' column
merged_dataset.head()


Unnamed: 0,movieId,title,genres,userId,rating,timestamp,imdbId,tmdbId,tag,Year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,pixar,1995
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,pixar,1995
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,fun,1995
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,pixar,1995
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,pixar,1995


In [None]:
# Drop the year in parenthesis from the title column

# Remove the year in parenthesis from the 'Title' column
merged_dataset['title'] = merged_dataset['title'].str.replace(r'\s*\(\d{4}\)', '')

# View the updated DataFrame
merged_dataset.head()


  merged_dataset['title'] = merged_dataset['title'].str.replace(r'\s*\(\d{4}\)', '')


Unnamed: 0,movieId,title,genres,userId,rating,timestamp,imdbId,tmdbId,tag,Year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,pixar,1995
1,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,pixar,1995
2,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,fun,1995
3,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,pixar,1995
4,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,pixar,1995


In [None]:
# Exploratory Data Analysis

## Modeling

In [None]:
import pandas as pd
import numpy as np
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV

# Load ratings dataset
df = pd.read_csv('rating.csv')
new_df = df.drop(columns='timestamp')

# Load movies dataset
df_movies = pd.read_csv('movies.csv')

# Create Surprise dataset
reader = Reader()
data = Dataset.load_from_df(new_df, reader)
dataset = data.build_full_trainset()


def evaluate_models(data):
    # Split the data into training and test sets
    trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

    # Define a list of models to evaluate
    models = [
        SVD(),
        KNNBasic(),
        KNNWithMeans()
    ]

    # Evaluate each model and store the results
    results = []
    for model in models:
        # Train the model on the training set
        model.fit(trainset)

        # Make predictions on the test set
        predictions = model.test(testset)

        # Calculate the evaluation metric (e.g., RMSE)
        rmse = accuracy.rmse(predictions)

        # Store the model and its performance
        results.append({'model': model.__class__.__name__, 'rmse': rmse})

    # Sort the results based on the RMSE in ascending order
    sorted_results = sorted(results, key=lambda x: x['rmse'])

    # Print the results
    for result in sorted_results:
        print(f"Model: {result['model']}, RMSE: {result['rmse']}")

    # Select the best performing model
    best_model = sorted_results[0]['model']
    print(f"Best performing model: {best_model}")
evaluate_models(data)


FileNotFoundError: [Errno 2] No such file or directory: 'rating.csv'