# Movie Recommendation System


![picture](Images/image2.jpeg)


## Business Understnding

### Introduction

### Problem Statement

### Research Questions

### Main Objective

### Specific Objectives

### Metrics of success

## Data Understanding
The datasets used were extracted from a movie database, [MovieLens](https://grouplens.org/datasets/movielens/latest/), the datasets are;
> Movies Dataset: the data contains 9742 movie and 3 columns , **movieId**, **movie title** and **genres**.

> Ratings Dataset: the data contains 100836 rows and 3 columns namely, **userId**, **movieId** and **rating**. I has 610 userIds , 9724 rated movies and 100,836 ratings.

These datasets were created by 610 users between March 29, 1996 and September 24, 2018 and generated on September 26, 2018 where users were selected at random for inclusion.

For download of the dataset's, view the [Link](https://grouplens.org/datasets/movielens/latest/).

The information contained in this dataset will be used to train out model.The columns (features) will provide great insight once used to train the model.

In [1]:
# loading necessary libraries
import pandas as pd
import numpy as np
from surprise import Reader , Dataset
from sklearn.metrics.pairwise import cosine_similarity
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV

In [2]:
# loading the movies.csv file 
movies_df=pd.read_csv('./data/movies.csv')

#printing the first 5 rows
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
# viewing the datasets features
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [4]:
#number of movies in the dataset
print("Number of Movies: ", movies_df.movieId.nunique())

Number of Movies:  9742


In [5]:
# Loading the second dataset, ratings.csv file
ratings_df=pd.read_csv('./data/ratings.csv')

#printinh the first 5 rows
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
# viewing the datasets features
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [7]:
# number of users and movies in the dataset
print("Number of Users: ", ratings_df.userId.nunique())
print("Number of Movies: ", ratings_df.movieId.nunique())
print("Number of Ratings: ", ratings_df.shape[0])

Number of Users:  610
Number of Movies:  9724
Number of Ratings:  100836


In [8]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


* The **timestamp** column will be dropped since it will not be helpful in analysis.


In [9]:
# droping the unrequired column timestamp
ratings_df.drop('timestamp', axis=1, inplace=True)

## Data Preparation

In [10]:
movies_df=pd.merge(left=movies_df, right=ratings_df[['movieId', 'rating']],  how = 'left' , on='movieId')
movies_df.head()

Unnamed: 0,movieId,title,genres,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.5
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,2.5
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.5


In [11]:

def avg_rating(movie):
    mean=movies_df.loc[movies_df.movieId==movie,'rating'].mean()
    mean=np.round(mean,2)
    return mean

# testing the function on the first movie id
avg_rating(movies_df.movieId[0])

3.92

In [13]:
# checking for duplicatedsand null values in our ratings dataframe
print("Duplicates: ",ratings_df.duplicated().sum())
print("Null values:\n--------------\n",ratings_df.isna().sum())

Duplicates:  0
Null values:
--------------
 userId     0
movieId    0
rating     0
dtype: int64


In [14]:
# checking for duplicatedsand null values in our movies dataframe
print("Duplicates: ",movie_df.duplicated().sum())
print("Null values:\n--------------\n",movie_df.isna().sum())

Duplicates:  0
Null values:
--------------
 movieId        0
title          0
genres         0
rating        18
avg_rating    18
dtype: int64


In [15]:
# cleaning the genres column
print(movie_df.genres[:5])

#replacing the "|" marks with commas ','
movie_df.genres=movie_df.genres.apply( lambda x: x.replace('|',','))
print(movie_df.genres[:5])

0      Adventure|Animation|Children|Comedy|Fantasy
215                     Adventure|Children|Fantasy
325                                 Comedy|Romance
377                           Comedy|Drama|Romance
384                                         Comedy
Name: genres, dtype: object
0      Adventure,Animation,Children,Comedy,Fantasy
215                     Adventure,Children,Fantasy
325                                 Comedy,Romance
377                           Comedy,Drama,Romance
384                                         Comedy
Name: genres, dtype: object


### Exploratory Data Analysis