# Movie Recommendation using Netflix Movie Reviews




This project aims to build a movie recommendation system using Netflix Movie Ratings. There are 17337458 Ratings given by 143458 users to 1350 movies. Ratings are in the form of Integer i.e. 1 - 5


**Table of Content**



#### 1.  Load Rating Data
#### 2.  Load Movie Data
#### 3.  Analyze Data
#### 4.  Recommendation Model
#### 4.1 Collaborative Filtering - SVD

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
! pip install scikit-surprise

Collecting scikit-surprise


ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/30/e1/f4f78b7dd32feaa6256f000668a6932e81b899d0e5a5f84ab3fd1f5e2743/scikit-surprise-1.1.3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))



In [3]:
from surprise import Reader, Dataset, SVD
from surprise import accuracy
from surprise.model_selection import train_test_split

# 1. Load Rating Data

In [5]:
df = pd.read_csv('Netflix_Dataset_Rating.csv')
df

Unnamed: 0,User_ID,Rating,Movie_ID
0,712664,5,3
1,1331154,4,3
2,2632461,3,3
3,44937,5,3
4,656399,4,3
...,...,...,...
17337453,520675,3,4496
17337454,1055714,5,4496
17337455,2643029,4,4496
17337456,1559566,3,4496


In [6]:
df.dtypes

User_ID     int64
Rating      int64
Movie_ID    int64
dtype: object

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17337458 entries, 0 to 17337457
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   User_ID   int64
 1   Rating    int64
 2   Movie_ID  int64
dtypes: int64(3)
memory usage: 396.8 MB


In [8]:
df['Rating'].describe().astype('int')

count    17337458
mean            3
std             1
min             1
25%             3
50%             4
75%             4
max             5
Name: Rating, dtype: int32

In [9]:
print("Unique Values :\n",df.nunique())

Unique Values :
 User_ID     143458
Rating           5
Movie_ID      1350
dtype: int64


# 2. Load Movie Data

In [12]:
df_title = pd.read_csv('Netflix_Dataset_Movie.csv')
df_title

Unnamed: 0,Movie_ID,Year,Name
0,1,2003,Dinosaur Planet
1,2,2004,Isle of Man TT 2004 Review
2,3,1997,Character
3,4,1994,Paula Abdul's Get Up & Dance
4,5,2004,The Rise and Fall of ECW
...,...,...,...
17765,17766,2002,Where the Wild Things Are and Other Maurice Se...
17766,17767,2004,Fidel Castro: American Experience
17767,17768,2000,Epoch
17768,17769,2003,The Company


In [13]:
df_title.dtypes

Movie_ID     int64
Year         int64
Name        object
dtype: object

In [14]:
df_title.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17770 entries, 0 to 17769
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Movie_ID  17770 non-null  int64 
 1   Year      17770 non-null  int64 
 2   Name      17770 non-null  object
dtypes: int64(2), object(1)
memory usage: 416.6+ KB


In [15]:
df_title['Year'].describe().astype('int')

count    17770
mean      1990
std         16
min       1915
25%       1985
50%       1997
75%       2002
max       2005
Name: Year, dtype: int32

In [16]:
print("Unique Values :\n",df_title.nunique())

Unique Values :
 Movie_ID    17770
Year           91
Name        17297
dtype: int64


# 3. Analyze Data

In [17]:
no_of_rated_products_per_users = df.groupby(by='User_ID')['Rating'].count().sort_values(ascending=False)
no_of_rated_products_per_users.head()

User_ID
305344     1344
387418     1339
2439493    1324
2118461    1305
1664010    1257
Name: Rating, dtype: int64

In [18]:
no_of_rated_products_per_users.describe()


count    143458.000000
mean        120.853895
std          79.783702
min           5.000000
25%          67.000000
50%          95.000000
75%         147.000000
max        1344.000000
Name: Rating, dtype: float64

In [19]:
no_of_rated_products_per_movies = df.groupby(by='Movie_ID')['Rating'].count().sort_values(ascending=False)
no_of_rated_products_per_movies.head()

Movie_ID
1905    117075
2452    102721
4306    102376
571     101450
3860     98545
Name: Rating, dtype: int64

In [20]:
no_of_rated_products_per_movies.describe()

count      1350.000000
mean      12842.561481
std       17805.334719
min        1042.000000
25%        2607.750000
50%        5229.000000
75%       14792.000000
max      117075.000000
Name: Rating, dtype: float64

In [21]:
f = ['count','mean']
df_movie_summary = df.groupby('Movie_ID')['Rating'].agg(f)
df_movie_summary.index = df_movie_summary.index.map(int)
movie_benchmark = round(df_movie_summary['count'].quantile(0.7),0)
drop_movie_list = df_movie_summary[df_movie_summary['count'] < movie_benchmark].index

df__title = df_title.set_index('Movie_ID')

# 4. Recommendation Model


## 4.1 Collaborative Filtering - SVD

In [22]:
model = SVD(n_epochs=10,verbose = True)

data = Dataset.load_from_df(df[['User_ID', 'Movie_ID', 'Rating']], Reader())

trainset, testset = train_test_split(data, test_size=0.3,random_state=10)

trainset = data.build_full_trainset()

model.fit(trainset)

NameError: name 'SVD' is not defined

In [None]:
predictions = model.test(testset)

accuracy.rmse(predictions, verbose=True)

In [None]:
def Recommendation(given_user_id,n_movies):
    given_user = df_title.copy()
    given_user = given_user.reset_index()
    given_user = given_user[~given_user['Movie_ID'].isin(drop_movie_list)]


    given_user['Estimated_Rating'] = given_user['Movie_ID'].apply(lambda x: model.predict(given_user_id, x).est)

    given_user = given_user.drop('Movie_ID', axis = 1)

    given_user = given_user.sort_values('Estimated_Rating', ascending=False)
    given_user.drop(['index'], axis = 1,inplace=True)
    given_user.reset_index(inplace=True,drop=True)
    return given_user.head(n_movies)

### Movie Recommendation for User - 712664

In [None]:
Recommendation(712664,10)

### Movie Recommendation for User - 2643029

In [None]:
Recommendation(2643029,10)