# Capstone Project - Netflix Movie recommendation 

# Problem Statement

#### Recommendation Engines are the much needed manifestations of the desired Predictability of User Activity. Recommendation Engines move one step further and not only give information but put forth strategies to further increase users interaction with the platform.

#### In today’s world OTT platform and Streaming Services have taken up a big chunk in the Retail and Entertainment industry. Organizations like Netflix, Amazon etc. analyse User Activity Pattern’s and suggest products that better suit the user needs and choices.

#### For the purpose of this Project we will be creating one such Recommendation Engine from the ground up, where every single user, based on there area of interest and ratings, would be recommended a list of movies that are best suited for them.

# Dataset Information

#### 1. ID Contains the separate keys for customer and movies.
#### 2. Rating A section contains the user ratings for all the movies.
#### 3. Genre Highlights the category of the movie.
#### 4. Movie Name Name of the movie with respect to the movie id.

# Objective

#### 1. Find out the list of most popular and liked genre
#### 2. Create Model that finds the best suited Movie for one user in every genre.
#### 3. Find what Genre Movies have received the best and worst ratings based on User Rating.

In [1]:
import zipfile
import os
from datetime import datetime
import numpy as np
import pandas as pd
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore")
import matplotlib
matplotlib.use('nbagg')

import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})


#### Created a Method getNetflixData() to read the netflix data and restructure the database and return the database.

In [2]:
def getNetflixData(zifilename,filename,usecols = [0,1,2], names=['Movie_id', 'User_id','Rating']):
    start = datetime.now()
    if not os.path.isfile('Netflix_data.csv'):
        Netflix_data = open('Netflix_data.csv', mode='w')
        movie_id =""
        row=[]
#         with zipfile.ZipFile('combined_data_1.txt.zip') as z:
        with zipfile.ZipFile(zifilename) as z:
            with z.open(filename) as f:
                for line in f: 
                    line = str(line.decode("utf-8") )
                    del row[:] # you don't have to do this.
                    line = line.strip()
                    if line.endswith(':'):
                        # All below are ratings for this movie, until another movie appears.
                        movie_id = line.replace(':', '')
                    else:
                        row = [x for x in line.split(',')]
                        row.insert(0, movie_id)
                        Netflix_data.write(','.join(row))
                        Netflix_data.write('\n')
        Netflix_data.close()
    print('Time taken to creating Netflix_data file:', datetime.now() - start)
    Netflix_df = pd.read_csv('Netflix_data.csv', sep=',',usecols =usecols,names=names)
#     Netflix_df.date = pd.to_datetime(Netflix_df.date)
    return Netflix_df
   


#### Called method getNetflixData() to get netflix data frame.

In [3]:
netflix_df = getNetflixData(zifilename='combined_data_1.txt.zip',filename="combined_data_1.txt", usecols = [0,1,2], names=['Movie_id', 'User_id','Rating'])

Time taken to creating Netflix_data file: 0:01:54.338897


In [4]:
netflix_df.head()

Unnamed: 0,Movie_id,User_id,Rating
0,1,1488844,3
1,1,822109,5
2,1,885013,4
3,1,30878,4
4,1,823519,3


In [5]:
netflix_df.tail()

Unnamed: 0,Movie_id,User_id,Rating
24053759,4499,2591364,2
24053760,4499,1791000,2
24053761,4499,512536,5
24053762,4499,988963,3
24053763,4499,1704416,3


#### Getting Movie title data in a data frame.

In [6]:
movie_title_df = pd.read_csv("movie_titles.csv",  encoding='ISO-8859-1', header=None, usecols=[0,1,2], names=['Movie_id','Year','Name' ])


In [7]:
movie_title_df.head()

Unnamed: 0,Movie_id,Year,Name
0,1,2003.0,Dinosaur Planet
1,2,2004.0,Isle of Man TT 2004 Review
2,3,1997.0,Character
3,4,1994.0,Paula Abdul's Get Up & Dance
4,5,2004.0,The Rise and Fall of ECW


# 1.Find out the list of most popular and liked genre

In [8]:
# most_like_gener = netflix_df[netflix_df['Rating']==netflix_df['Rating'].max()]
most_like_gener = netflix_df.groupby(['Movie_id','Rating'])['Rating'].count()
most_like_gener = most_like_gener.to_frame(name="Total_Rating_count").reset_index()
most_like_gener = most_like_gener.groupby(['Movie_id'])['Total_Rating_count'].sum()
most_like_gener = most_like_gener.to_frame(name="Total_review_count").reset_index()
most_like_gener = most_like_gener[most_like_gener["Total_review_count"]==most_like_gener["Total_review_count"].max()]
pd.merge(movie_title_df, most_like_gener, on=["Movie_id","Movie_id"])

Unnamed: 0,Movie_id,Year,Name,Total_review_count
0,1905,2003.0,Pirates of the Caribbean: The Curse of the Bla...,193941


#  Model building Start here 

### Created a Method delLowestReviewedMovies() to clean data based on user rating volumns

In [9]:
def delLowestReviewedMovies(data, quantile_percent=0.7):
    print('The original dataframe has: ', data.shape, 'shape')
    ratings_movie_summary = data.groupby('Movie_id')['Rating'].agg(['count', 'mean', 'std'])
    movie_benchmark = ratings_movie_summary["count"].quantile(quantile_percent) 
    movie_benchmark = round(movie_benchmark)
    dataset_cust_summary=data.groupby('User_id')['Rating'].agg(['count','mean'])
    cust_benchmark=round(dataset_cust_summary['count'].quantile(quantile_percent),0)
    print('The movie_benchmark is: ', movie_benchmark)
    
    drop_movie_list = ratings_movie_summary[ratings_movie_summary['count']<movie_benchmark].index
    drop_cust_list  = dataset_cust_summary[dataset_cust_summary['count']<cust_benchmark].index    
    data=data[~data['Movie_id'].isin(drop_movie_list)]
    data=data[~data['User_id'].isin(drop_cust_list)]
    print('After the triming, the shape is: {}'.format(data.shape))
    return data

### Called the method delLowestReviewedMovies to remove the rewiewd data less than 70 percent quantlile. 

In [10]:
netflix_df = delLowestReviewedMovies(data=netflix_df, quantile_percent=0.7)

The original dataframe has:  (24053764, 3) shape
The movie_benchmark is:  1799
After the triming, the shape is: (17337458, 3)


### Created a new method BuldModel to see test_rmse,	fit_time and 	test_time scores from different algorithms.

In [11]:
def BuldModel(data, columns=['User_id','Movie_id','Rating'], datalimit=100000):
    
    reader=Reader()
    data=Dataset.load_from_df(netflix_df[['User_id','Movie_id','Rating']][:100000], reader)
    
    svd=SVD()
    cross_validate(svd, data, measures=['RMSE','MAE'], cv=3)
    
    # Preare Trainset Data.
    trainset = data.build_full_trainset()
    testset = trainset.build_testset()
    predictions = svd.test(testset)
    model_pred = pd.DataFrame([[i.uid, i.iid, i.est] for i in predictions], columns=['User_id', 'Movie_id', 'svd'])
    print('The Trainset Data shape is: {}'.format(model_pred.shape))

    # Preare Testset Data.
    anti_testset = trainset.build_anti_testset()
    anti_predictions = svd.test(anti_testset)
    model_pred_anti = pd.DataFrame([[i.uid, i.iid, i.est] for i in anti_predictions], columns=['User_id', 'Movie_id', 'svd'])
    model_pred_anti = pd.concat([model_pred, model_pred_anti], ignore_index=True)
    print('The anti_testset Data shape is: {}'.format(model_pred_anti.shape))
    
    benchmark = []
    # Iterate over all algorithms
    algorithms = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(),  BaselineOnly(), CoClustering()]
    # KNNBaseline(),KNNBasic(),KNNWithMeans(),KNNWithZScore(), # These alogorithm required 22 GiB of RAM hence I ignored it.
    print ("Attempting: ", str(algorithms), '\n\n\n')

    for algorithm in algorithms:
        print("Starting: " ,str(algorithm))
        # Perform cross validation
        results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
        # results = cross_validate(algorithm, data, measures=['RMSE','MAE'], cv=3, verbose=False)

        # Get results & append algorithm name
        tmp = pd.DataFrame.from_dict(results).mean(axis=0)
        tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
        benchmark.append(tmp)
        print("Done: " ,str(algorithm), "\n\n")

    print ('\n\tDONE\n')
    surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
    return data, surprise_results

### Called the method BuldModel it returns model data frame and suprise result data frame. 
The below method used datalimit 100000 I have used this limed because the entire data set takes lots of time to build the model, hence I have parameterised the dataset limit.

In [12]:
start = datetime.now()
model_data , surprise_results = BuldModel(data=netflix_df, columns=['User_id','Movie_id','Rating'], datalimit=100000)
print('Time taken to build suprise:', datetime.now() - start)

The Trainset Data shape is: (100000, 3)
The anti_testset Data shape is: (594088, 3)
Attempting:  [<surprise.prediction_algorithms.matrix_factorization.SVD object at 0x0000024321D66C50>, <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x0000024321D66C90>, <surprise.prediction_algorithms.slope_one.SlopeOne object at 0x000002437A547890>, <surprise.prediction_algorithms.matrix_factorization.NMF object at 0x000002437A795E10>, <surprise.prediction_algorithms.random_pred.NormalPredictor object at 0x000002437A69BF90>, <surprise.prediction_algorithms.baseline_only.BaselineOnly object at 0x000002437A42DC10>, <surprise.prediction_algorithms.co_clustering.CoClustering object at 0x00000243001800D0>] 



Starting:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x0000024321D66C50>
Done:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x0000024321D66C50> 


Starting:  <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x00

### The below data frame shows the scors of test_rmse,	fit_time scores from different algorithms.

In [13]:
surprise_results

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BaselineOnly,0.989432,0.332659,0.258005
SVD,0.99868,1.051334,0.216667
SVDpp,1.012822,0.959997,0.354676
SlopeOne,1.097491,0.769333,0.236665
CoClustering,1.13174,7.027341,0.165985
NMF,1.145427,4.593641,0.218348
NormalPredictor,1.4116,0.098998,0.184349


### Created a new method getSurprisedResult to get Suprised data frame of an algorithms.

In [14]:
def getSurprisedResult(model_data,Algorithm, param_grid, MovieData, recommend_column='Movie_id', column_order=['Movie_id', 'Name', 'Year'], User_id =712664 ):
# def getPredictedResult(traindata,Algorithm, param_grid, MovieData, columns=['Movie_id', 'Name', 'Year'], User_id =712664 ):
    gs = GridSearchCV(Algorithm, param_grid, measures=["rmse", "mae"], refit=True, cv=5)
    gs.fit(model_data)

    training_parameters = gs.best_params["rmse"]

    print("Algorithm Name:", Algorithm, "BEST RMSE: \t", gs.best_score["rmse"])
    print("Algorithm Name:", Algorithm,"BEST MAE: \t", gs.best_score["mae"])
    print("Algorithm Name:", Algorithm,"BEST params: \t", gs.best_params["rmse"])

    user_df = MovieData[column_order]
    user_df['Estimate_Score'] =user_df[recommend_column].apply(lambda x: gs.predict(User_id, x).est)
    user_df = user_df.drop('Movie_id', axis = 1)
    user_df = user_df.sort_values('Estimate_Score', ascending=False)
    return user_df


### Executing the method getSurprisedResult for the algorith name :BaselineOnly to get the Surprised data frame.

In [15]:
start = datetime.now()
param_grid = {
}
Base_Surprised_movies_df = getSurprisedResult(model_data=model_data,Algorithm=BaselineOnly,param_grid = param_grid  , MovieData=movie_title_df, User_id =712664 )
print('Time taken to build BaselineOnly suprise model:', datetime.now() - start)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Algorithm Name: <class 'surprise.prediction_algorithms.baseline_only.BaselineOnly'> BEST RMSE: 	 0.988433993030869
Algorithm Name: <class 'surprise.prediction_algorithms.baseline_only.BaselineOnly'> BEST MAE: 	 0.7939843676080804
Algorithm Name: <class 'surprise.prediction_algorithms.baseline_only.BaselineOnly'> BEST params: 	 {}
Time taken to build BaselineOnly suprise model: 0:00:04.437555


### Executing the method getSurprisedResult for the algorith name :BaselineOnly to get the Surprised data frame.

In [16]:
Base_Surprised_movies_df.head(10)

Unnamed: 0,Name,Year,Estimate_Score
27,Lilo and Stitch,2002.0,3.910127
17,Immortal Beloved,1994.0,3.87475
29,Something's Gotta Give,2003.0,3.816699
2,Character,1997.0,3.738633
11846,Dust to Glory,2005.0,3.705793
11851,Return to the Blue Lagoon,1991.0,3.705793
11850,The Yearling,1946.0,3.705793
11849,Dumb and Dumberer: When Harry Met Lloyd,2003.0,3.705793
11848,Earth,1998.0,3.705793
11847,For Richer or Poorer,1997.0,3.705793


### Executing the method getSurprisedResult for the algorith name :SVD to get the Surprised data frame.

In [17]:
start = datetime.now()
param_grid = {
"n_epochs": [10, 20],
"lr_all": [0.002, 0.005],
"reg_all": [0.02]
}
SVD_Surprised_movies_df = getSurprisedResult(model_data=model_data,Algorithm=SVD,param_grid = param_grid  , MovieData=movie_title_df, User_id =712664 )
print('Time taken to build SVD suprise model:', datetime.now() - start)

Algorithm Name: <class 'surprise.prediction_algorithms.matrix_factorization.SVD'> BEST RMSE: 	 0.991384456503688
Algorithm Name: <class 'surprise.prediction_algorithms.matrix_factorization.SVD'> BEST MAE: 	 0.7941747867698165
Algorithm Name: <class 'surprise.prediction_algorithms.matrix_factorization.SVD'> BEST params: 	 {'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0.02}
Time taken to build SVD suprise model: 0:00:26.506996


### The below data frame Surprised_movies_df will show top 10 recommend movies using Algorithm : SVD 

In [18]:
SVD_Surprised_movies_df.head(10)

Unnamed: 0,Name,Year,Estimate_Score
17,Immortal Beloved,1994.0,3.884352
27,Lilo and Stitch,2002.0,3.839536
29,Something's Gotta Give,2003.0,3.797985
0,Dinosaur Planet,2003.0,3.67269
11851,Return to the Blue Lagoon,1991.0,3.67269
11850,The Yearling,1946.0,3.67269
11849,Dumb and Dumberer: When Harry Met Lloyd,2003.0,3.67269
11848,Earth,1998.0,3.67269
11847,For Richer or Poorer,1997.0,3.67269
11846,Dust to Glory,2005.0,3.67269


### Executing the method getSurprisedResult for the algorith name :SVDpp to get the Surprised data frame.

In [19]:
start = datetime.now()
param_grid = {
"n_epochs": [10, 20],
"lr_all": [0.002, 0.005],
"reg_all": [0.02]
}
SVDpp_Surprised_movies_df = getSurprisedResult(model_data=model_data,Algorithm=SVDpp,param_grid = param_grid  , MovieData=movie_title_df, User_id =712664 )
print('Time taken to build SVDpp suprise model:', datetime.now() - start)

Algorithm Name: <class 'surprise.prediction_algorithms.matrix_factorization.SVDpp'> BEST RMSE: 	 0.9907300236438932
Algorithm Name: <class 'surprise.prediction_algorithms.matrix_factorization.SVDpp'> BEST MAE: 	 0.7982833164737003
Algorithm Name: <class 'surprise.prediction_algorithms.matrix_factorization.SVDpp'> BEST params: 	 {'n_epochs': 20, 'lr_all': 0.002, 'reg_all': 0.02}
Time taken to build SVDpp suprise model: 0:00:26.261997


### The below data frame Surprised_movies_df will show top 10 recommend movies using Algorithm : SVDpp 

In [20]:
SVDpp_Surprised_movies_df.head(10)

Unnamed: 0,Name,Year,Estimate_Score
17,Immortal Beloved,1994.0,3.813666
27,Lilo and Stitch,2002.0,3.791544
29,Something's Gotta Give,2003.0,3.723045
0,Dinosaur Planet,2003.0,3.679149
11851,Return to the Blue Lagoon,1991.0,3.679149
11850,The Yearling,1946.0,3.679149
11849,Dumb and Dumberer: When Harry Met Lloyd,2003.0,3.679149
11848,Earth,1998.0,3.679149
11847,For Richer or Poorer,1997.0,3.679149
11846,Dust to Glory,2005.0,3.679149


### Executing the method getSurprisedResult for the algorith name :SlopeOne to get the Surprised data frame.

In [21]:
start = datetime.now()
param_grid = {
}
Slope_Surprised_movies_df = getSurprisedResult(model_data=model_data,Algorithm=SlopeOne,param_grid = param_grid  , MovieData=movie_title_df, User_id =712664 )
print('Time taken to build SlopeOne suprise model:', datetime.now() - start)

Algorithm Name: <class 'surprise.prediction_algorithms.slope_one.SlopeOne'> BEST RMSE: 	 1.1065073856632865
Algorithm Name: <class 'surprise.prediction_algorithms.slope_one.SlopeOne'> BEST MAE: 	 0.8837989057724268
Algorithm Name: <class 'surprise.prediction_algorithms.slope_one.SlopeOne'> BEST params: 	 {}
Time taken to build SlopeOne suprise model: 0:00:08.325999


### The below data frame Surprised_movies_df will show top 10 recommend movies using Algorithm : SlopeOne 

In [22]:
Slope_Surprised_movies_df.head(10)

Unnamed: 0,Name,Year,Estimate_Score
27,Lilo and Stitch,2002.0,4.333514
29,Something's Gotta Give,2003.0,4.304397
17,Immortal Beloved,1994.0,4.297718
2,Character,1997.0,4.16789
16,7 Seconds,2005.0,3.754298
15,Screamers,1996.0,3.719256
0,Dinosaur Planet,2003.0,3.60376
11852,Privates on Parade,1982.0,3.60376
11851,Return to the Blue Lagoon,1991.0,3.60376
11850,The Yearling,1946.0,3.60376


### Executing the method getSurprisedResult for the algorith name :CoClustering to get the Surprised data frame.

In [23]:
start = datetime.now()
param_grid = {
}

CoCluster_Surprised_movies_df = getSurprisedResult(model_data=model_data,Algorithm=CoClustering,param_grid = param_grid  , MovieData=movie_title_df, User_id =712664 )
print('Time taken to build CoClustering suprise model:', datetime.now() - start)

Algorithm Name: <class 'surprise.prediction_algorithms.co_clustering.CoClustering'> BEST RMSE: 	 1.1338308519187397
Algorithm Name: <class 'surprise.prediction_algorithms.co_clustering.CoClustering'> BEST MAE: 	 0.9080891647735095
Algorithm Name: <class 'surprise.prediction_algorithms.co_clustering.CoClustering'> BEST params: 	 {}
Time taken to build CoClustering suprise model: 0:00:53.389018


### Executing the method getSurprisedResult for the algorith name :CoClustering to get the Surprised data frame.

In [24]:
CoCluster_Surprised_movies_df.head(10)

Unnamed: 0,Name,Year,Estimate_Score
27,Lilo and Stitch,2002.0,4.235074
17,Immortal Beloved,1994.0,4.194078
29,Something's Gotta Give,2003.0,4.155878
2,Character,1997.0,4.047871
11846,Dust to Glory,2005.0,3.60376
11851,Return to the Blue Lagoon,1991.0,3.60376
11850,The Yearling,1946.0,3.60376
11849,Dumb and Dumberer: When Harry Met Lloyd,2003.0,3.60376
11848,Earth,1998.0,3.60376
11847,For Richer or Poorer,1997.0,3.60376


### Executing the method getSurprisedResult for the algorith name :NMF to get the Surprised data frame.

In [25]:
start = datetime.now()
param_grid = {'n_factors': [1,2,3,4,5,6,7,8,9,10], 'n_epochs': [10, 20], 'biased': [True], 'reg_bu': [0.1], 'reg_bi': [0.1]}   
NMF_Surprised_movies_df = getSurprisedResult(model_data=model_data,Algorithm=NMF,param_grid = param_grid  , MovieData=movie_title_df, User_id =712664 )
print('Time taken to build NMF suprise model:', datetime.now() - start)

Algorithm Name: <class 'surprise.prediction_algorithms.matrix_factorization.NMF'> BEST RMSE: 	 1.0830036445640525
Algorithm Name: <class 'surprise.prediction_algorithms.matrix_factorization.NMF'> BEST MAE: 	 0.8875485128865248
Algorithm Name: <class 'surprise.prediction_algorithms.matrix_factorization.NMF'> BEST params: 	 {'n_factors': 1, 'n_epochs': 10, 'biased': True, 'reg_bu': 0.1, 'reg_bi': 0.1}
Time taken to build NMF suprise model: 0:04:33.496014


### The below data frame Surprised_movies_df will show top 10 recommend movies using Algorithm : NMF 

In [27]:
NMF_Surprised_movies_df.head(10)

Unnamed: 0,Name,Year,Estimate_Score
17,Immortal Beloved,1994.0,4.118542
27,Lilo and Stitch,2002.0,4.109947
29,Something's Gotta Give,2003.0,3.936677
2,Character,1997.0,3.856555
11846,Dust to Glory,2005.0,3.673799
11851,Return to the Blue Lagoon,1991.0,3.673799
11850,The Yearling,1946.0,3.673799
11849,Dumb and Dumberer: When Harry Met Lloyd,2003.0,3.673799
11848,Earth,1998.0,3.673799
11847,For Richer or Poorer,1997.0,3.673799


### Executing the method getSurprisedResult for the algorith name :NormalPredictor to get the Surprised data frame.

In [28]:
start = datetime.now()
param_grid = {
}
NormalPred_Surprised_movies_df = getSurprisedResult(model_data=model_data,Algorithm=NormalPredictor,param_grid = param_grid  , MovieData=movie_title_df, User_id =712664 )
print('Time taken to build NormalPredictor suprise model:', datetime.now() - start)

Algorithm Name: <class 'surprise.prediction_algorithms.random_pred.NormalPredictor'> BEST RMSE: 	 1.413681687401473
Algorithm Name: <class 'surprise.prediction_algorithms.random_pred.NormalPredictor'> BEST MAE: 	 1.1306476425007592
Algorithm Name: <class 'surprise.prediction_algorithms.random_pred.NormalPredictor'> BEST params: 	 {}
Time taken to build NormalPredictor suprise model: 0:00:03.394691


### The below data frame Surprised_movies_df will show top 10 recommend movies using Algorithm : NormalPredictor 

In [29]:
NormalPred_Surprised_movies_df.head(10)

Unnamed: 0,Name,Year,Estimate_Score
6709,The Big Heat,1953.0,5.0
7830,Critters 2: The Main Course,1988.0,5.0
1548,Deja Vu,1997.0,5.0
1546,The Women,1939.0,5.0
1544,The Dimension Travelers,1998.0,5.0
12743,Who's That Knocking at My Door?,1969.0,5.0
1542,Nick of Time,1995.0,5.0
12744,River's Edge,1986.0,5.0
16933,The Wiggles: LIVE Hot Potatoes,2004.0,5.0
7838,Kiss of Fire,1998.0,5.0


# 3.Find what Genre Movies have received the best and worst ratings based on User Rating.

### The below steps shows the wors rated movie details.

In [30]:
worst_rated_movie = netflix_df[netflix_df["Rating"]==1]
worst_rated_movie = worst_rated_movie.groupby(["Movie_id","Rating"])["Rating"].sum()


In [31]:
worst_rated_movie = worst_rated_movie.to_frame(name ="Total_Rating").reset_index()

In [32]:
worst_rated_movie = worst_rated_movie[worst_rated_movie["Total_Rating"]==worst_rated_movie["Total_Rating"].max()]

In [33]:
# worst_rating_movie.join()
pd.merge(worst_rated_movie, movie_title_df,on=["Movie_id","Movie_id"])

Unnamed: 0,Movie_id,Rating,Total_Rating,Year,Name
0,3151,1,7807,2004.0,Napoleon Dynamite


### The below steps shows the best rated movie details.

In [34]:
best_rated_movie = netflix_df[netflix_df["Rating"]==5]
best_rated_movie = best_rated_movie.groupby(["Movie_id","Rating"])["Rating"].sum()

In [35]:
best_rated_movie = best_rated_movie.to_frame(name ="Total_Rating").reset_index()

In [36]:
best_rated_movie = best_rated_movie[best_rated_movie["Total_Rating"]==best_rated_movie["Total_Rating"].max()]
pd.merge(best_rated_movie, movie_title_df,on=["Movie_id","Movie_id"])

Unnamed: 0,Movie_id,Rating,Total_Rating,Year,Name
0,2452,5,322510,2001.0,Lord of the Rings: The Fellowship of the Ring


# Conclusion


    I have tested mutlitple algorithms -
        BaselineOnly	
        SVD				
        SVDpp			
        SlopeOne		
        CoClustering	
        NMF	1.149467	
        NormalPredictor	

    In final conclusion, NormalPredictor shows the best rated movies in its recommendation.

# References
Intellipaat Timeseries Class on 27th July 2023 - 


https://colab.research.google.com/github/singhsidhukuldeep/Recommendation-System/blob/master/Building_Recommender_System_with_Surprise.ipynb#scrollTo=j8BidQh53A8g
https://stackoverflow.com/questions/45376505/python-3-reading-a-file-within-zipped-archive-places-b-character-at-start-of
https://stackoverflow.com/questions/11482342/read-a-large-zipped-text-file-line-by-line-in-python
https://www.kaggle.com/code/sunyuanxi/surprise