# <font color = 'indianred'> **Movie Recommendation system using Pandas**

* Agenda is to build simple movie recommedation system using Pandas dataframe only.
* In real-world more complex algorithms are used to build any recommendation system.
* Algorithms like content-based filtering and collaborative filtering are mostly used to build these kinds of recommendation system.


#  <font color = 'indianred'>**Importing Libraries**

In [None]:
# Import the required packages
import numpy as np
import pandas as pd
from pathlib import Path
import zipfile

#  <font color = 'indianred'>Mount Google Drive
We will mount Google deive and specify Path to dowanload the data set

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:


base_path = '/content/drive/MyDrive/AML'

In [None]:
# create a POSIX path for data folder
# we can use this to navigate file system
base_folder = Path(base_path)

In [None]:
archive_folder = base_folder/'archive'
data_folder = base_folder/'data'

#  <font color = 'indianred'>**Data set**

We will download the movie lens data set from the following URL: https://grouplens.org/datasets/movielens/latest/

Summary about the data files:

* This dataset describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

* Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

* The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`.

* As part of this task, I will focus only on 2 files i.e. ratings.csv and movies.csv

## use wget command to get data from the url
Syntax
!wget {url} -P {path_to_save_file}

In [None]:
url = 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
!wget {url} -P {archive_folder}

--2023-09-09 02:02:40--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘/content/drive/MyDrive/AML/archive/ml-latest-small.zip.3’


2023-09-09 02:02:40 (4.58 MB/s) - ‘/content/drive/MyDrive/AML/archive/ml-latest-small.zip.3’ saved [978202/978202]



In [None]:
import zipfile
# this is a zipped folder

# let us first look at the content of the zipped files
# without unziping it
zipped_file = archive_folder/'ml-latest-small.zip'
with zipfile.ZipFile(zipped_file, 'r') as f:
  print(f.namelist())

['ml-latest-small/', 'ml-latest-small/links.csv', 'ml-latest-small/tags.csv', 'ml-latest-small/ratings.csv', 'ml-latest-small/README.txt', 'ml-latest-small/movies.csv']


In [None]:
with zipfile.ZipFile(zipped_file, 'r') as f:
  f.extractall(path=data_folder)

#  <font color = 'indianred'>Task1 : Create data frames using (1) movies.csv file and (2) ratings.csv file

In [None]:
# our file is in the folder ml-latest-small
# We can construct a path to the file by joining the parts using the special operator /.
#The / can join several paths or a mix of paths and strings given, atleast one of those

path_ratings = data_folder / 'ml-latest-small' / 'ratings.csv'
path_movies = data_folder / 'ml-latest-small' / 'movies.csv'

In [None]:
# create pandas dataframe using ratings.csv file
user_movie_ratings = pd.read_csv(path_ratings)# code here
user_movie_ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [None]:
# create pandas dataframe using movies.csv file
movie_info = pd.read_csv(path_movies)# code here
movie_info

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


#  <font color = 'indianred'> Task2 : Identify top movies based on average rating and number of ratings received



## Step 1: Get movie review stats in a new dataframe
Using groupby on user_movie_ratings to get count and mean of ratings for each movie. Storing this information in a
new dataframe :  movie_ratings_stats

In [None]:
movie_rating_stats = user_movie_ratings.groupby('movieId').agg({'rating':'mean','movieId':'count'})# code here
movie_rating_stats


Unnamed: 0_level_0,rating,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.920930,215
2,3.431818,110
3,3.259615,52
4,2.357143,7
5,3.071429,49
...,...,...
193581,4.000000,1
193583,3.500000,1
193585,3.500000,1
193587,3.500000,1


##  <font color = 'indianred'> Step 2 : Merge new dataframe with movie_info dataframe


In [None]:
# Rename the 'movieId' column in movie_rating_stats to 'movieId_stats'
movie_rating_stats.rename(columns={'movieId': 'movieId_stats'}, inplace=True)


movie_info = pd.merge(movie_info, movie_rating_stats, on='movieId', how='left')
movie_info

Unnamed: 0,movieId,title,genres,rating,movieId_stats
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.920930,215.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,110.0
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615,52.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143,7.0
4,5,Father of the Bride Part II (1995),Comedy,3.071429,49.0
...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,4.000000,1.0
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,3.500000,1.0
9739,193585,Flint (2017),Drama,3.500000,1.0
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,3.500000,1.0


##  <font color = 'indianred'> Step3: rename new columns in movie_info
count should be renamed to num_ratings and mean should be renamed to avg_ratings)

In [None]:
movie_info.rename(columns={'movieId_stats': 'num_ratings','rating':'avg_ratings'}, inplace=True)# code here
movie_info

Unnamed: 0,movieId,title,genres,avg_ratings,num_ratings
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.920930,215.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,110.0
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615,52.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143,7.0
4,5,Father of the Bride Part II (1995),Comedy,3.071429,49.0
...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,4.000000,1.0
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,3.500000,1.0
9739,193585,Flint (2017),Drama,3.500000,1.0
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,3.500000,1.0


##  <font color = 'indianred'> Step 4: check if any column in movie_info has null values

In [None]:
# check any column in movie_info has null values
#isNull = movie_info.isnull()
null_count = movie_info.isnull().sum()
null_count
movie_info=movie_info.dropna()
null_count2 = movie_info.isnull().sum()
null_count2
# code here

movieId        0
title          0
genres         0
avg_ratings    0
num_ratings    0
dtype: int64

##  <font color = 'indianred'> Step 5: Display top 10 movies based on mean ratings.


In [None]:
# code here
movie_info.sort_values('avg_ratings', ascending=False).head(10)

Unnamed: 0,movieId,title,genres,avg_ratings,num_ratings
7656,88448,Paper Birds (Pájaros de papel) (2010),Comedy|Drama,5.0,1.0
8107,100556,"Act of Killing, The (2012)",Documentary,5.0,1.0
9083,143031,Jump In! (2007),Comedy|Drama|Romance,5.0,1.0
9094,143511,Human (2015),Documentary,5.0,1.0
9096,143559,L.A. Slasher (2015),Comedy|Crime|Fantasy,5.0,1.0
4251,6201,Lady Jane (1986),Drama|Romance,5.0,1.0
8154,102217,Bill Hicks: Revelations (1993),Comedy,5.0,1.0
8148,102084,Justice League: Doom (2012),Action|Animation|Fantasy,5.0,1.0
4246,6192,Open Hearts (Elsker dig for evigt) (2002),Romance,5.0,1.0
9122,145994,Formula of Love (1984),Comedy,5.0,1.0


It seems that this does not give us a good set of top movies. Most of the movies has got only one or two ratings. We cannot recommend these movies. Let us impose condition that movies should have atleast 100 ratings and then sort by mean of ratings.

##  <font color = 'indianred'> Step 6: Display top 10 movies based on mean ratings with additional constraint.

Constraint: The movies should have at least 100 ratings i.e num_ratings >100
Hint: select only those movies that has more than 100 ratings and then sort by avg_ratings in descending order.

In [None]:

movie_info[movie_info['num_ratings']>100].sort_values('avg_ratings', ascending=False).head(10)

Unnamed: 0,movieId,title,genres,avg_ratings,num_ratings
277,318,"Shawshank Redemption, The (1994)",Crime|Drama,4.429022,317.0
659,858,"Godfather, The (1972)",Crime|Drama,4.289062,192.0
2226,2959,Fight Club (1999),Action|Crime|Drama|Thriller,4.272936,218.0
922,1221,"Godfather: Part II, The (1974)",Crime|Drama,4.25969,129.0
6315,48516,"Departed, The (2006)",Crime|Drama|Thriller,4.252336,107.0
914,1213,Goodfellas (1990),Crime|Drama,4.25,126.0
6710,58559,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX,4.238255,149.0
46,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,4.237745,204.0
899,1197,"Princess Bride, The (1987)",Action|Adventure|Comedy|Fantasy|Romance,4.232394,142.0
224,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,4.231076,251.0


* We have fetched top movies to be recommended based on their average rating and rated by more than 100 users.
* But these recommendations are based only on average ratings.
* However, users might have watched some movies and they may like other similar movies.
* So, we'll try to find the relations between movies and recommend those movies that are highly related with movies users have allready watched.
* To do so, we'll calculate correlation for each movie with other movies.
* Correlation tell us about the direction of the relationship, and the degree (strength) of the relationship between two variables. High correlation value indicates variables are highly related to each other.
* To find correlation between each movies, first we will create pivot table. In this pivot table each column will be a movie (since we want to find correlation between movies) and row will be a user. The values will be rating given by a user to a movie.


#  <font color = 'indianred'> Task3: Find top ten similar movies to a given movie

##  <font color = 'indianred'> Step 1: Create a Pivot Table </font>
1. Create a matrix that has the user ids on one axis (rows) and the movie ids on another axis (columns).
2. Each cell will then consist of the rating the user gave to that movie.

(Note there will be a lot of NaN values, because users have not rated all the movies)

In [None]:
movie_pivot =  user_movie_ratings.pivot_table(index='userId',columns='movieId',values='rating')

movie_pivot.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [None]:
movie_pivot.shape

(610, 9724)

##  <font color = 'indianred'> Step 2: Get movie id of a particular movie
Get the movie_id of movie 'Shawshank Redemption, The (1994)' from movie_info table

In [None]:
from os import name

movie_id = movie_info.loc[movie_info['title']=='Shawshank Redemption, The (1994)','movieId']# code here
movie_id = movie_id.item()
movie_id

318

##  <font color = 'indianred'> Step 3: Get the column from pivot table  corresponding to the focal movie


In [None]:
movie_ratings = movie_pivot.loc[:, movie_id]# code here
movie_ratings

userId
1      NaN
2      3.0
3      NaN
4      NaN
5      3.0
      ... 
606    3.5
607    5.0
608    4.5
609    4.0
610    3.0
Name: 318, Length: 610, dtype: float64

##  <font color = 'indianred'> Step 4: Get correlation of the selected movie with all movies
Correlation tell us about the direction of the relationship, and the degree (strength) of the relationship between two variables. High correlation value indicates variables are highly related to each other.

In our case, the correlation between two movies will be higher if they have received similar ratings from multiple users.

In [None]:
movie_corr = movie_pivot.corrwith(movie_ratings) # code here
movie_corr

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


movieId
1         0.174984
2         0.097461
3         0.466380
4         0.644380
5         0.138314
            ...   
193581         NaN
193583         NaN
193585         NaN
193587         NaN
193609         NaN
Length: 9724, dtype: float64

##  <font color = 'indianred'> Step 5: Create Data frame of correlations and clean datfarme
Create Dataframe from movie correlations created in previous step. Remove Null values from data frame.

In [None]:
# create a DataFrame having the correlation values
corDF = pd.DataFrame({ 'movieId' : movie_corr.index, 'Correlation' : movie_corr.values})# code here
# Drop the NA values, make sure to use inplace = True
corDF.dropna(inplace = True)
# code here
# Display top 5 values using head() method
# code here
corDF.head(5)

Unnamed: 0,movieId,Correlation
0,1,0.174984
1,2,0.097461
2,3,0.46638
3,4,0.64438
4,5,0.138314


In [None]:
## Merge Movie_corr data with movie_info data using movieID column
movie_corr_with_title = pd.merge(corDF,movie_info,on='movieId',how='left')# code here

In [None]:
movie_corr_with_title

Unnamed: 0,movieId,Correlation,title,genres,avg_ratings,num_ratings
0,1,0.174984,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.920930,215.0
1,2,0.097461,Jumanji (1995),Adventure|Children|Fantasy,3.431818,110.0
2,3,0.466380,Grumpier Old Men (1995),Comedy|Romance,3.259615,52.0
3,4,0.644380,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143,7.0
4,5,0.138314,Father of the Bride Part II (1995),Comedy,3.071429,49.0
...,...,...,...,...,...,...
4780,185029,-0.991241,A Quiet Place (2018),Drama|Horror|Thriller,2.750000,4.0
4781,185135,-1.000000,Sherlock - A Study in Pink (2010),Crime,4.750000,2.0
4782,187593,-0.004544,Deadpool 2 (2018),Action|Comedy|Sci-Fi,3.875000,12.0
4783,187595,0.207514,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,3.900000,5.0


##  <font color = 'indianred'> Step 6: Sort the above dataframe in descending order
Sort the above data frame in descending order based on values in the correlation column. Display top ten results.

In [None]:

movie_corr_with_title.sort_values(by = 'Correlation', ascending= False).head(10)

Unnamed: 0,movieId,Correlation,title,genres,avg_ratings,num_ratings
3760,55080,1.0,"Brave One, The (2007)",Crime|Drama|Thriller,3.25,2.0
3155,8656,1.0,"Short Film About Killing, A (Krótki film o zab...",Crime|Drama,3.75,2.0
4145,80166,1.0,"Switch, The (2010)",Comedy|Romance,2.666667,6.0
2335,4833,1.0,"Changeling, The (1980)",Horror|Mystery|Thriller,3.333333,3.0
3178,8835,1.0,Paparazzi (2004),Drama|Thriller,2.333333,3.0
4153,80846,1.0,Devil (2010),Horror|Mystery|Thriller,3.25,2.0
2672,6013,1.0,Kangaroo Jack (2003),Action|Comedy,1.75,2.0
2336,4835,1.0,Coal Miner's Daughter (1980),Drama,3.5,3.0
4169,81819,1.0,Biutiful (2010),Drama,4.25,2.0
2675,6022,1.0,American Me (1992),Drama,2.5,3.0


The top ten movies dos not seem to be related to focal movie "Shawshank Redemption, The (1994)". The movie has highest correlation with movies that has very few ratings. The correlations based on  movies that has very few ratings are not very reliable. For Example,  If a movie Z is rated by only one user and the same user has also rated the focal movie "Shawshank Redemption, The (1994)". Let us assume that user has given a rating of 5 to both the movies. The correlation between these two movies will be very high. However the correlation is based on preference of only one user and hence might not be very reliable. To overcome this, we will use only those movies that have atleast a minimum number of ratings.

In [None]:
# sleect only those movies from movie_corr_with_title dataframe that has more than 100 ratings
movie_corr_with_title =  movie_corr_with_title[movie_corr_with_title['num_ratings']> 100]# code here

In [None]:
# Sort the above dataframe using Correlation column and get top ten rows using head() method
# the output should be stored in a new datafrme : top_ten_recommendations
top_ten_recommendations = movie_corr_with_title.sort_values(by = 'Correlation', ascending=False).head(10)# code here

In [None]:
top_ten_recommendations

Unnamed: 0,movieId,Correlation,title,genres,avg_ratings,num_ratings
206,318,1.0,"Shawshank Redemption, The (1994)",Crime|Drama,4.429022,317.0
234,357,0.446212,Four Weddings and a Funeral (1994),Comedy|Romance,3.519417,103.0
341,527,0.402202,Schindler's List (1993),Drama|War,4.225,220.0
42,50,0.394294,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,4.237745,204.0
2382,4963,0.391546,Ocean's Eleven (2001),Crime|Thriller,3.844538,119.0
1649,3147,0.382818,"Green Mile, The (1999)",Crime|Drama,4.148649,111.0
4128,79132,0.377839,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX,4.066434,143.0
2660,5989,0.356612,Catch Me If You Can (2002),Crime|Drama,3.921739,115.0
643,1193,0.354215,One Flew Over the Cuckoo's Nest (1975),Drama,4.203008,133.0
669,1221,0.349872,"Godfather: Part II, The (1974)",Crime|Drama,4.25969,129.0
