# Collaborative movie Recommender system

## importing the necessary Libraries

In [32]:
# Python library used for working with arrays
import numpy as np
# Library that provides many functions and methods to expedite the data analysis process.
import pandas as pd
from math import sqrt

## Reading datasets

> **movies.csv** : movie names with release date and genres data attached <br><br>
> **ratings_sample.csv** : user ratings on a specific date for a moive <br>

we are going to use pandas **read_csv** fuction to make things easier with pandas functionality.

In [13]:
movies  = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings_sample.csv")

**Head()** fuction, which takes **n** (default = 5) as an argument, show the first n rows in dataset. you can use **tail()** as an alternative for displaying the last n rows of the dataset.  

In [14]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [15]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


## Optimizing Datasets

We need to remove the **year** from the movie title and move it to its own column. this process can be done in different ways but the most efficient solution is **Regex**(regular expressions).<br>
pandas support regex in its **extract** and **replace** functioins by default. so in our case we are going to use **\d** to select digits from title column.<br>

#### pandas.Series.str.extract(pat, expand=True)
> **pat** : String can be a character sequence or regular expression.<br>
> **expand** : If True, return DataFrame with one column per capture group. If False, return a Series/Index if there is one capture group or DataFrame if there are multiple capture groups.<br><br>
#### Series.str.replace(pat, case=None, regex=None)
> **pat** : String can be a character sequence or regular expression.<br>
> **regex** : Determines if the passed-in pattern is a regular expression<br>
> **case** : Determines if replace is case sensitive:
> * If True, case sensitive (the default if pat is a string)
> * Set to False for case insensitive
> * Cannot be set to False if pat is a compiled

In [16]:
# Extracting year from title column 
movies['year'] = movies.title.str.extract('(\(\d\d\d\d\))',expand=False)

# Extracting without the parentheses
movies['year'] = movies.year.str.extract('(\d\d\d\d)',expand=False)

# Removing the years from the title column
movies['title'] = movies.title.str.replace('(\(\d\d\d\d\))', '', regex=True)
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


Applying the **strip()** function to remove any whitespace from each right and left sides of the title column

In [17]:
movies['title'] = movies['title'].apply(lambda x: x.strip())

we wont be needing the genres column in collaborative recommendation system, so lets remove it to save some memory
> **DataFrame.drop(labels=None, axis=0)**
> * **labels** : Index or column labels to drop.
> * **axis** : Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

In [18]:
movies = movies.drop('genres', axis=1)
# here's the edited movies dataframe
movies.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


let's modify the ratings dataframe

we wont use the **timestamp** column. so lets remove it to save memory
> DataFrame.drop(labels=None, axis=0)<br>
> * **labels** : Index or column labels to drop.<br>
> * **axis** : Alternative to specifying axis

In [19]:
ratings = ratings.drop('timestamp', axis=1)

## Collaborative Filtering

collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). 

<img src="https://www.researchgate.net/publication/332293983/figure/fig1/AS:745757950894081@1554813959500/General-recommendation-process-of-collaborative-filtering-algorithm.png" width=800px>

lets begin with a sample user-input to suggest movies on the base of that:

In [20]:
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


we have to add movie-id of each user input. this process can be done by filtering the oringinal movies dataset.

In [21]:
# getting the rows in which they are included in the user's input
inputId = movies[movies["title"].isin(inputMovies["title"].tolist())]
# Merging it with the user inputs
inputMovies = pd.merge(inputId, inputMovies)
# Removing unnecessary columns
inputMovies = inputMovies.drop('year', axis=1)
# Final result
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


and now lets find the users who have seen the same movie with movie-id that we added to our input data in the last part by filtering the original rating dataset

In [26]:
userSubset = ratings[ratings['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
19,4,296,4.0
441,12,1968,3.0
479,13,2,2.0
531,13,1274,5.0
681,14,296,2.0


> **pandas.DataFrame.groupby**<br>
> A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

this function can be usefull when we are Sorting dataset. so then users with most in common with the input-user will have priority

In [28]:
userSubsetGroup = userSubset.groupby(['userId'])
userSubsetGroup.head()

Unnamed: 0,userId,movieId,rating
19,4,296,4.0
441,12,1968,3.0
479,13,2,2.0
531,13,1274,5.0
681,14,296,2.0
...,...,...,...
3899049,42118,296,3.0
3899165,42121,1968,4.0
3899433,42127,296,4.5
3899764,42128,2,3.0


e.g lets see one of the subgroups 

In [29]:
userSubsetGroup.get_group(448)

Unnamed: 0,userId,movieId,rating
40546,448,1,1.0
40547,448,2,5.0


sorting subgroups : 

In [30]:
# sorting in descending order
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

### Similarity of users to input user

Next, we are going to compare all usersto our specified user and find the ones that are most similar.\
we're going to find out how similar each user is to the input through the **Pearson Correlation Coefficient**. It is used to measure the strength of a linear association between two variables.

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSD66GkMDNKG_af1YxK0PqGT_BZRV1qWvVydvdnDp1nZ58dsJNhRD58LQYfHN_Wap5ROQ&usqp=CAU" width=450px>

> * Numerator : Sxy
> * denominator(right) : Syy
> * denominator(left) : Sxx

we will use a part of extracted data in order to not waste our time on less similar data

In [31]:
userSubsetGroup = userSubsetGroup[0:100]

now its time to calculate the pearson correlation between input user and suset groups to store the results in a python dictionary.

In [33]:
# creating an empty dictionary to asign :  key = user Id   and    value = coefficient
pearsonCorrelationDict = dict()

# iterrating over subset groups
for name, group in userSubsetGroup:
    # sorting our data to avoid future issues
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    
    # obtaining n for upper limit of sigma
    nratings = len(group)
    
    # creating a dataframe of the movies that both user's have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    
    # getting the ratings from temp_df to ease the upcoming calculations
    tempRatingList = temp_df['rating'].tolist()
    # also put the current user group ratings in a list format
    tempGroupList = group['rating'].tolist()
    
    # calculating the pearson correlation between two users
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nratings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nratings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nratings)
    
    # check if denominator is 0 or not. if it is, asign 0 for pearson correlation
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy) 
    else:
        pearsonCorrelationDict[name] = 0

now we create a dataframe from the dictionary that we generated with pearson correlation

In [34]:
# creating dataframe from dictionary
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
# renaming the pearson correlation column 
pearsonDF.columns = ['similarityIndex']
# as we usesed a dictionary for creating this dataframe, dict key or user id is actually the dataframe index.
# so create a user id column and reset the index
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.827278,75
1,0.586009,106
2,0.83205,686
3,0.576557,815
4,0.943456,1040


### Choosing the most similar rows

in this case we choose the fisrt 50 most similar rows

In [35]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
64,0.961678,12325
34,0.961538,6207
55,0.961538,10707
67,0.960769,13053
4,0.943456,1040


### Adding rating column to dataset

 **DataFrame.merge(right, how='inner')**
 * **right** : Object to merge with.
 * **how** :  
> * left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
> * right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
> * outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
> * inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
> * cross: creates the cartesian product from both frames, preserves the order of the left keys.


In [37]:
topUsersRating = topUsers.merge(ratings, how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,0.961678,12325,1,3.5
1,0.961678,12325,2,1.5
2,0.961678,12325,3,3.0
3,0.961678,12325,5,0.5
4,0.961678,12325,6,2.5


Multiplying the similarity by the user's ratings in order to have the weighted ratings for the future comparing

In [38]:
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,0.961678,12325,1,3.5,3.365874
1,0.961678,12325,2,1.5,1.442517
2,0.961678,12325,3,3.0,2.885035
3,0.961678,12325,5,0.5,0.480839
4,0.961678,12325,6,2.5,2.404196


grouping the data by movie id and calculating the sum of **similarityIndex** and **weightedRating**

In [39]:
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,38.376281,140.800834
2,38.376281,96.656745
3,10.253981,27.254477
4,0.929294,2.787882
5,11.723262,27.151751


In [40]:
# creating a recommendation dataframe
recommendation = pd.DataFrame()
# calculating the weighted average
recommendation['weighted average score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
# creating movie id column
recommendation['movieId'] = tempTopUsersRating.index
recommendation.head()

Unnamed: 0_level_0,weighted average score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.668955,1
2,2.518658,2
3,2.657941,3
4,3.0,4
5,2.316058,5


sorting rows with weighted average score in decending order

In [42]:
recommendation = recommendation.sort_values(by='weighted average score', ascending=False)
recommendation.head(10)

Unnamed: 0_level_0,weighted average score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
5073,5.0,5073
3329,5.0,3329
2284,5.0,2284
26801,5.0,26801
6776,5.0,6776
6672,5.0,6672
3759,5.0,3759
3769,5.0,3769
3775,5.0,3775
90531,5.0,90531


and finally we take the fisrt 10 recommendation and get the exact information from the original movies dataframe.

In [50]:
final_rec = movies.loc[movies['movieId'].isin(recommendation.head(10)['movieId'].tolist())]
final_rec.reset_index()

Unnamed: 0,index,movieId,title,year
0,2200,2284,Bandit Queen,1994
1,3243,3329,"Year My Voice Broke, The",1987
2,3669,3759,Fun and Fancy Free,1947
3,3679,3769,Thunderbolt and Lightfoot,1974
4,3685,3775,Make Mine Music,1946
5,4978,5073,"Son's Room, The (Stanza del figlio, La)",2001
6,6563,6672,War Photographer,2001
7,6667,6776,Lagaan: Once Upon a Time in India,2001
8,9064,26801,Dragon Inn (Sun lung moon hak chan),1992
9,18106,90531,Shame,2011


**Good luck to user for watching them!**

# By Sina Kazemi