# Content-Based movie Recommender system

## importing the necessary modules

In [69]:
# Python library used for working with arrays
import numpy as np
# Library that provides many functions and methods to expedite the data analysis process.
import pandas as pd

## Reading datasets

> **movies.csv** : movie names with release date and genres data attached <br><br>
> **ratings_sample.csv** : user ratings on a specific date for a moive <br>

we are going to use pandas **read_csv** fuction to make things easier with pandas functionality.

In [39]:
movies  = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings_sample.csv")

**Head()** fuction, which takes **n** (default = 5) as an argument, show the first n rows in dataset. you can use **tail()** as an alternative for displaying the last n rows of the dataset.  

In [33]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [34]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


## Optimizing Datasets

We need to remove the **year** from the movie title and move it to its own column. this process can be done in different ways but the most efficient solution is **Regex**(regular expressions).<br>
pandas support regex in its **extract** and **replace** functioins by default. so in our case we are going to use **\d** to select digits from title column.<br>

#### pandas.Series.str.extract(pat, expand=True)
> **pat** : String can be a character sequence or regular expression.<br>
> **expand** : If True, return DataFrame with one column per capture group. If False, return a Series/Index if there is one capture group or DataFrame if there are multiple capture groups.<br><br>
#### Series.str.replace(pat, case=None, regex=None)
> **pat** : String can be a character sequence or regular expression.<br>
> **regex** : Determines if the passed-in pattern is a regular expression<br>
> **case** : Determines if replace is case sensitive:
> * If True, case sensitive (the default if pat is a string)
> * Set to False for case insensitive
> * Cannot be set to False if pat is a compiled

In [40]:
# Extracting year from title column 
movies['year'] = movies.title.str.extract('(\(\d\d\d\d\))',expand=False)

# Extracting without the parentheses
movies['year'] = movies.year.str.extract('(\d\d\d\d)',expand=False)

# Removing the years from the title column
movies['title'] = movies.title.str.replace('(\(\d\d\d\d\))', '', regex=True)
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


Applying the **strip()** function to remove any whitespace from each right and left sides

In [41]:
movies['title'] = movies['title'].apply(lambda x: x.strip())

also, there is problem that a movie's genres are seprated by **|** character.so we have to split them up and put them into a list with python's **split()** fuction.

In [42]:
movies['genres'] = movies.genres.str.split('|')
# a small preview of result
movies.head(2)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995


Since we have to get the **weighted genres matrix**, we have to convert the list of genres into a vector with Binary values which determines whether the movie is in this genre or not. 

> **pandas.DataFrame.iterrows** : a python generator which Iterates over DataFrame rows and yields (index, Series) pairs.<br><br>
> **pandas.DataFrame.fillna** : a function which goves a specified value to **Nan** cells.

In [43]:
# We need to copy the first dataframe and add genres columns since we are not gonna use genres list from first dataframe
moviesWithGenres = movies.copy()

# Iterrating over rows
for index, row in moviesWithGenres.iterrows():
    for genre in row["genres"]:
        moviesWithGenres.at[index, genre] = 1
        
# filling the Nan cells
moviesWithGenres = moviesWithGenres.fillna(0)
moviesWithGenres.head(2)

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**now lets take a look at ratings dataframe**

we wont use the **timestamp** column. so lets remove it to save memory
> DataFrame.drop(labels=None, axis=0)<br>
> * **labels** : Index or column labels to drop.<br>
> * **axis** : Alternative to specifying axis

In [44]:
ratings = ratings.drop('timestamp', axis=1)

## Content-Based recommendation system

this technique takes the user's activity history, wathched and rated movies in this case, and produces a matrix to mesure the user's favors and mathches the extracted data to other movies and susggests the best-matching results

lets begin with a sample a user-input to suggest movies on the base of that:

In [48]:
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


we have to add movie-id of each user input in ordere to recognize and extract genres of that movie. this process can be done by filtering the oringinal movies dataset.

In [51]:
# getting the rows in which they are included in the user's input
inputId = movies[movies["title"].isin(inputMovies["title"].tolist())]
# Merging it with the user inputs
inputMovies = pd.merge(inputId, inputMovies)
# Removing unnecessary columns
inputMovies = inputMovies.drop('genres', axis=1).drop('year', axis=1)
# Final result
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


now we have to start the learning process by using user-input values

at first let's get the subset of movies that the input has watched from the Dataframe containing genres defined with binary values.

In [53]:
userMovies = moviesWithGenres[moviesWithGenres['movieId'].isin(inputMovies['movieId'].tolist())]

only the genre columns is needed. so lets clean up some the data

> **DataFrame.reset_index** : Reset the index of the DataFrame, and use the default one instead.

In [55]:
#Resetting the index to avoid future issues
userMovies = userMovies.reset_index(drop=True)
# genrating the genre matrix
userGenreTable = userMovies.drop('movieId', axis=1).drop('title', axis=1).drop('genres', axis=1).drop('year', axis=1)
userGenreTable

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


using weighted movie matrix to create user-profile

> **DataFrame.transpose** : Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. 

In [61]:
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])

now we know the user preferences. we can use this to suggest movies that user would like.

In [63]:
# genres of all movies in our original dataframe
genreTable = moviesWithGenres.set_index(moviesWithGenres['movieId'])
# Removing the unnecessary information
genreTable = genreTable.drop('movieId', axis=1).drop('title', axis=1).drop('genres', axis=1).drop('year', axis=1)
genreTable.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [64]:
# measuring each movie with the user-profile 
recommendationTable = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable

movieId
1         0.594406
2         0.293706
3         0.188811
4         0.328671
5         0.188811
            ...   
151697    0.069930
151701    0.000000
151703    0.139860
151709    0.202797
151711    0.000000
Length: 34208, dtype: float64

In [66]:
# Sorting recommendations table in descending order
recommendationTable.sort_values(ascending=False, inplace=True)
#Just a peek at the values
recommendationTable.head()

movieId
5018      0.748252
26093     0.734266
27344     0.720280
148775    0.685315
6902      0.678322
dtype: float64

we found movies that we want to suggest. now we have to convert them back to their original form so then user can read them

In [68]:
movies.loc[movies['movieId'].isin(recommendationTable.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
664,673,Space Jam,"[Adventure, Animation, Children, Comedy, Fanta...",1996
1824,1907,Mulan,"[Adventure, Animation, Children, Comedy, Drama...",1998
2902,2987,Who Framed Roger Rabbit?,"[Adventure, Animation, Children, Comedy, Crime...",1988
4923,5018,Motorama,"[Adventure, Comedy, Crime, Drama, Fantasy, Mys...",1991
6793,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002
8605,26093,"Wonderful World of the Brothers Grimm, The","[Adventure, Animation, Children, Comedy, Drama...",1962
8783,26340,"Twelve Tasks of Asterix, The (Les douze travau...","[Action, Adventure, Animation, Children, Comed...",1976
9296,27344,Revolutionary Girl Utena: Adolescence of Utena...,"[Action, Adventure, Animation, Comedy, Drama, ...",1999
9825,32031,Robots,"[Adventure, Animation, Children, Comedy, Fanta...",2005
11716,51632,Atlantis: Milo's Return,"[Action, Adventure, Animation, Children, Comed...",2003


**These are the movies that we recommand our user to see**

# By Sina Kazemi