# Movie Recommendation API
The first step in my Movie Recommendation is learn about the data, cross-reference the datasets and clean the data.

In [1]:
# Load libraries and create a path to folder ml-latest-small
import pandas as pd
import sys
sys.path.append('ml-latest-small')

In [3]:
# Load csv files and merge into a single dataframe

df_movies = pd.read_csv("ml-latest-small/movies.csv")
df_ratings = pd.read_csv("ml-latest-small/ratings.csv")
df_merged = pd.merge(df_movies, df_ratings, on="movieId", how='inner')

df_merged.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


## Initial treatments on the dataframe
I create a dataframe which maps individual **user ratings** into rows against each **movie** as a column and drop movies that had fewer than 8 ratings.

In [4]:
# create a single dataframe
df = df_merged.pivot_table(index='userId', columns='title', values='rating')

# keep only movies that had at least 8 ratings
df = df.dropna(thresh=8, axis=1)
df.fillna(0, inplace=True)

df.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),12 Years a Slave (2013),...,Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Working with correlations
I use the [Pearson](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) correlation to calculate the similarity between movies. The Pearson method treats each movie (column) as a vector containing user rating values and determines how close/similar a movie is to the other ones. As you can see from the similarity matrix below, each movie is perfectly similar to itself and either strongly correlated to other movies (~+1) or strongly dissimilar (~-1). Remember: Correlation does not mean causality, it's always good to remember that :)

In [5]:
df_similarity = df.corr(method='pearson')

# Store the data for later to be used in building the API
df_similarity.to_csv('movie_similarity.csv')

df_similarity.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),12 Years a Slave (2013),...,Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"'burbs, The (1989)",1.0,0.063117,-0.023768,0.143482,0.011998,0.087931,0.224052,-0.018608,0.034223,0.009277,...,0.03247,0.134701,0.153158,0.101301,0.049897,0.003233,-0.017905,0.187953,0.062174,0.353194
(500) Days of Summer (2009),0.063117,1.0,0.142471,0.273989,0.19396,0.148903,0.142141,0.066567,0.159756,0.135486,...,0.178655,0.068407,0.414585,0.355723,0.252226,0.216007,0.126147,0.053614,0.241092,0.125905
10 Cloverfield Lane (2016),-0.023768,0.142471,1.0,-0.005799,0.112396,0.006139,-0.016835,-0.017692,0.031704,-0.024275,...,0.099059,-0.023477,0.272347,0.241751,0.195054,0.319371,0.082246,0.177846,0.096638,0.002733
10 Things I Hate About You (1999),0.143482,0.273989,-0.005799,1.0,0.24467,0.223481,0.211473,0.109729,0.011784,0.091964,...,0.104858,0.13246,0.091853,0.158637,0.281934,0.050031,0.088391,0.121029,0.130813,0.110612
"10,000 BC (2008)",0.011998,0.19396,0.112396,0.24467,1.0,0.234459,0.119132,0.086195,0.059187,-0.025882,...,0.087592,0.094913,0.184521,0.242299,0.240231,0.094773,0.074425,0.088045,0.203002,0.083518


If you like the movie "Star Wars: Episode IV - A New Hope (1977)" like me, let's see what recommendations we would get by accessing the corresponding movie column and sorting the similarity scores from highest to lowest to get the top 50 movie recommendations:

In [6]:
movieliked = 'Star Wars: Episode IV - A New Hope (1977)'
similarity_score = df_similarity[movieliked]
similarity_score.sort_values(ascending=False)[1:50]

title
Star Wars: Episode V - The Empire Strikes Back (1980)                             0.739568
Star Wars: Episode VI - Return of the Jedi (1983)                                 0.682934
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    0.552444
Indiana Jones and the Last Crusade (1989)                                         0.504209
Star Wars: Episode I - The Phantom Menace (1999)                                  0.462449
Terminator, The (1984)                                                            0.450467
Back to the Future (1985)                                                         0.447119
Indiana Jones and the Temple of Doom (1984)                                       0.434504
Matrix, The (1999)                                                                0.427021
Star Wars: Episode III - Revenge of the Sith (2005)                               0.408958
Aliens (1986)                                                                     0.

Sequels seem to be our top 2 recommendations and Indiana Jones (another Harrison Ford movie) is the 3rd recommendation for us.

Now its time to create the flask API. You can access the code right [here]()