# Building a Movie Recommendation System 



In this project we will use well known Movielens 100k dataset.
You can download the dataset from Kaggle from the link below:
https://www.kaggle.com/imkushwaha/movielens-100k-dataset

You must download only 2 files from this repository:

u.item
u.data
u.item : Contains information about movies (movie id and name) u.data : Contains information about user reviews..




<IMG src="s.jpg" width="250" height="350" >

In [6]:
import pandas as pd

In [7]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names) #You can download the dataset(u.data and u.item) from Kaggle from the link: https://www.kaggle.com/imkushwaha/movielens-100k-dataset

In [8]:
df.head()


Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [9]:
# Let's see how many records we have:
len(df)

100000

### Now import u.item file:

In [12]:

m_cols = ['item_id', 'title']
movie_titles = pd.read_csv('u.item', sep='|', names=m_cols, usecols=range(2), encoding='latin1')
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [13]:
# Let's see how many records we have:

len(movie_titles)

1682

In [14]:
# Now lets merge u.data and u.item files based on item_id
df = pd.merge(df, movie_titles, on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


### We Are Setting Up Our Recommendation System:


In [16]:
# First, we set up a pivot table-like structure in Excel.
# According to this structure, each row will be a user (ie the index of our dataframe will be user_id)
# There will be movie names in the columns,
# We create a dataframe with rating values in the table!

moviepivot = df.pivot_table(index='user_id',columns='title',values='rating')
moviepivot.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


In [17]:
type(moviepivot)

pandas.core.frame.DataFrame

### Purpose: Making movie suggestions similar to Starwars movie

Let's take a look at the user ratings of Star Wars (1977):

In [18]:
starwars_user_ratings = moviepivot['Star Wars (1977)']
starwars_user_ratings.head()

user_id
1    5.0
2    5.0
3    NaN
4    5.0
5    4.0
Name: Star Wars (1977), dtype: float64

Let's calculate the correlations with the Star wars movie using the corrwith() method:

In [19]:
similar_to_starwars = moviepivot.corrwith(starwars_user_ratings)


  c /= stddev[:, None]
  c /= stddev[None, :]
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c *= np.true_divide(1, fact)


In [20]:
similar_to_starwars

title
'Til There Was You (1997)                0.872872
1-900 (1994)                            -0.645497
101 Dalmatians (1996)                    0.211132
12 Angry Men (1957)                      0.184289
187 (1997)                               0.027398
                                           ...   
Young Guns II (1990)                     0.228615
Young Poisoner's Handbook, The (1995)   -0.007374
Zeus and Roxanne (1997)                  0.818182
unknown                                  0.723123
Á köldum klaka (Cold Fever) (1994)            NaN
Length: 1664, dtype: float64

In [21]:
type(similar_to_starwars)

pandas.core.series.Series

#### It throws a warning because some records have spaces, let's convert it to a dataframe named corr_starwars and clear the NaN records and see:

In [22]:
corr_starwars = pd.DataFrame(similar_to_starwars, columns=['Correlation'])
corr_starwars.dropna(inplace=True)
corr_starwars.head()

Unnamed: 0_level_0,Correlation
title,Unnamed: 1_level_1
'Til There Was You (1997),0.872872
1-900 (1994),-0.645497
101 Dalmatians (1996),0.211132
12 Angry Men (1957),0.184289
187 (1997),0.027398


### Let's list the dataframe we obtained and see what is the closest movie it would recommend:

In [23]:
corr_starwars.sort_values('Correlation',ascending=False).head(10)

Unnamed: 0_level_0,Correlation
title,Unnamed: 1_level_1
Hollow Reed (1996),1.0
Commandments (1997),1.0
Cosi (1996),1.0
No Escape (1994),1.0
Stripes (1981),1.0
Star Wars (1977),1.0
Man of the Year (1995),1.0
"Beans of Egypt, Maine, The (1994)",1.0
"Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991)",1.0
"Outlaw, The (1943)",1.0


#### As you can see, there are irrelevant results. When you do a little research on this subject, you will find that the reason for this is because these films received very few votes. To correct this situation, let's eliminate the films that received less than 100 votes. Let's keep the votes (ie the number of votes)...

In [24]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


We don't need timestamp column, so drop it..

In [25]:
df.drop(['timestamp'], axis = 1)

Unnamed: 0,user_id,item_id,rating,title
0,196,242,3,Kolya (1996)
1,63,242,3,Kolya (1996)
2,226,242,5,Kolya (1996)
3,154,242,3,Kolya (1996)
4,306,242,5,Kolya (1996)
...,...,...,...,...
99995,840,1674,4,Mamma Roma (1962)
99996,655,1640,3,"Eighth Day, The (1996)"
99997,655,1637,3,Girls Town (1996)
99998,655,1630,3,"Silence of the Palace, The (Saimt el Qusur) (1..."


In [26]:
# Let's find the mean value rating of each movie
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())

# Let's sort them from high to low...
ratings.sort_values('rating',ascending=False).head()


Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
They Made Me a Criminal (1939),5.0
Marlene Dietrich: Shadow and Light (1996),5.0
"Saint of Fort Washington, The (1993)",5.0
Someone Else's America (1995),5.0
Star Kid (1997),5.0


#### Attention: While calculating these averages, we did not look at how many votes it received, so there were movies like this that we did not know at all..

In [27]:
# Now let's find the number of votes each movie received.
ratings['rating_count'] = pd.DataFrame(df.groupby('title')['rating'].count())
ratings.head()

Unnamed: 0_level_0,rating,rating_count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'Til There Was You (1997),2.333333,9
1-900 (1994),2.6,5
101 Dalmatians (1996),2.908257,109
12 Angry Men (1957),4.344,125
187 (1997),3.02439,41


In [28]:
# Now let's sort the movies with the most votes, from largest to smallest...
ratings.sort_values('rating_count',ascending=False).head()


Unnamed: 0_level_0,rating,rating_count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars (1977),4.358491,583
Contact (1997),3.803536,509
Fargo (1996),4.155512,508
Return of the Jedi (1983),4.00789,507
Liar Liar (1997),3.156701,485


In [29]:
# Let's go back to our main goal and add the rating_count column to our corr_starwars dataframe (with join)

In [30]:
corr_starwars.sort_values('Correlation',ascending=False).head(10)

Unnamed: 0_level_0,Correlation
title,Unnamed: 1_level_1
Hollow Reed (1996),1.0
Commandments (1997),1.0
Cosi (1996),1.0
No Escape (1994),1.0
Stripes (1981),1.0
Star Wars (1977),1.0
Man of the Year (1995),1.0
"Beans of Egypt, Maine, The (1994)",1.0
"Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991)",1.0
"Outlaw, The (1943)",1.0


In [31]:
corr_starwars = corr_starwars.join(ratings['rating_count'])
corr_starwars.head()

Unnamed: 0_level_0,Correlation,rating_count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'Til There Was You (1997),0.872872,9
1-900 (1994),-0.645497,5
101 Dalmatians (1996),0.211132,109
12 Angry Men (1957),0.184289,125
187 (1997),0.027398,41


### And the result:

In [32]:
corr_starwars[corr_starwars['rating_count']>100].sort_values('Correlation',ascending=False).head()

Unnamed: 0_level_0,Correlation,rating_count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars (1977),1.0,583
"Empire Strikes Back, The (1980)",0.747981,367
Return of the Jedi (1983),0.672556,507
Raiders of the Lost Ark (1981),0.536117,420
Austin Powers: International Man of Mystery (1997),0.377433,130


As a result we have a reasonable movie recommendations for Star Wars movie.. Similarly you can try and see what our system will recommend you for other movies..