# Data Mining and Statistics
## Session 7 - Recommendation
*Peter Stikker - Haarlem, the Netherlands*

----

## 7.1. Example data

We'll be using the movielens data, the 5MB version. To load this into Python we'll need pandas:

In [None]:
import pandas as pd

Now we can use 'read_csv' to load the ratings. These are in the u.data file, and have a 'tab' delimiter, so we can use 'sep='\t'':

In [None]:
df = pd.read_csv('ml-100k/u.data', sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])

Lets take a quick look:

In [None]:
df.head()

The 'item_id' is actually the movie itself. The full name can be found in the 'u.item' file. This is using a delimiter of '|' and has an encoding of 'latin-1':

In [None]:
movie_titles = pd.read_csv('ml-100k/u.item', sep='|', encoding='latin-1', header=None)
movie_titles.head()

I'll only need the item_id and the title itself so lets store those in a separate dataframe:

In [None]:
dfMovieNames = pd.DataFrame(movie_titles[0])
dfMovieNames.columns=['item_id']
dfMovieNames['MovieName'] = movie_titles[1]
dfMovieNames.head()

And now merge the two dataframes we have:

In [None]:
df2 = pd.merge(df, dfMovieNames, on='item_id')
df2.head()

## 7.2. Building an Item-Based Recommendation

Each row is the user rating of one particular movie. It will be easier if we have an overview where each row is a unique user, and as fields the different ratings that user gave for each movie.

This can be done using a pivottable:

In [None]:
movie_matrix = df2.pivot_table(index='user_id', columns='MovieName', values='rating')
movie_matrix.head()

We can use this movie_matrix to calculate the (Pearson) correlation coefficient between any two movies.

Let's say we are interested in someone who rated 'Air Force One (1997)' with a high rating. Which movies have a strong positive correlation in their ratings with those of 'Air Force One'? Since this would indicate that users who rate that movie high, usually also rate Air Force One high, so we could then recommend this to that person.

To calculate all the correlations between one field, and all other fields we can use 'corrwith'.

First select the movie:

In [None]:
myMovie = movie_matrix['Air Force One (1997)']

Now for the correlations, sorted of course to make life easy:

In [None]:
myCorrs = movie_matrix.corrwith(myMovie).sort_values(ascending=False)

# and as a dataframe
corrDf = pd.DataFrame(myCorrs, columns=['Correlation'])
corrDf.dropna(inplace=True)
corrDf = corrDf.sort_values('Correlation', ascending=False)
corrDf.head()

A few weird things. 

First the warning. It mentions '*Degrees of freedom <= 0 for slice c = cov(x, y, rowvar, dtype=dtype)*'. This is caused if we don't have at least two rating from each movie. Luckily it is just a warning so we don't need to worry about it too much.

Second is that the correlation between Air Force One and Air Force One is a perfect 1 (unrounded), but it is not listed on top. 

Third, some others have a perfect correlation. My guess is that those have rounding errors or perhaps very few ratings which just happened to be the same.

The number of ratings itself might actually also be helpful. Perhaps we should only take into consideration movies of which we have a decent number of ratings. Let's add the number of ratings.

In [None]:
ratings = pd.DataFrame(df2.groupby('MovieName')['rating'].mean())
ratings['nRatings'] = df2.groupby('MovieName')['rating'].count()
ratings.sort_values(by=['nRatings']).head()

Lets add the number of ratings to the correlations data frame:

In [None]:
corrDf = corrDf.merge(ratings['nRatings'], on='MovieName')
corrDf.head()

To find a minimum number of ratings that is suitable for our data, lets create a histogram of the number of ratings:

In [None]:
ratings['nRatings'].hist(bins=50, range=(0,500));

It is up to you to decide on a threshold, but I'll use 75.

Lets filter out the movies with less than 75 ratings:

In [None]:
corrDf[corrDf['nRatings']>=75].head()

So, final recommendation for someone who was happy with 'Air Force One' is to go and watch 'Copycat'.