## MovieLens Dataset

We will be using movielens dataset for movie recommender system. There are basically two types of dataset SMALL & FULL. 

As per the website (https://grouplens.org/datasets/movielens/), <br>
<strong>Small:</strong> 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users. Last updated 10/2016.<br>
<strong>Full:</strong> 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Last updated 8/2017.

In [1]:
small_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
full_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest.zip'

### Download Files

In [None]:
!wget {small_dataset_url} -P data/
!wget {full_dataset_url} -P data/

### Extract Files

In [None]:
!unzip data/ml-latest-small.zip -d data/
!unzip data/ml-latest.zip -d data/

In [2]:
import pandas as pd
import numpy as np

In [3]:
ratings_df = pd.read_csv('data/ml-latest-small/ratings.csv')
ratings_df = ratings_df[['userId','movieId','rating']]
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [4]:
movies_df = pd.read_csv('data/ml-latest-small/movies.csv')
movies_df = movies_df[['movieId','title']]
movies_df.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [5]:
num_users = len(ratings_df['userId'].unique())
num_movies = len(ratings_df['movieId'].unique())
print('Number of Users : {}'.format(num_users))
print('Number of Movies: {}'.format(num_movies))

Number of Users : 671
Number of Movies: 9066


In [6]:
from sklearn import cross_validation as cv
train, test = cv.train_test_split(ratings_df, test_size=0.25)



In [7]:
train.head()

Unnamed: 0,userId,movieId,rating
78739,547,1236,3.0
34602,247,1320,3.0
81945,559,1193,5.0
50318,370,1645,4.5
66043,468,1962,3.0


In [8]:
def create_matrix(df):
    users = df['userId'].unique()
    movies = df['movieId'].unique()
    mat = pd.DataFrame(index=users,columns=movies).fillna(value=0)
    for idx,row in df.iterrows():
        user = row['userId']
        movie = row['movieId']
        value = row['rating']
        mat.at[user,movie] = value
    return mat

In [9]:
train_matrix = create_matrix(train)
test_matrix = create_matrix(test)

In [11]:
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_matrix, metric='cosine')
item_similarity = pairwise_distances(train_matrix.T, metric='cosine')

In [12]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [13]:
item_prediction = predict(train_matrix.as_matrix(), item_similarity, type='item')
user_prediction = predict(train_matrix.as_matrix(), user_similarity, type='user')

In [14]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [15]:
print ('User-based CF RMSE: ' + str(rmse(user_prediction, test_matrix.as_matrix())))
print ('Item-based CF RMSE: ' + str(rmse(item_prediction, test_matrix.as_matrix())))

User-based CF RMSE: 3.557193493162695
Item-based CF RMSE: 3.628474744819959


In [58]:
item_prediction = pd.DataFrame(item_similarity, index = train_matrix.columns, columns = train_matrix.columns )
item_prediction = item_prediction.stack().reset_index()
item_prediction.rename({'level_0':'movie1','level_1':'movie2',0:'score'},axis='columns',inplace=True)

In [59]:
item_prediction.head()

Unnamed: 0,movie1,movie2,score
0,1236,1236,0.0
1,1236,1320,0.859766
2,1236,1193,0.896463
3,1236,1645,0.891674
4,1236,1962,0.933397


In [60]:
movies_df = movies_df[['movieId','title']]
movies_df = movies_df.drop_duplicates()
movieMap = movies_df.set_index("movieId").title
movieMap = movieMap.loc[~movieMap.index.duplicated(keep='first')]

In [61]:
item_prediction['title1'] = item_prediction.movie1.map(movieMap)
item_prediction['title2'] = item_prediction.movie2.map(movieMap)
item_prediction = item_prediction[['title1','title2','score']]
item_prediction = pd.concat([pd.DataFrame(np.sort(item_prediction[['title1','title2']],1), index=item_prediction.index), item_prediction['score']], axis=1)

In [63]:
item_prediction.rename({0:'title1',1:'title2'},axis='columns',inplace=True)

In [None]:
item_prediction['title1_year'] = item_prediction['title1'].str[-5:]
item_prediction['title1_year'] = item_prediction['title1_year'].str.replace(')','')
item_prediction['title1_year'] = item_prediction['title1_year'].astype(int, errors'ignore')

In [90]:
item_prediction = item_prediction.drop_duplicates()
item_prediction[item_prediction.title1=='Avengers, The (2012)'].sort_values('score',ascending=True)

Unnamed: 0,title1,title2,score,title1_year
6494126,"Avengers, The (2012)","Avengers, The (2012)",0.000000,2012
2266100,"Avengers, The (2012)",Guardians of the Galaxy (2014),0.274420,2012
6497299,"Avengers, The (2012)",Thor (2011),0.414472,2012
6494258,"Avengers, The (2012)",Iron Man (2008),0.436875,2012
6495154,"Avengers, The (2012)",X-Men: First Class (2011),0.441359,2012
6496308,"Avengers, The (2012)","Hobbit: The Desolation of Smaug, The (2013)",0.468035,2012
3803564,"Avengers, The (2012)","Dark Knight Rises, The (2012)",0.476154,2012
2764958,"Avengers, The (2012)",Elysium (2013),0.481796,2012
6494221,"Avengers, The (2012)",Zombieland (2009),0.484881,2012
6498559,"Avengers, The (2012)",Thor: The Dark World (2013),0.488188,2012
