<a href="https://colab.research.google.com/github/zayedupal/MovieRecommendationNotebook/blob/master/MovieLens_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Recommendation System 101**


Used movielens small dataset https://grouplens.org/datasets/movielens/latest/

In [0]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

PATH = '/content/'

# read all the data
ratings = pd.read_csv(PATH + 'ratings.csv')
movies = pd.read_csv(PATH + 'movies.csv')
tags = pd.read_csv(PATH + 'tags.csv')
links = pd.read_csv(PATH + 'links.csv')
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


**1. POPULARITY BASED RECOMMENDATION**

In [0]:
g = ratings.groupby('userId')['rating'].count()
np.unique(g)


In [0]:
g = ratings.groupby('movieId')['rating'].count()
np.unique(g)

We see that user with minimum given rating is 20,
Movie with minimum count of rating is 1.

Now, we'll give score to all the movies. Average rating is not a good score. Because, some movies may get less votes but high rating, which will underestimate movies with lesser rating but higher vote counts. So, we are using IMDB's weighted rating.




```
weighted rank (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

 where:
  R = average rating for the movie
  v = number of votes for the movie
  m = minimum votes required to be listed in the Top
  C = the mean vote across the whole report
```






In [0]:
movie_rating_mean = ratings.groupby('movieId',as_index=False)['rating'].mean()
movie_rating_mean.columns = ['movieId','avg_rating']
movie_rating_count = ratings.groupby('movieId',as_index=False)['rating'].count()
movie_rating_count.columns = ['movieId','votes']
movie_rating_mean_count = movie_rating_mean.merge(movie_rating_count,how='inner',on='movieId')
movie_rating_mean_count


Unnamed: 0,movieId,avg_rating,votes
0,1,3.920930,215
1,2,3.431818,110
2,3,3.259615,52
3,4,2.357143,7
4,5,3.071429,49
...,...,...,...
9719,193581,4.000000,1
9720,193583,3.500000,1
9721,193585,3.500000,1
9722,193587,3.500000,1


In [0]:
C = movie_rating_mean_count['avg_rating'].mean()
m = movie_rating_mean_count['votes'].quantile(0.9)
def weighted_rating(x, m=1, C=1):
    v = x['votes']
    R = x['avg_rating']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [0]:
movie_score = movie_rating_mean_count[movie_rating_mean_count['votes']>=m] 
movie_score['score'] = movie_score.apply(weighted_rating,axis=1)
movie_score = movie_score.sort_values('score',ascending=False)
movie_score

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,movieId,avg_rating,votes,score
277,318,4.429022,317,4.418239
659,858,4.289062,192,4.272021
2224,2959,4.272936,218,4.257991
602,750,4.268041,97,4.234694
921,1221,4.259690,129,4.234615
...,...,...,...,...
1173,1562,2.214286,42,2.186047
2028,2701,2.207547,53,2.185185
1234,1644,2.109375,32,2.075758
1372,1882,1.954545,33,1.926471


Return the top 10 movies according to the score

In [0]:
movie_top_ten = movie_score.head(10)
movie_top_ten = movies[movies['movieId'].isin(movie_top_ten['movieId'])]
# movie_top_ten
movie_top_ten

Unnamed: 0,movieId,title,genres
46,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
277,318,"Shawshank Redemption, The (1994)",Crime|Drama
602,750,Dr. Strangelove or: How I Learned to Stop Worr...,Comedy|War
659,858,"Godfather, The (1972)",Crime|Drama
686,904,Rear Window (1954),Mystery|Thriller
906,1204,Lawrence of Arabia (1962),Adventure|Drama|War
914,1213,Goodfellas (1990),Crime|Drama
922,1221,"Godfather: Part II, The (1974)",Crime|Drama
2226,2959,Fight Club (1999),Action|Crime|Drama|Thriller
6315,48516,"Departed, The (2006)",Crime|Drama|Thriller


**2. CONTENT BASED RECOMMENDER**

In short, we'll recommend content based similar movies. For this dataset, the content can be described with the **genres** and **tags**

We are combining these two in a new dataframe name **movie_metadata**

In [0]:
movie_tags = tags.groupby('movieId')['tag'].apply(list).reset_index(name='tag')
movie_tags['tag'] = movie_tags['tag'].apply(lambda x: ' '.join(x))
movie_genres_tags = movies.merge(movie_tags,how='inner',on='movieId')
movie_metadata = movie_genres_tags
movie_metadata['genres'] = movie_metadata['genres'].apply(lambda x: x.replace('|',' '))
movie_metadata['meta'] = movie_metadata['genres']+movie_metadata['tag'] 
movie_metadata= movie_metadata.drop(['genres','tag'],1)
# movie_metadata['meta'] = movie_metadata['meta']
movie_metadata

Unnamed: 0,movieId,title,meta
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasypix...
1,2,Jumanji (1995),Adventure Children Fantasyfantasy magic board ...
2,3,Grumpier Old Men (1995),Comedy Romancemoldy old
3,5,Father of the Bride Part II (1995),Comedypregnancy remake
4,7,Sabrina (1995),Comedy Romanceremake
...,...,...,...
1567,183611,Game Night (2018),Action Comedy Crime HorrorComedy funny Rachel ...
1568,184471,Tomb Raider (2018),Action Adventure Fantasyadventure Alicia Vikan...
1569,187593,Deadpool 2 (2018),Action Comedy Sci-FiJosh Brolin Ryan Reynolds ...
1570,187595,Solo: A Star Wars Story (2018),Action Adventure Children Sci-FiEmilia Clarke ...


Now use count vectorizer to vectorize metadata texts movie wise

In [0]:

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(stop_words='english')

#Replace NaN with an empty string
movie_metadata['meta'] = movie_metadata['meta'].fillna('')

#Construct the required CountVectorizer matrix by fitting and transforming the data
count_matrix  = count_vectorizer.fit_transform(movie_metadata['meta'])

#Output the shape of count_matrix
count_matrix .shape


(1572, 2375)

Now let's find similarity between each pair of movies. We'll use sklearn's **linear_kernel** for this. So, each of the row, has values of similarities with every other movies.

In [0]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim.shape

(1572, 1572)

In [0]:
count_matrix

<1572x2375 sparse matrix of type '<class 'numpy.int64'>'
	with 7310 stored elements in Compressed Sparse Row format>

In [0]:
#Construct a reverse map of indices and movie ids
indices = pd.Series(movie_metadata.index, index=movie_metadata['title'])
indices

title
Toy Story (1995)                         0
Jumanji (1995)                           1
Grumpier Old Men (1995)                  2
Father of the Bride Part II (1995)       3
Sabrina (1995)                           4
                                      ... 
Game Night (2018)                     1567
Tomb Raider (2018)                    1568
Deadpool 2 (2018)                     1569
Solo: A Star Wars Story (2018)        1570
Gintama: The Movie (2010)             1571
Length: 1572, dtype: int64

Get Top 10 recommendations according to similarity

In [0]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title,movie_metadf,movies_df):
    # title = movie_df[movie_df['movieId']==movie_id]['title'].values[0]
    print('you want recommendation for: ', title)
    # Get the index of the movie that matches the title
    idx = indices[title]
    print(idx)
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    print(sim_scores)
    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]
    print(sim_scores)

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    top_movie_ids = movie_metadf['movieId'].iloc[movie_indices].reset_index()
    
    return movie_metadata['title'].iloc[movie_indices]
    # Return the top 10 most similar movies
    # return top_movies

In [0]:
get_recommendations('Batman Returns (1992)',movie_metadata,movies)

you want recommendation for:  Batman Returns (1992)
380
[(380, 0.9999999999999998), (43, 0.6324555320336758), (113, 0.4999999999999999), (360, 0.4999999999999999), (383, 0.4999999999999999), (384, 0.4999999999999999), (411, 0.4999999999999999), (412, 0.4999999999999999), (540, 0.4999999999999999), (553, 0.4999999999999999), (554, 0.4999999999999999), (555, 0.4999999999999999), (636, 0.4999999999999999), (649, 0.4999999999999999), (702, 0.4999999999999999), (824, 0.4999999999999999), (1017, 0.4999999999999999), (48, 0.408248290463863), (100, 0.408248290463863), (145, 0.408248290463863), (216, 0.408248290463863), (268, 0.408248290463863), (297, 0.408248290463863), (314, 0.408248290463863), (439, 0.408248290463863), (484, 0.408248290463863), (505, 0.408248290463863), (516, 0.408248290463863), (566, 0.408248290463863), (586, 0.408248290463863), (589, 0.408248290463863), (596, 0.408248290463863), (607, 0.408248290463863), (758, 0.408248290463863), (789, 0.408248290463863), (828, 0.408248290

43                         Batman Forever (1995)
113                   In the Line of Fire (1993)
360    Butch Cassidy and the Sundance Kid (1969)
383                                  Jaws (1975)
384                              Jaws 3-D (1983)
411                             G.I. Jane (1997)
412                         Air Force One (1997)
540                            Siege, The (1998)
553                             Rocky III (1982)
554                              Rocky IV (1985)
Name: title, dtype: object

**3. COLLABORATIVE FILTERING**

**a. Single value decomposition(SVD)**

In [0]:
pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/f5/da/b5700d96495fb4f092be497f02492768a3d96a3f4fa2ae7dea46d4081cfa/scikit-surprise-1.1.0.tar.gz (6.4MB)
[K     |████████████████████████████████| 6.5MB 2.7MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.0-cp36-cp36m-linux_x86_64.whl size=1678260 sha256=437c200fafdc64bf08c760cab08c9b4d181de8e7f4d94f030c1175b6fb3e80d1
  Stored in directory: /root/.cache/pip/wheels/cc/fa/8c/16c93fccce688ae1bde7d979ff102f7bee980d9cfeb8641bcf
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.0 surprise-0.1


In [0]:
from surprise import Reader, Dataset, SVD, model_selection
reader = Reader()
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [0]:
# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset = data.build_full_trainset()
algo = SVD()

In [0]:
model_selection.cross_validate(algo,data,measures=['RMSE','MAE'])

{'fit_time': (4.0735249519348145,
  4.099108934402466,
  4.186619758605957,
  4.083319664001465,
  4.146083116531372),
 'test_mae': array([0.66677594, 0.66548906, 0.67150253, 0.68132314, 0.67317764]),
 'test_rmse': array([0.86719639, 0.87024851, 0.86642002, 0.88519781, 0.87721507]),
 'test_time': (0.12087416648864746,
  0.11387276649475098,
  0.21832275390625,
  0.11393904685974121,
  0.21335482597351074)}

In [0]:
# taken from surprise docs
import collections
def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = collections.defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [0]:
algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)
# predictions
top_n = get_top_n(predictions, n=10)



In [0]:
# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

1 [318, 720, 3949, 912, 1250, 3508, 4226, 750, 4993, 5952]
2 [898, 56782, 246, 1732, 6016, 720, 1204, 4973, 910, 177593]
3 [109374, 930, 3681, 1104, 223, 260, 3266, 356, 1041, 3552]
4 [1278, 858, 4235, 5902, 38061, 1228, 1204, 562, 1035, 177593]
5 [1204, 750, 1250, 1278, 1136, 260, 858, 1219, 1104, 1945]
6 [58559, 1203, 2160, 898, 56782, 3147, 8368, 2398, 44195, 1266]
7 [2571, 318, 1213, 177593, 2959, 1272, 1201, 246, 3429, 296]
8 [1198, 1270, 2571, 78499, 1203, 904, 246, 1250, 912, 908]
9 [1204, 260, 858, 608, 1201, 7153, 58559, 1136, 7361, 3468]
10 [1204, 246, 1203, 899, 6385, 5902, 5747, 1217, 3275, 2150]
11 [58559, 1196, 1197, 899, 904, 4306, 1223, 91529, 3429, 527]
12 [50, 110, 260, 356, 457, 527, 1089, 1136, 1196, 1197]
13 [2959, 1204, 898, 1213, 1250, 296, 78499, 750, 1197, 7361]
14 [5618, 78499, 1104, 1215, 2324, 1267, 4226, 1233, 6502, 1283]
15 [1204, 608, 1223, 1732, 1213, 1225, 1617, 1203, 926, 922]
16 [1204, 741, 54997, 48516, 1248, 6016, 142488, 1233, 2324, 51255]
17 [1276