### The Data Life Podcast - Overview of Netflix and Spotify like recommendation engines. 
Welcome!
To get started we can use MovieLens dataset that has a collection of 100k ratings applied to 9,000 movies by 600 users. This is great to get started.

We can then create vector for each movie and the corresponding genre. So if a movie has genres “Adventure, Drama and Thriller” we can make a vector of 1s and 0s, 1s correspond to the genre this movie has and 0s corresponds to genres this movie does not have. 
Having created this vector, we can use cosine similarity to find movies with similar genres. So our friend Adam told us that he likes Interstellar, we will look up the genre vector for this movie and find cosine similarity of this vector with other movie genre vectors - that might give us Martian and Contact. 

For collaborative filtering algorithms, we will build user-product matrix where each user’s ratings for every other movie or song is in vector format. For sites like Spotify that do not use ratings, this matrix can just be 1s and 0s of streamed and unstreamed songs. 
We can then use the rows in this matrix to find similar users by using cosine similarity. Instead of movie genres we are just using user activity here. 
These kind of models are “neighbourhood based” or “memory based” collaborative filtering systems. 

There is one small problem to this approach. For 600 users and 10k movies, this works fine, but it does not scale for Netflix or Spotify level. They have 100s of millions of users and millions of products, how does calculating similarity between different users or items work at this scale? 

That is actually done via Matrix Factorization methods that basically reduce the dimensions of sparse matrices. We can rely on these methods to work much faster and scale better, the reason being that most users won’t interact with more than a few hundred items, and rest of the millions of items will not have any activity for them. These methods fall under “model based” collaborative filtering systems. SVD is a common technique which helps us find principal components of our user-movie matrix. So we build one matrix that identifies each users' preference for action movies vs romance movies vs drama movies etc. And we build another matrix that identifies each products' content in action, romance vs drama etc. And learn this automatically - not relying on humans creating these labels for us. That's what makes this so powerful.

In [860]:
import pandas as pd

### Users > Movie Ratings

In [861]:
df_ratings = pd.read_csv("ratings.csv")

In [862]:
df_ratings.shape

(100836, 4)

In [863]:
df_ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [864]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Personal Tags by Users

In [865]:
df_tags = pd.read_csv("tags.csv")

In [866]:
df_tags.shape

(3683, 4)

In [867]:
df_tags.isnull().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

In [868]:
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [869]:
df_tags["tag"].value_counts().head()

In Netflix queue     131
atmospheric           36
thought-provoking     24
superhero             24
funny                 23
Name: tag, dtype: int64

### Movie Tags Official

In [870]:
df_movies = pd.read_csv("movies.csv")

In [871]:
df_movies.shape

(9742, 3)

In [872]:
df_movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [873]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### External Links

In [874]:
df_links = pd.read_csv("links.csv")

In [875]:
df_links.shape

(9742, 3)

In [876]:
df_links.isnull().sum()

movieId    0
imdbId     0
tmdbId     8
dtype: int64

In [877]:
df_links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


### Content Recommendation System

To make this we will do following steps: 

1) Build a user profile of what movies this user has liked in the past.  
2) Build a vector for each movie  
3) Find similar movies that user hasn't seen that are highly rated similar to what he/she has liked in past.

#### First, get movies that are liked by each user in a dictionary

In [878]:
user_liked_movies = {}
unique_users = list(set(df_ratings["userId"]))
print("Number of Unique Users is", len(unique_users))

for user_id in unique_users:
    user_df = df_ratings[df_ratings["userId"]==user_id]
    user_liked_movies[user_id] = user_df[user_df["rating"] > 4.5]["movieId"].tolist()

print("Movies liked by user 550", user_liked_movies[550])

Number of Unique Users is 610
Movies liked by user 550 [59315, 79132, 89745, 109487, 122904]


#### Second, build genres list to help vectorize them

In [879]:
genres = []

for _, row in df_movies.iterrows():
    genres.extend(row["genres"].split("|"))
    
genres = list(set(genres))
print("Total genres are {} and these are {}".format(len(genres), genres))

Total genres are 20 and these are ['Fantasy', 'Action', 'Thriller', 'Documentary', 'Film-Noir', 'Animation', 'Drama', 'Sci-Fi', 'Children', 'Adventure', 'Romance', 'Mystery', 'Comedy', 'Musical', '(no genres listed)', 'Horror', 'IMAX', 'Western', 'War', 'Crime']


In [880]:
removed_movies = df_movies[df_movies["genres"] == "(no genres listed)"]["movieId"].tolist()
df_movies[df_movies["genres"] == "(no genres listed)"].head()

Unnamed: 0,movieId,title,genres
8517,114335,La cravate (1957),(no genres listed)
8684,122888,Ben-hur (2016),(no genres listed)
8687,122896,Pirates of the Caribbean: Dead Men Tell No Tal...,(no genres listed)
8782,129250,Superfast! (2015),(no genres listed)
8836,132084,Let It Be Me (1995),(no genres listed)


#### Let's remove movies that have (no genres listed) and redo genre building

In [881]:
print(removed_movies)

[114335, 122888, 122896, 129250, 132084, 134861, 141131, 141866, 142456, 143410, 147250, 149330, 152037, 155589, 156605, 159161, 159779, 161008, 165489, 166024, 167570, 169034, 171495, 171631, 171749, 171891, 172497, 172591, 173535, 174403, 176601, 181413, 181719, 182727]


In [882]:
df_movies = df_movies[df_movies["genres"] != "(no genres listed)"].reset_index()
genres = []

for _, row in df_movies.iterrows():
    genres.extend(row["genres"].split("|"))
    
genres = sorted(set(genres))
print("Total genres are {} and these are {}".format(len(genres), genres))

Total genres are 19 and these are ['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']


In [883]:
print(df_movies.loc[1284])

index                          1284
movieId                        1704
title      Good Will Hunting (1997)
genres                Drama|Romance
Name: 1284, dtype: object


#### Third, let's vectorize each of the movie's genres

In [884]:
df_movies["feature_vector"] = [[0]*len(genres) for i in range(len(df_movies))]
df_movies.head()

Unnamed: 0,index,movieId,title,genres,feature_vector
0,0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,1,2,Jumanji (1995),Adventure|Children|Fantasy,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,2,3,Grumpier Old Men (1995),Comedy|Romance,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,4,5,Father of the Bride Part II (1995),Comedy,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [885]:
def annotate_vector(x):
    v = [0 for i in range(len(genres))]
    try:
        this_movie_genres = x["genres"].split("|")
    except:
        print(x)
    for g in this_movie_genres:
        v[genres.index(g)] = 1
    x["feature_vector"] = v
    return x

In [886]:
df_movies = df_movies.apply(annotate_vector, axis=1)
        
print(df_movies.shape)
df_movies.head()

(9708, 5)


Unnamed: 0,index,movieId,title,genres,feature_vector
0,0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
1,1,2,Jumanji (1995),Adventure|Children|Fantasy,"[0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
2,2,3,Grumpier Old Men (1995),Comedy|Romance,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ..."
3,3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, ..."
4,4,5,Father of the Bride Part II (1995),Comedy,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [887]:
df_movies.loc[df_movies['movieId']==5].index[0]

4

#### 4. Write a cosine similarity function

In [888]:
import math
def cosine_similarity_self(v1, v2):
    numerator, denominator, v1mag, v2mag, cosine_sim = 0.0, 0.0, 0.0, 0.0, 0.0 
    for i in range(len(v1)):
        numerator += v1[i]*v2[i]
        v1mag += v1[i] * v1[i]
        v2mag += v2[i] * v2[i]
    
    denominator = math.sqrt(v1mag) * math.sqrt(v2mag)
    if denominator == 0.0:
        return 0.0
    cosine_sim = numerator/ denominator
    return round(cosine_sim, 4)
    

In [889]:
print(cosine_similarity_self([0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]))
print(cosine_similarity_self([5, 5, 5], [1, 1, 1]))


0.3162
1.0


In [890]:
print(df_movies.loc[1, "feature_vector"])

[0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


#### Let's calculate pairwise cosine similarity, to prevent re-calculating this for every movie for every user.

In [891]:
from sklearn.metrics.pairwise import cosine_similarity
print(df_movies["feature_vector"].tolist()[:5])
print(cosine_similarity(df_movies["feature_vector"].tolist()[:5]))

[[0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
[[1.         0.77459667 0.31622777 0.25819889 0.4472136 ]
 [0.77459667 1.         0.         0.         0.        ]
 [0.31622777 0.         1.         0.81649658 0.70710678]
 [0.25819889 0.         0.81649658 1.         0.57735027]
 [0.4472136  0.         0.70710678 0.57735027 1.        ]]


In [892]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_matrix = cosine_similarity(df_movies["feature_vector"].tolist())
print(cosine_matrix, cosine_matrix.shape)

[[1.         0.77459667 0.31622777 ... 0.         0.31622777 0.4472136 ]
 [0.77459667 1.         0.         ... 0.         0.         0.        ]
 [0.31622777 0.         1.         ... 0.         0.         0.70710678]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.31622777 0.         0.         ... 0.         1.         0.        ]
 [0.4472136  0.         0.70710678 ... 0.         0.         1.        ]] (9708, 9708)


#### Test cosine similarity function

In [893]:
from sklearn.metrics.pairwise import cosine_similarity
a = df_movies.loc[9653]["feature_vector"]
b = df_movies.loc[507]["feature_vector"]
cosine_similarity([a, b])

a = (1,1,1,2) # now normalize and see 
b = (10,10,10,8) # normalize and see below

print(cosine_similarity([a, b]))
a, b = [-0.25000, -0.25000, -0.25000, 0.75000], [0.50000, 0.50000, 0.50000, -1.50000]
cosine_similarity([a, b])

[[1.         0.91129318]
 [0.91129318 1.        ]]


array([[ 1., -1.],
       [-1.,  1.]])

#### Time for recommending movies to users!

In [894]:
print(cosine_matrix, cosine_matrix[507])

[[1.         0.77459667 0.31622777 ... 0.         0.31622777 0.4472136 ]
 [0.77459667 1.         0.         ... 0.         0.         0.        ]
 [0.31622777 0.         1.         ... 0.         0.         0.70710678]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.31622777 0.         0.         ... 0.         1.         0.        ]
 [0.4472136  0.         0.70710678 ... 0.         0.         1.        ]] [0.  0.  0.  ... 0.  0.5 0. ]


In [895]:
def get_movie_title(index_id, df):
    # return df.loc[df["movieId"] == movie_id, "title"].tolist()[0] 
    return df.loc[index_id, "title"]
print(get_movie_title(1142, df_movies))

Anna Karenina (1997)


In [896]:
import numpy as np

def content_recommender_for_user(user_id):
    liked_movie_ids = user_liked_movies[user_id]
    for liked_movie_id in liked_movie_ids:
        if liked_movie_id in removed_movies:
            continue
            
        # Get index of this movie in df
        idx = df_movies.loc[df_movies['movieId']==liked_movie_id].index[0]
        liked_movie_title = get_movie_title(idx, df_movies)
        
        # idx = df_movies.index[df_movies['movieId'] == liked_movie_id].tolist()[0]
        print(liked_movie_title, idx, liked_movie_id)
        cosine_sims = np.copy(cosine_matrix[idx])  # Get cosine sim of this movie to all other movies
        # Find top 5 highest movies using argsort in the pairwise cosine sim matrix
        best_movies_to_recommend = cosine_sims.argsort()[::-1][:5]
        print("best movies to recommend", best_movies_to_recommend, np.sort(cosine_sims)[::-1][:5])
        for m in best_movies_to_recommend:
            if m != idx:
                print("Because you liked {}, we recommend {}".format(liked_movie_title, get_movie_title(m, df_movies)))

### Content Recommender given a particular user
Given a user, find which movies they have liked in the past and find similar movies based on genres

In [897]:
content_recommender_for_user(10)

Troy (2004) 4948 7458
best movies to recommend [4877 6457 4795 5707 4948] [1. 1. 1. 1. 1.]
Because you liked Troy (2004), we recommend Flesh & Blood (1985)
Because you liked Troy (2004), we recommend Inglorious Bastards (Quel maledetto treno blindato) (1978)
Because you liked Troy (2004), we recommend Last Samurai, The (2003)
Because you liked Troy (2004), we recommend Saints and Soldiers (2003)
Notebook, The (2004) 5227 8533
best movies to recommend [1144 1142 8072 1268 3651] [1. 1. 1. 1. 1.]
Because you liked Notebook, The (2004), we recommend Inventing the Abbotts (1997)
Because you liked Notebook, The (2004), we recommend Anna Karenina (1997)
Because you liked Notebook, The (2004), we recommend Misérables, Les (2000)
Because you liked Notebook, The (2004), we recommend Wings of the Dove, The (1997)
Because you liked Notebook, The (2004), we recommend Monster's Ball (2001)
First Daughter (2004) 5332 8869
best movies to recommend [4688 3777  361 3792 3790] [1. 1. 1. 1. 1.]
Because yo

### Content Recommender given a particular movie
Given a movie, find similar movies based on common genres


In [986]:
def get_movie_genres(index_id, df):
    # return df.loc[df["movieId"] == movie_id, "title"].tolist()[0] 
    return df.loc[index_id, "genres"]
print(get_movie_genres(1142, df_movies))

Drama|Romance


In [899]:
def content_recommender_for_movie(movie_title):
    # Get index of this movie in df
    idx = df_movies[df_movies["title"].str.contains(movie_title)].index[0]

    # idx = df_movies.index[df_movies['movieId'] == liked_movie_id].tolist()[0]
    print("Movie you picked is {} with genres {} and index {}".format(get_movie_title(idx, df_movies),
                                                                    get_movie_genres(idx, df_movies), idx))
    cosine_sims = np.copy(cosine_matrix[idx])  # Get cosine sim of this movie to all other movies
    # print(pd.Series(cosine_sims).sort_values(ascending=False))
    # Find top 5 highest movies using argsort in the pairwise cosine sim matrix
    best_movies_to_recommend = cosine_sims.argsort()[::-1][:10]
    print("best movies to recommend", best_movies_to_recommend, np.sort(cosine_sims)[::-1][:10])
    for m in best_movies_to_recommend:
        if m != idx:
            print("We recommend {} with genres {}".format(get_movie_title(m, df_movies), get_movie_genres(m, df_movies)))

In [927]:
content_recommender_for_movie("Return of the Jedi")

Movie you picked is Star Wars: Episode VI - Return of the Jedi (1983) with genres Action|Adventure|Sci-Fi and index 911
best movies to recommend [3599 7627  911 5896 4658 6791 3731 1692 9064  176] [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
We recommend Flash Gordon (1980) with genres Action|Adventure|Sci-Fi
We recommend Green Lantern (2011) with genres Action|Adventure|Sci-Fi
We recommend Star Wars: Episode III - Revenge of the Sith (2005) with genres Action|Adventure|Sci-Fi
We recommend Timeline (2003) with genres Action|Adventure|Sci-Fi
We recommend Journey to the Center of the Earth (2008) with genres Action|Adventure|Sci-Fi
We recommend Time Machine, The (2002) with genres Action|Adventure|Sci-Fi
We recommend Six-String Samurai (1998) with genres Action|Adventure|Sci-Fi
We recommend Our Brand Is Crisis (2015) with genres Comedy|Drama
We recommend Waterworld (1995) with genres Action|Adventure|Sci-Fi


## Things to improve content recommendation engine:
1) Remove years from movie titles and use years in some way 

2) Do ranking within the filtered movies for user above to select movies based on User-User CF or some other technique 

3) Only look at 5 star movies first for each user, and then 4.5 and so on 

4) What if the only movie liked by the user has no genres?

## Now, let's try User-User Collaborative Filtering system
We could have chosen Item-Item Similarity also but users are more sparse and fewer, thats why its better, also somewhat more fun to find similar users also. In real life, Amazon and others use item-item CF because items are lesser than total users.

In [908]:
df_movies = pd.read_csv("movies.csv")

In [909]:
df_movies.shape

(9742, 3)

In [910]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [911]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [912]:
df_movies.loc[df_movies["movieId"]==33669]

Unnamed: 0,movieId,title,genres
5908,33669,"Sisterhood of the Traveling Pants, The (2005)",Adventure|Comedy|Drama


In [913]:
x = df_ratings.loc[df_ratings["userId"]==2]
x[x["rating"]>4.5]

Unnamed: 0,userId,movieId,rating,timestamp
241,2,60756,5.0,1445714980
248,2,80906,5.0,1445715172
250,2,89774,5.0,1445715189
254,2,106782,5.0,1445714966
259,2,122882,5.0,1445715272
260,2,131724,5.0,1445714851


In [914]:
df_ratings.shape

(100836, 4)

In [915]:
movie_index = {}
for i, m in enumerate(df_movies["movieId"]):
    movie_index[m] = i
print(len(movie_index))

9742


In [916]:
# Which movie Ids not in ratings?
# df_movies.join(df_ratings, on="movieId", how="left", lsuffix='_left', rsuffix='_right')
c = 0 
ans = []
for m in df_movies["movieId"]:
    if m not in df_ratings["movieId"]:
        if m not in ans:
            c = c+1
            ans.append(m)
        
print(df_movies.loc[df_movies["movieId"]==m])
print(c)

      movieId                                title  genres
9741   193609  Andrew Dice Clay: Dice Rules (1991)  Comedy
1629


In [917]:
all_movies = df_movies["movieId"].unique()
len(all_movies)

9742

In [918]:
all_users = df_ratings["userId"].unique()

In [919]:
len(df_ratings["userId"].unique())

610

### Build a user movie matrix 
Each row corresponds to user and each column represents movie. Every cell corresponds to rating given by a user for a movie.

In [920]:
user_movie_matrix = []
for user_id in all_users:
    user_vector = [0 for m in all_movies]
    user_df = df_ratings.loc[df_ratings["userId"]==user_id]
    for _, row in user_df.iterrows():
        user_vector[movie_index[row["movieId"]]] = row["rating"]
    user_movie_matrix.append(user_vector)

print("The shape of matrix is {} x {}".format(len(user_movie_matrix), len(user_movie_matrix[0])))
# for user in all_users:
    

The shape of matrix is 610 x 9742


In [921]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_user_matrix = cosine_similarity(user_movie_matrix)
print(cosine_user_matrix, cosine_user_matrix.shape)

[[1.         0.02728287 0.05972026 ... 0.29109737 0.09357193 0.14532081]
 [0.02728287 1.         0.         ... 0.04621095 0.0275654  0.10242675]
 [0.05972026 0.         1.         ... 0.02112846 0.         0.03211875]
 ...
 [0.29109737 0.04621095 0.02112846 ... 1.         0.12199271 0.32205486]
 [0.09357193 0.0275654  0.         ... 0.12199271 1.         0.05322546]
 [0.14532081 0.10242675 0.03211875 ... 0.32205486 0.05322546 1.        ]] (610, 610)


In [922]:
df_movies[df_movies["movieId"].isin([1,23, 123124])]["title"].tolist()

['Toy Story (1995)', 'Assassins (1995)']

### Collaborative Filtering for user

In [979]:
def cf_recommender_for_user(user_id, cosine_user_matrix=cosine_user_matrix):
    liked_movie_ids = user_liked_movies[user_id]
    #df[df['A'].isin([3, 6])]
    print("Some movies liked by this user", df_movies[df_movies["movieId"].isin(liked_movie_ids)]["title"].tolist())
    idx = list(all_users).index(user_id)
    print(idx)
    cosine_sims_for_this_user = np.copy(cosine_user_matrix[idx]) 
    # Top 5 most similar users
    similar_users = cosine_sims_for_this_user.argsort()[::-1][:10]
    print("best users to recommend", similar_users, np.sort(cosine_sims_for_this_user)[::-1][:5])
    for u in similar_users:
        if u!=user_id:
            movies_to_recommend = user_liked_movies[u]
            print(movies_to_recommend)
            for m in movies_to_recommend[:5]:
                if m not in liked_movie_ids:
                    print("We recommend {} with genres {}".format(df_movies.loc[df_movies["movieId"]==m, "title"].tolist()[0], 
                                                              df_movies.loc[df_movies["movieId"]==m]["genres"].tolist()[0]))


In [929]:
cf_recommender_for_user(10)

Some movies liked by this user ['Troy (2004)', 'Notebook, The (2004)', 'First Daughter (2004)', 'Batman Begins (2005)', 'Casino Royale (2006)', 'Holiday, The (2006)', 'Education, An (2009)', 'Despicable Me (2010)', "King's Speech, The (2010)", 'Dark Knight Rises, The (2012)', 'Intouchables (2011)', 'Skyfall (2012)', 'Spectre (2015)', 'The Intern (2015)']
9
best users to recommend [  9 158 142 562 176 188 508 490 465  67] [1.         0.28826463 0.27390192 0.26435889 0.24630531]
[923, 1198, 1270, 2300, 4993, 5481, 5902, 5952]
We recommend Citizen Kane (1941) with genres Drama|Mystery
We recommend Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) with genres Action|Adventure
We recommend Back to the Future (1985) with genres Adventure|Comedy|Sci-Fi
We recommend Producers, The (1968) with genres Comedy
We recommend Lord of the Rings: The Fellowship of the Ring, The (2001) with genres Adventure|Fantasy
[2490]
We recommend Payback (1999) with genres Action|Thrill

### Summary of Content vs Collaborative Filtering
User user CF is much better overall - somehow from the look of it it seems it is able to capture freshness in an amazing way, also we were able to use quality of judgment of others by recommending movies that actually have good ratings by other similar users. In Content Recommendation case, we were just recommending similar items based on genre without taking rating in to account.


### Next, try matrix factorization methods
First try NMF and then SVD, look at results and see if they make sense.

In [991]:
from sklearn.decomposition import NMF
model = NMF(n_components=50, init='random', random_state=0)
W = model.fit_transform(user_movie_matrix)
H = model.components_

In [992]:
print(W.shape)
W

(610, 50)


array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.02999298, 0.        , 0.        , ..., 0.        , 0.28837545,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [993]:
print(H.T.shape)
H.T

(9742, 50)


array([[0.21860701, 0.09536652, 0.29798082, ..., 0.84760486, 0.23298279,
        0.        ],
       [0.2415129 , 0.07280174, 0.15206926, ..., 0.        , 0.23960137,
        0.54777911],
       [0.16771816, 0.10291515, 0.27788272, ..., 0.        , 0.00122041,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [994]:
H.T[0].shape

(50,)

In [995]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_user_matrix_nmf = cosine_similarity(W)
print(cosine_user_matrix_nmf, cosine_user_matrix_nmf.shape)

[[1.00000000e+00 2.15372533e-02 6.74456305e-01 ... 1.39227663e-01
  2.95907548e-03 2.23155834e-05]
 [2.15372533e-02 1.00000000e+00 0.00000000e+00 ... 1.02461000e-01
  6.60409085e-03 0.00000000e+00]
 [6.74456305e-01 0.00000000e+00 1.00000000e+00 ... 1.92349823e-01
  0.00000000e+00 1.66114955e-01]
 ...
 [1.39227663e-01 1.02461000e-01 1.92349823e-01 ... 1.00000000e+00
  3.07839424e-03 1.96994099e-06]
 [2.95907548e-03 6.60409085e-03 0.00000000e+00 ... 3.07839424e-03
  1.00000000e+00 1.46222986e-04]
 [2.23155834e-05 0.00000000e+00 1.66114955e-01 ... 1.96994099e-06
  1.46222986e-04 1.00000000e+00]] (610, 610)


In [996]:
print(cosine_user_matrix, cosine_user_matrix.shape)

[[1.         0.02728287 0.05972026 ... 0.29109737 0.09357193 0.14532081]
 [0.02728287 1.         0.         ... 0.04621095 0.0275654  0.10242675]
 [0.05972026 0.         1.         ... 0.02112846 0.         0.03211875]
 ...
 [0.29109737 0.04621095 0.02112846 ... 1.         0.12199271 0.32205486]
 [0.09357193 0.0275654  0.         ... 0.12199271 1.         0.05322546]
 [0.14532081 0.10242675 0.03211875 ... 0.32205486 0.05322546 1.        ]] (610, 610)


Nice resource to learn about details of SVD http://nicolas-hug.com/blog/matrix_facto_3 - creator of Suprise package 

SVD Implementation below from From https://beckernick.github.io/matrix-factorization-recommender/


In [967]:
import numpy as np
# R = cosine_user_matrix.as_matrix()
user_ratings_mean = np.mean(user_movie_matrix, axis = 1)
user_movie_matrix_demeaned = user_movie_matrix - user_ratings_mean.reshape(-1, 1)

In [968]:
print(user_movie_matrix_demeaned, user_movie_matrix_demeaned.shape)

[[ 3.89601724 -0.10398276  3.89601724 ... -0.10398276 -0.10398276
  -0.10398276]
 [-0.01175323 -0.01175323 -0.01175323 ... -0.01175323 -0.01175323
  -0.01175323]
 [-0.00975159 -0.00975159 -0.00975159 ... -0.00975159 -0.00975159
  -0.00975159]
 ...
 [ 2.23265243  1.73265243  1.73265243 ... -0.26734757 -0.26734757
  -0.26734757]
 [ 2.98757955 -0.01242045 -0.01242045 ... -0.01242045 -0.01242045
  -0.01242045]
 [ 4.50703141 -0.49296859 -0.49296859 ... -0.49296859 -0.49296859
  -0.49296859]] (610, 9742)


In [973]:
# https://beckernick.github.io/matrix-factorization-recommender/
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(user_movie_matrix_demeaned, k = 50)

In [974]:
print(U.shape, sigma.shape, Vt.shape)

(610, 50) (50,) (50, 9742)


In [975]:
sigma = np.diag(sigma)

In [976]:
print(U.shape, sigma.shape, Vt.shape)

(610, 50) (50, 50) (50, 9742)


In [977]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_user_matrix_svd = cosine_similarity(U)
print(cosine_user_matrix_svd, cosine_user_matrix_svd.shape)

[[ 1.         -0.18804202  0.10409503 ... -0.09132694 -0.01344317
  -0.04456514]
 [-0.18804202  1.          0.03868282 ...  0.10333583 -0.0309986
  -0.08058716]
 [ 0.10409503  0.03868282  1.         ...  0.1693497  -0.17250777
   0.14641208]
 ...
 [-0.09132694  0.10333583  0.1693497  ...  1.         -0.14915979
   0.00459655]
 [-0.01344317 -0.0309986  -0.17250777 ... -0.14915979  1.
  -0.01122566]
 [-0.04456514 -0.08058716  0.14641208 ...  0.00459655 -0.01122566
   1.        ]] (610, 610)


In [980]:
cf_recommender_for_user(10, cosine_user_matrix=cosine_user_matrix_svd)

Some movies liked by this user ['Troy (2004)', 'Notebook, The (2004)', 'First Daughter (2004)', 'Batman Begins (2005)', 'Casino Royale (2006)', 'Holiday, The (2006)', 'Education, An (2009)', 'Despicable Me (2010)', "King's Speech, The (2010)", 'Dark Knight Rises, The (2012)', 'Intouchables (2011)', 'Skyfall (2012)', 'Spectre (2015)', 'The Intern (2015)']
9
best users to recommend [  9 142 562 577 158 597 489 465 147 307] [1.         0.68104754 0.62562493 0.52782395 0.52728088]
[923, 1198, 1270, 2300, 4993, 5481, 5902, 5952]
We recommend Citizen Kane (1941) with genres Drama|Mystery
We recommend Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) with genres Action|Adventure
We recommend Back to the Future (1985) with genres Adventure|Comedy|Sci-Fi
We recommend Producers, The (1968) with genres Comedy
We recommend Lord of the Rings: The Fellowship of the Ring, The (2001) with genres Adventure|Fantasy
[50, 150, 161, 296, 527, 588, 593]
We recommend Usual Suspec