[View in Colaboratory](https://colab.research.google.com/github/saranyamandava/Lambda-School-DataScience/blob/master/Week8__Coding_Challenge_2_Collaborative_Filtering.ipynb)

**Coding Challenge** #** 2** - Collaborative Filtering

**Coding Challenge:** **Context**

With collaborative filtering, an application can find users with similar tastes and can look at ietms they like and combine them to create a ranked list of suggestions which is known as user based recommendation. Or can also find items which are similar to each other and then suggest the items to users based on their past purchases which is known as item based recommendation. The first step in this technique is to find users with similar tastes or items which share similarity. 

There are various similarity models like** Cosine Similarity, Euclidean Distance Similarity and Pearson Correlation Similarity** which can be used to find similarity between users or items.

In this coding challenge, you will go through the process of identifying users that are similar (i.e. User Similarity) and items that are similar (i.e. "Item Similarity")

**User Similarity:**

**1a)** Compute "User Similarity" based on  cosine similarity coefficient (fyi, the other commonly used similarity coefficients are Pearson Correlation Coefficient and Euclidean)

**1b)** Based on the cosine similarity coefficient, identify 2 users who are similar and then discover common movie names that have been rated by the 2 users; examine how the similar users have rated the movies

**Item Similarity:**

**2a) ** Compute "Item Similarity" based on the Pearson Correlation Similarity Coefficient

**2b)** Pick 2 movies and find movies that are similar to the movies you have picked

**Challenges:**

**3)** According to you, do you foresee any issue(s)  associated with Collaborative Filtering? 

**Dataset: ** For the purposes of this challenge, we will leverage the data set accessible via https://grouplens.org/datasets/movielens/

The data set is posted under the section: ***recommended for education and development*** and we will stick to the small version of the data set with 100,000 ratings

In [0]:
# Imports and reading in data
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# Distance measurements
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import correlation

# Read in the data
df_movies = pd.read_csv(
    'https://www.dropbox.com/s/qo7v9k5rcwt7wgh/movies.csv?raw=1')
df_ratings = pd.read_csv(
    'https://www.dropbox.com/s/f2h5px87n2lfqj7/ratings.csv?raw=1')
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [0]:
# How many movies are there?
print(df_movies.movieId.count())

# How many (unique) users are there?
print(df_ratings.userId.nunique())

9125
671


In [0]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [0]:
# Drop the timestamp column from ratings
df_ratings.drop(['timestamp'], axis=1)

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0
5,1,1263,2.0
6,1,1287,2.0
7,1,1293,2.0
8,1,1339,3.5
9,1,1343,2.0


In [0]:
# Transform the data so rows are users, columns are movies, values are ratings
df_ratings_pivot = pd.pivot_table(df_ratings, index='userId', columns='movieId',
                                  values='rating').fillna(0)
print(df_ratings_pivot)

movieId  1       2       3       4       5       6       7       8       \
userId                                                                    
1           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
2           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
3           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
4           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
5           0.0     0.0     4.0     0.0     0.0     0.0     0.0     0.0   
6           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
7           3.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
8           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
9           4.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
10          0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
11          0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
12          0.0     0.0  

In [0]:
cos_similarity = cosine_similarity(df_ratings_pivot.as_matrix())
print(cos_similarity)  # User similarity matrix
print(cos_similarity.shape)

[[1.         0.         0.         ... 0.06291708 0.         0.01746565]
 [0.         1.         0.12429498 ... 0.02413984 0.17059464 0.1131753 ]
 [0.         0.12429498 1.         ... 0.08098382 0.13660585 0.17019275]
 ...
 [0.06291708 0.02413984 0.08098382 ... 1.         0.04260878 0.08520194]
 [0.         0.17059464 0.13660585 ... 0.04260878 1.         0.22867673]
 [0.01746565 0.1131753  0.17019275 ... 0.08520194 0.22867673 1.        ]]
(671, 671)


In [0]:
# We want to find users similar to each other, other than themselves
np.fill_diagonal(cos_similarity, 0)
print(cos_similarity)

[[0.         0.         0.         ... 0.06291708 0.         0.01746565]
 [0.         0.         0.12429498 ... 0.02413984 0.17059464 0.1131753 ]
 [0.         0.12429498 0.         ... 0.08098382 0.13660585 0.17019275]
 ...
 [0.06291708 0.02413984 0.08098382 ... 0.         0.04260878 0.08520194]
 [0.         0.17059464 0.13660585 ... 0.04260878 0.         0.22867673]
 [0.01746565 0.1131753  0.17019275 ... 0.08520194 0.22867673 0.        ]]


In [0]:
# Answering question 1a, user similarity based on cosine similarity coefficient
df_cos_sim = pd.DataFrame(cos_similarity)
max_col_idx = df_cos_sim.idxmax(axis=1)
# Sample 5 values randomly from the DataFrame, random_state to reproduce
print(max_col_idx.sample(5, random_state=15))

200    294
40     196
90      41
218    561
365    396
dtype: int64


In [0]:
# Question 1b, examine ratings for similar users
def review_ratings_by_similar_users(user1, user2):
  df_ratings1 = df_ratings[df_ratings.userId == user1]
  df_ratings2 = df_ratings[df_ratings.userId == user2]
  df_ratings_merged = df_ratings1.merge(df_ratings2, on='movieId', how='inner')
  df_common_names = df_ratings_merged.merge(df_movies, on='movieId')
  df_common_names.drop(['timestamp_x', 'timestamp_y'], axis=1, inplace=True)
  return df_common_names
print(review_ratings_by_similar_users(201, 295))

    userId_x  movieId  rating_x  userId_y  rating_y  \
0        201        6       5.0       295       4.5   
1        201       47       5.0       295       4.5   
2        201       50       5.0       295       4.5   
3        201      110       5.0       295       4.0   
4        201      150       5.0       295       4.5   
5        201      153       4.0       295       3.0   
6        201      165       4.0       295       4.0   
7        201      186       3.0       295       3.5   
8        201      260       5.0       295       5.0   
9        201      293       5.0       295       4.5   
10       201      296       5.0       295       4.0   
11       201      316       4.0       295       4.5   
12       201      318       5.0       295       4.5   
13       201      356       4.0       295       4.5   
14       201      377       4.0       295       4.0   
15       201      380       4.0       295       4.0   
16       201      457       4.0       295       4.5   
17       2

In [0]:
# Question 2a
df_movie_pivot = pd.pivot_table(df_ratings, index='movieId', columns='userId',
                                values='rating').fillna(0)
df_movie_pivot.head()

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,4.0,0.0,...,0.0,4.0,3.5,0.0,0.0,0.0,0.0,0.0,4.0,5.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
# Pairwise gives us distance, 1 - distance -> similarity
corr_sim = 1 - pairwise_distances(df_movie_pivot.as_matrix(),
                                  metric='correlation')
print(corr_sim)

[[ 1.          0.22374218  0.18326579 ... -0.0281574  -0.0281574
   0.04097762]
 [ 0.22374218  1.          0.12379014 ... -0.01619963 -0.01619963
  -0.01619963]
 [ 0.18326579  0.12379014  1.         ... -0.01122147 -0.01122147
  -0.01122147]
 ...
 [-0.0281574  -0.01619963 -0.01122147 ...  1.          1.
  -0.00149254]
 [-0.0281574  -0.01619963 -0.01122147 ...  1.          1.
  -0.00149254]
 [ 0.04097762 -0.01619963 -0.01122147 ... -0.00149254 -0.00149254
   1.        ]]


In [0]:
df_corr_sim = pd.DataFrame(corr_sim)
print(df_corr_sim)

          0         1         2         3         4         5         6     \
0     1.000000  0.223742  0.183266  0.071055  0.105076  0.201503  0.156075   
1     0.223742  1.000000  0.123790  0.125014  0.193144  0.085889  0.117211   
2     0.183266  0.123790  1.000000  0.147771  0.317911  0.158071  0.390331   
3     0.071055  0.125014  0.147771  1.000000  0.150562  0.024466  0.156876   
4     0.105076  0.193144  0.317911  0.150562  1.000000  0.186936  0.339605   
5     0.201503  0.085889  0.158071  0.024466  0.186936  1.000000  0.198062   
6     0.156075  0.117211  0.390331  0.156876  0.339605  0.198062  1.000000   
7     0.019379  0.209299  0.109818  0.496859  0.179371  0.072355  0.181772   
8     0.023699  0.053810  0.274638  0.238193  0.339402  0.201880  0.321178   
9     0.089163  0.306685  0.086065  0.063511  0.150292  0.175972  0.075798   
10    0.168445  0.186660  0.139719  0.116366  0.220251  0.182425  0.311394   
11    0.124866  0.022262  0.146940 -0.020366  0.153043  0.101210

In [0]:
def retrieve_similar_movies(movieId, n=5):
  # Retrieve the correlation similarity coefficients for the given movie
  df_movies['corr_similarity'] = df_corr_sim.iloc[movieId - 1]
  top_similar_items = df_movies.sort_values(['corr_similarity'],
                                            ascending=False)[1:n+1]
  print(top_similar_items)

In [0]:
# 2b examine similar movies for two movies
retrieve_similar_movies(6)  # Heat
retrieve_similar_movies(42) # Restoration

     movieId                                      title  \
615      733                           Rock, The (1996)   
24        25                   Leaving Las Vegas (1995)   
650      786                              Eraser (1996)   
87        95                        Broken Arrow (1996)   
31        32  Twelve Monkeys (a.k.a. 12 Monkeys) (1995)   

                        genres  corr_similarity  
615  Action|Adventure|Thriller         0.430485  
24               Drama|Romance         0.421901  
650      Action|Drama|Thriller         0.398260  
87   Action|Adventure|Thriller         0.375565  
31     Mystery|Sci-Fi|Thriller         0.367041  
      movieId                             title  \
401       452               Widows' Peak (1994)   
280       314  Secret of Roan Inish, The (1994)   
1943     2434          Down in the Delta (1998)   
339       375               Safe Passage (1994)   
1137     1399              Marvin's Room (1996)   

                              genres  