**Coding Challenge** #** 2** - Collaborative Filtering

**Coding Challenge:** **Context**

With collaborative filtering, an application can find users with similar tastes and can look at ietms they like and combine them to create a ranked list of suggestions which is known as user based recommendation. Or can also find items which are similar to each other and then suggest the items to users based on their past purchases which is known as item based recommendation. The first step in this technique is to find users with similar tastes or items which share similarity. 

There are various similarity models like** Cosine Similarity, Euclidean Distance Similarity and Pearson Correlation Similarity** which can be used to find similarity between users or items.

In this coding challenge, you will go through the process of identifying users that are similar (i.e. User Similarity) and items that are similar (i.e. "Item Similarity")

**User Similarity:**

**1a)** Compute "User Similarity" based on  cosine similarity coefficient (fyi, the other commonly used similarity coefficients are Pearson Correlation Coefficient and Euclidean)

**1b)** Based on the cosine similarity coefficient, identify 2 users who are similar and then discover common movie names that have been rated by the 2 users; examine how the similar users have rated the movies

**Item Similarity:**

**2a) ** Compute "Item Similarity" based on the Pearson Correlation Similarity Coefficient

**2b)** Pick 2 movies and find movies that are similar to the movies you have picked

**Challenges:**

**3)** According to you, do you foresee any issue(s)  associated with Collaborative Filtering? 

**Dataset: ** For the purposes of this challenge, we will leverage the data set accessible via https://grouplens.org/datasets/movielens/

The data set is posted under the section: ***recommended for education and development*** and we will stick to the small version of the data set with 100,000 ratings

In [0]:
import zipfile
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
from scipy.spatial.distance import pdist, squareform

In [2]:
! wget 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'

--2018-05-25 23:39:01--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.235
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 918269 (897K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2018-05-25 23:39:01 (3.05 MB/s) - ‘ml-latest-small.zip’ saved [918269/918269]



In [0]:
folder = zipfile.ZipFile('ml-latest-small.zip')

In [4]:
folder.infolist()

[<ZipInfo filename='ml-latest-small/' filemode='drwxr-xr-x' external_attr=0x10>,
 <ZipInfo filename='ml-latest-small/links.csv' compress_type=deflate filemode='-rw-r--r--' file_size=183372 compress_size=80618>,
 <ZipInfo filename='ml-latest-small/movies.csv' compress_type=deflate filemode='-rw-r--r--' file_size=458390 compress_size=155389>,
 <ZipInfo filename='ml-latest-small/ratings.csv' compress_type=deflate filemode='-rw-r--r--' file_size=2438266 compress_size=663997>,
 <ZipInfo filename='ml-latest-small/README.txt' compress_type=deflate filemode='-rw-r--r--' file_size=8364 compress_size=3289>,
 <ZipInfo filename='ml-latest-small/tags.csv' compress_type=deflate filemode='-rw-r--r--' file_size=41902 compress_size=13898>]

In [0]:
ratings = pd.read_csv(folder.open('ml-latest-small/ratings.csv'))
movies = pd.read_csv(folder.open('ml-latest-small/movies.csv'))

In [6]:
display(ratings.head())
display(movies.head())

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## User Similarity

In [7]:
ratings_pivot = pd.pivot_table(ratings.drop('timestamp', axis=1), 
                               index='userId', columns='movieId', 
                               aggfunc=np.max).fillna(0)
print(ratings_pivot.shape)
ratings_pivot.head()

(671, 9066)


Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
distances = pdist(ratings_pivot.as_matrix(), 'cosine')
squareform(distances)

array([[0.        , 1.        , 1.        , ..., 0.93708292, 1.        ,
        0.98253435],
       [1.        , 0.        , 0.87570502, ..., 0.97586016, 0.82940536,
        0.8868247 ],
       [1.        , 0.87570502, 0.        , ..., 0.91901618, 0.86339415,
        0.82980725],
       ...,
       [0.93708292, 0.97586016, 0.91901618, ..., 0.        , 0.95739122,
        0.91479806],
       [1.        , 0.82940536, 0.86339415, ..., 0.95739122, 0.        ,
        0.77132327],
       [0.98253435, 0.8868247 , 0.82980725, ..., 0.91479806, 0.77132327,
        0.        ]])

Since pdist calculates $1 - \frac{u\cdot v}{|u||v|}$ instead of cosine similarity, I will have to subtract the result from 1.

In [27]:
similarities = squareform(1-distances)
print(similarities.shape)
similarities

(671, 671)


array([[0.        , 0.        , 0.        , ..., 0.06291708, 0.        ,
        0.01746565],
       [0.        , 0.        , 0.12429498, ..., 0.02413984, 0.17059464,
        0.1131753 ],
       [0.        , 0.12429498, 0.        , ..., 0.08098382, 0.13660585,
        0.17019275],
       ...,
       [0.06291708, 0.02413984, 0.08098382, ..., 0.        , 0.04260878,
        0.08520194],
       [0.        , 0.17059464, 0.13660585, ..., 0.04260878, 0.        ,
        0.22867673],
       [0.01746565, 0.1131753 , 0.17019275, ..., 0.08520194, 0.22867673,
        0.        ]])

In [28]:
ix = np.unravel_index(np.argmax(similarities), similarities.shape)
print(ix)
print(similarities[ix])

(150, 368)
0.8453008752801064


Users 151 and 369 appear to be similar, with a cosine similarity of 0.84

In [29]:
print('Common movies rated')
display(ratings_pivot.iloc[[150, 368], :].T[(ratings_pivot.iloc[150]>0) 
                                          & (ratings_pivot.iloc[368]>0)])

Common movies rated


Unnamed: 0_level_0,userId,151,369
Unnamed: 0_level_1,movieId,Unnamed: 2_level_1,Unnamed: 3_level_1
rating,2,4.0,3.0
rating,10,5.0,3.0
rating,21,5.0,3.0
rating,32,5.0,4.0
rating,39,3.0,3.0
rating,47,5.0,4.0
rating,50,3.0,3.0
rating,110,4.0,5.0
rating,150,5.0,4.0
rating,153,1.0,3.0


## Item Similarity

In [30]:
correlations = squareform(1-pdist(ratings_pivot.as_matrix().T, 'correlation'))
correlations

array([[ 0.        ,  0.22374218,  0.18326579, ..., -0.0281574 ,
        -0.0281574 ,  0.04097762],
       [ 0.22374218,  0.        ,  0.12379014, ..., -0.01619963,
        -0.01619963, -0.01619963],
       [ 0.18326579,  0.12379014,  0.        , ..., -0.01122147,
        -0.01122147, -0.01122147],
       ...,
       [-0.0281574 , -0.01619963, -0.01122147, ...,  0.        ,
         1.        , -0.00149254],
       [-0.0281574 , -0.01619963, -0.01122147, ...,  1.        ,
         0.        , -0.00149254],
       [ 0.04097762, -0.01619963, -0.01122147, ..., -0.00149254,
        -0.00149254,  0.        ]])

In [41]:
np.argsort(correlations[0])[::-1]

array([2506, 1866, 1019, ...,  169, 4494, 1749])

In [42]:
correlations[0][np.argsort(correlations[0])[::-1]]

array([ 0.47414073,  0.39379904,  0.3723713 , ..., -0.06178254,
       -0.06197323, -0.07638657])

In [34]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


I will see which movies correlate the most with "Toy Story" and "Jumanji."

In [77]:
np.argsort(correlations[1])[::-1][:5] + 1

array([448, 284, 329, 138, 332])

In [0]:
def most_correlated_movies(movieId, corr_matrix, n=5):
    ix = movieId - 1
    
    return np.argsort(correlations[ix])[::-1][:n] + 1

In [60]:
toy_story_similar = most_correlated_movies(1, correlations)
movies[movies['movieId'].isin(toy_story_similar)]

Unnamed: 0,movieId,title,genres
824,1020,Cool Runnings (1993),Comedy
2008,2507,Breakfast of Champions (1999),Comedy|Sci-Fi
3041,3804,H.O.T.S. (1979),Comedy


In [80]:
jumanji_similar = most_correlated_movies(2, correlations)
movies[movies['movieId'].isin(jumanji_similar)]

Unnamed: 0,movieId,title,genres
294,329,Star Trek: Generations (1994),Adventure|Drama|Sci-Fi
297,332,Village of the Damned (1995),Horror|Sci-Fi
397,448,Fearless (1993),Drama


It seems that there are less movies in DataFrame matching IDs to titles, so not every movie ID found by the `most_correlated_movies` function correponds to a named entry.

In [81]:
movies.shape

(9125, 3)