# COLLABORATIVE  FILTERING - Finding Similar Books and Movies

We'll start by loading up the Goodreads dataset. Using Pandas, we can very quickly load the rows of the rating and item files that we care about, and merge them together so we can work with book names instead of ID's. (In a real production job, you'd stick with ID's and worry about the names at the display layer to make things more efficient. But this lets us understand what's going on better for now.)

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.simplefilter('ignore')

#### Illustration of the Principle

In [2]:
import numpy as np
np.random.seed(5)
ratings = np.random.randint(5, size=(5, 5), )
ratings

array([[3, 0, 1, 0, 4],
       [3, 0, 0, 4, 1],
       [0, 3, 4, 3, 1],
       [4, 2, 1, 1, 2],
       [1, 1, 1, 2, 0]])

In [3]:
ratingsDF = pd.DataFrame(ratings, columns=['BK1', 'BK2', 'BK3', 'BK4', 'BK5'])

In [4]:
ratingsDF['user'] = ['user1', 'user2', 'user3', 'user4', 'user5']
ratingsDF.set_index('user', inplace=True)

In [5]:
ratingsDF

Unnamed: 0_level_0,BK1,BK2,BK3,BK4,BK5
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
user1,3,0,1,0,4
user2,3,0,0,4,1
user3,0,3,4,3,1
user4,4,2,1,1,2
user5,1,1,1,2,0


In [6]:
corrTable = ratingsDF.corr()
corrTable

Unnamed: 0,BK1,BK2,BK3,BK4,BK5
BK1,1.0,-0.490098,-0.742379,-0.3849,0.541736
BK2,-0.490098,1.0,0.834441,0.121268,-0.328719
BK3,-0.742379,0.834441,1.0,0.104257,-0.130435
BK4,-0.3849,0.121268,0.104257,1.0,-0.7298
BK5,0.541736,-0.328719,-0.130435,-0.7298,1.0


Pearson's correlation formula for a sample population: <br>

$r_{xy} = \frac{\sum_{i=1}^{n} (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 (y_i - \bar{y})^2}}$

In [7]:
ratingsDF[['BK2', 'BK3']]

Unnamed: 0_level_0,BK2,BK3
user,Unnamed: 1_level_1,Unnamed: 2_level_1
user1,0,1
user2,0,0
user3,3,4
user4,2,1
user5,1,1


In [8]:
corrTable['BK2']['BK3']

0.8344408667498864

In [9]:
ratingsDF[['BK2', 'BK4']]

Unnamed: 0_level_0,BK2,BK4
user,Unnamed: 1_level_1,Unnamed: 2_level_1
user1,0,0
user2,0,4
user3,3,3
user4,2,1
user5,1,2


In [10]:
corrTable['BK2']['BK4']

0.12126781251816648

In [11]:
ratingsDF[['BK4', 'BK5']]

Unnamed: 0_level_0,BK4,BK5
user,Unnamed: 1_level_1,Unnamed: 2_level_1
user1,0,4
user2,4,1
user3,3,1
user4,1,2
user5,2,0


In [12]:
corrTable['BK4']['BK5']

-0.7298004491997616