# Notebook 2: Distance Intro
***

In this notebook we'll have a look at some notions of defining the distance between two elements. Namely, we are interested in figuring out item similarity. Example contexts include recommender systems and clustering.

We'll need Numpy and Pandas for this notebook, so let's load them.

In [1]:
import numpy as np 
import pandas as pd

### Exercise 1: Euclidean distance calculations
***

On the course Canvas page you can obtain the a shortened and cleaned version of a data set of Google reviews for 20 users, across 10 categories of location to be reviewed. The full original data set spans 5456 users and 24 categories and can be found [here](https://archive.ics.uci.edu/ml/datasets/Tarvel+Review+Ratings). Each rating is the user's average for that category (1-5) and a 0 represents no ratings from that user in that category.

Each row corresponds to a different category. The 10 rows that we have represent:
1. Average ratings on churches 
1. Average ratings on resorts 
1. Average ratings on beaches 
1. Average ratings on parks 
1. Average ratings on theatres 
1. Average ratings on museums 
1. Average ratings on malls 
1. Average ratings on zoos
1. Average ratings on restaurants 
1. Average ratings on pubs/bars

More information can be found at the original link above.

Let's load up the data set and use it to practice some distance calculations, shall we?

In [130]:
dfR = pd.read_csv("../data/google_review_ratings_inclass.csv")
dfR.head(10)  # for our 10-row dataframe, this will show all

Unnamed: 0,User 10,User 18,User 280,User 667,User 762,User 778,User 839,User 851,User 874,User 1045,User 1219,User 1232,User 1603,User 2213,User 3768,User 4068,User 5153,User 5228,User 5350,User 5413
0,0.0,0.0,2.33,0.0,1.69,1.92,2.64,2.65,4.43,0.98,2.43,1.62,0.82,0.78,0.72,1.11,1.37,0.54,2.36,3.62
1,5.0,0.53,2.27,0.0,1.73,4.51,2.64,2.64,3.64,1.54,2.58,2.44,0.85,0.8,5.0,1.15,1.42,0.53,2.25,5.0
2,3.64,3.69,5.0,1.43,2.66,1.96,2.17,2.65,0.0,1.54,2.58,0.0,1.82,1.56,1.37,1.83,1.43,0.0,0.0,2.62
3,3.64,0.0,0.0,0.0,1.96,0.0,0.0,0.0,0.0,0.0,5.0,0.0,1.63,2.9,0.0,3.25,2.56,0.0,1.29,2.62
4,0.0,0.0,2.32,1.42,1.94,2.01,0.0,1.87,0.0,1.53,0.0,0.0,0.0,0.0,0.0,0.0,3.18,1.47,1.65,0.0
5,0.0,2.93,0.0,1.57,0.0,2.0,0.0,0.0,0.0,5.0,3.72,0.0,0.0,0.0,1.48,5.0,0.0,1.48,1.25,1.44
6,5.0,2.95,4.17,5.0,1.97,2.02,1.3,1.33,2.23,5.0,2.09,3.74,1.63,5.0,5.0,5.0,5.0,1.5,1.24,1.87
7,2.35,0.0,2.91,1.52,0.0,1.99,0.0,0.0,1.91,4.0,2.64,2.26,1.66,3.41,4.15,0.0,0.0,1.53,0.0,1.35
8,0.0,0.0,0.0,0.0,5.0,0.0,1.41,0.0,1.93,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,3.2,0.0,0.0,2.1,1.58,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.59,0.0,0.0,0.0,0.0


Users 10, 18 and 667 all stand out because they have no ratings - or no love - for churchs. So, let's pick on them. 

**First**, make a guess based on just looking at the three users' data - the ***eyeball metric***, if you will - which pair of users among those three is the most similar, and which pair is the least similar.

**Then**, compute the Euclidean distance between each of the three possible pairs among the three of them to determine which pair is most similar. We'll compute the distance between User 10 and User 18 together:

In [131]:
# square the differences
diffs = (dfR.iloc[:,0] - dfR.iloc[:,1])**2

# sum them up
ssd = sum(diffs)

# and square root the sum
dist_10_18 = np.sqrt(ssd)
print(dist_10_18)

7.179338409630794


In [132]:
# SOLUTION:

# 10 and 667
dist_10_667 = np.sqrt(sum((dfR.iloc[:,0] - dfR.iloc[:,3])**2))

# 18 and 667
dist_18_667 = np.sqrt(sum((dfR.iloc[:,1] - dfR.iloc[:,3])**2))

print("d(10,18)  = {:0.4f}".format(dist_10_18))
print("d(10,667) = {:0.4f}".format(dist_10_667))
print("d(18,667) = {:0.4f}".format(dist_18_667))

d(10,18)  = 7.1793
d(10,667) = 6.9501
d(18,667) = 3.9708


### Exercise 2: Finding the most similar users
***

There are 20 users in our data set, and a pair is produced by drawing 2 distinct users (without replacement). Thus, there are ${{20}\choose{2}} = \frac{20 \cdot 19}{2} = 190$ possible pairs of users. So, it is feasible for us to actually compare all pairs of users to find the most similar pairs. In the next notebook and your homework, you will use some nifty computational tools to approximate these similarities.

First, let's write a function that will take as input two users and return their Euclidean distance.

In [133]:
def euclidean_distance(u1, u2, df):
    return np.sqrt(sum((df.iloc[:,u1] - df.iloc[:,u2])**2))

Now let's use that function to create an $n \times n$ matrix, where the element in row $i$ and column $j$ represents the distance between Users $i$ and $j$. Finish the code off below, where some `for` loops may be just what the doctor ordered!

In [134]:
n = dfR.shape[1]
distances = np.zeros([n,n])
# your code goes here!
# be a hero and fill in the distances matrix

# SOLUTION:
for row in range(n):
    for col in range(n):
        distances[row,col] = euclidean_distance(row,col,dfR)

What is an important matrix property that `distances` should have?  Perform a computation that will allow you to verify that the matrix `distances` indeed has this property.

In [135]:
# SOLUTION:

# distances should be symmetric, which means that distances = np.transpose(distances)
np.max(distances - np.transpose(distances))

0.0

If we wanted to find the three most similar other users to User 10, we could do this in a straightforward but crude fashion using some of Python's native list operations:

In [136]:
# Convert the distances from User 10 to other users into a list
dists = list(distances[0,:])

# Sort that list
dists_sorted = sorted(dists)

# Find the indices of the 3 lowest values
# (User 10 will always be the lowest, so start at index 1 within the sorted version)
k = 3
neighbors = []
for idx in range(1,k+1):
    new_neighbor = dists.index(dists_sorted[idx])
    neighbors.append(new_neighbor)
    
print(dfR.columns[neighbors])

Index(['User 2213', 'User 3768', 'User 5413'], dtype='object')


Modify the code above to find the 5 users most similar to User 1219. Then, determine as a group what type of attractions they rated the highest (on average) and what type of attraction they rated the lowest.

In [137]:
# SOLUTION:

# Convert the distances from User 1219 to other users into a list
dists = list(distances[dfR.columns=="User 1219",:][0])
#dists = list(distances[10,:])

# Sort that list
dists_sorted = sorted(dists)

# Find the indices of the k lowest values
k = 5
neighbors = []
for idx in range(1,k+1):
    new_neighbor = dists.index(dists_sorted[idx])
    neighbors.append(new_neighbor)
    
print(dfR.iloc[:,neighbors])

   User 5413  User 4068  User 1603  User 2213  User 5350
0       3.62       1.11       0.82       0.78       2.36
1       5.00       1.15       0.85       0.80       2.25
2       2.62       1.83       1.82       1.56       0.00
3       2.62       3.25       1.63       2.90       1.29
4       0.00       0.00       0.00       0.00       1.65
5       1.44       5.00       0.00       0.00       1.25
6       1.87       5.00       1.63       5.00       1.24
7       1.35       0.00       1.66       3.41       0.00
8       0.00       0.00       0.00       0.00       0.00
9       0.00       2.59       0.00       0.00       0.00


To determine their favorite and least favorite, we can use the **mean** function on our dataframe:

In [138]:
dfR.iloc[:,neighbors].mean(axis=1)

0    1.738
1    2.010
2    1.566
3    2.338
4    0.330
5    1.538
6    2.948
7    1.284
8    0.000
9    0.518
dtype: float64

So it appears they like malls the best and restaurants the least. But none of these users rated any restaurants at all, and yet that appears as a 0... the *worst* possible rating. Perhaps a method of measuring similarity that accounts for what elements users have in common/don't have in common would be a nice way to go... so let's check out Exercise 3 for more on that!

(Note that more elegant methods for finding the $k$ **nearest neighbors** exist, and you are encouraged to check them out if you are interested! For example, scikit-learn's [Nearest Neighbors](https://scikit-learn.org/stable/modules/neighbors.html) module is prevalent and fairly easy to get up and running.)

### Exercise 3: Set representations
***

Some difficulty is presented in the above analysis because there are attraction categories that some users did not visit/rate. For example, some of the users do not have any ratings on churches, which are represented with a 0 rating (the worst of the worst). But this data set is from Europe, where there are some really nice churches!

> <img width=500px src="http://europeanpost.co/wp-content/uploads/2017/07/Barcelona.360.728.jpg">
>
> **Figure 1**. A very nice cathedral in Barcelona, Spain.

So perhaps a reasonable approach to grouping similar users is to represent each user using a Boolean set, with 1s for attraction categories that they have rated (anything > 0) and 0s for attraction categories that they have not rated (0s in the raw array that we have been working with). 

Characterizing users' preferences for vacation activities is an obvious use for this type of analysis: users with many 1s in common across categories have done (or at least rated) similar activities, so one might believe they prefer to do similar things while traveling.

Recall that the **Jaccard similarity** for two sets $A$ and $B$ is:
$$sim(A,B) = \dfrac{|A \cap B|}{|A \cup B|}$$

Let's compute the Jaccard similarity for Users 10, 18 and 667. First, we need to represent each user's ratings as a Boolean set. Since the maximum rating is a 5 and minimum is a 0, if we divide a user's ratings by 5, then we will end up with 0s where there were originally 0s, and numbers strictly greater than 0 but less than 1 wherever there was an actual rating. So we can turn this into a Boolean vector by hitting it with the ceiling function:

In [146]:
dfB = np.ceil(dfR/5)

In [148]:
dfB

Unnamed: 0,User 10,User 18,User 280,User 667,User 762,User 778,User 839,User 851,User 874,User 1045,User 1219,User 1232,User 1603,User 2213,User 3768,User 4068,User 5153,User 5228,User 5350,User 5413
0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0
4,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
5,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
6,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0
8,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


We can get the numerator $|A \cap B|$ and denominator $|A \cup B|$ using Boolean operations on the columns from the updated characteristic matrix:

In [156]:
# & for intersection
numer = np.sum((dfB["User 10"]==1) & (dfB["User 18"]==1))
print(numer)

# | for union
denom = np.sum((dfB["User 10"]==1) | (dfB["User 18"]==1))
print(denom)

3
6


Thus, $sim(User\ 10, User\ 18) = 3/6 = 0.5$

Now, let's find the 5 most similar users to User 1219, using Jaccard similarity. What sorts of attractions seem to interest this group of users? What sorts of attractions are they not interested in visiting while on vacation?

In [None]:
sims = []
for user in dfB.columns:
    # Compute the Jaccard similarity numerator and denominator
    numer = 0 # your code goes here!
    denom = 0 # your code goes here!
    # Then compute the Jaccard similarity itself and append it to sims
    sims.append(0) # your code goes here!
    
# Find the 5 *highest* similarities
# your code goes here!
# stealing from the method above for finding the *lowest* distances is encouraged :D

In [159]:
# SOLUTION:

sims = []
for user in dfB.columns:
    # Compute the Jaccard similarity numerator and denominator
    numer = np.sum((dfB["User 1219"]==1) & (dfB[user]==1)) # your code goes here!
    denom = np.sum((dfB["User 1219"]==1) | (dfB[user]==1)) # your code goes here!
    # Then compute the Jaccard similarity itself and append it to sims
    sims.append(numer/denom) # your code goes here!
    
# Find the 5 *highest* similarities
# your code goes here!
# stealing from the method above for finding the *lowest* distances is encouraged :D

# Sort that list
sims_sorted = sorted(sims)

# Reverse the list, since we want the *highest* values

# Find the indices of the k lowest values
k = 5
neighbors = []
for idx in range(1,k+1):
    new_neighbor = sims.index(sims_sorted[idx])
    neighbors.append(new_neighbor)
    
print(dfB.iloc[:,neighbors])

   User 667  User 667  User 667  User 280  User 280
0       0.0       0.0       0.0       1.0       1.0
1       0.0       0.0       0.0       1.0       1.0
2       1.0       1.0       1.0       1.0       1.0
3       0.0       0.0       0.0       0.0       0.0
4       1.0       1.0       1.0       1.0       1.0
5       1.0       1.0       1.0       0.0       0.0
6       1.0       1.0       1.0       1.0       1.0
7       1.0       1.0       1.0       1.0       1.0
8       0.0       0.0       0.0       0.0       0.0
9       0.0       0.0       0.0       1.0       1.0


<br><br>

## Code to process raw Google review ratings data set into in-class version

* draw a sub-sample
* zero out extra entries (*not* the original data set!)

In [118]:
dfRaw = pd.read_csv("../data/google_review_ratings.csv")
dfRaw.head()

num_users_inclass = 20 # randomly sampled
num_ratings_inclass = 10
newCols = {}
np.random.seed(3)
user_idx = np.random.choice(range(len(dfRaw)), size=num_users_inclass, replace=False)
user_idx.sort()

for i in range(num_users_inclass):
    newCols[dfRaw["User"][user_idx[i]]] = list(dfRaw.iloc[user_idx[i],1:(num_ratings_inclass+1)])
    
dfNew = pd.DataFrame(newCols)
print(np.sum(np.sum(dfNew==0)))
dfNew.head(10)

4


Unnamed: 0,User 10,User 18,User 280,User 667,User 762,User 778,User 839,User 851,User 874,User 1045,User 1219,User 1232,User 1603,User 2213,User 3768,User 4068,User 5153,User 5228,User 5350,User 5413
0,0.0,0.0,2.33,0.0,1.69,1.92,2.64,2.65,4.43,0.98,2.43,1.62,0.82,0.78,0.72,1.11,1.37,0.54,2.36,3.62
1,5.0,0.53,2.27,0.0,1.73,4.51,2.64,2.64,3.64,1.54,2.58,2.44,0.85,0.8,5.0,1.15,1.42,0.53,2.25,5.0
2,3.64,3.69,5.0,1.43,2.66,1.96,2.17,2.65,5.0,1.54,2.58,2.6,1.82,1.56,1.37,1.83,1.43,1.57,2.21,2.62
3,3.64,3.66,1.88,1.41,1.96,2.03,2.15,2.08,2.68,1.56,5.0,2.59,1.63,2.9,1.39,3.25,2.56,1.48,1.29,2.62
4,5.0,2.95,2.32,1.42,1.94,2.01,1.85,1.87,2.67,1.53,5.0,2.61,1.6,2.85,1.45,4.25,3.18,1.47,1.65,2.48
5,2.92,2.93,3.2,1.57,1.95,2.0,1.32,1.67,2.04,5.0,3.72,5.0,1.61,2.84,1.48,5.0,3.2,1.48,1.25,1.44
6,5.0,2.95,4.17,5.0,1.97,2.02,1.3,1.33,2.23,5.0,2.09,3.74,1.63,5.0,5.0,5.0,5.0,1.5,1.24,1.87
7,2.35,3.0,2.91,1.52,1.99,1.99,1.3,1.31,1.91,4.0,2.64,2.26,1.66,3.41,4.15,1.99,3.87,1.53,1.25,1.35
8,2.32,1.7,2.93,3.33,5.0,2.11,1.41,1.35,1.93,5.0,2.64,2.68,3.01,2.85,5.0,2.59,5.0,2.66,1.46,1.33
9,2.63,2.62,3.2,3.33,2.05,2.1,1.58,1.58,1.38,5.0,2.66,2.66,2.97,2.88,5.0,2.59,0.93,3.26,1.71,1.34


Zero out more rows so the Jaccard distance exercise is more interesting.

In [127]:
for row in range(2,10):
    num = np.random.choice(range(dfNew.shape[1]))
    users = list(np.random.choice(range(dfNew.shape[1]), replace=False, size=num))
    for col in users:
        dfNew.iloc[row,col] = 0

In [128]:
dfNew.head(10)

Unnamed: 0,User 10,User 18,User 280,User 667,User 762,User 778,User 839,User 851,User 874,User 1045,User 1219,User 1232,User 1603,User 2213,User 3768,User 4068,User 5153,User 5228,User 5350,User 5413
0,0.0,0.0,2.33,0.0,1.69,1.92,2.64,2.65,4.43,0.98,2.43,1.62,0.82,0.78,0.72,1.11,1.37,0.54,2.36,3.62
1,5.0,0.53,2.27,0.0,1.73,4.51,2.64,2.64,3.64,1.54,2.58,2.44,0.85,0.8,5.0,1.15,1.42,0.53,2.25,5.0
2,3.64,3.69,5.0,1.43,2.66,1.96,2.17,2.65,0.0,1.54,2.58,0.0,1.82,1.56,1.37,1.83,1.43,0.0,0.0,2.62
3,3.64,0.0,0.0,0.0,1.96,0.0,0.0,0.0,0.0,0.0,5.0,0.0,1.63,2.9,0.0,3.25,2.56,0.0,1.29,2.62
4,0.0,0.0,2.32,1.42,1.94,2.01,0.0,1.87,0.0,1.53,0.0,0.0,0.0,0.0,0.0,0.0,3.18,1.47,1.65,0.0
5,0.0,2.93,0.0,1.57,0.0,2.0,0.0,0.0,0.0,5.0,3.72,0.0,0.0,0.0,1.48,5.0,0.0,1.48,1.25,1.44
6,5.0,2.95,4.17,5.0,1.97,2.02,1.3,1.33,2.23,5.0,2.09,3.74,1.63,5.0,5.0,5.0,5.0,1.5,1.24,1.87
7,2.35,0.0,2.91,1.52,0.0,1.99,0.0,0.0,1.91,4.0,2.64,2.26,1.66,3.41,4.15,0.0,0.0,1.53,0.0,1.35
8,0.0,0.0,0.0,0.0,5.0,0.0,1.41,0.0,1.93,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,3.2,0.0,0.0,2.1,1.58,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.59,0.0,0.0,0.0,0.0


In [129]:
dfNew.to_csv("../data/google_review_ratings_inclass.csv", index=None)