# Cosine Similarity

In this section you will construct another similarity metric, now based on the cosinus.

Remember trigonometry (or better, linear algebra!) from your mathematics class? Well this metric is based on trigonometric operations and calculates the angle between two vectors. It might look difficult but it is rather simple. 

## 1. Load the dataset

In [1]:
import pandas as pd
df_books_ratings = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';')
df_books_ratings

## 2. Explore the dimensions (shape) of users' rating data

Here are the IDs of two users in our ratings dataset. What are their respective ratings' dimensions (shape)? How many books did these users rate respectively?

In [2]:
df = df_books_ratings

a = df[df['User-ID'] == 277427]
b = df[df['User-ID'] == 277203]

print(a.shape)
print(b.shape)

# what is the problem here? Different shapes... they need to be of the same size...

If we are to produce vectors from the users' ratings and apply trigonometric operations on them,  can you see a problem here? Are the vectors of the same dimension? If not, why is this a 'problem'?

## 3. Vectorize ratings

Can you vectorize the above users' ratings so they have the same dimension? To help you do this, here is sorted  list of all the ISBNs in our dataset. How can you use this list of all the ISBNs to create a (large!) vector for user_id_a?

In [4]:
import numpy as np

ISBNS_array = df['ISBN'].unique()
ISBNS_array = np.sort(ISBNS_array).tolist()

# print(isbns_array[:10])
# print(isbns_array[:10])

# Select a user

df_a_user = df[df['User-ID'] == 277427]

# print(df_a_user)

a_user_ISBN_rating = dict(zip(df_a_user['ISBN'], df_a_user['Book-Rating']))

# print(a_user_ISBN_rating)

a_user_ISBNS = a_user_ISBN_rating.keys()

# print(a_user_ISBN)

a_user_ISBN_rating_vector = [0 if v not in a_user_ISBNS else a_user_ISBN_rating[v] for v in ISBNS_array]

# len(a_user_ISBN_rating_vector)

print(a_user_ISBN_rating_vector)

# big (and sparse) vector!!


## 4. Helper functions

Below are two functions that (1) retrieve user ratings from a given dataset and (2) vectorize these ratings according to a certain dimension (all ISBNs).

In [6]:
def get_user_ratings(user_id, df_subset):
    
    df_user = df_subset[df_subset['User-ID'] == user_id]
        
    return dict(zip(df_user['ISBN'], df_user['Book-Rating']))

In [7]:
def create_ratings_vector(user_ISBN_rating_dict, all_ISBNS_array):
    
    user_ISBNS = user_ISBN_rating_dict.keys()
    
    return [0 if v not in user_ISBNS else user_ISBN_rating_dict[v] for v in all_ISBNS_array]    

## 5. Cosine distance function

Can you finish the writing of the function below that calculates the angle between two vectors have the same dimension? As you can see, use the numpy `dot` and `norm` operators do translate the given formula into code.

In [8]:
from numpy import dot
from numpy.linalg import norm

def cosine_distance(ratings_vector_user_a, ratings_vector_user_b):
    
#     a . b  -> dot(a, b)
#     -----
#     |a||b| -> norm(a) * norm(b)
    
    return dot(ratings_vector_user_a, ratings_vector_user_b) / (norm(ratings_vector_user_a) * norm(ratings_vector_user_b))
    

## 6. Calculate distances

Here is the ID of a user in our dataset (you can of course choose another one!).

Can you calculate this user's cosine distance from all the other users in the dataset?

In [5]:
a_user_id = 277427

all_user_ids = df['User-ID'].unique().tolist()

a_user_ISBN_ratings = get_user_ratings(a_user_id, df)

a_user_ratings_vector = create_ratings_vector(a_user_ISBN_ratings, ISBNS_array)

# print(a_user_ratings_vector)

for u_id in all_user_ids:
    
    if u_id == a_user_id:
        continue
    
    user_ISBN_ratings = get_user_ratings(u_id, df)
    
    user_ratings_vector = create_ratings_vector(user_ISBN_ratings, ISBNS_array)
    
    d = cosine_distance(a_user_ratings_vector, user_ratings_vector)
    
    if d > 0.0:
        print(f'{a_user_id} - {u_id} : d={d}')


## 7. Function calculating distances

Considering the code above, can you make a function that will take as input a given user's ID and calculate its distance from all other users in our dataset?

In [None]:
# code goes here