# Jaccard Similarity 

In this section you will construct a similarity metric based on the Jaccard similarity coefficient. 

Remember sets from your mathematics class? Well the coefficient is rather simple as it is based on sets operations, namely intersection and union.

## 1. Load the dataset

In [None]:
import pandas as pd
df_books_ratings = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';')
df_books_ratings

## 2. Make a set for each book entry

Now we need to construct a set for each book in our dataset. This set will be composed of User-IDs who rated the book. 

From our data frame, can you construct a python dictionary containing ISBNs as keys and an array of User-IDs as values? 

In [1]:
df = df_books_ratings

dict_isbn_groups = df.groupby(['ISBN'])['User-ID'].aggregate(lambda x: list(x))

print(len(dict_isbn_groups.keys()))

dict_isbn_groups.to_dict()

## 3. Jaccard distance function

Here is the `jaccard_distance` function we provide you for the exercise. It calculates the distance between 2 books, taking into account who rated them (i.e., if more users rated the same book, then the books are closer). 

Please have a closer look at the function. As you can see, we are using python sets and the function is expecting two arrays composed of User-IDs.

In [7]:
def jaccard_distance(user_ids_isbn_a, user_ids_isbn_b):
                
    set_isbn_a = set(user_ids_isbn_a)
    set_isbn_b = set(user_ids_isbn_b)
    
    union = set_isbn_a.union(set_isbn_b)
    intersection = set_isbn_a.intersection(set_isbn_b)
        
    return len(intersection) / float(len(union))

## 4. Calculate distances 

Here is the ISBN of a book in our dataset (you can of course choose another one!). 

Can you calculate this book's jaccard distance from all the other books in the dataset?

In [2]:
a_book_isbn = '002542730X'

for isbn, users in dict_isbn_groups.items():
    if isbn != a_book_isbn:
        d = jaccard_distance(dict_isbn_groups[a_book_isbn], users)
        if d > 0.0:
            print(a_book_isbn + ' - ' + isbn + ' : d=' + str(d))

## 5. Function calculating distances 

Considering the code above, can you make a function that will take as input a given book's ISBN and calculate its distance from all other books in our dataset? 

In [None]:
# code goes here