# Introduction

This notebook will demonstrate how to create a basic recommender system (RS) for books. The RS will take as input a historic set of book ratings from multiple reviewers. For a given book it then suggests what other books might be interesting. We find this in some online shops as well.

In the accompanied presentation some types of Recommender Systems are explained including a manual calculation for User and Item Based Recommenders. Below we will develop a simple Item Based Recommender.

This notebook is by no means complete. The purpose is to give a quick hands-on experience with building a data product.

## Item Based Recommender System

As a start we import some python libraries that we require for our prototype.

In [None]:
import pandas as pd
import numpy as np
import matplotlib
#matplotlib.use('tkagg')
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Inspect and prepare the data

We are going the create a book recommender and for that we need books and user ratings. We will use a dataset from [Book-Crossing](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) which is already transformed into usable user-book-rating format. So that forst step has been done. ;-)

We also have to think about how we will evaluate the performance of our RS in the end and therefore how to prepare the data.

Lets read-in the data and see how the data looks like.

In [None]:
data = pd.read_csv("data_books.csv", sep = ",", header=None, names=['Reviewer', 'Book', 'Rating'])

In [None]:
# Let's see how the dataframe looks like
data.head()

To get some feeling with the data, we want to get a few statistics:
- Size of the dataset
- Amount of unique books and unique reviewers
- Top 20 most reviewed books and reviewers
- The sparsity of the user-item matrix?

In [None]:
# Get the dimensions of the dataframe
dim = data.shape
print ("There are {:d} rows/records in this dataframe and {:d} columns/features.".format(dim[0], dim[1]))

In [None]:
unique_books = pd.unique(data[['Book']].values.ravel()).size
print ("{:d} unique books.".format(unique_books))

In [None]:
unique_reviewers = pd.unique(data[['Reviewer']].values.ravel()).size
print ("{:d} unique reviewers.".format(unique_reviewers))

In [None]:
# Top 20 most reviewed books 
top_books = pd.value_counts(data.Book)
top_books.head(20)

Lets show a graph of the top 1000 most reviewed books.

In [None]:
plt.plot(top_books[:1000].values)

In [None]:
# Top Reviewers
top_reviewers = pd.value_counts(data.Reviewer)
top_reviewers.head(20)

In [None]:
plt.plot(top_reviewers[:100].values)

**Question**: Do we see something interesting?

The sparsity is calculated as the fraction of non-empty cells over the total amount of cells in the matrix.

In [None]:
# Total matrix would be
total_cells = unique_reviewers * unique_books

# How sparse is the matrix now
perc = dim[0] / float(total_cells) * 100.
print ('{:.4f}%'.format(perc))

 **Discussion**: How can we evaluate the performance of a top-N recommender system?

 **Discussion**: How to prepare the data?

**Prepare the train and test datasets** and save them as files for later. For this exercise we only take the records of only those reviewers for which we have more than two reviews. The highest rated review will go into the test dataset and the other reviews will be added to the train dataset.

```python
# This is working code, but will take about 15 minutes to run
train = pd.DataFrame()
test = pd.DataFrame()

grouped = data.groupby(['Reviewer'])

for k, g in grouped:
    # We are only interested in reviewers with more than 2 reviews
    if len(g) > 2:
        print k
        g.sort_values('Rating', ascending=False, inplace=True)
        test = test.append(g.iloc[:1])
        train = train.append(g.iloc[1:])
        
train.to_csv('train_books.csv', index=False)
test.to_csv('test_books.csv', index=False)
```

In [None]:
train = pd.read_csv("train_books.csv", sep = ",")
test = pd.read_csv("test_books.csv", sep = ",")

In [None]:
print ('Train: {:d}, Test: {:d}'.format(train.shape[0], test.shape[0]))

### Data Size and performance

*For performance reasons we will select the top most reviewed books for the calculation of similarity between all book pairs will take too much time. This is the reason why these calculations are executed seperately from the online scoring mechanism.*

In [None]:
most_reviewed_books = pd.DataFrame({'count' : train.groupby(["Book"]).size()})\
                                    .reset_index().sort_values(['count'],ascending = False)

most_reviewed_books.head(20)

**Question**: Why do we see less review for *The Lovely Bones: A Novel*?

We need these names in a list for later use.

In [None]:
# Getting the list of the most reviewed books
top_books = []

for i in most_reviewed_books.Book[0:20]:
    top_books.append(i)

## 2. Lets use the wisdom of the Crowd! Calculating Similarities

For the top rated books we need to calculate their mutual similarities.
In order to do so we will start simple by calculating the mutual similarity of two books and then generalize this for any two book. The similarity will be determined by Pearson-correlation.

**Approach**:
- Choose two books
- Determine their shared reviewers
- Get for each book a list of all reviews of their shared reviewers.
- Calculate the Pearson-correlation over these reviews

Cosine similarity:
$$\cos (\theta) = \frac{v \cdot w }{\left \| v \right \|*\left \| w \right \|} = \frac{\sum_{i=1}^{n} v_{i} u_{i}}{\sqrt{\sum_{i=1}^{n} v_{i}^{2}} \sqrt{\sum_{i=1}^{n} u_{i}^{2}}}$$

Choose two book.

In [None]:
book_1, book_2 = "Harry Potter and the Chamber of Secrets", "Harry Potter and the Sorcerer's Stone (Harry Potter)"

Determine the shared reviewers.

In [None]:
# Getting all the reviewers for these books. For this we still need the train dataset as source.
book_1_reviewers = train[train.Book == book_1].Reviewer
book_2_reviewers = train[train.Book == book_2].Reviewer

# Determine any common reviewers
common_reviewers = set(book_1_reviewers).intersection(book_2_reviewers)

print "%d people have reviewed both books" % len(common_reviewers)

Get the ratings for both books given by the shared reviewers.

In [None]:
# Checking the table with only the common reviewers and get the book name and rating
com_rev_book1 = train[(train.Reviewer.isin(common_reviewers)) & (train.Book == book_1)]
com_rev_book2 = train[(train.Reviewer.isin(common_reviewers)) & (train.Book == book_2)]

In [None]:
com_rev_book1.head(10)

In [None]:
com_rev_book2.head(10)

We see duplicate reviewers, that must be fixed to calculate the Pearson-correlation as we need two vectors of the same size.

In [None]:
# Fix the duplicates to prevent errors when calculating Pearson-correlation.
com_rev_book1.sort_values('Reviewer')
com_rev_book1 = com_rev_book1[com_rev_book1.Reviewer.duplicated()==False]

com_rev_book1.sort_values('Reviewer')
com_rev_book1 = com_rev_book1[com_rev_book1.Reviewer.duplicated()==False]

In [None]:
def cosin_sim(v, w):
    return np.dot(v, w) / np.math.sqrt(np.dot(v, v) * np.dot(w, w))

In [None]:
cosin_sim(com_rev_book1.Rating, com_rev_book2.Rating)

Now we have two books that look quit correlated or similar.

### Create reusable code

In order to calculate the similarity for any two books we need to build two functions that will make our live much more easy.

1. A function that helps us to retrieve the reviews for a specific book given a shared set of reviewers with another book
2. A function that helps us calculating the Cosine similarity for two books.

In [None]:
# First let's create a function that collects the reviews of our common reviewers for a specified book
def get_book_reviews(title, common_reviewers):
    mask = (train.Reviewer.isin(common_reviewers)) & (train.Book==title)
    reviews = train[mask].sort_values('Reviewer')
    reviews = reviews[reviews.Reviewer.duplicated()==False]
    return reviews

Check that we get the same correlation as had before.

In [None]:
def calculate_cosine_similarity(book1, book2):
    # We start by finding the common reviewers
    book_1_reviewers = train[train.Book == book1].Reviewer
    book_2_reviewers = train[train.Book == book2].Reviewer
    common_reviewers = set(book_1_reviewers).intersection(book_2_reviewers)

    # Then we look for the reviews given by common reviewers
    book_1_reviews = get_book_reviews(book1, common_reviewers)
    book_2_reviews = get_book_reviews(book2, common_reviewers)
    
    # Calculate the Pearson Correlation Score
    return cosin_sim(book_1_reviews.Rating, book_2_reviews.Rating)

In [None]:
# Print the correlation score
calculate_cosine_similarity(book_1,book_2)

Calculating these similarities in real-time and for many users in parallel will require huge computation power. But this type of Recommender System (Item Based) has the advantage to have the mutual book similarities be pre calculated.

Lets do that!

Build the matrix with all mutual similaroties.

In [None]:
# calculate the correlation for our top books
correlation_coefficient = []

for book1 in top_books:
    print "Calculating the correlations for:", book1
    for book2 in top_books:
        if book1 != book2:
            row = [book1, book2] + [calculate_cosine_similarity(book1, book2)]
            correlation_coefficient.append(row)
            
print "Done calculating."

The resulting dictionary is converted back into a pandas DataFrame to make it more easy to access.

In [None]:
# Define the columns to use
cols = ["Book_1", "Book_2", "Correlation"]
correlation_coefficient = pd.DataFrame(correlation_coefficient, columns=cols).sort_values('Correlation')
#correlation_coefficient.head(10)

**Question**: what is the size of correlation matrix?

In [None]:
correlation_coefficient.shape

### 3. Define the Recommender System

So now we need a function that gives us a similarity score for two any books.

In [None]:
# corr is the pre calculated similarity matrix
def calc_correlation(corr, book1, book2):
    mask = (corr.Book_1==book1) & (corr.Book_2==book2)
    row = corr[mask]
    corr = row
    return corr.sum(axis=1).tolist()[0]

In [None]:
calc_correlation(correlation_coefficient,"Harry Potter and the Chamber of Secrets", "Harry Potter and the Sorcerer's Stone (Harry Potter)")

Again, the same!

### Check: for "Harry Potter and the Chamber of Secrets"

In [None]:
# Get the sorted correlations for Harry Potter
my_book = "Harry Potter and the Chamber of Secrets"

results = []
for b in top_books:
    if my_book!=b:
        results.append((b, calc_correlation(correlation_coefficient, my_book, b)))
sorted(results, key=lambda x: x[1], reverse=True)

But what if we need a top-N instead of all?

In [None]:
def get_recommendations(my_book, topn=5):
    results = []
    for other_book in top_books:
        if my_book != other_book:
            correlation = calc_correlation(correlation_coefficient, my_book, other_book)
            results.append((my_book, other_book, correlation))
    if topn > len(results):
        topn = len(results)
    return pd.DataFrame.from_dict(sorted(results, key=lambda x: x[2], reverse=True))[:topn] 

Now we can test for any given book, what suggestions we should consider.

In [None]:
get_recommendations("Harry Potter and the Chamber of Secrets", 3)

**Discussion**: How would you apply this in your company or job?

## References

- Part of the code is from: http://www.mickaellegal.com/blog/2014/1/30/how-to-build-a-recommender
- CF technique backgrounds are explained at http://www.hindawi.com/journals/aai/2009/421425/
- Item-Based Top-N Recommendation Algorithms (Deshpande and Karypis)
