Recommending beers with Python
============================

We will be using beer review data to build a recommendation system

We are going to build a recommendation system for beers, relying on thousands of reviews made by beer lovers. Reviews are marks given on different aspects of a beer: aroma, taste, appearance, etc. 

We will select only a subset of such features (and beers) and in the end we will also be able to give a weight to each feature, to get the beers that better reflect our taste.

For more data and notebooks see https://github.com/marcelcaraciolo/big-data-tutorial/tree/master/data

In [None]:
import pandas as pd
import numpy as np
import pylab as pl

import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D

%matplotlib inline

In [None]:
# read and explore data
df = pd.read_csv("https://s3-eu-west-1.amazonaws.com/asi-training-data/beer_reviews.csv")
df.head()

## Finding Beers with the largest number of reviews...

We will use these functions to build the recommendation system. We need beers with more than one reviewer, so that we can match different opinions to get better results. 

In [None]:
beer_names = df.beer_name.unique()
print 'total number of reviewed beers: ', len(beer_names)

rev_num = df.groupby('beer_name').count().iloc[:, 0].copy()

rev_num.sort(ascending=False)
print rev_num[:20]


## ...and common reviewers set

In [None]:
# we should choose the beers with the biggest number of reviews...

beer_1, beer_2 = "90 Minute IPA", "India Pale Ale"

beer_1_reviewers = df[df.beer_name==beer_1].review_profilename.unique()
beer_2_reviewers = df[df.beer_name==beer_2].review_profilename.unique()
common_reviewers = set(beer_1_reviewers).intersection(beer_2_reviewers)
print "Users in the sameset: %d" % len(common_reviewers)
list(common_reviewers)[:20]

## Extracting Reviews

In [None]:
def get_beer_reviews(beer, common_users):
    mask = (df.review_profilename.isin(common_users)) & (df.beer_name==beer)
    reviews = df[mask].sort('review_profilename')
    reviews = reviews[reviews.review_profilename.duplicated()==False]
    return reviews
beer_1_reviews = get_beer_reviews(beer_1, common_reviewers)
beer_2_reviews = get_beer_reviews(beer_2, common_reviewers)

cols = ['beer_name', 'review_profilename', 'review_overall', 'review_aroma', 'review_palate', 'review_taste']
beer_2_reviews[cols].head()

## Calculating Distance

We will use the Euclidean distance again!


In [None]:
def euclideanDistance(instance1, instance2):
    length = len(instance1)
    # you can also check if instance1 and instance2 have the same length
    distance = 0
    for l in range(length):
        distance += (instance1.iloc[l] - instance2.iloc[l])**2
    return np.sqrt(distance)

In [None]:
ALL_FEATURES = ['review_overall', 'review_aroma', 'review_palate', 'review_taste']
def calculate_distance(beer1, beer2):
    # find common reviewers
    beer_1_reviewers = df[df.beer_name==beer1].review_profilename.unique()
    beer_2_reviewers = df[df.beer_name==beer2].review_profilename.unique()
    common_reviewers = set(beer_1_reviewers).intersection(beer_2_reviewers)

    # get reviews
    beer_1_reviews = get_beer_reviews(beer1, common_reviewers)
    beer_2_reviews = get_beer_reviews(beer2, common_reviewers)
    dists = []
    for f in ALL_FEATURES:
        dists.append(euclideanDistance(beer_1_reviews[f], beer_2_reviews[f]))
    
    return dists

calculate_distance(beer_1, beer_2)

## Calculate the Similarity for a Set of Beers

In [None]:
# calculate only a subset for the demo, let's say the 5 most reviewed beers...

beers = ["90 Minute IPA", "India Pale Ale", "Old Rasputin Russian Imperial Stout", "Sierra Nevada Celebration Ale", 
         "Two Hearted Ale"]

simple_distances = []
for beer1 in beers:
    print "computing distances for", beer1
    for beer2 in beers:
        if beer1 != beer2:
            row = [beer1, beer2] + calculate_distance(beer1, beer2)
            simple_distances.append(row)

## Inspect the Results

In [None]:
cols = ["beer1", "beer2", "overall_dist", "aroma_dist", "palate_dist", "taste_dist"]
simple_distances = pd.DataFrame(simple_distances, columns=cols)
simple_distances.head()

## Allow the User to Customize the Weights

In [None]:
def calc_distance(dists, beer1, beer2, weights):
    mask = (dists.beer1==beer1) & (dists.beer2==beer2)
    row = dists[mask]
    row = row[['overall_dist', 'aroma_dist', 'palate_dist', 'taste_dist']]
    dist = weights * row
    return dist.sum(axis=1).tolist()[0]

weights = [4, 1, 5, 1]
print calc_distance(simple_distances, "90 Minute IPA", "India Pale Ale", weights)
print calc_distance(simple_distances, "90 Minute IPA", "Old Rasputin Russian Imperial Stout", weights)

## Find Similar Beers for your favourite beer!

In [None]:
my_beer = "90 Minute IPA"
results = []
weights = [2, 1, 4, 1]
for b in beers:
    if my_beer!=b:
        results.append((my_beer, b, calc_distance(simple_distances, my_beer, b, weights)))
sorted(results, key=lambda x: x[2])[:10]