## Beer Recommender using Content Based Recommender System

In [96]:
# Dependencies and packages
%reload_ext lab_black
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
import random
from sklearn.metrics import accuracy_score

pd.options.display.max_columns = 30
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

In [97]:
df = pd.read_csv("../data/csv/reviews_NLTK1686.csv", encoding="latin-1")

In [98]:
df.head()

Unnamed: 0,beer_id,username,text,score,name,style
0,5,woodychandler,I feel as though The CANQuest (TM) is once mor...,4.13,Amber,Vienna Lager
1,6,rickyleepotts,"I have had this beer before, and all I can rem...",2.96,Turbodog,English Brown Ale
2,7,Jadjunk,#115. I haven't reviewed much by Abita Brewing...,3.67,Purple Haze,Fruit and Field Beer
3,10,bditty187,"From notesÂ Batch #29 Dark brown in hue, damn...",3.86,Dubbel Ale,Belgian Dubbel
4,17,mikesgroove,"So, going to GrandmaÂs for dinner tonight, an...",3.78,Widmer Hefeweizen,German Hefeweizen


In [99]:
print("We have", len(df), "beers in the data")

We have 1686 beers in the data


In [100]:
def print_description(index):
    example = df[df.index == index][["text", "name"]].values[0]
    if len(example) > 0:
        print(example[0])
        print("Name:", example[1])

In [101]:
print_description(10)

Revisiting this ale nearly 5 years since my first encounter, this bottle's BB date is... DEC. 2017! Served cool in Gulden Draak's tulip-shaped bowl sniffer. A: super dark russet brown hue with mahogany shines, coming with a moderate carbonated body and a nicely lingering beige frothy head. S: very different from the Wee Heavy tasted immediately before this - Orkney Skullsplitter - the aroma here is not so pronounced in its overripe-fruit departments; it's low-profiled, more distinctive with a clear highlight of the yeastiness, comprising touches of cheese, salty-sweet butterscotch and a yeasty-maltiness not unlike a robust and estery Belgian dark ale. The sour hint of Chinese herbed prunes provides a suitable balancer with the yeasty hints as well as with the rich yet underlying flow of overripe fruitiness (prunes, apples, and figs mainly). T: a clean entry of faintly raisiny, sour black dates and black-sugary dark malts comes not sweet at all and pretty understated in taste actually, 

In [102]:
print_description(200)

It's time for another "Good Idea, Bad Idea". Good idea, drinking and reviewing awesome beers on BA. Bad idea, the existence of this beer. Well it's that time again, time for another Jared's Bad Beer Review. In this session, St. Pauli Girl, what I remember as being the most foul thing I've ever put into my mouth from years ago, the first and only time I've ever touched this thing. I almost wanted to skunk it more for the fun, but why bother. It's bad enough I'm sure. They even have the date on there, best by 03/2012, so it can't be that bad, can it?! And the cap is removed, no going back now. The pour yields a pasty white 1.5 finger fizzy but creamy semi thick but bubbly head, reminiscent of albino sperm, and fades with actual sperm like lacing. The beer is more clear than Boston water, less brown as well, it's golden colored like a fake tan. Nose, well not quite as bad as I remember actually. I'm a little turned off by not throwing up as I breath it in. It's actually somewhat breathabl

### Beer Review Text Length Distribution

In [103]:
df["word_count"] = df["text"].apply(lambda x: len(str(x).split()))

In [104]:
desc_lengths = list(df["word_count"])

print(
    "Number of descriptions:",
    len(desc_lengths),
    "\nAverage word count",
    np.average(desc_lengths),
    "\nMinimum word count",
    min(desc_lengths),
    "\nMaximum word count",
    max(desc_lengths),
)

Number of descriptions: 1686 
Average word count 599.008896797153 
Minimum word count 275 
Maximum word count 973


### Preprocessing review description text 


In [105]:
import nltk

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sheetalbongale/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [106]:
df.dtypes

beer_id         int64
username       object
text           object
score         float64
name           object
style          object
word_count      int64
dtype: object

In [107]:
df["text"] = df["text"].astype(str)

In [108]:
REPLACE_BY_SPACE_RE = re.compile("[/(){}\[\]\|@,;]")
BAD_SYMBOLS_RE = re.compile("[^0-9a-z #+_Â]")
STOPWORDS = set(stopwords.words("english"))


def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower()  # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(
        " ", text
    )  # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub(
        "", text
    )  # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
    text = " ".join(
        word for word in text.split() if word not in STOPWORDS
    )  # remove stopwors from text
    return text


df["desc_clean"] = df["text"].apply(clean_text)

  REPLACE_BY_SPACE_RE = re.compile("[/(){}\[\]\|@,;]")


In [109]:
def print_description(index):
    example = df[df.index == index][["desc_clean", "name"]].values[0]
    if len(example) > 0:
        print(example[0])
        print("Name:", example[1])


print_description(0)

feel though canquest tm gaining traction recent visit pinocchios netted abita cans anxious give try remember one industry professional telling ambers dead gar hope ambers take many forms knowing difficult work lager yeast duly impressed red ale got rocky finger tawny head pour hung around long could humidity evening color beautiful orangishamber nequality clarity nose malty also clean mouthfeel full nearly thick malty sweetness reminded hot cereal oatmeal cream wheat finish semisweet sweet cloying means pleasant would nice beer cooler hot day visit beach bottle also review original canquest tm takes precedence today good friday f 25 march 2016 bottle abita amber munich style lager brewed pale caramel malts german perle hops smooth malty caramel flavor rich amber color one first popular brews amber pairs well many foods popped cap gave semiagro pour try might could raise much way head even inglass swirl failed create anything mere skim wisps atop beer color deep amber light copper coppe

In [110]:
print_description(200)

time another good idea bad idea good idea drinking reviewing awesome beers ba bad idea existence beer well time time another jareds bad beer review session st pauli girl remember foul thing ive ever put mouth years ago first time ive ever touched thing almost wanted skunk fun bother bad enough im sure even date best 03 2012 cant bad cap removed going back pour yields pasty white 15 finger fizzy creamy semi thick bubbly head reminiscent albino sperm fades actual sperm like lacing beer clear boston water less brown well golden colored like fake tan nose well quite bad remember actually im little turned throwing breath actually somewhat breathable get faint cracked straw pilsner like malts breath light juiciness almost one point made actual beer hint skunked lager much way horridness actually granted pull right nonlight penetrating case fridge directly fridge little light exposure light bavarian malt aroma really seals deal actual beer even hops wait kidding hops harmed making beer taste 

In [111]:
df.set_index("name", inplace=True)

In [120]:
tf = TfidfVectorizer(
    analyzer="word", ngram_range=(2, 3), min_df=0, stop_words="english"
)
tfidf_matrix = tf.fit_transform(df["desc_clean"])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

In [121]:
indices = pd.Series(df.index)

In [122]:
indices[:50]

0                                                 Amber
1                                              Turbodog
2                                           Purple Haze
3                                            Dubbel Ale
4                                     Widmer Hefeweizen
5                             Mackeson Triple XXX Stout
6                                        Trois Pistoles
7                                    Blanche De Chambly
8                                               Maudite
9                                       La Fin Du Monde
10                                   Traquair House Ale
11                                           Alpha King
12                                            Grand Cru
13                                                White
14                                        Anchor Porter
15                                    Anchor Steam Beer
16                                            Budweiser
17                       Young's Double Chocolat

In [123]:
def recommendations(name, cosine_similarities=cosine_similarities):

    recommended_beers = []

    # gettin the index of the beer that matches the name
    idx = indices[indices == name].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending=False)

    # getting the indexes of the 10 most similar beer except itself
    top_10_indexes = list(score_series.iloc[1:11].index)
    print(score_series.iloc[1:11])

    # populating the list with the names of the top 10 matching beers
    for i in top_10_indexes:
        recommended_beers.append(list(df.index)[i])

    return recommended_beers

In [124]:
recommendations("Hazy Little Thing IPA")

98      0.046232
1355    0.023741
1014    0.019565
1643    0.019498
101     0.018839
1504    0.016843
1410    0.015325
99      0.013884
402     0.013102
1612    0.011933
dtype: float64


['Sierra Nevada Pale Ale',
 'Sierra Nevada Estate Brewers Harvest Ale',
 'Torpedo Extra IPA',
 'Second Fiddle',
 'Sierra Nevada Wheat Beer',
 'Best Of Beer Camp: California Common - Beer Camp #8',
 "30th Anniversary - Fritz And Ken's Ale",
 'Porter',
 'Sierra Nevada India Pale Ale',
 'Flipside Red IPA']

In [125]:
recommendations("Samuel Adams Summer Ale")

320     0.235412
733     0.224747
1119    0.200300
952     0.190512
32      0.187793
30      0.168832
797     0.164552
155     0.160932
31      0.158118
298     0.144466
dtype: float64


['Pyramid Hefeweizen',
 'Samuel Adams White Ale',
 'Blue Moon Honey Moon Summer Ale',
 'Pyramid Apricot Ale',
 'Samuel Adams Octoberfest',
 'Blue Moon Harvest Moon Pumpkin Ale',
 'Redhook Winterhook',
 'Spaten Oktoberfestbier Ur-MÃ¤rzen',
 'Samuel Adams Winter Lager',
 'Redhook ESB']

In [126]:
recommendations("Corona Extra")

83      0.059339
781     0.008774
235     0.007242
604     0.007043
693     0.006685
1117    0.006327
211     0.005964
1390    0.005585
533     0.005194
399     0.005079
dtype: float64


['Corona Light',
 'Morimoto Imperial Pilsner',
 "Mickey's",
 'Rugged Trail Nut Brown Ale',
 'Northern Hemisphere Harvest Wet Hop IPA',
 'Wailua Wheat',
 "He'Brew Genesis Ale (Old Version)",
 'Sierra Nevada Glissade Golden Bock',
 'Maredsous 8 - Brune',
 'Winter Ale']

### These are not exactly great recommendations for a Corona Lover

In [74]:
cosine_similarities

array([[1.        , 0.01667019, 0.        , ..., 0.        , 0.0484561 ,
        0.        ],
       [0.01667019, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.0484561 , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [119]:
recommendations("Shiner Bock")

543     0.529969
799     0.348127
840     0.335035
1612    0.329352
1238    0.323763
312     0.320974
1055    0.320538
716     0.302118
673     0.289096
1614    0.281792
dtype: float64


['Uff-da',
 'Samuel Adams Chocolate Bock',
 'Hop Sun (Summer Wheat Beer)',
 'Flipside Red IPA',
 '400 Pound Monkey',
 "Raison D'Ã\x8atre",
 'Lager of The Lakes',
 'Sisyphus',
 'Staghorn Octoberfest',
 'Accumulation']

In [127]:
recommendations("Salvator Doppel Bock")

359     0.019193
691     0.011363
6       0.011282
636     0.010804
1575    0.010200
1270    0.009908
161     0.009672
89      0.009510
236     0.009469
846     0.009286
dtype: float64


['Delirium Nocturnum',
 "Dale's Pale Ale",
 'Trois Pistoles',
 'Matilda',
 'Organic Chocolate Stout',
 'Vienna Lager',
 'Fat Tire Belgian Style Ale',
 'Heineken Lager Beer',
 'Bass Pale Ale',
 'Twilight Summer Ale']