## Beer Recommender using Content Based Recommender System

In [1]:
# Dependencies and packages
%reload_ext lab_black
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
import random
from sklearn.metrics import accuracy_score

pd.options.display.max_columns = 30
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

In [2]:
df = pd.read_csv("../data/csv/reviews_NLTK1686.csv", encoding="latin-1")

In [3]:
df.head()

Unnamed: 0,beer_id,username,text,score,name,style
0,5,woodychandler,I feel as though The CANQuest (TM) is once mor...,4.13,Amber,Vienna Lager
1,6,rickyleepotts,"I have had this beer before, and all I can rem...",2.96,Turbodog,English Brown Ale
2,7,Jadjunk,#115. I haven't reviewed much by Abita Brewing...,3.67,Purple Haze,Fruit and Field Beer
3,10,bditty187,"From notesÂ Batch #29 Dark brown in hue, damn...",3.86,Dubbel Ale,Belgian Dubbel
4,17,mikesgroove,"So, going to GrandmaÂs for dinner tonight, an...",3.78,Widmer Hefeweizen,German Hefeweizen


In [4]:
print("We have", len(df), "reviews in the data")
print("We have", df["beer_id"].nunique(), "unique beers in the data")

We have 1686 reviews in the data
We have 1686 unique beers in the data


In [5]:
def print_description(index):
    example = df[df.index == index][["text", "name"]].values[0]
    if len(example) > 0:
        print(example[0])
        print("Name:", example[1])

In [6]:
print_description(10)

Revisiting this ale nearly 5 years since my first encounter, this bottle's BB date is... DEC. 2017! Served cool in Gulden Draak's tulip-shaped bowl sniffer. A: super dark russet brown hue with mahogany shines, coming with a moderate carbonated body and a nicely lingering beige frothy head. S: very different from the Wee Heavy tasted immediately before this - Orkney Skullsplitter - the aroma here is not so pronounced in its overripe-fruit departments; it's low-profiled, more distinctive with a clear highlight of the yeastiness, comprising touches of cheese, salty-sweet butterscotch and a yeasty-maltiness not unlike a robust and estery Belgian dark ale. The sour hint of Chinese herbed prunes provides a suitable balancer with the yeasty hints as well as with the rich yet underlying flow of overripe fruitiness (prunes, apples, and figs mainly). T: a clean entry of faintly raisiny, sour black dates and black-sugary dark malts comes not sweet at all and pretty understated in taste actually, 

In [7]:
print_description(200)

It's time for another "Good Idea, Bad Idea". Good idea, drinking and reviewing awesome beers on BA. Bad idea, the existence of this beer. Well it's that time again, time for another Jared's Bad Beer Review. In this session, St. Pauli Girl, what I remember as being the most foul thing I've ever put into my mouth from years ago, the first and only time I've ever touched this thing. I almost wanted to skunk it more for the fun, but why bother. It's bad enough I'm sure. They even have the date on there, best by 03/2012, so it can't be that bad, can it?! And the cap is removed, no going back now. The pour yields a pasty white 1.5 finger fizzy but creamy semi thick but bubbly head, reminiscent of albino sperm, and fades with actual sperm like lacing. The beer is more clear than Boston water, less brown as well, it's golden colored like a fake tan. Nose, well not quite as bad as I remember actually. I'm a little turned off by not throwing up as I breath it in. It's actually somewhat breathabl

### Beer Review Text Length Distribution

In [8]:
df["word_count"] = df["text"].apply(lambda x: len(str(x).split()))

In [9]:
desc_lengths = list(df["word_count"])

print(
    "Number of descriptions:",
    len(desc_lengths),
    "\nAverage word count",
    np.average(desc_lengths),
    "\nMinimum word count",
    min(desc_lengths),
    "\nMaximum word count",
    max(desc_lengths),
)

Number of descriptions: 1686 
Average word count 599.008896797153 
Minimum word count 275 
Maximum word count 973


### Preprocessing review description text 


In [10]:
import nltk

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sheetalbongale/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
df.dtypes

beer_id         int64
username       object
text           object
score         float64
name           object
style          object
word_count      int64
dtype: object

In [12]:
df["text"] = df["text"].astype(str)

In [13]:
REPLACE_BY_SPACE_RE = re.compile("[/(){}\[\]\|@,;]")
BAD_SYMBOLS_RE = re.compile("[^0-9a-z #+_Â]")
STOPWORDS = set(stopwords.words("english"))


def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower()  # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(
        " ", text
    )  # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub(
        "", text
    )  # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
    text = " ".join(
        word for word in text.split() if word not in STOPWORDS
    )  # remove stopwors from text
    return text


df["desc_clean"] = df["text"].apply(clean_text)

  REPLACE_BY_SPACE_RE = re.compile("[/(){}\[\]\|@,;]")


In [14]:
def print_description(index):
    example = df[df.index == index][["desc_clean", "name"]].values[0]
    if len(example) > 0:
        print(example[0])
        print("Name:", example[1])


print_description(0)

feel though canquest tm gaining traction recent visit pinocchios netted abita cans anxious give try remember one industry professional telling ambers dead gar hope ambers take many forms knowing difficult work lager yeast duly impressed red ale got rocky finger tawny head pour hung around long could humidity evening color beautiful orangishamber nequality clarity nose malty also clean mouthfeel full nearly thick malty sweetness reminded hot cereal oatmeal cream wheat finish semisweet sweet cloying means pleasant would nice beer cooler hot day visit beach bottle also review original canquest tm takes precedence today good friday f 25 march 2016 bottle abita amber munich style lager brewed pale caramel malts german perle hops smooth malty caramel flavor rich amber color one first popular brews amber pairs well many foods popped cap gave semiagro pour try might could raise much way head even inglass swirl failed create anything mere skim wisps atop beer color deep amber light copper coppe

In [15]:
print_description(200)

time another good idea bad idea good idea drinking reviewing awesome beers ba bad idea existence beer well time time another jareds bad beer review session st pauli girl remember foul thing ive ever put mouth years ago first time ive ever touched thing almost wanted skunk fun bother bad enough im sure even date best 03 2012 cant bad cap removed going back pour yields pasty white 15 finger fizzy creamy semi thick bubbly head reminiscent albino sperm fades actual sperm like lacing beer clear boston water less brown well golden colored like fake tan nose well quite bad remember actually im little turned throwing breath actually somewhat breathable get faint cracked straw pilsner like malts breath light juiciness almost one point made actual beer hint skunked lager much way horridness actually granted pull right nonlight penetrating case fridge directly fridge little light exposure light bavarian malt aroma really seals deal actual beer even hops wait kidding hops harmed making beer taste 

In [16]:
df.set_index("name", inplace=True)

In [35]:
tf = TfidfVectorizer(
    analyzer="word", ngram_range=(0, 1), min_df=0, stop_words="english"
)
tfidf_matrix = tf.fit_transform(df["desc_clean"])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

In [36]:
indices = pd.Series(df.index)

In [37]:
indices[:50]

0                                                 Amber
1                                              Turbodog
2                                           Purple Haze
3                                            Dubbel Ale
4                                     Widmer Hefeweizen
5                             Mackeson Triple XXX Stout
6                                        Trois Pistoles
7                                    Blanche De Chambly
8                                               Maudite
9                                       La Fin Du Monde
10                                   Traquair House Ale
11                                           Alpha King
12                                            Grand Cru
13                                                White
14                                        Anchor Porter
15                                    Anchor Steam Beer
16                                            Budweiser
17                       Young's Double Chocolat

In [63]:
def recommendations(name, cosine_similarities=cosine_similarities):

    recommended_beers = []

    # gettin the index of the beer that matches the name
    idx = indices[indices == name].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending=False)

    # getting the indexes of the 10 most similar beer except itself
    top_10_indexes = list(score_series.iloc[1:11].index)

    # populating the list with the names of the top 10 matching beers
    for i in top_10_indexes:
        recommended_beers.append(list(df.index)[i])

    return recommended_beers, score_series.iloc[1:11]

In [64]:
recommendations("Hazy Little Thing IPA")

(['Sierra Nevada Pale Ale',
  'Sierra Nevada Estate Brewers Harvest Ale',
  'Sierra Nevada Wheat Beer',
  'Best Of Beer Camp: California Common - Beer Camp #8',
  'Torpedo Extra IPA',
  'Flipside Red IPA',
  'Weizenbock - Beer Camp #37 (Best Of Beer Camp)',
  "30th Anniversary - Fritz And Ken's Ale",
  'Beer Camp: Tropical IPA (2016)',
  'Estate Homegrown Wet Hop Ale'],
 98      0.309395
 1355    0.223830
 101     0.211645
 1504    0.198116
 1014    0.196591
 1612    0.190541
 1505    0.183568
 1410    0.179690
 1677    0.166768
 1457    0.157394
 dtype: float64)

In [65]:
recommendations("Samuel Adams Summer Ale")

(['Samuel Adams White Ale',
  'Pyramid Hefeweizen',
  'Samuel Adams Winter Lager',
  'Samuel Adams Octoberfest',
  'Blue Moon Honey Moon Summer Ale',
  'Pyramid Apricot Ale',
  'Winter Welcome Ale',
  'Redhook Winterhook',
  'Spaten Oktoberfestbier Ur-MÃ¤rzen',
  'Winter Solstice Seasonal Ale'],
 733     0.694529
 320     0.659415
 31      0.633963
 32      0.556539
 1119    0.547519
 952     0.540080
 153     0.532500
 797     0.525485
 155     0.521907
 489     0.513892
 dtype: float64)

In [66]:
recommendations("Corona Extra")

(['Corona Light',
  'Northern Hemisphere Harvest Wet Hop IPA',
  'Morimoto Imperial Pilsner',
  'Red Stripe Jamaican Lager',
  'Harvest Ale',
  'Rugged Trail Nut Brown Ale',
  'Landshark Lager',
  'Rolling Rock Extra Pale',
  'Imperial Stout',
  'Hop Trip Harvest Ale (Fresh Hop Ale)'],
 83      0.326586
 693     0.117405
 781     0.115421
 194     0.114012
 879     0.113377
 604     0.111596
 1082    0.111556
 146     0.103927
 120     0.102173
 966     0.101908
 dtype: float64)

In [68]:
recommendations("Shiner Bock")

(['Uff-da',
  'Samuel Adams Chocolate Bock',
  'Hop Sun (Summer Wheat Beer)',
  'Flipside Red IPA',
  '400 Pound Monkey',
  "Raison D'Ã\x8atre",
  'Lager of The Lakes',
  'Sisyphus',
  'Staghorn Octoberfest',
  'Accumulation'],
 543     0.529969
 799     0.348127
 840     0.335035
 1612    0.329352
 1238    0.323763
 312     0.320974
 1055    0.320538
 716     0.302118
 673     0.289096
 1614    0.281792
 dtype: float64)

In [69]:
recommendations("Salvator Doppel Bock")

(['Sierra Nevada Tumbler Autumn Brown Ale',
  'Ranger',
  'Delirium Nocturnum',
  "Dale's Pale Ale",
  'Matilda',
  'Double Bag',
  'Fat Tire Belgian Style Ale',
  'Fire Rock Pale Ale',
  'Bass Pale Ale',
  'Vienna Lager'],
 1442    0.295929
 1389    0.287635
 359     0.282161
 691     0.262696
 636     0.260952
 97      0.257077
 161     0.245514
 668     0.242698
 236     0.238420
 1270    0.236661
 dtype: float64)

- - -
## Try model with the CSV file that has beers having atleast 500 reviews

In [2]:
df2 = pd.read_csv("../data/csv/reviews500_NLTK.csv", encoding="latin-1")

In [3]:
print("We have", len(df2), "reviews in the data")
print("We have", df2["beer_id"].nunique(), "unique beers in the data")

We have 958712 reviews in the data
We have 916 unique beers in the data


In [4]:
def print_description(index):
    example = df2[df2.index == index][["text", "name"]].values[0]
    if len(example) > 0:
        print(example[0])
        print("Name:", example[1])

In [5]:
print_description(10)

Hazy NE IPA golden color. Foamy head that quickly disappears. Juicy explosions at first pour. Smooth taste, little after taste, not too hoppy. Good mouth feel, some creaminess yet too much lingering feel. Overall great offering from TH, easy to drink everyday.
Name: Haze


In [6]:
print_description(200)

Straight pour from a 16oz can to an oversized wineglass (TH stemware, for those curious). Thereâs a canning date of May 2, 2016 printed in black ink on the underside of the can, which reads âKEEP COLD 05 02 16 SUHH DUDEââwhich would make this five days old at the time of consumption. This was purchased at the brewery on the 4th, and it immediately went into the fridge so these are essentially ideal tasting circumstances. Appearance (4.5): Two very full fingers of creamy, custard-colored foam rise happily off of the pour, capping a cloudy, orange and honey-colored body. The head dies down relatively slowly, leaving scattered suds near the peak of the head, a half-finger collar, and a diminutive clump of foam near the center. As the body goes down, it leaves a solid sheath of foam on the far side of the glass, with peaks and valleys in its profile. Smell (4.5): Positively pungent, nectar-like citrus dominates the nose: musky papaya flesh, sticky mango juice, and passion fruit. It

### Beer Review Text Length Distribution

In [7]:
df2["word_count"] = df2["text"].apply(lambda x: len(str(x).split()))

In [8]:
desc_lengths = list(df2["word_count"])

print(
    "Number of descriptions:",
    len(desc_lengths),
    "\nAverage word count",
    np.average(desc_lengths),
    "\nMinimum word count",
    min(desc_lengths),
    "\nMaximum word count",
    max(desc_lengths),
)

Number of descriptions: 958712 
Average word count 114.83544901910062 
Minimum word count 1 
Maximum word count 973


### Preprocessing review description text 


In [9]:
import nltk

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sheetalbongale/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
df2.dtypes

beer_id         int64
username       object
text           object
score         float64
name           object
style          object
word_count      int64
dtype: object

In [11]:
df2["text"] = df2["text"].astype(str)

In [12]:
REPLACE_BY_SPACE_RE = re.compile("[/(){}\[\]\|@,;]")
BAD_SYMBOLS_RE = re.compile("[^0-9a-z #+_Â]")
STOPWORDS = set(stopwords.words("english"))


def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower()  # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(
        " ", text
    )  # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub(
        "", text
    )  # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
    text = " ".join(
        word for word in text.split() if word not in STOPWORDS
    )  # remove stopwors from text
    return text


df2["desc_clean"] = df2["text"].apply(clean_text)

  REPLACE_BY_SPACE_RE = re.compile("[/(){}\[\]\|@,;]")


In [13]:
def print_description(index):
    example = df2[df2.index == index][["desc_clean", "name"]].values[0]
    if len(example) > 0:
        print(example[0])
        print("Name:", example[1])


print_description(0)

0 16 oz funny story finally walked doors 45 min wait line freezing temps sweet sound grateful deads sugar magnolia greeted treehouse sound system bottom reads going wind goes bloomin like red rose white haze yellow golden liquid thick healthy totally unfiltered brawny white foam cap thick allwhite clumps huge lacing left aroma zesty citrus hop effect mellon mango grainy earthiness tropical fruit blend bitter sweet effect peppery kick end aromatic flavor bursting complex hops zesty earthy tones sweet orange peppery malt clean fresh feel overall vibe crispy bite wakes full lush mouthfeel follows totally unfiltered expereince feel flavor finishes fun earthy zesty dry bite tropical juicy zesty citrus zippy golden wheat malt melons rustic earthiness sums taste pretty well levels complexity deep interesting ride sure overall one stands somewhere near top new england ipas
Name: Haze


In [14]:
print_description(200)

straight pour 16oz oversized wineglass th stemware curious theres canning date may 2 2016 printed black ink underside reads keep cold 05 02 16 suhh dudewhich would make five days old time consumption purchased brewery 4th immediately went fridge essentially ideal tasting circumstances appearance 45 two full fingers creamy custardcolored foam rise happily pour capping cloudy orange honeycolored body head dies relatively slowly leaving scattered suds near peak head halffinger collar diminutive clump foam near center body goes leaves solid sheath foam far side glass peaks valleys profile smell 45 positively pungent nectarlike citrus dominates nose musky papaya flesh sticky mango juice passion fruit blends smoothly sugar cookielike sweetness country bread earthy tones paired slightly grassy aroma dank weedy note brings rear warms shifts toward passion fruit sauvignon blanclike gooseberry notes well much stronger musk seriously excellent stuff taste 45 shit bonkers im entirely sure whats go

In [15]:
df2.set_index("name", inplace=True)

In [None]:
tf2 = TfidfVectorizer(
    analyzer="word", ngram_range=(2, 3), min_df=0, stop_words="english"
)
tfidf_matrix2 = tf2.fit_transform(df2["desc_clean"])
cosine_similarities = linear_kernel(tfidf_matrix2, tfidf_matrix2)

In [None]:
"""
def ChunkIterator():
    for chunk in pd.read_csv("../data/csv/reviews_tfidf.csv", chunksize=1000):
      for doc in  chunk["desc_clean"].values:
             yield doc

corpus  = ChunkIterator()
tfidf = TfidfVectorizer()
tfidf_matrix2 = tfidf.fit_transform(corpus)
cosine_similarities = linear_kernel(tfidf_matrix2, tfidf_matrix2)
"""

In [None]:
indices = pd.Series(df2.index)

In [None]:
indices[:50]

In [None]:
def recommendations(name, cosine_similarities=cosine_similarities):

    recommended_beers = []

    # gettin the index of the beer that matches the name
    idx = indices[indices == name].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending=False)

    # getting the indexes of the 10 most similar beer except itself
    top_10_indexes = list(score_series.iloc[1:11].index)
    print(score_series.iloc[1:11])

    # populating the list with the names of the top 10 matching beers
    for i in top_10_indexes:
        recommended_beers.append(list(df2.index)[i])

    return recommended_beers

In [124]:
recommendations("Hazy Little Thing IPA")

98      0.046232
1355    0.023741
1014    0.019565
1643    0.019498
101     0.018839
1504    0.016843
1410    0.015325
99      0.013884
402     0.013102
1612    0.011933
dtype: float64


['Sierra Nevada Pale Ale',
 'Sierra Nevada Estate Brewers Harvest Ale',
 'Torpedo Extra IPA',
 'Second Fiddle',
 'Sierra Nevada Wheat Beer',
 'Best Of Beer Camp: California Common - Beer Camp #8',
 "30th Anniversary - Fritz And Ken's Ale",
 'Porter',
 'Sierra Nevada India Pale Ale',
 'Flipside Red IPA']

In [125]:
recommendations("Samuel Adams Summer Ale")

320     0.235412
733     0.224747
1119    0.200300
952     0.190512
32      0.187793
30      0.168832
797     0.164552
155     0.160932
31      0.158118
298     0.144466
dtype: float64


['Pyramid Hefeweizen',
 'Samuel Adams White Ale',
 'Blue Moon Honey Moon Summer Ale',
 'Pyramid Apricot Ale',
 'Samuel Adams Octoberfest',
 'Blue Moon Harvest Moon Pumpkin Ale',
 'Redhook Winterhook',
 'Spaten Oktoberfestbier Ur-MÃ¤rzen',
 'Samuel Adams Winter Lager',
 'Redhook ESB']

In [126]:
recommendations("Corona Extra")

83      0.059339
781     0.008774
235     0.007242
604     0.007043
693     0.006685
1117    0.006327
211     0.005964
1390    0.005585
533     0.005194
399     0.005079
dtype: float64


['Corona Light',
 'Morimoto Imperial Pilsner',
 "Mickey's",
 'Rugged Trail Nut Brown Ale',
 'Northern Hemisphere Harvest Wet Hop IPA',
 'Wailua Wheat',
 "He'Brew Genesis Ale (Old Version)",
 'Sierra Nevada Glissade Golden Bock',
 'Maredsous 8 - Brune',
 'Winter Ale']