# Beer Data Analysis

## Assignment

In this assignment you will work with a beer data set. Please provide an answer to the questions below. Answer as many questions as possible:

1. Rank the top 3 breweries which produce the strongest beers.
2. Which year did beers enjoy the highest ratings?
3. Based on the users' ratings, which factors are important among taste, aroma, appearance, and palette?
4. If you were to recommend 3 beers to your friends based on this data, which ones would you recommend?
5. Which beer style seems to be the favourite based on the reviews written by users? How does written reviews compare to overall review score for the beer style?

In [126]:
import numpy as np
import pandas as pd

In [127]:
data = pd.read_csv('/Users/rosiebai/Downloads/datasets-6/BeerDataScienceProject.tar.bz2', compression="bz2")

In [128]:
data.head()

Unnamed: 0,beer_ABV,beer_beerId,beer_brewerId,beer_name,beer_style,review_appearance,review_palette,review_overall,review_taste,review_profileName,review_aroma,review_text,review_time
0,5.0,47986,10325,Sausa Weizen,Hefeweizen,2.5,2.0,1.5,1.5,stcules,1.5,A lot of foam. But a lot. In the smell some ba...,1234817823
1,6.2,48213,10325,Red Moon,English Strong Ale,3.0,2.5,3.0,3.0,stcules,3.0,"Dark red color, light beige foam, average. In ...",1235915097
2,6.5,48215,10325,Black Horse Black Beer,Foreign / Export Stout,3.0,2.5,3.0,3.0,stcules,3.0,"Almost totally black. Beige foam, quite compac...",1235916604
3,5.0,47969,10325,Sausa Pils,German Pilsener,3.5,3.0,3.0,2.5,stcules,3.0,"Golden yellow color. White, compact foam, quit...",1234725145
4,7.7,64883,1075,Cauldron DIPA,American Double / Imperial IPA,4.0,4.5,4.0,4.0,johnmichaelsen,4.5,"According to the website, the style for the Ca...",1293735206


In [129]:
data.dtypes

beer_ABV              float64
beer_beerId             int64
beer_brewerId           int64
beer_name              object
beer_style             object
review_appearance     float64
review_palette        float64
review_overall        float64
review_taste          float64
review_profileName     object
review_aroma          float64
review_text            object
review_time             int64
dtype: object

## 1. Rank the top 3 breweries which produce the strongest beers.


In [130]:
avg_taste_by_brewer = data.groupby('beer_brewerId')['review_taste'].mean().reset_index(name = 'avg_review_taste').sort_values(by = 'avg_review_taste', ascending=False)
avg_taste_by_brewer['rank'] = avg_taste_by_brewer['avg_review_taste'].rank(method = 'dense',ascending = False)
avg_taste_by_brewer[avg_taste_by_brewer['rank'] <= 3]

Unnamed: 0,beer_brewerId,avg_review_taste,rank
1245,15524,5.0,1.0
772,6225,5.0,1.0
1437,19161,5.0,1.0
71,387,5.0,1.0
563,3700,5.0,1.0
266,1175,4.6,2.0
1794,27645,4.5,3.0
994,10393,4.5,3.0
555,3625,4.5,3.0
1438,19208,4.5,3.0


## 2. Which year did beers enjoy the highest ratings?


In [131]:
data['review_time'] = pd.to_datetime(data['review_time'], errors='coerce')
#data['review_time']= data.review_time.dt.date
data['review_year'] = data['review_time'].dt.year
data.groupby('review_year')['review_overall'].mean().reset_index(name = 'avg_rating').sort_values(by = 'avg_rating', ascending = False)

Unnamed: 0,review_year,avg_rating
0,1970,3.833197


In [132]:
data['review_year'].describe()

count    528870.0
mean       1970.0
std           0.0
min        1970.0
25%        1970.0
50%        1970.0
75%        1970.0
max        1970.0
Name: review_year, dtype: float64

This dataset contains only data from year 1970, so I really don't know how to answer this question.

## 3. Based on the users' ratings, which factors are important among taste, aroma, appearance, and palette?


In [133]:
data.columns

Index(['beer_ABV', 'beer_beerId', 'beer_brewerId', 'beer_name', 'beer_style',
       'review_appearance', 'review_palette', 'review_overall', 'review_taste',
       'review_profileName', 'review_aroma', 'review_text', 'review_time',
       'review_year'],
      dtype='object')

In [134]:
data.isna().sum()

beer_ABV              20280
beer_beerId               0
beer_brewerId             0
beer_name                 0
beer_style                0
review_appearance         0
review_palette            0
review_overall            0
review_taste              0
review_profileName      115
review_aroma              0
review_text             119
review_time               0
review_year               0
dtype: int64

In [135]:
from sklearn.linear_model import LinearRegression
X = data[['review_appearance','review_palette', 'review_taste', 'review_aroma']]
y = data['review_overall']
model = LinearRegression()
model.fit(X,y)
# view importance
importance = pd.Series(model.coef_, index = X.columns)
print(importance)

review_appearance    0.033035
review_palette       0.044925
review_taste         0.259415
review_aroma         0.553102
dtype: float64


Beer aroma turned out to be the most important factor, then it's the taste. And it makes sense, usually before we drink something, we smell it first, and if it smells very good, it might increase our satisfaction with the beer. 

## 4. If you were to recommend 3 beers to your friends based on this data, which ones would you recommend?


In [136]:
data.beer_style.unique()

array(['Hefeweizen', 'English Strong Ale', 'Foreign / Export Stout',
       'German Pilsener', 'American Double / Imperial IPA',
       'Herbed / Spiced Beer', 'Oatmeal Stout', 'American Pale Lager',
       'Rauchbier', 'American Pale Ale (APA)', 'American Porter',
       'Belgian Strong Dark Ale', 'American Stout',
       'Russian Imperial Stout', 'American Amber / Red Ale',
       'American Strong Ale', 'Märzen / Oktoberfest',
       'American Adjunct Lager', 'American Blonde Ale', 'American IPA',
       'Fruit / Vegetable Beer', 'American Double / Imperial Stout',
       'English Bitter', 'English Porter', 'Irish Dry Stout',
       'American Barleywine', 'Belgian Strong Pale Ale', 'Doppelbock',
       'Maibock / Helles Bock', 'Light Lager', 'Pumpkin Ale',
       'Dortmunder / Export Lager', 'Euro Strong Lager',
       'Euro Dark Lager', 'Low Alcohol Beer', 'Euro Pale Lager', 'Bock',
       'English India Pale Ale (IPA)', 'Altbier', 'Kölsch',
       'Munich Dunkel Lager', 'Rye Beer',

In [137]:
avg_rating_aroma_taste = data.groupby(['beer_name', 'beer_style'])[['review_overall',
                           'review_aroma',
                            'review_taste',
                            'review_palette',
                            'review_appearance'
                            ]].mean().reset_index().sort_values(by = ['review_overall', 'review_aroma','review_taste'], ascending = False)
avg_rating_aroma_taste['rank'] = avg_rating_aroma_taste[['review_overall']].rank(method = 'min', ascending=False)
staff_pick = avg_rating_aroma_taste[(avg_rating_aroma_taste['rank'] == 1)
                       & (avg_rating_aroma_taste['beer_style'].str.contains('lager', case=False, na = False))
                       & (avg_rating_aroma_taste['review_aroma'] >= 3)
                       & (avg_rating_aroma_taste['review_taste'] >= 3)
                       & (avg_rating_aroma_taste['review_appearance']> 3)]
staff_pick

Unnamed: 0,beer_name,beer_style,review_overall,review_aroma,review_taste,review_palette,review_appearance,rank
9997,Lemon Light,Light Lager,5.0,5.0,4.5,4.0,4.0,1.0
11504,Neustadt 180,American Pale Lager,5.0,4.5,5.0,4.0,3.5,1.0
5252,Draught Lager,American Adjunct Lager,5.0,4.5,4.5,4.5,3.5,1.0
12046,Ohota Extra Light,Light Lager,5.0,4.5,4.5,4.5,4.0,1.0
12939,Pioneer American Lager,American Amber / Red Lager,5.0,4.0,5.0,4.0,4.5,1.0
14905,Shakedown Nut Brown,Munich Helles Lager,5.0,4.0,4.5,4.5,4.0,1.0
16431,Tado Dunkel,Munich Dunkel Lager,5.0,4.0,4.0,3.5,4.0,1.0
16432,Tado Helles,Munich Helles Lager,5.0,4.0,3.5,4.0,3.5,1.0
733,Amber Classic,Euro Dark Lager,5.0,3.0,4.0,3.0,3.5,1.0
16111,Sub Zero Lager,American Pale Lager,5.0,3.0,4.0,3.0,4.0,1.0


In [138]:
len(staff_pick)

10

First of all, I love Lager beers. If I were to recommend to my friends, I would pick the lager with the highest overall rating, strong aroma, good maybe fruity taste but not too strong, and good appearance. I don't care the color too much. Hence, Lemon Light,Shakedown Nut Brown, Sub Zero Lager.

## 5. Which beer style seems to be the favourite based on the reviews written by users? How does written reviews compare to overall review score for the beer style?

In [139]:
data2 = data.dropna(subset = ['beer_style', 'review_text', 'review_overall'])

In [140]:
data2.isna().sum()

beer_ABV              20278
beer_beerId               0
beer_brewerId             0
beer_name                 0
beer_style                0
review_appearance         0
review_palette            0
review_overall            0
review_taste              0
review_profileName      115
review_aroma              0
review_text               0
review_time               0
review_year               0
dtype: int64

In [141]:
#import re
# without lemmatization
#def clean_text(text):
#    text = text.lower()
#    text = re.sub(r'[^a-z\s]','', text)
#    text = re.sub(r'\s+',' ', text)
#    return text.strip()
#data2['clean_review'] = data2['review_text'].apply(clean_text)

In [142]:
import re
import nltk 
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
custom_stopwords = {'beer','like','taste','drink','flavor'}
stop_words.update(custom_stopwords)

lemmatizer = WordNetLemmatizer()
def clean_text2(text):
    text = str(text).lower()
    text = re.sub(r'[^a-z\s]','', text)
    tokens = word_tokenize(text)
    return ' '.join(
        lemmatizer.lemmatize(word)
        for word in tokens
        if word not in stop_words
    )
data2['clean_review'] = data2['review_text'].apply(clean_text2)

[nltk_data] Downloading package punkt to /Users/rosiebai/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rosiebai/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rosiebai/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data2['clean_review'] = data2['review_text'].apply(clean_text2)


In [143]:
data2.head()

Unnamed: 0,beer_ABV,beer_beerId,beer_brewerId,beer_name,beer_style,review_appearance,review_palette,review_overall,review_taste,review_profileName,review_aroma,review_text,review_time,review_year,clean_review
0,5.0,47986,10325,Sausa Weizen,Hefeweizen,2.5,2.0,1.5,1.5,stcules,1.5,A lot of foam. But a lot. In the smell some ba...,1970-01-01 00:00:01.234817823,1970,lot foam lot smell banana lactic tart good sta...
1,6.2,48213,10325,Red Moon,English Strong Ale,3.0,2.5,3.0,3.0,stcules,3.0,"Dark red color, light beige foam, average. In ...",1970-01-01 00:00:01.235915097,1970,dark red color light beige foam average smell ...
2,6.5,48215,10325,Black Horse Black Beer,Foreign / Export Stout,3.0,2.5,3.0,3.0,stcules,3.0,"Almost totally black. Beige foam, quite compac...",1970-01-01 00:00:01.235916604,1970,almost totally black beige foam quite compact ...
3,5.0,47969,10325,Sausa Pils,German Pilsener,3.5,3.0,3.0,2.5,stcules,3.0,"Golden yellow color. White, compact foam, quit...",1970-01-01 00:00:01.234725145,1970,golden yellow color white compact foam quite c...
4,7.7,64883,1075,Cauldron DIPA,American Double / Imperial IPA,4.0,4.5,4.0,4.0,johnmichaelsen,4.5,"According to the website, the style for the Ca...",1970-01-01 00:00:01.293735206,1970,according website style caldera cauldron chang...


In [144]:
# sentiment analysis on text 
from nltk.sentiment import SentimentIntensityAnalyzer 
import nltk 
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer() 
data2['sentiment_score'] = data2['clean_review'].apply(lambda x: sia.polarity_scores(x)['compound'])

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/rosiebai/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data2['sentiment_score'] = data2['clean_review'].apply(lambda x: sia.polarity_scores(x)['compound'])


In [145]:
# aggregate the sentiment scores by beer style
agg_df = data2.groupby('beer_style').agg({
    'review_overall':'mean',
    'sentiment_score':'mean',
    'review_text':'count'
}).rename(columns = {
    'review_overall':'avg_rating',
    'sentiment_score':'avg_sentiment',
    'review_text':'review_count'
}).sort_values('avg_sentiment',ascending=False)
agg_df

Unnamed: 0_level_0,avg_rating,avg_sentiment,review_count
beer_style,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Eisbock,4.079487,0.876705,195
Quadrupel (Quad),4.049159,0.872963,4933
Flanders Red Ale,3.961835,0.860493,2856
American Double / Imperial Stout,4.100441,0.857466,23352
Wheatwine,3.817059,0.857361,891
...,...,...,...
Happoshu,2.818182,0.556395,55
Japanese Rice Lager,3.032258,0.549719,496
American Malt Liquor,2.721986,0.523239,1410
Light Lager,2.913554,0.478990,4471


In [146]:
data2[['sentiment_score', 'review_overall']].corr()

Unnamed: 0,sentiment_score,review_overall
sentiment_score,1.0,0.346192
review_overall,0.346192,1.0


Eisbock looks like the favoriate beer style by the reviewers based on the sentiment score, but customer review rating and review text are not highly correlated, the correlations is only 0.34. The beer styles that have high sentiment scores don't necessarily guarantee a high overall rating, but they do seem to be moving toward the same direction, which means that they agree with each sometimes. 

## Topics Modeling for a given beer style

In [147]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Choose a specific style to analyze
ipa_df = data2[data2['beer_style'].str.contains('ipa', case=False, na=False)]

# Vectorize text
vectorizer = CountVectorizer(max_df=0.95, min_df=5, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(ipa_df['clean_review'])

# LDA
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(doc_term_matrix)

# Show top words per topic
words = vectorizer.get_feature_names_out()
for i, topic in enumerate(lda.components_):
    top_words = [words[i] for i in topic.argsort()[-10:]]
    print(f"Topic {i+1}: {' | '.join(top_words)}")


Topic 1: pine | good | really | great | ipa | citrus | head | malt | hop | nice
Topic 2: poured | color | citrus | malt | glass | nice | smell | good | head | hop
Topic 3: citrus | bitterness | alcohol | glass | head | orange | pine | grapefruit | malt | hop
Topic 4: carbonation | nice | citrus | bitterness | aroma | light | finish | head | malt | hop
Topic 5: ive | bottle | head | really | ale | smell | good | im | ipa | hop


In [148]:

# Choose a specific style to analyze 
# Filter rows where beer_style contains 'lager' (case-insensitive)
lager_df = data2[data2['beer_style'].str.contains('lager', case=False, na=False)]


# Vectorize text
vectorizer = CountVectorizer(max_df = 0.95, min_df = 5, stop_words = 'english')
doc_term_matrix = vectorizer.fit_transform(lager_df['clean_review'])

# LDA
lda = LatentDirichletAllocation(n_components = 5, random_state = 123)
lda.fit(doc_term_matrix)
# Show top words per topic
words = vectorizer.get_feature_names_out()
for i, topic in enumerate(lda.components_):
    top_words = [words[i] for i in topic.argsort()[-10:]]
    print(f"Topic {i+1}: {' | '.join(top_words)}")


Topic 1: good | aroma | finish | amber | nice | sweet | caramel | head | hop | malt
Topic 2: bit | smell | corn | malt | white | hop | carbonation | sweet | light | head
Topic 3: really | head | beer | bad | good | better | macro | lager | smell | light
Topic 4: carbonation | color | white | nice | good | lager | malt | head | hop | light
Topic 5: great | im | really | brew | smell | time | beer | bottle | lager | good
