### Import the Necessary Libraries

In [None]:
#import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

import plotly.express as px
import statsmodels

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

from nltk import sent_tokenize, word_tokenize
from nltk.probability import FreqDist

from bs4 import BeautifulSoup

from wordcloud import WordCloud, STOPWORDS

from tqdm import tqdm

import re
import os
import datetime
from collections import Counter

import pickle

import warnings
warnings.filterwarnings(action = 'ignore')

### Reading the Data from file

In [None]:
#https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c

#df = pd.read_csv('../input/beer-data-analytics/BeerProject.csv',encoding='latin-1') #this works too for Utf-8 encoding errors
df = pd.read_csv('../input/beer-data-analytics/BeerProject.csv',engine='python')

In [None]:
df.head(10)

### Save a copy of the original data.

In [None]:
df_original = df.copy()

### Exploratory Data Analysis (EDA)

In [None]:
# Feature Names

df.columns

In [None]:
#Shape of data

df.shape

**Observations:**
* Data set contains 528870 rows and 13 columns.

### Check the data type and count of each feature

In [None]:
df.info()

**Observations:**
* Out of 13 features, we have 4 features ['beer_name', 'beer_style','review_profileName','review_text'] which are categorical /text based features.
* The remaining 9 features ['beer_ABV', 'beer_beerId', 'beer_brewerId', 'review_appearance', 'review_palette', 'review_overall', 'review_taste', 'review_aroma', 'review_time'] are numeric type.

* From the given data we can briefly infer about the different features as follows:
<ol>
    <li>beer_ABV : Alcohol by volume content of a beer</li>
    <li>beer_beerId : Unique ID for beer identification</li>
    <li>beer_brewerId : Unique ID identifying the brewer</li>
    <li>beer_name : Name of the beer</li>
    <li>beer_style : Beer Category</li>
    <li>review_appearance: Rating based on how the beer looks [Range : 1-5]</li>
    <li>review_palatte : Rating based on how the beer interacts with the palate [Range : 1-5]</li>
    <li>review_overall : Overall experience of the beer is combined in this rating [Range : 1-5]</li>
    <li>review_taste : Rating based on how the beer actually tastes [Range : 1-5]</li>
    <li>review_profileName: Reviewer’s profile name / user ID</li>
    <li>review_aroma : Rating based on how the beer smells [Range : 1-5]</li>
    <li>review_text : Review comments/observations in text format</li>
    <li>review_time : Time in UNIX format when review was recorded</li>
</ol>

### Analyzing the Statistical significance of Numeric features

In [None]:
df.describe()

**Observation :**
* The IQR [Inter Quartile Range - that is between 25 % - 75 %] for the beer_ABV feature lies between the values 5.3 to 8.5 with a mean value of around 7.0. For beer_ABV data we can observe outliers values where the max value for the beer ABV contents is around 57.7 
* Based on the count of the beer_ABV we can observe some Null values exists for these feature.
* beer_brewerId - although it is a numeric value but it signifies a specific value of corresponding to each brewery name.
* review_appearance, review_palette, review_taste, review_aroma and review_overall - are the key indicators of the various aspect related to the beer review. The IQR for these lies between 3.5 - 4.5. All these values are observed in the range of 1-5. All the values for these features are fairly spread across the mean value which is centered around 3.8.
* review_time is a numeric feaure which records the UNIX time when the review was given.

### Analyzing the Statistical significance of Non-Numeric(categorical)features

In [None]:
df.describe(exclude=np.number)

**Observation :**
* There are 18339 unique varieties of beers presented in this dataset. Most common beer observed is 'Sierra Nevada Celebration Ale'.
* Most common beer_style is 'American IPA'.
* We can see missing values exists for features - 'review_profileName' and 'review_text'.

### Check for missig values across features

In [None]:
df.isna().sum()

In [None]:
(df.isna().sum()/len(df)) * 100

**Observation :**
* Missing values are present for 3 features - beer_ABV, review_profileName, review_text.
* Around 3.8% of the values for the feature [beer_ABV] are missig, whereas the missing values for the  [review_profileName, review_text] features are minuscule and only around 0.02 % of the total.
* Since missing data can reduce the statistical power and can produce biased estimates, leading to invalid conclusions, we will need to handle these missing values before building the model.

### Handling the missing values for 'review_profileName' feature

In [None]:

#since profile name (categorical feature) is merely a name of the person givng the review comments, 
#we will repace the missing values with the mode of the feature i.e. northyorksammy

df.loc[df['review_profileName'].isna(),'review_profileName'] = df['review_profileName'].fillna(df['review_profileName'].mode()[0])


In [None]:
df['review_profileName'].mode()[0]

In [None]:
df.review_profileName.isna().sum()

### Handling the missing values for 'review_text' feature

In [None]:

#since review text is the description of the user's specific comments about a particular beer, 
#we will repace the missing values for the reviw text with the most common review text i.e. '#NAME?'. 

df.loc[df['review_text'].isna(),'review_text'] = df['review_text'].fillna(df['review_text'].mode()[0])

In [None]:
df['review_text'].mode()[0]

In [None]:
df.review_text.isna().sum()

### Handling the missing values for 'beer_ABV' feature

* 'beer_ABV' is a feature which describes about the volume of alcohol content in a beer. 
* Also we can observe that the beer_abv is related to beer name and we can identify it to be a unique value for a particular beer name. So we will identify the 'beer_ABV' by its 'beer_name' and replace the null values for the same.
* After analyzing further we have observed that some 'beer_name' feature values have both null and non-null values for the correspoing the 'beer_ABV' feature. So we will need to handle the null replacement for 'beer_ABV' feature by considering the unique non-null value replacement here.


In [None]:
df.loc[:,['beer_name','beer_ABV']]

In [None]:
print('No. of unique values of beer names in the given data :',df.beer_name.nunique(dropna=False))

print('No. of unique values of beer abv in the given data :',df.beer_ABV.nunique(dropna=False))

In [None]:
#create a dataframe for beer_ABV not null data

df_NNA = df.loc[df.beer_ABV.notna(),['beer_name','beer_ABV']].sort_values(by = 'beer_name', axis=0, ascending=True, 
                                                         inplace=False, kind='quicksort', na_position='last')
df_NNA

In [None]:
print('No. of unique values of beer names in the not_null data :',df_NNA.beer_name.nunique(dropna=False))

print('No. of unique values of beer abv in the not_null data :',df_NNA.beer_ABV.nunique(dropna=False))

In [None]:
#get the mode of the 'beer_ABV' feature corresponding to the 'beer_name' feature

# credits: https://stackoverflow.com/questions/15222754/groupby-pandas-dataframe-and-select-most-common-value
get_items = lambda vals : max(Counter(vals).items(), key = lambda x : x[1])[0] 
beer_name_abv1 = df_NNA.groupby('beer_name')['beer_ABV'].agg(get_items).to_dict()
beer_name_abv1

In [None]:
#replace the beer_ABV feture with the mode of the 'beer_ABV' feature corresponding to the beer_name

df.beer_ABV = df.beer_name.map(beer_name_abv1)

In [None]:
#we can observe that around (20280-17920=)2360 the missing values in the beer_ABV feature got replaced by mode.

df.loc[df.beer_ABV.isna()].shape[0]

In [None]:
df.loc[df.beer_ABV.notna(),['beer_name','beer_ABV']]

In [None]:
#now get the mode of the 'beer_ABV' feature corresponding to the 'beer_name' feature for the entire data

get_items1 = lambda vals : max(Counter(vals).items(), key = lambda x :(x[0] != np.NaN) & x[1])[0] 
beer_name_abv2 = df.groupby('beer_name')['beer_ABV'].agg(get_items1).to_dict()
beer_name_abv2

In [None]:
#again replace the beer_ABV feture with the mode of the 'beer_ABV' feature corresponding to the beer_name

df.beer_ABV = df.beer_name.map(beer_name_abv2)

In [None]:
#we can observe that there are still 17920 the missing values present in the beer_ABV feature
#corresponding to the beer_name. 

df.shape[0]


In [None]:
df_temp = df.loc[df.beer_ABV.isna(),['beer_name','beer_ABV']]
df_temp.groupby('beer_name')['beer_ABV'].count().to_dict()

* **Since the beer_ABV values are not present at all for these  17920 records, we will drop these datapoints from our dataset for further analysis.**

In [None]:
df.dropna(inplace=True)
df

In [None]:
df.isna().sum()

### Featurization - Adding new Feature for simplifying analysis

In [None]:
#converting the 'review_time' from UNIX timestap to date_time format

df['review_time'] = df['review_time'].apply(lambda x :datetime.datetime.fromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S'))
df['year'] = df['review_time'].apply(lambda x : x[0:4]).astype(int)

### Univariate and Bivariate analysis of different features

In [None]:
#Beer Name

df['beer_name'].value_counts().head(50).plot.bar(figsize=(16,5),title= 'Most Poular Beers by Name')

In [None]:
#Beer style

df['beer_style'].value_counts().head(50).plot.bar(figsize=(16,5),title= 'Most Poular Beers by Style')

In [None]:
#Beer ABV
plt.figure(figsize=(12,5))
sns.distplot(df['beer_ABV'],bins = 50)
##df['beer_ABV'].plot.density()  # this can be used alternatively but prefer sns.distplot
plt.xlabel("Alcohol By Volume")
plt.show()

**Observation :**
* It can be infered that almost all of the majority data in the distribution of 'beer_ABV' is between 5-10 with long tail towards right.
* Data is not perfectly normally distributed but good overall.

In [None]:
plt.figure(figsize=(12,5))
df['beer_ABV'].plot.box(title= 'beer_ABV') 
##df.boxplot(column='beer_ABV') # this can be used alternatively 
#plt.tight_layout()
plt.show()

**Observation :**
* We can see that the feature 'beer_ABV' has presence of outlier values. 
* Since the missing values are around than 3%, we will be replacing them with the unique value of the feature corrsponding to beer_name feature.

In [None]:
#Review Overall

plt.figure(figsize=(16,5))

plt.subplot(121) 
sns.distplot(df.review_overall,bins=50)

plt.subplot(122) 
df['review_overall'].plot.box(title= 'review_overall') 

plt.tight_layout()
plt.show()

**Observation :**
* It can be infered that the overall ratings are distributed in the range of 1 to 5 with most common rating is 4. 
* Data is not normally distributed, left-skewness is observed in the data.
* Also the the IQR for the overall review feature is observed to be between 3.5-4.5.

In [None]:
# Plotting Histograms to display the PDF of all the numeric type features in this dataset. 

df.hist(bins = 15,figsize=(16,12))
plt.show()

In [None]:
# Number of Beers By Alcohol content

d1 = df.groupby('beer_ABV')['beer_name'].count().sort_values(ascending=False).head(50)

x = list(d1.index.values)
for i in range(len(x)):
    x[i] = np.format_float_positional(np.float16(x[i]*1))

y = d1.values

plt.figure(figsize=(20,10))

sns.barplot(x,y)
plt.xlabel("Alcohol By Volume (%)",color='blue')
plt.ylabel("Number of Beers",color='red')
plt.title("Beer by Alcohol content", color='green')
plt.show()

In [None]:
d2 = df.groupby('beer_style')[['beer_ABV','review_overall']].mean().sort_values('beer_style').reset_index()
#d2

In [None]:
#Beer style vs Beer ABV
fig = px.scatter(d2,x="beer_style",y="beer_ABV")
fig.show()

**Observatin:**
    
* Almost all the Beer Styles have an average alcohol volume,  ABV > 4%.


In [None]:
#Beer ABV vs Overall Review

fig = px.scatter(d2,x="beer_ABV",y="review_overall",trendline ='ols')
fig.show()

**Observatin:**
    
* Beers with ABV >5%  tend to get higher Overall ratings, with almost all of them getting >3 overall rating
* There is a positive correlation between ABV levels and the overall rating of the beer.

In [None]:
# Pair Plot for all the user ratings

dat = df.loc[:,['review_appearance', 'review_palette', 'review_overall', 'review_taste','review_aroma']]
dat = dat.groupby('review_overall')['review_appearance', 'review_palette','review_taste','review_aroma'].mean().sort_values('review_overall').reset_index()
#dat
sns.pairplot(data=dat)

## Let's explore some really interesting intuitive questions about the beer data :

### Q1:

1. **Rank top 3 Breweries which produce the strongest beers?**

* Based on the alcohol volume of a beer (i.e. 'beer_ABV'), we can determine how strong it is. 
* In this dataset, we are given only 'beer_brewerId' and not the corresponding **'brewer_names'**, so we will use the 'beer_brewerId' for finding the breweries which produce the strongest beer.

In [None]:
df.beer_brewerId.value_counts()

In [None]:
df_abv = df.groupby('beer_brewerId')['beer_ABV'].mean()
df_abv = pd.DataFrame(data=df_abv).sort_values(by=['beer_ABV'],ascending=False).reset_index()
df_abv.head(3)

* **The top 3 breweries which produce the strongest beer can recognized by below brewery ids.**
<ul>
    <li>6513</li> 	
    <li>736</li>  	
    <li>24215</li> 
</ul>

In [None]:
fig = px.scatter(df_abv,x="beer_brewerId",y="beer_ABV")
fig.show()

### Q2:

2. **Which year did beers enjoy the highest ratings?**

* For determining whether a beer is overall good or not, we will consider the overall rating that is  'review_overall'.


In [None]:
df.groupby('year')['year'].count()

In [None]:
df_dt = df.loc[:,['year','review_overall']]

In [None]:
df_dt = df_dt.groupby('year')[['review_overall']].mean().sort_values('review_overall',ascending = False).reset_index()
df_dt

* **Thus Beers enjoyed the highest ratings in the year 2000.**

In [None]:
fig = px.scatter(df_dt,x="year",y="review_overall")
fig.show()

### Q3:

3. **Based on the user’s ratings which factors are important among taste, aroma, appearance, and palette?**

* For determining the important factors, we need to find the correlation amongst different factors.
* Compare the different factors with the overall review and thus find the important factor.

In [None]:
df_taap = df.loc[:,['review_taste','review_aroma','review_appearance', 'review_palette', 'review_overall']]
df_taap

In [None]:
corr_mat = df_taap.corr()
corr_mat

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(data=corr_mat, annot=True,cmap="YlGnBu")

* **Standard Pearson Correlation coefficient is used here to calculate the correlation between different factors and overall beer quatlity.**
* **It can be observed that the 'review_aroma' feature is most correlated with the 'review_overall' feature and thus we can conclude it to be an important feature based on user's review and different ratings.**

In [None]:
#Review_aroma vs Overall Review
df_taap = df_taap.groupby('review_overall')['review_taste','review_aroma','review_appearance', 'review_palette'].mean().sort_values('review_overall').reset_index()
#df_taap
fig = px.scatter(df_taap,x="review_aroma",y="review_overall",trendline ='ols')
fig.show()

* **We can see a strong positive correlation exist between the Review_aroma and Overall Review.**

### Q4:

4. **If you were to recommend 3 beers to your friends based on this data which ones will you recommend?**

* For determining whether a beer is overall good or not, we will consider 2 factors here - 'review_overall' and 'beer_ABV'.
* Those beers witht the best overall values considering both the overall ratings and alcohol volume will be considered for recommendation.

In [None]:
df.groupby('beer_name')['beer_name'].count().sort_values(ascending=False)

* **Observation** : 14028 different varieties of beers are available in the dataset.

In [None]:
df_br = df.loc[:,['beer_name','review_overall','beer_ABV']]
df_br

In [None]:
df_br = df_br.groupby('beer_name')['review_overall','beer_ABV'].mean().reset_index().sort_values(by = ['review_overall','beer_ABV'],ascending = False).head(10)
df_br

* **The top 3 beers which can be considered for recommendation based on the overall ratings and the alcohol volume can recognized by below beer names.**
<ul>
    <li>AleSmith Speedway Stout - Oak Aged</li> 	
    <li>Pilot Series Imperial Sweet Stout - Palm Ridge Reserve Barrel Aged</li>  	
    <li>Bees Knees Barleywine</li> 
</ul>

In [None]:
fig = px.scatter(df_br,x="beer_name",y="review_overall")
fig.show()


### Q5:

5. **Which Beer style seems to be the favorite based on reviews written by users?**

In [None]:
df['beer_style'].value_counts()

In [None]:
plt.figure(figsize=(20,10))

df['beer_style'].value_counts().plot(kind = "bar", color = "blue")

plt.title("Most Favorite Beer Styles by Count of written reviews")

In [None]:
df_bsrt = df.loc[:,['beer_style','review_text']].sort_values(by='beer_style')
df_bsrt = df_bsrt.iloc[0:100000,:]
#df_bsrt.to_dict()

In [None]:
df_tmp = df_bsrt.groupby('beer_style')['review_text'].count().nlargest(10)
df_tmp

In [None]:
# Credits : https://stackoverflow.com/a/47091490/4084039

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    #phrase = re.sub(r"I\'d", "I had", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

## Natural Language Processing of Text Feature

In [None]:
nltk.download('stopwords')

In [None]:
preprocessed_reviews = []

if os.path.isfile('./preprocessed_reviews.pkl'):
    #retrieve the preprocessed_reviews list for usage.
    with open('./preprocessed_reviews.pkl', 'rb') as f:
        preprocessed_reviews = pickle.load(f)
else:
    for rev in  tqdm(df_bsrt['review_text'].values):
        rev = re.sub(r"http\S+", "", rev)
        rev = BeautifulSoup(rev, 'lxml').get_text()
        rev = decontracted(rev)
        rev = re.sub("\S*\d\S*", "", rev).strip()
        rev = re.sub("[^A-Za-z]+", ' ', rev)
        rev = ' '.join(w.lower() for w in rev.split() if w.lower() not in stop_words)
        preprocessed_reviews.append(rev)

    #save the preprocessed_reviews list for later usage.
    with open('preprocessed_reviews.pkl', 'wb') as f: 
        pickle.dump(preprocessed_reviews, f)

In [None]:
review_text_string = ' '.join(map(str, preprocessed_reviews)) 
review_text_words = word_tokenize(review_text_string)
len(review_text_words)

In [None]:
wordsToken = FreqDist(review_text_words)

In [None]:
wordsToken.most_common(50)

In [None]:
review_text_words_clean = [w for w in review_text_words if w.isalpha()]
print(len(review_text_words_clean))

In [None]:
wordstring = ' '.join(map(str,review_text_words_clean))

In [None]:
# Word Cloud

wc = WordCloud(background_color="white",stopwords=STOPWORDS)
# generate word cloud
wc.generate(wordstring)
print ("Word Cloud for input text:")
plt.figure(figsize=(20,20))
plt.imshow(wc)
plt.axis("off")
plt.show()

**Observations:**

* word cloud here displays the most important words corresponding to the beer_style.

In [None]:
d_0 = df.loc[:,['beer_style','review_text','review_overall']]
d_0

In [None]:
d_0 = d_0.groupby(['beer_style','review_text'])[['review_overall']].sum().sort_values('review_overall',ascending = False).reset_index()
d_0

In [None]:
d_0.loc[d_0.review_text != '#NAME?'].head(5)

**Observations:**

* THe most favorite 'beer_style ' based on reviews written by users are:
<ul>
    <li>American Adjunct Lager</li> 	
    <li>Märzen / Oktoberfes</li>  	
    <li>American Adjunct Lager</li> 
    <li>English Porter</li>
    <li>Fruit / Vegetable Beer</li>
</ul>

### Q6:

6. **How does written review compare to overall review score for the beer styles?**

In [None]:
df_rtro = df.loc[:,['beer_style','review_text']].sort_values('beer_style')
df_rtro

In [None]:
df_bsro = df.loc[:,['beer_style','review_overall']].sort_values('beer_style')
df_bsro

In [None]:
q6 = df_bsro.loc[:,['beer_style','review_overall']]

In [None]:
q6 = q6.groupby('beer_style')[['review_overall']].mean().sort_values('review_overall', ascending = False).reset_index()
q6

In [None]:
fig = px.scatter(q6,x="beer_style",y="review_overall",color='beer_style')

fig.show()