# Datawrangling - Capstone 1
***

## Beer Review Data Description
***

The dataset contains over 1.5 million reviews of various beers from two websites: BeerAdvocate.com. This data not only includes user reviews, product category and alcohol by volume(ABV), but sensory aspects as well such as taste, look, smell and overall ratings. For this project I will train and test models to predict beer ratings and beer style based off the user reviews that were left.

These reviews were made available by Julian Mcauley, a UCSD Computer Science professor, from a collection period of January 1998 to November 2011. This dataset was accessed with permssion. Here are some key specs of the dataset itself.
+ Number of users: 33,387
+ Number of items: 66,051
+ Number of reviews: 1,586,259
    
To tackle the issue of size, I will take a subset sample of 99,999 reviews to train and test our models before applying it to the rest of the dataset.


### Import and Review Data
***

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spacy
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re

In [2]:
beer = pd.read_excel('data/ratebeer_sample.xlsx')

In [3]:
beer.head()

Unnamed: 0,name,beerId,brewerId,ABV,style,appearance,aroma,palate,taste,overall,time,profileName,text
0,John Harvards Simcoe IPA,63836,8481,5.4,India Pale Ale &#40;IPA&#41;,4,6,3,6,13,1157587200,hopdog,"On tap at the Springfield, PA location. Poured..."
1,John Harvards Simcoe IPA,63836,8481,5.4,India Pale Ale &#40;IPA&#41;,4,6,4,7,13,1157241600,TomDecapolis,On tap at the John Harvards in Springfield PA....
2,John Harvards Cristal Pilsner,71716,8481,5.0,Bohemian Pilsener,4,5,3,6,14,958694400,PhillyBeer2112,"Springfield, PA. I've never had the Budvar Cri..."
3,John Harvards Fancy Lawnmower Beer,64125,8481,5.4,K•À_lsch,2,4,2,4,8,1157587200,TomDecapolis,On tap the Springfield PA location billed as t...
4,John Harvards Fancy Lawnmower Beer,64125,8481,5.4,K•À_lsch,2,4,2,4,8,1157587200,hopdog,"On tap at the Springfield, PA location. Poured..."


In [4]:
beer.columns

Index(['name', 'beerId', 'brewerId', 'ABV', 'style', 'appearance', 'aroma',
       'palate', 'taste', 'overall', 'time', 'profileName', 'text'],
      dtype='object')

In [5]:
beer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 13 columns):
name           99999 non-null object
beerId         99999 non-null object
brewerId       99999 non-null int64
ABV            99999 non-null object
style          99999 non-null object
appearance     99999 non-null int64
aroma          99999 non-null int64
palate         99999 non-null int64
taste          99999 non-null int64
overall        99999 non-null int64
time           99999 non-null int64
profileName    99999 non-null object
text           99999 non-null object
dtypes: int64(7), object(6)
memory usage: 9.9+ MB


## Examining and Cleaning the DataFrame 
***
Examining the first few columns of the dataframe you can notice a few things
+ Special characters and notation weren't read properly. We need to fix this somehow.
+ There is a time column that if we would want to use needs to be altered. In this case we don't need it, so we will drop it. The same can be said for the brewerID and breweryID.
+ Some beers don't have an alcohol percentage (ABV)
+ Index needs to be redone.
+ The rating system is not on the same scale. "Aroma" and "Taste" ratings are on a scale of 10 while, the rest are on a scale of 5. 
+ Some text reviews are not in english. Let's figure out a way to remove them.

***
### Remove Unecessary Data


In [6]:
# Drop Time, beerId and brewerId columns from pandas dataframe
beer = beer.drop(['beerId', 'brewerId', 'time'], axis=1)

### Fix Spelling Errors

In [7]:
# Check for Beer Styles where the name needs to be altered.
beer['style'].unique()

array(['India Pale Ale &#40;IPA&#41;', 'Bohemian Pilsener', 'K•À_lsch',
       'Sweet Stout', 'Brown Ale', 'Belgian Ale', 'Abbey Tripel',
       'Belgian White &#40;Witbier&#41;', 'Mild Ale', 'Pale Lager',
       'Imperial/Double IPA', 'Sour Ale/Wild Ale', 'Traditional Ale',
       'Heller Bock', 'Porter', 'Bitter', 'Spice/Herb/Vegetable',
       'Imperial Stout', 'Belgian Strong Ale', 'Golden Ale/Blond Ale',
       'Scottish Ale', 'Stout', 'Scotch Ale', 'Abbey Dubbel', 'Saison',
       'Dunkel', 'American Pale Ale', 'Altbier', 'Wheat Ale',
       'Abt/Quadrupel', 'Oktoberfest/M•À_rzen', 'Baltic Porter',
       'Premium Lager', 'Imperial/Strong Porter', 'Smoked', 'Fruit Beer',
       'Amber Ale', 'English Pale Ale', 'Pilsener', 'German Hefeweizen',
       'Premium Bitter/ESB', 'Cream Ale', 'California Common', 'Vienna',
       'Barley Wine', 'Doppelbock', 'Sak•À_ - Ginjo',
       'American Strong Ale', 'Dunkler Bock', 'Black IPA',
       'Strong Pale Lager/Imperial Pils', 'Irish Ale', 

In [9]:
def fix_spelling(col):
    """Replaces character errors in a given column"""
    # Replace each value that matches the left side of the pair
    return col.replace({
        'K•À_lsch': 'Kölsch',
        '&#40;' : '(',
        '&#41;' : ')',
        'M•À_rzen' : 'Märzen',
        'Sak•À_' : 'Sake',
        'Bi•À_re de Garde' : 'Bière de Garde',
        '&quot;' : '"', 
        '&#40;' : '(',
        '&#41;' : ')',
        'Brï¿½u' : 'Bräu',
        'Kï¿½r' : 'Kür',
        'Mï¿½r' : 'Mär',
        'hï¿½f' : 'häf',
        'lï¿½n' : 'lán',
        'gï¿½u' : 'gäu',
        'rï¿½n' : 'rän',
        'tï¿½c' : 'tüc'
    }, regex=True)

# Apply the function replacestyle to the 'style' column
beer['style'] = fix_spelling(beer['style'])

# Apply the function replacename to the 'name' column
beer['name'] = fix_spelling(beer['name'])

### Account for Missing ABV for Beers
Some beers do not have a alcohol content listed within the data. For these we will convert the '-' to a NaN so that we can still work with the data.

In [10]:
beer.ABV = beer.ABV.replace('-', np.nan)

### Adjust Rating Scales
***
In order to have all the ratings on the same scale the following was done:
+ 'aroma' and 'taste' columns were cut in half to bring them down to a rating scale of 5
+ 'overall' rating column was recalculated based off the other 4 ratings given and brought down to scale of 5

In [11]:
beer['aroma'] = round((4 * beer['aroma'] / 9 + (5/9))).astype(int)
beer['taste'] = round((4 * beer['taste'] / 9 + (5/9))).astype(int)
beer['overall'] = (beer['appearance'] + beer['aroma'] + beer['palate'] + beer['taste']) / 4

### Extract a few distinct Beer Styles
***
Beer has many variations and the complexities of styles continues to grow as brewers blend styles from all around the world together. From looking at the data and the size of our sets there are 5 distinct styles of beers we will look into. The rest will be unnecessary and a waste of space for the time being.
+ India Pale Ale (IPA)
+ Stout
+ Amber Ale
+ Brown Ale
+ Belgian Ale

In [14]:
def beer_style_info(df, title):
    """Displays Various Information of each DataFrame including: Style of beers, brands, number or reviews, etc.
       Also displays the rating distribtuion from 1-5 in the four categories: palate, taste, appearance, and aroma"""
    print('\n', title, '\n-----------------------------')
    print('Number of Different Brands:   ', len(df['name'].value_counts()))
    print('Number of Reviews:            ', len(df['text'].value_counts()))
    print('Average Reviews per Brand:    ', round(df['name'].value_counts().mean()))
    print('Number of Reviewers:          ', len(df['profileName'].value_counts()))
    print('Reviews per Brand:            ', min(df['name'].value_counts()), '-', max(df['name'].value_counts()))
    
    ratings = pd.DataFrame(columns=['appearance', 'taste', 'aroma', 'palate'])
    ratings['appearance'] = df.appearance.value_counts()
    ratings['taste'] = df.taste.value_counts()
    ratings['aroma'] = df.aroma.value_counts()
    ratings['palate'] = df.palate.value_counts()
    return ratings.sort_index()

In [15]:
beer_style_info(beer, title='All Beers')


 All Beers 
-----------------------------
Number of Different Brands:    4349
Number of Reviews:             98457
Average Reviews per Brand:     23
Number of Reviewers:           7538
Reviews per Brand:             1 - 3043


Unnamed: 0,appearance,taste,aroma,palate
1,789,1303,1434,1409
2,6840,7073,7362,11013
3,44757,33781,36566,49457
4,40253,52700,49981,33523
5,7360,5142,4656,4597


In [16]:
# Separate Amber Ale style of beers and extract information from it
amber = beer[beer['style'] == 'Amber Ale'].reset_index()
beer_style_info(amber, title='Amber Ale')


 Amber Ale 
-----------------------------
Number of Different Brands:    136
Number of Reviews:             4450
Average Reviews per Brand:     33
Number of Reviewers:           2084
Reviews per Brand:             1 - 994


Unnamed: 0,appearance,taste,aroma,palate
1,19,42,40,47
2,221,379,394,578
3,2454,2113,2353,2636
4,1541,1835,1604,1073
5,217,83,61,118


In [17]:
# Separate Belgian Ale style of beers and extract information from it
belgian = beer[beer['style'] == 'Belgian Ale'].reset_index()
beer_style_info(belgian, title='Belgian Ale')


 Belgian Ale 
-----------------------------
Number of Different Brands:    90
Number of Reviews:             2603
Average Reviews per Brand:     29
Number of Reviewers:           1217
Reviews per Brand:             1 - 572


Unnamed: 0,appearance,taste,aroma,palate
1,5,10,16,21
2,177,202,239,372
3,1548,1359,1421,1613
4,793,1024,913,578
5,93,21,27,32


In [18]:
# Separate Brown Ale style of beers and extract information from it
brown = beer[beer['style'] == 'Brown Ale'].reset_index()
beer_style_info(brown, title='Brown Ale')


 Brown Ale 
-----------------------------
Number of Different Brands:    139
Number of Reviews:             4042
Average Reviews per Brand:     30
Number of Reviewers:           2129
Reviews per Brand:             1 - 1211


Unnamed: 0,appearance,taste,aroma,palate
1,9,37,46,46
2,148,259,264,518
3,1817,1699,1892,2304
4,1910,2026,1852,1170
5,267,130,97,113


In [19]:
# Separate IPA style of beers and extract information from it
ipa = beer[beer['style'] == 'India Pale Ale (IPA)'].reset_index()
beer_style_info(ipa, title='India Pale Ale')


 India Pale Ale 
-----------------------------
Number of Different Brands:    231
Number of Reviews:             7831
Average Reviews per Brand:     35
Number of Reviewers:           2808
Reviews per Brand:             1 - 1414


Unnamed: 0,appearance,taste,aroma,palate
1,19,23,22,32
2,212,234,237,452
3,3707,2223,2320,4184
4,3722,5282,5108,3148
5,478,376,451,322


In [20]:
# Separate Stout style of beers and extract information from it
stout = beer[beer['style'] == 'Stout'].reset_index()
beer_style_info(stout, title='Stout')


 Stout 
-----------------------------
Number of Different Brands:    138
Number of Reviews:             3998
Average Reviews per Brand:     29
Number of Reviewers:           2019
Reviews per Brand:             1 - 1052


Unnamed: 0,appearance,taste,aroma,palate
1,4,12,12,16
2,68,96,90,268
3,1229,998,1131,1937
4,2241,2721,2594,1588
5,460,175,175,193


***
### Text Preprocessing
Our main feature will work around the user written reviews for each beer. However, text reviews often times than not have spelling errors, extraneous words and other issues that need to be addressed before they can become useful.

#### Expand Contractions
Words like "can't" or "couldn't" are contractions that put two words together. It is hard for the machine or model to process this sometimes, ignoring the extra "not" that was contracted into one word. To fix this I expanded any contractions based of a list of contractions I found here from Springboard mentor DJ Sarkar

https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%20content/nlp%20proven%20approach/contractions.py

In [21]:
# Contraction Map to be used to expand contractions 
CONTRACTION_MAP = {
"ain't": "is not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have", "'cause": "because", 
"could've": "could have", "couldn't": "could not", "couldn't've": "could not have", "didn't": "did not", "doesn't": "does not",
"don't": "do not", "hadn't": "had not", "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not",
"he'd": "he would", "he'd've": "he would have", "he'll": "he will", "he'll've": "he will have", "he's": "he is",
"how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "I'd": "I would", 
"I'd've": "I would have", "I'll": "I will", "I'll've": "I will have", "I'm": "I am", "I've": "I have", "i'd": "i would", 
"i'd've": "i would have", "i'll": "i will", "i'll've": "i will have", "i'm": "i am", "i've": "i have", "isn't": "is not",
"it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have", "it's": "it is", 
"let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have", "mightn't": "might not",
"mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", 
"needn't": "need not", "needn't've": "need not have", "o'clock": "of the clock", "oughtn't": "ought not", 
"oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
"she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have",
"she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have",
"so's": "so as", "that'd": "that would", "that'd've": "that would have", "that's": "that is","there'd": "there would",
"there'd've": "there would have", "there's": "there is", "they'd": "they would", "they'd've": "they would have",
"they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
"wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have",
"we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have",
"what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have",
"where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have",
"who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not",
"won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have",
"y'all": "you all", "y'all'd": "you all would", "y'all'd've": "you all would have", "y'all're": "you all are",
"y'all've": "you all have", "you'd": "you would", "you'd've": "you would have", "you'll": "you will", 
"you'll've": "you will have", "you're": "you are", "you've": "you have"
}


def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    """Expand contractions in given column of a dataframe if found in the CONTRACTION_MAP"""
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, str(text))
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

#### Remove Special Characters
I'm not interested in looking at special characters or punctuation and therefore I need to remove all of this.

In [22]:
def remove_special_characters(text, remove_digits=False):
    """Removes special characters. If you set remove_digits to true it will also remove digits"""
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

#### Stemming and Lemmatization 
To get a better understanding of what words are used when, I care to only look at the root or base of each word. For example, the words "played", "plays", "playing" all come from "play". To tackle this there are two routes I will try stemming and lemmatization. Stemming is often times sufficient enough and faster, but lemmatization might provide a more robust output to use. 

In [23]:
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

def simple_stemmer(text):
    """Apply the PorterStemmer() onto the given text to extract only the base of the word"""
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text


def lemmatize_text(text):
    """Use spacy to lemmatize the text"""
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

#### Stopwords
There are certain words that won't really be helpful to include because they can be found frequently over a group of reviews. Examples of these words are "a", "is", etc. I used an already built in stopword list in the nltk library for English. However, I did make some adjustments.
 1. Remove words "no" and "not" from the stopword list because they can reverse the sentiment of a text
 2. After looking into the frequency of words, more words maybe added to the stopword list in the future.

In [24]:
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

def remove_stopwords(text, is_lower_case=False):
    """Remove stopwords from text"""
    # Tokenize text to separate words
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    # If is_lower_case is false, apply stopwords filter, otherwise lower the case for everything first
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

#### Putting it all together
It is messy having to deal with step on its own. I put it all together into one function that can turn parameters on and off.

In [25]:
def normalize_corpus(corpus, contraction_expansion=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    """Normalize each document in the corpus. Functions can be excluded by changing the parameter to False"""
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # Stem text
        else:
            doc = simple_stemmer(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

# Clean "text" column using stemming
beer['clean_text_stem'] = normalize_corpus(beer['text'], text_lemmatization=False)

# Clean "text" column using lemmatization
beer['clean_text_lem'] = normalize_corpus(beer['text'])

#### Word Count 
I want to create another feature to look at besides the text review while doing my EDA. This will be the word count of each review. 

In [26]:
# Determine the word count for each text review: word_count
beer['word_count'] = beer['text'].apply(lambda x: len(str(x).split(" ")))

In [27]:
beer.head()

Unnamed: 0,name,ABV,style,appearance,aroma,palate,taste,overall,profileName,text,clean_text_stem,clean_text_lem,word_count
0,John Harvards Simcoe IPA,5.4,India Pale Ale (IPA),4,3,3,3,3.25,hopdog,"On tap at the Springfield, PA location. Poured...",tap springfield pa location pour deep cloudi o...,tap springfield pa location pour deep cloudy o...,73
1,John Harvards Simcoe IPA,5.4,India Pale Ale (IPA),4,3,4,4,3.75,TomDecapolis,On tap at the John Harvards in Springfield PA....,tap john harvard springfield pa pour rubi red ...,tap john harvard springfield pa pour ruby red ...,74
2,John Harvards Cristal Pilsner,5.0,Bohemian Pilsener,4,3,3,3,3.25,PhillyBeer2112,"Springfield, PA. I've never had the Budvar Cri...",springfield pa never budvar cristal thi exactl...,springfield pa never budvar cristal exactly im...,43
3,John Harvards Fancy Lawnmower Beer,5.4,Kölsch,2,2,2,2,2.0,TomDecapolis,On tap the Springfield PA location billed as t...,tap springfield pa locat bill fanci lawnmow li...,tap springfield pa location bill fancy lawnmow...,53
4,John Harvards Fancy Lawnmower Beer,5.4,Kölsch,2,2,2,2,2.0,hopdog,"On tap at the Springfield, PA location. Poured...",tap springfield pa location pour lighter golde...,tap springfield pa location pour light golden ...,74


### Remove Non-English Text Reviews

In [None]:
from langdetect import detect

# In the length of the beer dataframe
for i in range(len(beer2)):
    try:
        if detect(str(beer2['clean_text_lem'][i])) != 'en':   # If the language detected is not english, drop the row
            beer2.drop(i, inplace=True)
    except:     # Pass the loop if no language can be identified
        pass

### Export cleaned dataframe for later use.

In [28]:
# Export cleaned dataframe into an new csv file. 
# beer.to_csv(r'C:\Users\soham\OneDrive\Desktop\Springboard\beercleaned.csv')

## Conclusion
***
To wrap up data wrangling I did the following to the dataset:
+ Took a subset of the entire data collected
+ Dropped unnecessary columns: time, beerId, brewerId
+ Fixed spelling errors
+ Broke down text reviews in preparation for sentiment analysis
+ Added a new feature: word_count
+ Extracted five beer styles to examine along with the entire dataset: Amber Ale, Belgian Ale, Brown Ale, IPA, Stout
+ Started with dataset size of 99,999. After cleaning and removing data that it was reduced to 81048.


## Sources
***
Lipton, Zachary & Vikram, Sharad & McAuley, Julian. (2015). Capturing Meaning in Product Reviews with Character-Level Generative Text Models.
https://www.researchgate.net/publication/283761921_Capturing_Meaning_in_Product_Reviews_with_Character-Level_Generative_Text_Models