# Datawrangling - Capstone 1
***

## Beer Review Data Description
***

The dataset we have contains over 1.5 million reviews of various beers from two websites: BeerAdvocate.com. This data not only includes user reviews, product category and alcohol by volume(ABV), but sensory aspects as well such as taste, look, smell and overall ratings. For this project we will train and test models to predict beer ratings and beer style based of the user reviews that were left.

These reviews were made available by Julian Mcauley, a UCSD Computer Science professor, from a collection period of January 1998 to November 2011. This dataset was accessed with permssion. Here are some key specs of the dataset itself.
+ Number of users: 33,387
+ Number of items: 66,051
+ Number of reviews: 1,586,259
    
To tackle the issue of size, we will take a subset sample of 99,999 reviews to train and test our models before applying it to the rest of the dataset.


### Import and Review Data
***

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from textblob import TextBlob

ModuleNotFoundError: No module named 'textblob'

In [None]:
beersmall = pd.read_excel('data/ratebeer_sample.xlsx')

## Examining and Cleaning the DataFrame 
***
Examining the first few columns of the dataframe you can notice a few things
+ Special characters and notation weren't read properly. We need to fix this somehow.
+ There is a time column that if we would want to use needs to be altered. In this case we don't need it, so we will drop it.
+ Beers have a range of reviews from 1 to x amount. We want beers to have at least 5 reviews so we need to remove the ones that have less
+ Some beers don't have an alcohol percentage (ABV)
+ Index needs to be redone.
+ The rating system is not on the same scale. "Aroma" and "Taste" ratings are on a scale of 10 while, the rest are on a scale of 5. 
***
### Remove Unecessary Data

In [None]:
# Drop Time Column
beersmall = beersmall.drop(['time'], axis=1)

In [None]:
# Drop any beer styles that have less than 5 reviews
id_count = beersmall.groupby('style')['style'].transform(len)
mask = id_count > 5
beersmall = beersmall[mask]

# Drop any beer names that have less than 5 reviews
id_count = beersmall.groupby('name')['name'].transform(len)
mask = id_count > 5
beersmall = beersmall[mask]

### Fix Spelling Errors

In [None]:
# Function to replace character errors in the style column
def replacestyle(col):
    # Replace each value that matches the left side of the pair
    return col.replace({
        'K•À_lsch': 'Kölsch',
        '&#40;' : '(',
        '&#41;' : ')',
        'M•À_rzen' : 'Märzen',
        'Sak•À_' : 'Sake',
        'Bi•À_re de Garde' : 'Bière de Garde'
    }, regex=True)

# Apply the function replacestyle to the beer['style'] column
beersmall['style'] = replacestyle(beersmall['style'])

# Function to replace character errors in the name column 
def replacename(col):
    # Replace each value that matches the left side of the pair
    return col.replace({
        '&quot;' : '"', 
        '&#40;' : '(',
        '&#41;' : ')',
        'Brï¿½u' : 'Bräu',
        'Kï¿½r' : 'Kür',
        'Mï¿½r' : 'Mär',
        'hï¿½f' : 'häf',
        'lï¿½n' : 'lán',
        'gï¿½u' : 'gäu',
        'rï¿½n' : 'rän',
        'tï¿½c' : 'tüc'
    }, regex= True)

# Apply the function replacename to the beer.name column
beersmall.name = replacename(beersmall.name)

### Adjust Rating Scales
***
In order to have all the ratings on the same scale the following was done:
+ 'aroma' and 'taste' columns were cut in half to bring them down to a rating scale of 5
+ 'overall' rating column was recalculated based off the other 4 ratings given and brought down to scale of 5

In [None]:
beersmall['aroma'] = round(beersmall['aroma'] / 2)
beersmall['taste'] = round(beersmall['taste'] / 2)
beersmall['overall'] = (beersmall['appearance'] + beersmall['aroma'] + beersmall['palate'] + beersmall['taste']) / 4

### Break Down and Clean Text Reviews
*** 
Text reviews often times have spelling errors, extraneous words and punctuation that we may not need when analyzing the information. For this reason we breakdown each text review into the following:
+ word count
+ character count
+ average word length
+ stop words
    + words that are generally considered useless when it comes to performing a sentiment analysis
+ number of upper case words
    + this may indicate moments of anger or rage

In [None]:
def avg_word_len(sentence):
    words = str(sentence).split()
    return (sum(len(word) for word in words)/len(words))

# Determine the word count for each text review: word_count
beersmall['word_count'] = beersmall['text'].apply(lambda x: len(str(x).split(" ")))

# Determine the character count for each text review: char_count
beersmall['char_count'] = beersmall['text'].str.len() ## this also includes spaces

# Determine the average word length per text review: avg_word_len
beersmall['avg_word_len'] = beersmall['text'].apply(lambda x: avg_word_len(x))

# Determine the number of stop words per text review: stop_words
stop = stopwords.words('english')
beersmall['stop_words'] = beersmall['text'].apply(lambda x: len([x for x in str(x).split() if x in stop]))

# Determine the number of uppercase words per text review: upper
beersmall['upper'] = beersmall['text'].apply(lambda x: len([x for x in str(x).split() if x.isupper()]))

Furthermore we clean the text into a new column "clean_text" where we perform the following actions:
+ change all characters to lowercase
+ Remove any special characters and punctuation
+ Remove any stop words

In [None]:
# Create a new column clean_text where each text review is cleaned to be prepared for sentiment analyses
# Change all characters in each review to lowercase.
beersmall['clean_text'] = beersmall['text'].apply(lambda x: " ".join(x.lower() for x in str(x).split()))

# Remove any special characters from each review.
beersmall['clean_text'] = beersmall['clean_text'].str.replace('[^\w\s]','')

# Remove any stop words.
beersmall['clean_text'] = beersmall['clean_text'].apply(lambda x: " ".join(x for x in str(x).split() if x not in stop))

# Determine the 10 most frequent words and 10 least frequent words and remove them from each review
freq = pd.Series(' '.join(beersmall['clean_text']).split()).value_counts()[:10]
freq2 = pd.Series(' '.join(beersmall['clean_text']).split()).value_counts()[-10:]
beersmall['clean_text'] = beersmall['clean_text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
beersmall['clean_text'] = beersmall['clean_text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq2))

### Extract One Beer Style for Training Set
***
By looking at our value counts we determine that within this chunk of the data, it appears that there are the most reviews for India Pale Ales(IPAs) and therefore we will use these reviews as our training set to test our models. In order to do so we must extract the rows from the beer dataframe that only include IPAs. 

In [None]:
beer_ipa = beersmall[beersmall['style'] == 'India Pale Ale (IPA)'] 

We have 7818 reviews for 101 different types of IPAs. We can increase this size by incorporating the rest of the reviews at a later time.

## Conclusion
***
To wrap up data wrangling we did the following to the dataset:
+ Took a subset of the entire data collected
+ Dropped specific beers and beer styles that had less than 5 reviews
+ Fixed spelling errors
+ Broke down text reviews into separate columns by words, characters, etc. for further analysis.
+ Cleaned each text review in preparation for sentiment analysis
+ Extracted one beer style, IPAs, as our training dataset.


## Sources
***
Lipton, Zachary & Vikram, Sharad & McAuley, Julian. (2015). Capturing Meaning in Product Reviews with Character-Level Generative Text Models.
https://www.researchgate.net/publication/283761921_Capturing_Meaning_in_Product_Reviews_with_Character-Level_Generative_Text_Models