# Predicting Beer Reviews Data Wrangling
***

This data is a portion of a csv file filled with beer reviews from RateBeer.com. This is my attempt at cleaning the dataset to use later.   

### Import in the Data
***

In [2]:
# Import necessary libraries
import pandas as pd
import csv
from glob import glob
import numpy as np

In [3]:
beersmall = pd.read_excel('ratebeer_sample.xlsx')

In [4]:
beersmall.head()

Unnamed: 0,name,beerId,brewerId,ABV,style,appearance,aroma,palate,taste,overall,time,profileName,text
0,John Harvards Simcoe IPA,63836,8481,5.4,India Pale Ale &#40;IPA&#41;,4,6,3,6,13,1157587200,hopdog,"On tap at the Springfield, PA location. Poured..."
1,John Harvards Simcoe IPA,63836,8481,5.4,India Pale Ale &#40;IPA&#41;,4,6,4,7,13,1157241600,TomDecapolis,On tap at the John Harvards in Springfield PA....
2,John Harvards Cristal Pilsner,71716,8481,5.0,Bohemian Pilsener,4,5,3,6,14,958694400,PhillyBeer2112,"Springfield, PA. I've never had the Budvar Cri..."
3,John Harvards Fancy Lawnmower Beer,64125,8481,5.4,K•À_lsch,2,4,2,4,8,1157587200,TomDecapolis,On tap the Springfield PA location billed as t...
4,John Harvards Fancy Lawnmower Beer,64125,8481,5.4,K•À_lsch,2,4,2,4,8,1157587200,hopdog,"On tap at the Springfield, PA location. Poured..."


In [5]:
beersmall.columns

Index(['name', 'beerId', 'brewerId', 'ABV', 'style', 'appearance', 'aroma',
       'palate', 'taste', 'overall', 'time', 'profileName', 'text'],
      dtype='object')

### Examining and Cleaning the DataFrame 
***
Examining the first few columns of the dataframe you can notice a few things
+ Special characters and notation weren't read properly. We need to fix this somehow.
+ There is a time column that if we would want to use needs to be altered. In this case we don't need it, so we will drop it.
+ Beers have a range of reviews from 1 to x amount. We want beers to have at least 5 reviews so we need to remove the ones that have less
+ Some beers don't have an alcohol percentage(ABV)
+ Index needs to be redone.

In [6]:
# Drop Time Column
beersmall = beersmall.drop(['time'], axis=1)

In [7]:
# Drop any beers that hav less than 5 reviews
id_count = beersmall.groupby('style')['style'].transform(len)
mask = id_count > 5
beersmall = beersmall[mask]

In [8]:
# Check for Beer Styles where the name needs to be altered.
beersmall['style'].unique()

array(['India Pale Ale &#40;IPA&#41;', 'Bohemian Pilsener', 'K•À_lsch',
       'Sweet Stout', 'Brown Ale', 'Belgian Ale', 'Abbey Tripel',
       'Belgian White &#40;Witbier&#41;', 'Mild Ale', 'Pale Lager',
       'Imperial/Double IPA', 'Sour Ale/Wild Ale', 'Traditional Ale',
       'Heller Bock', 'Porter', 'Bitter', 'Spice/Herb/Vegetable',
       'Imperial Stout', 'Belgian Strong Ale', 'Golden Ale/Blond Ale',
       'Scottish Ale', 'Stout', 'Scotch Ale', 'Abbey Dubbel', 'Saison',
       'Dunkel', 'American Pale Ale', 'Altbier', 'Wheat Ale',
       'Abt/Quadrupel', 'Oktoberfest/M•À_rzen', 'Baltic Porter',
       'Premium Lager', 'Imperial/Strong Porter', 'Smoked', 'Fruit Beer',
       'Amber Ale', 'English Pale Ale', 'Pilsener', 'German Hefeweizen',
       'Premium Bitter/ESB', 'Cream Ale', 'California Common', 'Vienna',
       'Barley Wine', 'Doppelbock', 'Sak•À_ - Ginjo',
       'American Strong Ale', 'Dunkler Bock', 'Black IPA',
       'Strong Pale Lager/Imperial Pils', 'Irish Ale', 

In [9]:
# Function to replace character errors in the style column
def replacestyle(col):
    # Replace each value that matches the left side of the pair
    return col.replace({
        'K•À_lsch': 'Kölsch',
        '&#40;' : '(',
        '&#41;' : ')',
        'M•À_rzen' : 'Märzen',
        'Sak•À_' : 'Sake',
        'Bi•À_re de Garde' : 'Bière de Garde'
    }, regex=True)

# Apply the function replacestyle to the beer['style'] column
beersmall['style'] = replacestyle(beersmall['style'])

In [15]:
types_of_beer = beersmall['style'].unique()

In [24]:
len(types_of_beer)

77

In [None]:
†how 

In [None]:
# Function to replace character errors in the name column 
def replacename(col):
    # Replace each value that matches the left side of the pair
    return col.replace({
        '&quot;' : '"', 
        '&#40;' : '(',
        '&#41;' : ')',
        'Brï¿½u' : 'Bräu',
        'Kï¿½r' : 'Kür',
        'Mï¿½r' : 'Mär',
        'hï¿½f' : 'häf',
        'lï¿½n' : 'lán',
        'gï¿½u' : 'gäu',
        'rï¿½n' : 'rän',
        'tï¿½c' : 'tüc'
    }, regex= True)

# Apply the function replacename to the beer.name column
beer.name = replacename(beer.name)