### How To Choose The Perfect Beer?

This is a perfect competition for persons who have a beginner’s level understanding of various concepts of machine learning and data science, and are looking to polish their understanding and check how they stand against a larger community.

Data scientists take their beer very seriously. Recommendations from friends? No thank you. Websites? Too many pop-ups. Ads? Yeah, as if. They can trust only true solid numbers. Here’s a fun fact: Last year, Indians drank a total of 4.7 million litres of beer and the number is expected to go up to 6.5 billion litres by 2022.

Newer brands are also entering the market — take Delhi-based startup Bira 91, for example. With the $50 million they have raised, Bira plans to flood India with more beer and fill in that gap between traditional inexpensive brands and expensive ones.

So how will data scientists choose their beer? Will they look at the combination of barley, water, hops and yeast arrived upon? There are many things to consider — a series of complex biochemical reactions need to take place to make the perfect beer.

That’s why, here at MachineHack, we have entrusted this very important job to the most trustworthy people in the world (especially when it comes to beer) to you, the data scientists.

#### Data:

The train and test data will consist of various features that describe a beer. In many beer cellars, important factors such as temperature and humidity are maintained by a climate control system. Hence features like Cellar Temperature and Serving Temperature become really important. This is an actual data set that is curated over months of primary and secondary research by our team. Each row contains fixed size object of features. There are nine features and each feature can be accessed by its name.

#### Features

*   ABV – Alcohol By Volume
*   Brewing Company
*   Food Pairing – Perfect food to have with this beer
*   Glassware Used – Perfect glassware to use to enjoy this beer
*   Beer Name – Name of the beer
*   Ratings
*   Score (Predict) – Overall score of the beer
*   Style Name – Style in which the beer is prepared
*   Cellar Temperature
*   Serving Temperature

#### Problem Statement

With the given nine features (categorical and continuous) build a model to predict the score of the beer.

#### Evaluation

**Goal**: It is your job to predict the score for each beer. For each beer in the test set, you must predict the score variable.

**Metric**: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the predicted value and observed score values. The final score calculation is done in the following way:

1. X = Sigmoid of RMSE, which squashes the RMSE between the range of 0 and 1
2. Score = 1 – X, Hence, lesser the RMSE better your score is.

**Submission File Format**: Please do not change the format of the test file while submissions. Just fill up the price columns without touching any other data on the file.

In [1]:
import pandas as pd                 # pandas is a dataframe library
import matplotlib.pyplot as plt     # matplotlib.pyplot plots data
import numpy as np                  # numpy provides N-dim object support

# do ploting inline instead of in a separate window
%matplotlib inline

  'Matplotlib is building the font cache using fc-list. '


In [2]:
df = pd.read_csv("data/Beer Train Data Set.csv")  # load Pima data. Adjust path as necessary

df.shape

(185643, 10)

In [3]:
df.dtypes

ABV                    float64
Brewing Company          int64
Food Paring             object
Glassware Used          object
Beer Name                int64
Ratings                 object
Style Name              object
Cellar Temperature      object
Serving Temperature     object
Score                  float64
dtype: object

In [4]:
df.head(5)

Unnamed: 0,ABV,Brewing Company,Food Paring,Glassware Used,Beer Name,Ratings,Style Name,Cellar Temperature,Serving Temperature,Score
0,6.5,8929,"(Curried,Thai)Cheese(pepperyMontereyPepperJack...","PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...",15121,22,AmericanIPA,40-45,45-50,3.28
1,5.5,13187,"(PanAsian)Cheese(earthyCamembert,Fontina,nutty...","PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...",59817,1,AmericanPaleAle(APA),35-40,40-45,3.52
2,8.1,6834,"Meat(Pork,Poultry)","PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...",32669,3,IrishRedAle,35-40,40-45,4.01
3,,11688,"(Indian,LatinAmerican,PanAsian)General(Aperitif)","PintGlass(orBecker,Nonic,Tumbler),PilsenerGlas...",130798,0,AmericanMaltLiquor,35-40,35-40,0.0
4,6.0,10417,"Meat(Poultry,Fish,Shellfish)",PilsenerGlass(orPokal),124087,1,EuroPaleLager,35-40,40-45,2.73


In [5]:
df.tail(5)

Unnamed: 0,ABV,Brewing Company,Food Paring,Glassware Used,Beer Name,Ratings,Style Name,Cellar Temperature,Serving Temperature,Score
185638,4.5,9105,"(Dessert,Aperitif,Digestive)","PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...",141522,0,HerbedSpicedBeer,,45-50,0.0
185639,4.5,3348,"(Barbecue,Italian)Cheese(earthyCamembert,Fonti...",PilsenerGlass(orPokal),85557,1,AmericanPaleLager,35-40,40-45,4.19
185640,,8216,"Cheese(earthyCamembert,Fontina,nuttyAsiago,Col...","PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...",105072,1,EnglishBrownAle,40-45,45-50,3.11
185641,6.2,1755,"(Curried,Thai)Cheese(pepperyMontereyPepperJack...","PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...",70788,2,AmericanIPA,40-45,45-50,3.4
185642,6.4,4341,"(Curried,Thai)Cheese(pepperyMontereyPepperJack...","PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...",149979,1,AmericanIPA,40-45,45-50,4.31


In [6]:
df.isnull().values.any()

True

In [7]:
df.describe()

Unnamed: 0,ABV,Brewing Company,Beer Name,Score
count,170513.0,185643.0,185643.0,185643.0
mean,6.354961,7008.757659,83738.220111,3.198432
std,1.907205,3914.168053,48520.065146,1.358862
min,0.01,0.0,0.0,0.0
25%,5.0,3825.0,41232.5,3.27
50%,6.0,7111.0,83335.0,3.71
75%,7.2,10402.0,125148.5,3.97
max,80.0,13541.0,168534.0,5.0


In [25]:
df['Ratings'] = df['Ratings'].astype('category')
df['Ratings'].head(5)

0    22
1     1
2     3
3     0
4     1
Name: Ratings, dtype: category
Categories (1824, object): [0, 1, 1,000, 1,001, ..., 995, 996, 997, 999]

In [9]:

df['Cellar Temperature'] = df['Cellar Temperature'].astype('category')
df['Cellar Temperature'].head(5)

0    40-45
1    35-40
2    35-40
3    35-40
4    35-40
Name: Cellar Temperature, dtype: category
Categories (3, object): [35-40, 40-45, 45-50]

In [10]:
df['Serving Temperature'] = df['Serving Temperature'].astype('category')
df['Serving Temperature'].head(5)

0    45-50
1    40-45
2    40-45
3    35-40
4    40-45
Name: Serving Temperature, dtype: category
Categories (4, object): [35-40, 40-45, 45-50, 50-55]

In [17]:
print("# rows in dataframe : {0}".format(len(df)))
print("# rows missing ABV : {0}".format(len(df.loc[np.isnan(df['ABV'])])))
print("# rows missing Brewing Company : {0}".format(len(df.loc[np.isnan(df['Brewing Company'])])))
print("# rows missing Food Paring : {0}".format(len(df.loc[df['Food Paring']==''])))
print("# rows missing Glassware Used : {0}".format(len(df.loc[df['Glassware Used']==''])))
print("# rows missing Beer Name : {0}".format(len(df.loc[np.isnan(df['Beer Name'])])))
print("# rows missing Ratings : {0}".format(len(df.loc[df['Ratings']==''])))
print("# rows missing Style Name : {0}".format(len(df.loc[df['Style Name']== ''])))
print("# rows missing Cellar Temperature : {0}".format(len(df.loc[df['Cellar Temperature'].isnull().values == True])))
print("# rows missing Serving Temperature : {0}".format(len(df.loc[df['Serving Temperature'].isnull().values == True])))

# rows in dataframe : 185643
# rows missing ABV : 15130
# rows missing Brewing Company : 0
# rows missing Food Paring : 0
# rows missing Glassware Used : 0
# rows missing Beer Name : 0
# rows missing Ratings : 0
# rows missing Style Name : 0
# rows missing Cellar Temperature : 6781
# rows missing Serving Temperature : 193


In [18]:
df.describe()

Unnamed: 0,ABV,Brewing Company,Beer Name,Score
count,170513.0,185643.0,185643.0,185643.0
mean,6.354961,7008.757659,83738.220111,3.198432
std,1.907205,3914.168053,48520.065146,1.358862
min,0.01,0.0,0.0,0.0
25%,5.0,3825.0,41232.5,3.27
50%,6.0,7111.0,83335.0,3.71
75%,7.2,10402.0,125148.5,3.97
max,80.0,13541.0,168534.0,5.0


In [None]:
def plot_corr(df, size=10):
    """
    Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: pandas DataFrame
        size: vertical and horizontal size of the plot

    Displays:
        matrix of correlation between columns.  Blue-cyan-yellow-red-darkred => less to more correlated
                                                0 ------------------>  1
                                                Expect a darkred line running from top left to bottom right
    """

    corr = df.corr()    # data frame correlation function
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)   # color code the rectangles by correlation value
    plt.xticks(range(len(corr.columns)), corr.columns)  # draw x tick marks
    plt.yticks(range(len(corr.columns)), corr.columns)  # draw y tick marks

In [None]:
plot_corr(df)

In [None]:
df.corr()

In [None]:
from sklearn.cross_validation import train_test_split

featured_col_names = ['ABV', 'Brewing Company','Beer Name','Ratings', 'Style Name' ,'Cellar Temperature', 'Serving Temperature', '']
predicted_class_name = ['diabetes']

x = df[featured_col_names].values               # predictor feature columns (8 X m)
y = df[predicted_class_name].values             # predicted class (1 = true, 0 = false) column (1 X m)

split_test_size = 0.30

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = split_test_size, random_state = 42)
        # test_size = 0.30 is 30%, 42 -> sets the seed for generating random numbers for the iterations in traing process

In [None]:
print("{0:0.2f}% in Training set".format((len(x_train)/len(df.index))*100))
print("{0:0.2f}% in Test set".format((len(x_test)/len(df.index))*100))