# Let us the fun begin
**To cut long story short**, the provided data is to small to siginifciantly distinguish beer's style by ibu or abv.

## import beers and breweries

In [1]:
#import libraries
import pandas as pd
import numpy as np

In [2]:
#import beers
beers = pd.read_csv('../input/beers.csv', index_col='name')

In [3]:
#import breweries
breweries = pd.read_csv('../input/breweries.csv', index_col='name')

# Can you predict the beer type from the characteristics provided in the dataset?

## Clearing the beer dataset
To predict the beer type, I'm going to use abv and ibu.
<br>**ABV** stands for **alcohol by volume** and describes ... the dose of fairytale in the bottle. Seriously.
<br>**IBU** stands for **International Bittering Units** scale and describes the amount of iso-alpha acids.

Drop NaNs

In [4]:
sbeers = beers[
    ['abv','ibu','style']
]
sbeers = sbeers.dropna()

Split the beers data into train and test

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sbeers[['ibu','abv']], sbeers[['style']], random_state=0)
from sklearn.preprocessing import MinMaxScaler
#scaler = MinMaxScaler()
#X_train = scaler.fit_transform(X_train)
#X_test = scaler.transform(X_test)

## Using logistic regression

In [6]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', multi_class='ovr')
logreg.fit(X_train, np.ravel(y_train), )
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

Accuracy of Logistic regression classifier on training set: 0.25
Accuracy of Logistic regression classifier on test set: 0.24


## Using random forests

In [7]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
fvals = le.fit_transform(np.ravel(sbeers[['style']]))
fvals = np.unique(fvals)

In [8]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=1001)
rf.fit(X_train, le.transform(np.ravel(y_train)));

In [9]:
predictions = rf.predict(X_test)

Try to match prediction to the closes label

In [10]:
def find_nearest(array, value):
    array = np.asarray(array)
    idx = (np.abs(array - value)).argmin()
    return array[idx]

In [11]:
for i in range (len(predictions)):
    predictions[i] = find_nearest(fvals, predictions[i])

In [12]:
predictions = predictions.astype(int)
predictions = le.inverse_transform(predictions)

In [13]:
foo = pd.DataFrame({'actual_values' : y_test['style'], 'predictions':predictions})
foo['comp'] = np.where(foo['actual_values']==foo['predictions'], 1, 0)
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format((len(foo[foo['comp']==1])/len(foo))))

Accuracy of Logistic regression classifier on test set: 0.11


## Building a linear scale

The idea was to check if naive sorting beers by ibu or abv is enough to see cutoff levels, to build a kind of scale.

In [14]:
tr_sc = pd.concat([X_train,y_train], axis=1)
tr_sc = tr_sc.sort_values(by='abv')

As can be seen below, there is no a clear correlation between abv and a style. So, the simple scale is not available to clearly distinguish beer styles.

In [15]:
tr_sc.tail(20)

Unnamed: 0_level_0,ibu,abv,style
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abrasive Ale,120.0,0.097,American Double / Imperial IPA
Bad Axe Imperial IPA,76.0,0.098,American Double / Imperial IPA
Hopkick Dropkick,115.0,0.099,American Double / Imperial IPA
Chaotic Double IPA,93.0,0.099,American Double / Imperial IPA
Upslope Imperial India Pale Ale,90.0,0.099,American Double / Imperial IPA
Ten Fidy Imperial Stout,98.0,0.099,Russian Imperial Stout
Elevation Triple India Pale Ale,100.0,0.099,American Double / Imperial IPA
Bourbon Barrel Aged Timmie,75.0,0.099,Russian Imperial Stout
Johan the Barleywine,60.0,0.099,English Barleywine
Ten Fidy,98.0,0.099,Russian Imperial Stout


## Using SVM

In [16]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='rbf')
#svclassifier = SVC(kernel='Gaussian')
svclassifier.fit(X_train, np.ravel(y_train));



In [17]:
y_pred = svclassifier.predict(X_test)

In [18]:
foo = pd.DataFrame({'actual_values' : y_test['style'], 'predictions':y_pred})
foo['comp'] = np.where(foo['actual_values']==foo['predictions'], 1, 0)
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format((len(foo[foo['comp']==1])/len(foo))))

Accuracy of Logistic regression classifier on test set: 0.25


# What is the most popular beer in North Dakota?



In [19]:
beers.head()

Unnamed: 0_level_0,Unnamed: 0,abv,ibu,id,style,brewery_id,ounces
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Pub Beer,0,0.05,,1436,American Pale Lager,408,12.0
Devil's Cup,1,0.066,,2265,American Pale Ale (APA),177,12.0
Rise of the Phoenix,2,0.071,,2264,American IPA,177,12.0
Sinister,3,0.09,,2263,American Double / Imperial IPA,177,12.0
Sex and Candy,4,0.075,,2262,American IPA,177,12.0


In [20]:
breweries = pd.read_csv('../input/breweries.csv')
breweries = breweries.rename(columns ={'Unnamed: 0':'brewery_id'})

In [21]:
cdata = beers.merge(breweries, on='brewery_id')
cdata['state'] = cdata['state'].str.strip()

In [22]:
cdata[cdata['state']=="ND"]

Unnamed: 0.1,Unnamed: 0,abv,ibu,id,style,brewery_id,ounces,name,city,state
771,771,0.05,32.0,1722,American Pale Ale (APA),335,12.0,Fargo Brewing Company,Fargo,ND
772,772,0.045,19.0,1435,Scottish Ale,335,12.0,Fargo Brewing Company,Fargo,ND
773,773,0.067,70.0,1434,American IPA,335,12.0,Fargo Brewing Company,Fargo,ND
