# Workbook : Machine Learning

For our last section workbook (so that next week you can ask questions about and work on your final projects in section), we're going to work with a dataset all about craft beer. We'll work to predict what type of beer each is based on the characteristics of that beer.

**Disclaimer**: Working with data about beer does *NOT* mean that I'm encouraging the drinking of beer by students. In fact, your professor doesn't even like beer (blech). Specifically, individuals under the age of 21 are not legally allowed to consume alcoholic beverages, but lucky for you all, that doesn't stop us from working with data on the topic!

The data we'll use here come from a publicly-available [Kaggle dataset on craft beer](https://www.kaggle.com/nickhould/craft-cans).

# Part I : Data, Wrangling, & EDA

To get started, you'll need to **import the following**:
   * `pandas
   * `numpy`
   * `SVC` from sklearn.svm
   * `confusion_matrix`, `classification_report`, `precision_recall_fscore_support` from `sklearn.metrics`

In [1]:
## YOUR CODE HERE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# scikit-learn imports
#   SVM (Support Vector Machine) classifer 
#   Metrics functions to evaluate performance
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_fscore_support

Now that you're setup to go in Python, **read in the `breweries.csv` file from the `data` directory. Assign this to the variable `breweries`**. Then, **read in the file `beers.dsv` from the `data` directory. Assign this to the variable `beers`.**

In [2]:
## YOUR CODE HERE
breweries = pd.read_csv('../data/breweries.csv')
beers = pd.read_csv('../data/beers.csv')

Take a **look at the first few rows of each dataset** to give yourself an idea of what data are inclued in each dataset. Notice if there are any common columns between the two datasets.

In [3]:
## YOUR CODE HERE
print(breweries.head())
print(beers.head())

   Unnamed: 0                       name           city state
0           0         NorthGate Brewing     Minneapolis    MN
1           1  Against the Grain Brewery     Louisville    KY
2           2   Jack's Abby Craft Lagers     Framingham    MA
3           3  Mike Hess Brewing Company      San Diego    CA
4           4    Fort Point Beer Company  San Francisco    CA
   Unnamed: 0    abv  ibu    id                 name  \
0           0  0.050  NaN  1436             Pub Beer   
1           1  0.066  NaN  2265          Devil's Cup   
2           2  0.071  NaN  2264  Rise of the Phoenix   
3           3  0.090  NaN  2263             Sinister   
4           4  0.075  NaN  2262        Sex and Candy   

                            style  brewery_id  ounces  
0             American Pale Lager         408    12.0  
1         American Pale Ale (APA)         177    12.0  
2                    American IPA         177    12.0  
3  American Double / Imperial IPA         177    12.0  
4          

To start to get a handle on what's going on these data, **print out the number of missing values in each variable of the `beers` variable.**

In [4]:
## YOUR CODE HERE
beers.isnull().sum(axis = 0)

Unnamed: 0       0
abv             62
ibu           1005
id               0
name             0
style            5
brewery_id       0
ounces           0
dtype: int64

We're going to try to predict the `style` of beer from its alcohol by volume (`abv`) and its international bitterness unites (`ibu`). To do this, **remove any beers from our `beers` dataset where data are missing for any of these three values. Do this in place.** Note that you may not always want to take this approach and removing samples from your dataset will not always be appropriate, but for this example, it's a reasonable approach.

In [5]:
## YOUR CODE HERE
beers.dropna(subset=['style','abv','ibu'], inplace=True)
beers.isnull().sum(axis = 0)

Unnamed: 0    0
abv           0
ibu           0
id            0
name          0
style         0
brewery_id    0
ounces        0
dtype: int64

Check to see how many entries remain in your `beers` dataset now.

In [6]:
## YOUR CODE HERE
print(beers.shape)

(1403, 8)


In [7]:
assert beers.shape == (1403, 8)

Using the beers dataset you've not got, **merge `beers` and `breweries` together using a left join. Assign this to hte variable `beer`. Look at the first few rows of `beer`.**

In [8]:
## YOUR CODE HERE
beer = pd.merge(beers, 
                breweries, how="left")
beer.head()

Unnamed: 0.1,Unnamed: 0,abv,ibu,id,name,style,brewery_id,ounces,city,state
0,14,0.061,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0,,
1,21,0.099,92.0,1036,Lower De Boom,American Barleywine,368,8.4,,
2,22,0.079,45.0,1024,Fireside Chat,Winter Warmer,368,12.0,,
3,24,0.044,42.0,876,Bitter American,American Pale Ale (APA),368,12.0,,
4,25,0.049,17.0,802,Hell or High Watermelon Wheat (2009),Fruit / Vegetable Beer,368,12.0,,


**Use the `describe` method to describe the quantitative variables in your `beer` dataset.**

In [9]:
## YOUR CODE HERE
beer.describe()

Unnamed: 0.1,Unnamed: 0,abv,ibu,id,brewery_id,ounces
count,1403.0,1403.0,1403.0,1403.0,1403.0,1403.0
mean,1241.128297,0.059919,42.739843,1413.88881,223.375624,13.510264
std,691.675612,0.013585,25.962692,757.572191,150.38751,2.254112
min,14.0,0.027,4.0,1.0,0.0,8.4
25%,681.5,0.05,21.0,771.0,95.5,12.0
50%,1228.0,0.057,35.0,1435.0,198.0,12.0
75%,1864.5,0.068,64.0,2068.5,350.0,16.0
max,2408.0,0.125,138.0,2692.0,546.0,32.0


**Be sure to look at the output from what you just ran. What do you learn? Do any values surprise you? Are there any with really big standard deviations? Does this make sense?**

Now, let's take a look and **see how many different styles of beer we have in our datset.** The `value_counts` method may help you accomplish this.

In [10]:
## YOUR CODE HERE
beer['style' ].value_counts()

American IPA                           301
American Pale Ale (APA)                153
American Amber / Red Ale                77
American Double / Imperial IPA          75
American Pale Wheat Ale                 61
American Blonde Ale                     61
American Porter                         39
American Brown Ale                      38
Fruit / Vegetable Beer                  30
Kölsch                                  27
Hefeweizen                              27
Witbier                                 24
Saison / Farmhouse Ale                  23
Märzen / Oktoberfest                    21
American Black Ale                      20
Cream Ale                               18
German Pilsener                         17
American Stout                          16
Czech Pilsener                          16
American Amber / Red Lager              16
American Pale Lager                     16
Vienna Lager                            14
Extra Special / Strong Bitter (ESB)     14
American Pi

Due to limitations in time here in section, let's just try to predict the three most common styles of beer. **Filter your `beer` dataset to only include entries from the three most common beers. Be sure to determine how many different beers are now included in your dataset.**

In [11]:
## YOUR CODE HERE
styles = ['American IPA', 'American Pale Ale (APA)', 'American Amber / Red Ale', 'American Double / Imperial IPA ']
beer = beer[beer['style'].isin(styles)]
beer.shape

(531, 10)

# Part II : Prediction Model

Let's start to build our model! To do so, **create a variable `num_training` that includes the number of samples that corresponds to 80% of our total samples in our `beer dataset`. Be sure that this is an integer. Also, create a variable `num testing` including the number corresponding to 20% of our total samples.**

In [12]:
## YOUR CODE HERE
num_training = int(len(beer)*0.8)
num_testing = len(beer)-num_training

In [13]:
assert num_training == 424
assert num_testing == 107

To model these data, **split your data into `beer_X`, which includes the `abv` and `ibu` columns from `beer` (predictors). This should be a `pandas` DataFrame. The outcome variable will be `style`. Assign the outcome variable to the variable `beer_Y`. This should be a `numpy` array.**

In [14]:
## YOUR CODE HERE
beer_X = beer[['abv','ibu']]
beer_Y = np.array(beer['style'])

Before running our model, we'll need to **split our data into a training and test set. Use `num_training` (created above) to extract the following variables**: 
* from `beer_X`, generate : `beer_train_X`, `beer_test_X`
* from `beer_Y`, generate: `beer_train_Y`, `beer_test_Y`

In [15]:
## YOUR CODE HERE
beer_train_X = beer_X[:num_training]
beer_train_Y = beer_Y[:num_training]
beer_test_X = beer_X[num_training:]
beer_test_Y = beer_Y[num_training:]

In [16]:
assert len(beer_train_X) == 424
assert len(beer_test_X) == 107

To train our model, we'll use a linear SVM classifier. Here a function has been defined for you. **Run the following cell, but be sure you understand what the function is doing.**

In [17]:
def train_SVM(X, y, kernel='linear'):
    clf = SVC(kernel=kernel)
    clf.fit(X, y)
    
    return clf

Using the `train_SVM` function defined above, **train your model. Assign this output to `beer_clf`.**

In [18]:
## YOUR CODE HERE
beer_clf = train_SVM(beer_train_X, beer_train_Y)

In [19]:
assert isinstance(beer_clf, SVC)
assert hasattr(beer_clf, "predict")

Now, **generate predictions from your training and test sets of predictors using the `predict` method. Assign your predictions from the training data to `beer_predicted_train_Y`. Assign your predictison from the test data to `beer_predicted_test_Y`.**

In [20]:
## YOUR CODE HERE
beer_predicted_train_Y = beer_clf.predict(beer_train_X)
beer_predicted_test_Y = beer_clf.predict(beer_test_X)

# Part III : Model Assessment

At this point, you should have built your model and generated predictions using that model for both your training and test datasets. 

Let's determine how our predictor did. **Generate a `classification_report` for the predictions generated for your training data relative to the truth (from the original beers dataset). Print the output.**

In [21]:
## YOUR CODE HERE
print(classification_report(beer_train_Y,beer_predicted_train_Y))

                          precision    recall  f1-score   support

American Amber / Red Ale       0.82      0.45      0.58        69
            American IPA       0.80      0.85      0.83       230
 American Pale Ale (APA)       0.57      0.64      0.60       125

               micro avg       0.72      0.72      0.72       424
               macro avg       0.73      0.65      0.67       424
            weighted avg       0.73      0.72      0.72       424



What are precision and recall? What do these numbers represent? How accurate are our predictions?

**Generate a `classification_report` for the predictions generated for your *test* data relative to the truth (from the original beers dataset). Print the output.**

In [22]:
## YOUR CODE HERE
print(classification_report(beer_test_Y, beer_predicted_test_Y))

                          precision    recall  f1-score   support

American Amber / Red Ale       0.71      0.62      0.67         8
            American IPA       0.90      0.76      0.82        71
 American Pale Ale (APA)       0.55      0.79      0.65        28

               micro avg       0.76      0.76      0.76       107
               macro avg       0.72      0.72      0.71       107
            weighted avg       0.79      0.76      0.77       107



How is our model performing? Does tis dffer between training and test data? Where does it have trouble? Where does it perform well? Do we have thoughts as to why? One way to determine where a model is going wrong is to look at a confusion matrix. **Generate a confusion matrix for the training data predictions as well as the ground truth from the `beer` dataset.**

In [23]:
## YOUR CODE HERE
confusion_matrix(beer_train_Y, beer_predicted_train_Y, sample_weight=None)

array([[ 31,  11,  27],
       [  0, 196,  34],
       [  7,  38,  80]])

**Generate a confusion matrix for the testing data.**

In [24]:
## YOUR CODE HERE
confusion_matrix(beer_test_Y, beer_predicted_test_Y, labels=None, sample_weight=None)

array([[ 5,  2,  1],
       [ 0, 54, 17],
       [ 2,  4, 22]])

While this is a somewhat small example using a limited dataset for prediction, we hope you have a better understanding of how to approach a machine learning question, knowing specifically what training and test datasets are used for, how to build a model, and how to assess model/prediction performance. **Feel free to try different models, include more beer types in your analysis or ask a completely different prediction question!**