# Workbook : Machine Learning

For our last section workbook (so that next week you can ask questions about and work on your final projects in section), we're going to work with a dataset all about craft beer. We'll work to predict what type of beer each is based on the characteristics of that beer.

**Disclaimer**: Working with data about beer does *NOT* mean that I'm encouraging the drinking of beer by students. In fact, your professor doesn't even like beer (blech). Specifically, individuals under the age of 21 are not legally allowed to consume alcoholic beverages, but lucky for you all, that doesn't stop us from working with data on the topic!

The data we'll use here come from a publicly-available [Kaggle dataset on craft beer](https://www.kaggle.com/nickhould/craft-cans).

# Part I : Data, Wrangling, & EDA

To get started, you'll need to **import the following**:
   * `pandas
   * `numpy`
   * `SVC` from sklearn.svm
   * `confusion_matrix`, `classification_report`, `precision_recall_fscore_support` from `sklearn.metrics`

In [None]:
## YOUR CODE HERE

Now that you're setup to go in Python, **read in the `breweries.csv` file from the `data` directory. Assign this to the variable `breweries`**. Then, **read in the file `beers.dsv` from the `data` directory. Assign this to the variable `beers`.**

In [None]:
## YOUR CODE HERE

Take a **look at the first few rows of each dataset** to give yourself an idea of what data are inclued in each dataset. Notice if there are any common columns between the two datasets.

In [None]:
## YOUR CODE HERE

To start to get a handle on what's going on these data, **print out the number of missing values in each variable of the `beers` variable.**

In [None]:
## YOUR CODE HERE

We're going to try to predict the `style` of beer from its alcohol by volume (`abv`) and its international bitterness unites (`ibu`). To do this, **remove any beers from our `beers` dataset where data are missing for any of these three values. Do this in place.** Note that you may not always want to take this approach and removing samples from your dataset will not always be appropriate, but for this example, it's a reasonable approach.

In [None]:
## YOUR CODE HERE

Check to see how many entries remain in your `beers` dataset now.

In [None]:
## YOUR CODE HERE

In [None]:
assert beers.shape == (1403, 8)

Using the beers dataset you've not got, **merge `beers` and `breweries` together using a left join. Assign this to hte variable `beer`. Look at the first few rows of `beer`.**

In [None]:
## YOUR CODE HERE

**Use the `describe` method to describe the quantitative variables in your `beer` dataset.**

In [None]:
## YOUR CODE HERE

**Be sure to look at the output from what you just ran. What do you learn? Do any values surprise you? Are there any with really big standard deviations? Does this make sense?**

Now, let's take a look and **see how many different styles of beer we have in our datset.** The `value_counts` method may help you accomplish this.

In [None]:
## YOUR CODE HERE

Due to limitations in time here in section, let's just try to predict the three most common styles of beer. **Filter your `beer` dataset to only include entries from the three most common beers. Be sure to determine how many different beers are now included in your dataset.**

In [None]:
## YOUR CODE HERE

# Part II : Prediction Model

Let's start to build our model! To do so, **create a variable `num_training` that includes the number of samples that corresponds to 80% of our total samples in our `beer dataset`. Be sure that this is an integer. Also, create a variable `num testing` including the number corresponding to 20% of our total samples.**

In [None]:
## YOUR CODE HERE

In [None]:
assert num_training == 424
assert num_testing == 107

To model these data, **split your data into `beer_X`, which includes the `abv` and `ibu` columns from `beer` (predictors). This should be a `pandas` DataFrame. The outcome variable will be `style`. Assign the outcome variable to the variable `beer_Y`. This should be a `numpy` array.**

In [None]:
## YOUR CODE HERE

Before running our model, we'll need to **split our data into a training and test set. Use `num_training` (created above) to extract the following variables**: 
* from `beer_X`, generate : `beer_train_X`, `beer_test_X`
* from `beer_Y`, generate: `beer_train_Y`, `beer_test_Y`

In [None]:
## YOUR CODE HERE

In [None]:
assert len(beer_train_X) == 424
assert len(beer_test_X) == 107

To train our model, we'll use a linear SVM classifier. Here a function has been defined for you. **Run the following cell, but be sure you understand what the function is doing.**

In [None]:
def train_SVM(X, y, kernel='linear'):
    clf = SVC(kernel=kernel)
    clf.fit(X, y)
    
    return clf

Using the `train_SVM` function defined above, **train your model. Assign this output to `beer_clf`.**

In [None]:
## YOUR CODE HERE

In [None]:
assert isinstance(beer_clf, SVC)
assert hasattr(beer_clf, "predict")

Now, **generate predictions from your training and test sets of predictors using the `predict` method. Assign your predictions from the training data to `beer_predicted_train_Y`. Assign your predictison from the test data to `beer_predicted_test_Y`.**

In [None]:
## YOUR CODE HERE

# Part III : Model Assessment

At this point, you should have built your model and generated predictions using that model for both your training and test datasets. 

Let's determine how our predictor did. **Generate a `classification_report` for the predictions generated for your training data relative to the truth (from the original beers dataset). Print the output.**

In [None]:
## YOUR CODE HERE

What are precision and recall? What do these numbers represent? How accurate are our predictions?

**Generate a `classification_report` for the predictions generated for your *test* data relative to the truth (from the original beers dataset). Print the output.**

In [None]:
## YOUR CODE HERE

How is our model performing? Does tis dffer between training and test data? Where does it have trouble? Where does it perform well? Do we have thoughts as to why? One way to determine where a model is going wrong is to look at a confusion matrix. **Generate a confusion matrix for the training data predictions as well as the ground truth from the `beer` dataset.**

In [None]:
## YOUR CODE HERE

**Generate a confusion matrix for the testing data.**

In [None]:
## YOUR CODE HERE

While this is a somewhat small example using a limited dataset for prediction, we hope you have a better understanding of how to approach a machine learning question, knowing specifically what training and test datasets are used for, how to build a model, and how to assess model/prediction performance. **Feel free to try different models, include more beer types in your analysis or ask a completely different prediction question!**