<a href="https://colab.research.google.com/github/tamejames/Rainbow-Poem/blob/master/F3_Homework_Classification_with_the_ML_Pipeline-JLCopy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Training Data

We will use two datasets to test your ML model and process: the 'zoo dataset' and the 'ANU coffee dataset' which we will create together. Once the 'ANU coffee dataset' has been collected, we will release for you to analyze with a similar process to the one outlined below.  

[Zoo dataset](http://archive.ics.uci.edu/ml/datasets/Zoo?ref=datanews.io) -- As described in the dataset information sheet:

“A simple database containing 17 Boolean-valued attributes. The "type" attribute appears to be the class attribute.” The datasheet will quickly reveal that the dataset itself has some problematic categories. We are primarily using it in order to have access to multiple datasets with binary-valued attributes.


In [None]:
# First, we'll download the zoo dataset to a local (temporary) folder
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data

In [None]:
# We can also download and display the dataset's description:
# This command downloads the relevant file
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.names
# This command displays the file's contents
!cat zoo.names

## Data Ingestion

Here we'll "ingest" the data by importing it into [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html):

In [None]:
from IPython.display import display
import pandas as pd

# Because the data file doesn't have header names, we'll list them here
# You can find a description of the data file at http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.names
feature_names = ['animal name', 'hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous', 'fins', 'legs', 'tail', 'domestic', 'catsize', 'type']

# Import the "zoo" dataset
zoo = pd.read_csv('zoo.data', names = feature_names)

# Lets take a peek at the data
zoo.head()

In [None]:
# We'll now import a few useful packages

# Numpy is a linear algebra library, 
# useful for common math operations
import numpy as np 
# Matplotlib is a common plotting library
import matplotlib.pyplot as plt
# Seaborn is handy for creating beautiful plots
import seaborn as sns; sns.set()

## Data validation

Now's a good chance to have a look at your data and make sure it "checks out". Try plotting several aspects of the data, following the exploratory data analysis steps we've looked at in build, to check for trends and inconsistencies in the data. This is also a good opportunity to begin familiarizing yourself with Panda's capabilities, or revising Tableau or Excel. We've included a 'correlation matrix' to help you get started.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Compute the correlation matrix
corr = zoo.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(10, 240, as_cmap=True, sep=20, n=11)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

  import pandas.util.testing as tm


NameError: ignored

# Feature Extraction

We'll now split up the data into features and labels. We will not do any special pre-processing to generate features, but you are of course welcome to experiment with engineering new features if you have an intuition for how it will improve your model performance.  

In [None]:
features = zoo.loc[:, 'hair':'catsize'] # Omit animal name
labels = zoo.loc[:, 'type']
# We then convert the feature and labels dataframes to 
# numpy ndarrays, which can interface with the scikit-learn models
X = features.to_numpy()
y = labels.to_numpy()

In [None]:
# To help familiarize yourself with these matrices, 
# have a look at their 'shapes' and understand why they are so.
print('X shape', X.shape)
print('y shape', y.shape)

# ML Algorithm and Model

The model you will use to create classifications is called a “Decision Tree”. You have likely seen these charted as thought diagrams, and are a popular tool among biologists for species identification. It’s also a powerful machine learning model that’s relatively intuitive which we can use to practice classification in the ML pipeline. 

The decision tree model is described in the assigned pre-reading *A visual introduction to machine learning* (parts [I](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/) and [II](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)) by stephanie yee and Tony Chu, 2018. 

We also give one example of pseudo-code for training a simple version tree (with binary splits) in: 

[A Course in Machine Learning, ch 1., Decision Trees](http://ciml.info/dl/v0_99/ciml-v0_99-ch01.pdf), by Hal Daumé III, 2015

We encourage you to read it carefully and work with peers to understand it’s behavior. Depending on your stretch task submission, it may be to your advantage to explain how the model learns. 

Additional resources: To use the scikit-learn decision tree algorithm, have a look at their [documentation](https://scikit-learn.org/stable/modules/tree.html). A more advanced ensemble of decision trees is called a “random forest”, while we do not cover it in class you are welcome to learn more about this approach. You can find one useful resource visualizing decision trees and generating random forests [here](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html). 

In [None]:
# INITIALIZE YOUR MODEL HERE

## Quality Metric and Model Tuning

First, let's begin by splitting our data into training and test sets. You can do so using scikit-learn's [`test_train_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. When setting its parameters, be mindful record your decisions. (You can also create a validation set by using `test_train_split` a second time.)

In [None]:
# SPLIT YOUR DATA INTO TRAIN/TEST SETS HERE 

### Train your model

Try training your model decision tree model below on the training set.

In [None]:
# TRAIN YOUR MODEL HERE

## Evaluate your model

Start by testing out your model's test set accuracy using the DecisionTree's `score` function.

In [None]:
# EVALUATE YOUR MODEL ON TEST SET HERE

## Visualizing your model

A good first way to investigate your model is by visualizing it! We can do so using code similar to that provided by scikit-learn on the documentation page for the DecisionTreeClassifier model.

You may also find their "[Understanding the decision tree structure](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py)" useful for understanding your model's behavior.

In [None]:
from sklearn import tree
import graphviz

# clf = # YOUR DECISON TREE CLASSIFIER HERE
dot_data = tree.export_graphviz(clf, out_file=None, 
                      feature_names=features.columns,  
                      class_names=sorted(list(map(str, labels.unique()))),  
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

## Tuning your model

Hyperparameters like max depth can drastically affect your model's performance. Use k-fold cross validation to determine a good `max_depth` for your decision tree. Plot the cross validation score for each `max_depth` setting. Be sure to record how many folds you selected.

You can learn how to do k-fold cross validation with scikit-learn from the [documentation](https://scikit-learn.org/stable/modules/cross_validation.html).

In [None]:
# YOUR CROSS VALIDATION CODE HERE

Next steps -- now that you've completed the ML pipeline for a classifier model, it's time for you to complete the stretch task. The stretch tasks are explained in more detail in the project handout -- a good place to start is taking some time to reflect on your skills through the skills essay, and then selecting the stretch task you think will best help you develop those skills. This will likely lead you to continuing to investigate concepts related to the code above, implementing or experimenting with a new approach, and so forth. In this sense, you can view the above as a helpful scaffold for what you will discuss in the video. We also encourage you as using the above as a framework for applying the ML pipeline to create a classifier on the ANU coffee dataset. We plan to release it by 5pm on Friday, August 27.

