# Welcome to "Classification Intro" a quick primer 

This notebook is an introduction and primer on machine learning classification using a decision tree model.

## Getting Started
1. Create your own account on Kaggle.com
2. Click the blue "Copy and Edit" in the upper-right part of this document to create your own copy to your own Kaggle account.
3. As you complete exercises, be sure to click the blue "Save" button to create save points for your work.

### Orientation:
- This notebook is composed of cells. Each cell will contain text either or Python code.
- To run the Python code cells, click the "play" button next to the cell or click your cursor inside the cell and do "Shift + Enter" on your keyboard. 
- Run the code cells in order from top to bottom, because order matters in programming and code.

### Troubleshooting
- If the notebook appears to not be working correctly, then restart this environment by going up to **Run** then select **Restart Session**. 
- If the notebook is running correctly, but you need a fresh copy of this original notebook, go to https://www.kaggle.com/ryanorsinger/data-basics and click "Copy and Edit" to make yourself a new copy.
- Save frequently and save often, so you have access to all of your exercise solutions!

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split # sklearn is a machine learning library
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

## What kinds of questions can Data Science methods answer?
- How Many or How Much of something (Regression)
- **Is this observation A or B, or C or D or E... (Classification)**
- What groupings exist in the data already (Clustering)
- What should we expect to happen next? (Time Series Analysis)
- Is this weird? (Anomaly Detection)

## What are we doing?
- We'll be using a decision tree classifier to predict the species of an iris flower based on the measurement of its flowers.
- Classification machine learning is used all the time for such things as:
    - Facial recognition
    - Handwriting recognition and conversion to typed text
- Classification is a "supervised learning" type of machine learning. That means we train the algorithm on existing data to learn a rule, a recognized pattern, to apply to future data.

![machine learning vs. classical programming](https://camo.githubusercontent.com/fedd5d66bea57a430635498de58dc7c6f064f280/68747470733a2f2f64707a6268796262327064636a2e636c6f756466726f6e742e6e65742f63686f6c6c65742f466967757265732f303166696730322e6a7067)

## How does a decision tree work:
- Classification algorithms use training data to measure the distance between points or the distance around boundaries between points.
- By "learning" the pattern recognition around sets of points, the classifier produces a "decision rule" to use to apply to classify new incoming data.
![decision tree diagram](https://raw.githubusercontent.com/ryanorsinger/machine-learning-classification-workshop/master/decision_tree_diagram.png)

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")
df["petal_area"] = df.petal_width * df.petal_length
df["sepal_area"] = df.sepal_width * df.sepal_length
df.head()
df.shape

(150, 7)

In [3]:
# For our modeling, X is the input variables to build a predictor
X = df.drop(['species'],axis=1)
X.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,petal_area,sepal_area
0,5.1,3.5,1.4,0.2,0.28,17.85
1,4.9,3.0,1.4,0.2,0.28,14.7
2,4.7,3.2,1.3,0.2,0.26,15.04
3,4.6,3.1,1.5,0.2,0.3,14.26
4,5.0,3.6,1.4,0.2,0.28,18.0


In [4]:
# y is our target variable, we're trying to predict the species of future irises based on their measurements
y = df[['species']]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

X_train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,petal_area,sepal_area
114,5.8,2.8,5.1,2.4,12.24,16.24
136,6.3,3.4,5.6,2.4,13.44,21.42
53,5.5,2.3,4.0,1.3,5.2,12.65
19,5.1,3.8,1.5,0.3,0.45,19.38
38,4.4,3.0,1.3,0.2,0.26,13.2


## Remember the Data Science Pipeline
- Plan your project, what is a success? What's the most important thing? What are our initial hypotheses? What's the research question?
- Acquire the raw data
- Prepare the raw data (lot of data cleaning and validation)
- Exploring the data (Visualizing and performing statistical tests)
    - Goal of exploring: identify drivers or predictors of your target
    - Identify and create derived columns, if they're helpful
    - Statistical testing of our hypotheses
- Modeling
    - Build a ML model 
    - train it on training data
    - then evaluate the accuracy of the model when making predictions on the test dataset

In [6]:
# for classification you can change the algorithm to gini or entropy (information gain).  Default is gini.
# The pattern for sklearn is:
# 1. Make a thing (a new, blank machine learning model of a specific kind)
# 2. Fit the thing (.fitting means to train the machine learning model)
# 3. Use the thing (we'll use our trained model to make predictions on future datapoints)
clf = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=123)
clf

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')

In [7]:
# The easiest part of the entire Data Science pipeline is fitting the machine learning model...
# It's almost anticlimatic...
clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')

In [8]:
# Produce a set of species predictions
# Calculate the predicted probability that the prediction is correct
y_pred = clf.predict(X_train)
y_pred_proba = clf.predict_proba(X_train)

In [9]:
labels = sorted(y_train.species.unique())
predicted_labels = [name + " predicted" for name in labels ]

conf = pd.DataFrame(confusion_matrix(y_train, y_pred), index=labels, columns=[predicted_labels])
conf.index.name = "actual"
conf

Unnamed: 0_level_0,setosa predicted,versicolor predicted,virginica predicted
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,32,0,0
versicolor,0,40,0
virginica,0,3,30


In [10]:
# Accuracy = total number of (true positives + number of true negatives) divided by the total numbrer of observations
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))

Accuracy of Decision Tree classifier on training set: 0.97


In [11]:
# The model is a little less accurate on the test data, but 93% accuracy is pretty good!
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Accuracy of Decision Tree classifier on test set: 0.93


In [12]:
# Actual vs. predicted numbers on the test set!
# y_prediction based on X_test
y_pred = clf.predict(X_test)

labels = sorted(y_train.species.unique())
predicted_labels = [name + " predicted" for name in labels ]

conf = pd.DataFrame(confusion_matrix(y_test, y_pred), index=labels, columns=[predicted_labels])
conf.index.name = "actual"
conf

Unnamed: 0_level_0,setosa predicted,versicolor predicted,virginica predicted
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,18,0,0
versicolor,0,10,0
virginica,0,3,14


## Get More Practice with Modeling
Use what you've learned from the code above to work with the Titanic dataset. Can you build a decision tree that accurately predicts who would survive the Titanic disaster?

In [13]:
# The "survived" class is the target we're trying to predict based on other features/columns
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## What to do next:
- Use the steps above as a guide for how to break apart the features from the prediction column
- Be sure to split your training and testing data
- Train your data on the training set
- Then evaluate your model based on the testing dataset.