# Multi-Class and Imbalanced Class Machine Learning

<b> Goals: </b>

- Finish the rest of the advanced Sklearn tools lesson by learning about imputation and one hot encoding.
- Work on a supervised classification dataset with more than two classes, specifically the famous MNIST digits dataset.
- Work on a supervised classification dataset with imbalanced classes, specifically the credit card fraud dataset.

## Advanced Sklearn tools (cont.)

In [None]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from sklearn.metrics import recall_score, roc_auc_score, roc_curve, accuracy_score
from sklearn.preprocessing import Imputer, OneHotEncoder, LabelEncoder, LabelBinarizer

In [None]:
#Load in titanic data

path = "../data/titanic.csv"



### Encoding aka dummy variables with sklearn

**One hot encoding:*** Transforming categorical variables 

In [None]:
#Assign X and y

X = 
y = 

In [None]:
#Make a train test split with the titanic data



We're going to use LabelEncoder to turn the object values into numbers. Instead turning each unique value into a column a la dummy variables, this tool returns a single column and replaces the objects/strings with a number.

In [None]:
#Intialize LabelEncoder object

le = 

#Use le on the sex column




Turns male and female into a 1s and 0s.

The advantage of using this is that we can use the LabelEncoder object (le) to transform other data.

In [None]:
#Pass in the Sex column on the testing dataset into the le object



Now let's try this on the Embarked column

In [None]:

#Intialize LabelEncoder



#Pass Embarked column into le object


#Look at first twenty rows


In [None]:
# Call .classes_ to see the original object values


In [None]:
#Transform the embarked class of the testing dataset



In [None]:
#Look at original X_test.Embarked



How to use the OneHotEncoder object

In [None]:

#Intialize object


#Fit and transform using the emb_encoded variable



#Look at emb_onehot



In [None]:
#Transform emb_encoded_test using onehot object



We can also use the LabelBinarizer to do this as well

In [None]:

#Intialize LabelBinarizer



#Fit and transform on the Embarked column of the training dataset

#Fill nans with unknown


In [None]:
#Look at the class or column values



In [None]:
#Transform the testing data using lb



You may be asking yourself "Why use this instead of pd.get_dummies?"

That's because testing data or any other new you want to use may not have the same values in their categorical columns.

In [None]:
#Create new dataset from X_test where there is no C value in the Embarked column



#Transform the Embarked column from  X_test2 using the LabelBinarizer



This returns a 0 for every value under the C column. Using pd.get_dummies we would have three columns instead of four. This is important because when you fit you model using the training data and then make predictions using the testing data, the model won't work if your testing and training data don't have the same number of columns.

In conclusion:

![e](https://chrisalbon.com/machine-learning/one-hot_encode_nominal_categorical_features/One-Hot_Encoding_print.png)

### Imputation

Like the previous topic, imputation is something we've done using pandas but in this lesson we'll be using sklearn to do some.

**Imputation** Replacing nan or null values with the mean or median values of a specific column in a dataset.

In [None]:
#Intialize Imputer object and set axis = 1




In [None]:
#Fit imp on Age column of training data



In [None]:
#Whats the average of X_train.Age?



In [None]:
#Transform X_train.Age using imp



#view age_imp


Transformation time

In [None]:
#Whats the average age in X_test
X_test.Age.mean()

In [None]:

#Tranform the Age column of testing dataset with the imp object

age_test_imp = 



We can also use the median instead of age

In [None]:
#Intialize imputer object with strategy set to median
imp_med = 
#Fit

#Whats the median age of Age?
X_train.Age.median()

In [None]:
#Median age of X_test
X_test.Age.median()

In [None]:
#Transform the Age column of the testing dataset


We can also use the mode as an imputation strategy

In [None]:
#Intialize imputation object and set strategy to most frequence

imp_cat = 


#Fit imp_cat on Age column



In [None]:
#Whats the most frequent age?
X_train.Age.value_counts().max()

In [None]:
#Tranform 



## Multi-class Supervised Learning

So far in our classification lessons we've mainly modeling binary classification datasets aka either-or data. In this class we're going to work through the MNIST digits dataset and learn to work with and interpret models trained on multiple classes—multiple meaning more than two.

Before we get into the MNIST dataset, let's bring back the Iris dataset

In [None]:
#Load in iris data using seaborn
iris = 



The iris dataset is a mulit-class dataset because there are three uniques values in the dependent variable

In [None]:
#The class of species


Now let's model this data and using a confusion matrix to analyze the results.

In [None]:
#Step 1. Assign X and y

X = 

y = 

#Step 2. train test split with random split = 27

X_train, X_test, y_train, y_test = 

#Step 3. Fit KNN model with 3 neighbors on training data





#Step 4. Make predictions on the X_test using the model



#Step 5. Score predictions



Pretty decent model right? Now let's use the confusion matrix to better understand our predictions, particulary the wrong ones.

In [None]:

#Call pass in y_test and preds into confusion_matrix object

cm = 

cm

Let's turn this into a dataframe

In [None]:
#Column and index values for our confusion matrix dataframe
cols = ["pred_setosa", "pred_versicolor", "pred_virginica"]
index = ["actual_setosa", "actual_versicolor", "actual_virginica"]

#Make dataframe out of confusion matrix

cm_df = 

cm_df

The accuracy score told us that our model correctly classified 88% of the testing dataset.

What does the confusion matrix tell us that the accuracy score doesn't? What is the accuracy if we ignore setosa?

In [None]:
#Non setosa accuracy score



The 13, 15, 16, represent the values that we correctly identified.

The two 3 values represent the values that we incorrectly identified.

<br>

If we want to calculate the recall and precision scores then we need to designate one of the classes as true and the rest as false

![ee](https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAykAAAAJDUzZGVlZGM0LTUyNWMtNDNjZi1hNjkxLTdlZjEzY2VmMmM4OQ.png)

Precision = TP/TP+FP

Recall = TP/TP+FN

Good blog post to refresh your memory on these metrics:http://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/

In [None]:
#The recall score for setaosa

tp_set = 

#Divide tp_set by the sum of values in the actual setosa row



In [None]:
#The precision score for setosa



#Divide tp_set by the sum of values in the predicted setosa column



We got perfect scores but thats boring, let's try this with versicolor

In [None]:
#The recall score for versicolor

tp_ver = 

#Divide tp_set by the sum of values in the actual versicolor row



In [None]:
#The precision score for versicolor

#Divide tp_set by the sum of values in the predicted versicolor column



Multi-class confusion matrix explained:
![s](https://3.bp.blogspot.com/-YpiS7AXxlgs/VEVrZGx5oaI/AAAAAAAAG1c/E8PdwoUamYw/s1600/multi-class-confusionmatrix.png)

[Source](http://text-analytics101.rxnlp.com/2014/10/computing-precision-and-recall-for.html)

## MNIST Dataset

Famous machine learning dataset that is frequently used over and over again in machine learning courses.
http://yann.lecun.com/exdb/mnist/

The hand written digits of the mnist dataset.
![a](https://kuanhoong.files.wordpress.com/2016/01/mnistdigits.gif)

In [None]:
#Load in dataset from sklearn
from sklearn import datasets
digits_dict = datasets.load_digits()
#Data description
digits_dict["DESCR"].split("\n")

In [None]:
#Load in data

data = 

target = 

In [None]:
#Look at data
data

In [None]:
#Look at target variable
target

In [None]:
# Print to show there are 1797 images (8 by 8 images for a dimensionality of 64)
print('Image Data Shape' , data.shape)
# Print to show there are 1797 labels (integers from 0–9)
print("Label Data Shape", target.shape)

In [None]:
#View data of single digit

data[1].reshape(8, 8)

Can you tell what number this data is showing?

Let's visualize the digits

In [None]:
#Use matplotlib to view the images
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(data[0:5], target[0:5])):
    plt.subplot(1, 5, index + 1)
    plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
    plt.title('Training: %i\n' % label, fontsize = 20)

Now let's use a logistic regression algorithm to model this data

In [None]:

#Step 1. train test split with random state = 19 and test size = 0.4

X_train, X_test, y_train, y_test = 

#Step 2. Fit DecisionTreeClassifier with max_depth 10 on training data



#Step 3. Make predictions on the X_test using the model

preds = 

#Step 4. Score predictions



Confusion matrix time

In [None]:
#Create confusion matrix from predictions and y_test

cm_digits = 

cm_digits

In [None]:
#Heatmap version

plt.figure(figsize=(8, 8))
sb.heatmap(cm_digits, annot=True)
plt.xlabel("Predicted Labels")
plt.ylabel("Actual Labels");

What does this tell us?

Get the real MNIST dataset

In [None]:
# from sklearn.datasets import fetch_mldata
# mnist = fetch_mldata('MNIST original')

Let's see some of the images we got wrong.

In [None]:
#Identify indices of wrong predictions
index = 0
false_index = []

In [None]:
#Iterate to find indices of bad predictions
false_preds = y_test != preds

for i, e in enumerate(false_preds):
    if e: 
        false_index.append(i)

In [None]:
#Plot five incorrectly classified numbers
plt.figure(figsize=(20,4))
for plotIndex, badIndex in enumerate(false_index[0:5]):
    plt.subplot(1, 5, plotIndex + 1)
    plt.imshow(np.reshape(X_test[badIndex], (8,8)), cmap=plt.cm.gray)
    plt.title("Predicted: {}, Actual: {}".format(preds[badIndex], y_test[badIndex]), fontsize = 15)

Class exercise time: 

Which digit did our model do the best at identifying? What digit was the worst?

In [None]:
#Accuracy score answer 


In [None]:
#Precision score answer 


In [None]:
#Recall score answer 


## Imbalanced Class Machine Learning

As if often the case in real-world machine learning, there is a huge imbalance between the distribution of class of the variable we're trying to predicting.

Class False: 98.23%

Class True: 1.77%

Illustrated example
![www](http://www.svds.com/wp-content/uploads/2016/08/messy.png)

The main problem that imbalanced machine learning project present that is they render the accuracy score metric almost irrelevant.

If our null accuracy is 99.5%, then we don't have much room for improvement. Which is why we need to rely on metrics such as precision and recall.

<br>

Recall aka sensitivity aka the True Positive Rate: The number of correct positive predictions divided by number of positive instances. 

Recall = TP/(TP + FN)

<br>

Precision: The number of correct positive predictions divided by number of positive predictions. 

Precision = TP/(TP+ FP)

<br>

Imagine a confusion with 1000 TNs, 20 FPs, 15 FNs, and 25 TPs. What would the accuracy, precision, and recall scores be?

In [None]:
#Assign variables

tn = 1000.
fp = 20.
fn = 15.
tp = 25.

In [None]:
#Accuracy score



Pretty good score! Or is it???

In [None]:
#Whats the null accuracy



We hardly beat the null accuracy. Now let's calculate precision and recall

In [None]:
#Precision



In [None]:
#Recall


Do we have a good model or not?

Let's move onto the real thing by modeling credit card fraud data

https://www.kaggle.com/dalpozz/creditcardfraud

In [None]:
path = "../data/fraud.csv"
#Read data into notebook
fraud = pd.read_csv(path, index_col=[0])

#Drop the Time column
fraud.drop("Time", axis = 1)
fraud.head()

The V features are principal components which we'll learn about next Thursday, in the meantime think of them as hidden features.

Let'see how imbalanced the classes are

In [None]:
#Value counts without normalize


In [None]:
#Value counts with normalize


That is pretty imbalanced would you say?

In [None]:
#Quick EDA to find relationship amount and fraud status



What does this say about the relationship between amount of the transaction and fraudelent status?

Before we get into modeling, what metric should we try to minimize False Negatives or False Positives? Why?

Let's do some modeling.

Train a logistic regression model and evaluate it on the testing dataset using accuracy, recall, and precision scores

In [None]:
#Assign variables

X = 

y = 

In [None]:
#Make a train test split with random state = 25 and test size = 0.4



In [None]:
#Fit logistic regression model on training data

lr = 


In [None]:
#Null accuracy of testing data



In [None]:
#Evaluate it on testing set using accuracy score

preds = 


In [None]:
#Evaluate it on testing set using precision score



In [None]:
#Evaluate it on testing set using recall score



In [None]:
#Confusion matrix



What do these metrics tell us about our dataset?

Cross validation time

In [None]:
#Cross validate using precision score



In [None]:
#Cross validate using recall score



In [None]:
#Cross validate using roc_auc score



Make a roc curve

In [None]:
#Derive probabilities of class 1 from the test set

#Pass in the test_probs variable and the true test labels aka y_test in the roc_curve function
fpr, tpr, thres = 
#Outputs the fpr, tpr, for varying thresholds

In [None]:
#Plot ROC_curve again but this time annotate the curve with the threshold value
plt.figure(figsize=(15,11))
plt.plot(fpr, tpr, linewidth=12, alpha = .6)
plt.plot([0,1], [0,1], "--", alpha=.6)
for label, x, y in zip(thres[::10], fpr[::10], tpr[::10]):
    plt.annotate("{0:.2f}".format(label), xy=(x, y ), size = 15)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.show()

How do you interpret this chart?

### Imbalanced class techniques

![eop](https://chrisalbon.com/machine-learning/handling_imbalanced_classes_with_downsampling/Downsampling_print.png)
Source: Chris Albon

Let's go ahead apply down sampling to our training dataset

In [None]:
#How many true class are there in the training dataset?



In [None]:
#Assign number of fraud class in training data to variable
N = 
N

In [None]:
#Import resample function from sklearn
from sklearn.utils import resample

fraud_maj = fraud[fraud.Class==0]
fraud_min = fraud[fraud.Class==1]
 
# Downsample majority class
fraud_majority_downsampled = resample(fraud_maj, 
                                 replace=False,     # Do not sample with replacement
                                 n_samples=N,    # to match minority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
fraud_ds = pd.concat([fraud_majority_downsampled, fraud_min])
 
# Display new class counts
fraud_ds.Class.value_counts()

Perfectly balanced classes. Let's use cross validate to see how well our model does.

Use accuracy, recall, precision, and roc_auc metrics

In [None]:
X_ds = fraud_ds.drop("Class", axis = 1)
y_ds = fraud_ds.Class

#Accuracy score


In [None]:
#Precision



In [None]:
#Recall


In [None]:
#Roc Auc


What does this tell us about our model and our data?

![aw](https://chrisalbon.com/machine-learning/handling_imbalanced_classes_with_upsampling/Upsampling_print.png)

In [None]:

#Number non fraud observations
N = fraud.Class.value_counts()[0]

# Downsample majority class
fraud_minority_upsampled = resample(fraud_min, 
                                 replace=True,     # Do not sample with replacement
                                 n_samples=N,    # to match minority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
fraud_us = pd.concat([fraud_minority_upsampled, fraud_maj])
 
# Display new class counts
fraud_us.Class.value_counts()

Cross validation again

In [None]:
#Assign X and y
X_us = fraud_us.drop("Class", axis = 1)
y_us = fraud_us.Class

#Accuracy score


In [None]:
#Precision score


In [None]:
#Recall


In [None]:
#Roc auc score


With both techniques, our interpretation of the accuracy score is more meaningful.

![wee](https://svds.com/wp-content/uploads/2016/08/ImbalancedClasses_fig5.jpg)

However, there is an issue here and that is can a model trained on balanced data work well with imbalanced data? Let's find out!


We're going to train a logistic regression model on a downsampled training dataset and then apply it to an imbalanced testing dataset.

In [None]:
#reassign variables

X = 

y = 

In [None]:
#Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.4, 
                                                    random_state = 23)

In [None]:
# X_train

Downsample data

In [None]:
#Combine the two training datasets
train = 


In [None]:
#Class count


In [None]:
N = train.Class.value_counts()[1]

fraud_maj = train[train.Class==0]
fraud_min = train[train.Class==1]
 
# Downsample majority class
fraud_majority_downsampled = resample(fraud_maj, 
                                 replace=False,     # Do not sample with replacement
                                 n_samples=N,    # to match minority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
fraud_ds = pd.concat([fraud_majority_downsampled, fraud_min])
 
# Display new class counts
fraud_ds.Class.value_counts()


Train Logistic Regression on downsampled data and evaluate it on testing data

In [None]:
#Assign X and y
fraud_ds_X = 
fraud_ds_y = 
#Intialize



In [None]:
#Null accuracy 



In [None]:
#Evaluate on testing dataset

preds = 


In [None]:
#precision



In [None]:
#recall



In [None]:
#Confusion matrix



What's your interpretation now?

<br>
Let's the upsampling technique to see if that produces a better model.

In [None]:
N = train.Class.value_counts()[0]

fraud_maj = train[train.Class==0]
fraud_min = train[train.Class==1]
 
# Downsample majority class
fraud_minority_upsampled = resample(fraud_min, 
                                 replace=True,     # Do not sample with replacement
                                 n_samples=N,    # to match minority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
fraud_us = pd.concat([fraud_minority_upsampled, fraud_maj])
 
# Display new class counts
fraud_us.Class.value_counts()

In [None]:
#Assign X and y
fraud_us_X = 
fraud_us_y = 
#Intialize



In [None]:
#Evaluate on testing dataset

preds = 



In [None]:
#precision



In [None]:
#recall



In [None]:
#Confusion matrix



What do we make of these results??

Here's the good news. We can set a class_weight setting in our models to be assigned to "balanced".

From sklearn:

"The 'balanced' mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))."



We'll cross validate with Logistic Regression, Decision Trees, and RandomForest models with the class_weight parameter set to "balanced."

But first let's calculate those weights

In [None]:
#Class 0 weight
(y.shape[0])/float((2*y.value_counts()[0]))

In [None]:
#Class 1 weight
(y.shape[0])/float((2*y.value_counts()[1]))

In [None]:
#Logistic regression 

cross_val_score(LogisticRegression(class_weight="balanced"), 
                X, y, cv = 5, scoring="accuracy").mean()

In [None]:
#Decision Tree with max depth = 12

cross_val_score(DecisionTreeClassifier(class_weight="balanced", max_depth=14), 
                X, y, cv = 5, scoring="accuracy").mean()

In [None]:
#Random Forest with n_estimators = 40

cross_val_score(RandomForestClassifier(class_weight="balanced", n_estimators=40), 
                X, y, cv = 5, scoring="accuracy").mean()

# Resources

MNIST:

- https://github.com/grfiv/MNIST/blob/master/knn/knn.ipynb
- https://github.com/monsta-hd/ml-mnist
- https://www.youtube.com/watch?v=aZsZrkIgan0
- http://joshmontague.com/posts/2016/mnist-scikit-learn/

Multi-class:

- https://gallery.cortanaintelligence.com/Competition/Tutorial-Competition-Iris-Multiclass-Classification-2
-https://www.youtube.com/watch?v=6kzvrq-MIO0


Imbalanced classes:

- https://towardsdatascience.com/what-metrics-should-we-use-on-imbalanced-data-set-precision-recall-roc-e2e79252aeba
- https://svds.com/learning-imbalanced-classes/
- https://www.youtube.com/watch?v=X9MZtvvQDR4
- https://elitedatascience.com/imbalanced-classes
- https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/



# Class work

If we have enough time for class work, I want to use optimize RandomForest and AdaBoost models fit on the fraud and mnist datasets. Use GridSearch!