# Classification with Scikit-Learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
import sklearn as skl

## Game of Thrones Dataset

In this section, we will use the dataset based on popular book series (and HBO TV series) from George RR Martin, Game of Thrones. The dataset was made available through [Kaggle](https://www.kaggle.com/mylesoneill/game-of-thrones/data) which has information on the character deaths. The dataset was cleaned and we will be working with a sample dataset for this analysis. 

Game of Thrones is known for abruptly ending its characters. We will use machine learning methods to predict if a character will be alive or dead. 

In [None]:
got_data = pd.read_csv("./data/GoT_Character_Deaths.csv")
print(got_data.shape)
got_data.head()

Note that the data also includes the 'Name' of the person and the 'Allegiances'. We will remove 'Name' as the name itself is not indicative if the character will alive or dead. We will also remove 'Allegiances' **for now as we do not know how to handle categorical datatype**. In the next class we will handle categorical datatype.  

In [None]:
got_data.drop(['Name', 'Allegiances'], axis = 1, inplace=True)
got_data.head()

## Classfication using Logistic Regression

In [None]:
## Split the input features and outcome variable

got_data_X = got_data.drop('dead',1)
got_data_Y = got_data['dead']

In [None]:
got_data_X.head()

### `train_test_split()`: Method to split the data into train and test

We usually split the data into training set to learn a classifier and then a test set to validate how good our model is 

Important parameters to this method

* **random_state**: Seed to used by randomizer to randomly split the data
* **train_size**: Use float to specify what fraction to use for training. 

In [None]:
from sklearn.model_selection import train_test_split

got_train_X, got_test_X, got_train_Y, got_test_Y = train_test_split(got_data_X, got_data_Y, random_state=42, train_size = 0.7)

In [None]:
print(len(got_data_X), len(got_train_X), len(got_test_X))

### Learn a classifier: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

log_regression_model = LogisticRegression()

log_regression_model.fit(got_train_X, got_train_Y)

### Predict on test data

In [None]:
got_predict_Y = log_regression_model.predict(got_test_X)

In [None]:
import sklearn.metrics as sklmetrics

sklmetrics.accuracy_score(got_test_Y, got_predict_Y)

### Confusion Matrix and plotting it

In [None]:
conf_mat = sklmetrics.confusion_matrix(got_test_Y, got_predict_Y, labels =[0,1])
conf_mat

In [None]:
sns.heatmap(conf_mat, square=True, annot=True, cbar = False, xticklabels = ['Alive','Dead'], yticklabels = ['Alive','Dead'])
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

### Understanding the feature importance of the Logistic Regression

In [None]:
# Defining a function to plot coefficients as feature importance
# INPUT: Used for Logistic Regression Classifier
#        Feature Names
# OUTPUT: A plot of top most Coefficients
def plot_feature_importance_coeff(model, Xnames, cls_nm = None):

    imp_features = pd.DataFrame(np.column_stack((Xnames, model.coef_.ravel())), columns = ['feature', 'importance'])
    imp_features[['importance']] = imp_features[['importance']].astype(float)
    imp_features[['abs_importance']] = imp_features[['importance']].abs()
    # Sort the features based on absolute value of importance
    imp_features = imp_features.sort_values(by = ['abs_importance'], ascending = [1])
    
    # Plot the feature importances of the forest
    plt.figure()
    plt.title(cls_nm + " - Feature Importance")
    plt.barh(range(imp_features.shape[0]), imp_features['importance'],
            color="b", align="center")
    plt.yticks(range(imp_features.shape[0]), imp_features['feature'], )
    plt.ylim([-1, imp_features.shape[0]])
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout() 
    plt.savefig(cls_nm + "_feature_imp.png", bbox_inches='tight')
    plt.show()

In [None]:
plot_feature_importance_coeff(log_regression_model, got_data_X.columns, cls_nm="Logistic Regression")

## Classification using Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

dec_tree_model = DecisionTreeClassifier()

dec_tree_model.fit(got_train_X, got_train_Y)

In [None]:
got_predict_Y = dec_tree_model.predict(got_test_X)

print(sklmetrics.accuracy_score(got_test_Y, got_predict_Y))

conf_mat = sklmetrics.confusion_matrix(got_test_Y, got_predict_Y, labels =[0,1])
print(conf_mat)

sns.heatmap(conf_mat, square=True, annot=True, cbar = False, xticklabels = ['Alive','Dead'], yticklabels = ['Alive','Dead'])
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

### Understanding the feature importance of the Decision Tree

In [None]:
# Defining a function to plot feature importance for trees
# INPUT: Used for Tree based Classifier
#        Feature Names
# OUTPUT: A plot of top most features

def plot_feature_importance(model, Xnames, cls_nm = None):

    # Measuring important features
    imp_features = pd.DataFrame(np.column_stack((Xnames, model.feature_importances_)), columns = ['feature', 'importance'])
    imp_features[['importance']] = imp_features[['importance']].astype(float)
    imp_features[['abs_importance']] = imp_features[['importance']].abs()
    # Sort the features based on absolute value of importance
    imp_features = imp_features.sort_values(by = ['abs_importance'], ascending = [1])
    
    # Plot the feature importances of the forest
    plt.figure()
    plt.title(cls_nm + " - Feature Importance")
    plt.barh(range(imp_features.shape[0]), imp_features['importance'],
            color="b", align="center")
    plt.yticks(range(imp_features.shape[0]), imp_features['feature'], )
    plt.ylim([-1, imp_features.shape[0]])
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout() 
    plt.savefig(cls_nm + "_feature_imp.png", bbox_inches='tight')
    plt.show()

In [None]:
plot_feature_importance(dec_tree_model, got_data_X.columns, cls_nm='Decision Tree Classifier')

## Activity

We will be using the dataset available from [UCI data repository](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#), that provides information on the phone campaign run by the bank to see if a customer can be converted to have term deposit at their bank. We will only be using a sample from the data. 

In [None]:

bank_data = pd.read_csv('./data/bank_campaign_small.csv')
bank_data.head()

In [None]:
bank_data.dtypes

### Activity: Data Preprocessing Step - Remove categorical input variables

Note that there are some categorical (the data type is object) and the classifiers do not like that datatype. So, we will remove it for now. Later, we will learn how to handle categorical input variables. 

## Activity: Classification using Logistic Regression and Decision Trees

Follow these steps
1. Seperate X (input features) and Y (outcome)
2. Split into training data and test data. Use 70% of data for training
    * Verify if the data is appropriately split by checking the number of rows in each of the training and test data. 
3. Learn the following two classifiers to predict success or failure
    * Logistic Regression
    * Decision Tree 
4. Predict using the test data for both the classifiers
5. Provide accuracy score as well as plot the confusion matrix
    * Think about the consequence of False Positives and False Negatives
6. Provide the variable importance for each classifier
    * Use `plot_feature_importance_coeff` for Logistic Regression
    * Use `plot_feature_importance` for Decision Tree