# Classification with Scikit-Learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
import sklearn as skl

## Example 1: Who Lives or Dies in Game of Thrones?

In this section, we will use the dataset based on popular book series (and HBO TV series) from George RR Martin, Game of Thrones. The dataset was made available through [Kaggle](https://www.kaggle.com/mylesoneill/game-of-thrones/data) which has information on the character deaths. The dataset was cleaned and we will be working with a sample dataset for this analysis. 

Game of Thrones is known for abruptly ending its characters. We will use machine learning methods to predict if a character will be alive or dead. 

In [None]:
got_data = pd.read_csv("https://www3.nd.edu/~jng2/GoT_Character_Deaths.csv")
print(got_data.shape)
got_data.head()

Note that the data also includes the 'Name' of the person and the 'Allegiances'. We will remove 'Name' as the name itself is not indicative if the character will alive or dead. We will also remove 'Allegiances' for now as we do not know how to handle categorical datatype. 

In [None]:
got_data.drop(['Name', 'Allegiances'], axis = 1, inplace=True)
got_data.head()

## Classification using Logistic Regression

In [None]:
## Split the input features and outcome variable

got_data_X = got_data.drop('dead',1)
got_data_Y = got_data['dead']

In [None]:
got_data_X.head()

### `train_test_split()`: Method to split the data into train and test

We usually split the data into training set to learn a classifier and then a test set to validate how good our model is 

Important parameters to this method

* **random_state**: Seed to used by randomizer to randomly split the data. Must set this if you want to reproduce results.
* **train_size**: Use float to specify what fraction to use for training, e.g. 0.8 if you want 80%. 

In [None]:
from sklearn.model_selection import train_test_split

got_train_X, got_test_X, got_train_Y, got_test_Y = train_test_split(got_data_X, got_data_Y, 
                                                                    random_state=123, train_size = 0.7)

In [None]:
print(len(got_data_X), len(got_train_X), len(got_test_X))

### Learn a classifier: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

log_regression_model = LogisticRegression()

log_regression_model.fit(got_train_X, got_train_Y)

### Predict on test data

In [None]:
got_predict_Y = log_regression_model.predict(got_test_X)

In [None]:
import sklearn.metrics as sklmetrics

sklmetrics.accuracy_score(got_test_Y, got_predict_Y)

### Confusion Matrix and plotting it

In [None]:
conf_mat = sklmetrics.confusion_matrix(got_test_Y, got_predict_Y, labels =[0,1])
conf_mat

In [None]:
sns.heatmap(conf_mat, square=True, annot=True, cbar = False, 
            xticklabels = ['Alive','Dead'], yticklabels = ['Alive','Dead'],
            fmt='g')
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

### Understanding the feature importance of the Logistic Regression

In [None]:
# Defining a function to plot coefficients as feature importance
# INPUT: Used for Logistic Regression Classifier
#        Feature Names
# OUTPUT: A plot of top most Coefficients
def plot_feature_importance_coeff(model, Xnames, cls_nm = None):

    imp_features = pd.DataFrame(np.column_stack((Xnames, model.coef_.ravel())), columns = ['feature', 'importance'])
    imp_features[['importance']] = imp_features[['importance']].astype(float)
    imp_features[['abs_importance']] = imp_features[['importance']].abs()
    # Sort the features based on absolute value of importance
    imp_features = imp_features.sort_values(by = ['abs_importance'], ascending = [1])
    
    # Plot the feature importances of the forest
    plt.figure()
    plt.title(cls_nm + " - Feature Importance")
    plt.barh(range(imp_features.shape[0]), imp_features['importance'],
            color="b", align="center")
    plt.yticks(range(imp_features.shape[0]), imp_features['feature'], )
    plt.ylim([-1, imp_features.shape[0]])
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout() 
    plt.savefig(cls_nm + "_feature_imp.png", bbox_inches='tight')
    plt.show()

In [None]:
plot_feature_importance_coeff(log_regression_model, got_data_X.columns, cls_nm="Logistic Regression")

## Example 2: Bank Marketing Campaign Success or Failure

We will be using a dataset available from [UCI data repository](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#), that provides information on a phone campaign run by a bank to see if a customer can be converted to start a term deposit at the bank. We will only be using a smaller sample of the data. 

In [None]:
bank_data = pd.read_csv('https://www3.nd.edu/~jng2/bank_campaign_small.csv')
bank_data.head()

In [None]:
bank_data.dtypes

### Data Preprocessing Step - Dealing with categorical input variables

Note that there are three string variables (the data type is object). Looking at them, they are clearly categorical variables, and machine learning classifiers do not like categorical variables. To make them usable by classifiers, we will convert them to **dummy variables** (binary 0-1).

In [None]:
bank_data_2 = pd.get_dummies( bank_data, columns=['marital', 'education', 'contact'])

In [None]:
bank_data_2.head()

In [None]:
bank_data_2.describe()

<div class="alert alert-block alert-danger">
<h5>Remove one category for each categorical variable</h5>
<p>
As you saw, the `get_dummies` method introducing a column for each value of a categorical varible. It is **very important** to remove one column from each set of dummy variables for the models to appropriately work. 
</p>
<p>
For example for the categorical variable `marital status`, because `marital status` has three categories, we generated three dummy variables: marital_divorced, marital_married, and marital_single. We MUST remove one of these dummy variables. We can choose to remove marital_divorced.
    
The same applies for the education dummies and contact dummies.
</p>
</div> 

In [None]:
bank_data_2.drop(['marital_divorced','education_unknown','contact_unknown'], axis = 1, inplace=True)
bank_data_2.head()

In [None]:
# If you don't care which category is dropped, you can ask get_dummies to drop the first category from each categorical
# variable by setting the keyword drop_first = True.

## Solving Class Imbalance problem with `class_weight='balanced'`

Notice below that in the bank campaign dataset there are many more failures than successes. In this case it is important to let the classifier know that it needs to handle the class imbalance problem. There are many ways to handle the class imbalance problem including oversampling and under sampling.

We can do that by creating a classifier with the parameter `class_weight='balanced'`. In that way, the classifier handles the class imbalance problem by choosing the appropriate cost of making False Negatives and False Positives. 

In [None]:
bank_data_2['success'].value_counts()

## Activity: Classification using Logistic Regression

Follow these steps
1. Separate X (input features) and Y (outcome)
2. Split into training data and test data. Use 70% of data for training
    * Verify if the data is appropriately split by checking the number of rows in each of the training and test data. 
3. Learn the Logistic Regression classifier to predict success or failure
4. Predict using the test data 
5. Provide accuracy score as well as plot the confusion matrix
    * Think about the consequence of False Positives and False Negatives
6. Plot variable importance for the classifier
    * Use `plot_feature_importance_coeff` for Logistic Regression
    
The first 2 steps have been done for you.

In [None]:
# Step 1: Separate X (input features) and Y (outcome)
bank_data_X = bank_data_2.drop('success', axis = 1)
bank_data_Y = bank_data_2['success']

In [None]:
# Step 2: Split into training data and test data
bank_train_X, bank_test_X, bank_train_Y, bank_test_Y = train_test_split(bank_data_X, 
                                                                        bank_data_Y, 
                                                                        random_state = 123, 
                                                                        train_size = 0.7)

In [None]:
# Next steps here...

## Parameters of the Classification models

Every classification model has a set of **parameters** that need to be set *before* the model can be learned. These are also called hyper parameters. In the folllowing we explain a couple of important parameters for Logistic Regression. 

* **Logistic Regression**: When you are running logistic regression, having a lot of input variables can lead to overfit in the data. Hence there are methods to impose limits to the number of input variables used in the logistic regression. These methods are called **variable selection** methods. One classic machine learning method to do the variable selection with the data is through **regularization**. You can think of regularization as selecting only a subset of input features that can be used in the logistic regression to avoid overfit. Rather than doing it manually in an ad hoc or arbitrary way, regularization is a statistically rigorous way of doing **variable selection**. There are two parameters with respect to regularization in LogisticRegression classifier. The main idea of these parameters is to **penalize models that use too many input variables** in the logistic regression
    * **penalty**: This is the kind of penalty you wish to apply. The most common ones are ['l1','l2']. 
    * **C**: The inverse of regularization strength. You can think of C as the tolerance for many inputs. The larger the value of C, the more tolerant the model is for using more input variables. The smaller the value of C, the stronger the regularization and hence fewer input variables are allowed. The most common values are from [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]

## Activity

Fit LogisticRegression model for various parameter choices. **Which parameter combination has best accuracy score**? Select the model parameter that has good accuracy on the test data. 

**Important Note**: This way of selecting the model parameters is strongly recommended against. **YOU SHOULD NEVER DO THIS**. This is only for learning. See the section on **cross-validation** below for the correct way to select model parameters. 


An example of creating a model with specific parameters is shown below. 

```python
# Logistic regression model with penalty = 'l1' and inverse of reqularization strength C = 1
log_reg_with_parameter = LogisticRegression(class_weight='balanced', penalty='l1', C= 1)
```

Use `bank_data_2` as your starting dataset. Remember to do all the steps of the classification and choose the model that has the highest accuracy. Remember the outcome variable of interest is 'success'. 

First two steps are done for you. 

In [None]:
# Step 1: Split the input variables (X) and outcome variable (Y)

bank_data_X = bank_data_2.drop(['success'], axis = 1)
bank_data_Y = bank_data_2['success']

In [None]:
# Step 2: Split into training and test data.  Use 70% of the data for training. 
# PLEASE set random_state = 123
bank_train_X, bank_test_X, bank_train_Y, bank_test_Y = train_test_split(bank_data_X, 
                                                                        bank_data_Y, 
                                                                        random_state = 123, 
                                                                        train_size = 0.7)

In [None]:
# Step 3a: Learn the classifier with various parameters on the training data for Logistic Regression.  
# Try these combinations

# penalty = 'l1', C = 0.1
# penalty = 'l1', C = 1
# penalty = 'l1', C = 10
# penalty = 'l2', C = 0.1
# penalty = 'l2', C = 1
# penalty = 'l2', C = 10

# Set class_weight as balanced for all runs.


In [None]:
# Step 4: Use the test data to predict the Y and then print the accuracy_score for Logistic Regression 


<div class="alert alert-block alert-danger">
<h5>DO NOT SELECT HYPER PARAMETERS BASED ON TEST DATA </h5>
<p> Above exercise is NOT the correct way to select the best hyper parameters!</p>
<p> Rather, we must use cross validation techniques discussed below.</p>
</div> 

## Cross Validation to Select the Parameters of a Model using `GridSearchCV()`


As we said above, selecting the model parameters has to be done by cross validation. See lecture notes for more details on cross validation. [Here](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) is a nice discussion.

Scikit-Learn provides a wide variety of cross validation techniques, but the most common way is the GridSearchCV() method.

GridSearchCV() is the process of searching through every possible combination of parameters and selecting the best parameter combination. For example

`params = {'penalty':['l1','l2'], 'C':[0.1,1,10]}`

Then you have 6 different combination of parameters listed below. Hence, it is called grid search.
```
penalty = 'l1', C = 0.1
penalty = 'l1', C = 1
penalty = 'l1', C = 10
penalty = 'l2', C = 0.1
penalty = 'l2', C = 0.1
penalty = 'l2', C = 10
```
**Finally you can select the best model, based on a metric (usually accuracy).**

GridSearchCV() is a very useful method that automates this process of selecting the model with the best set of parameters. The method has four main parameters

* `estimator`: The classifier whose parameters you wish to learn.
* `param_grid`: Dictionary (dict) of parameters and their values to be searched over.
* `cv`: How many folds of classification you want to use. Usually 3 for smaller data and 10 for large data.
* `n_jobs`: Usually you specify this as 1. You can parallelize the search by specifying a number greater than 1. Do not parallelize on Vocareum or you may end up stalling the machine and it has to be rebooted. For more experienced students or if you have Python installed on your own machine, check the number of processing cores of your machine before you increase the number.


In [None]:
from sklearn.model_selection import GridSearchCV

**Important Notes**
1. Usually for large datasets, the above `GridSearchCV` method takes a long time. You might have to first run with a limited set of parameters before you increase the number of possible values for the parameters. 
2. Every time you run the GridSearchCV, you might find a different combination of parameters to be the best one. This is another issue with **consistency** of machine learning algorithms. Addressing this is a whole topic in itself.

## Activity

Learn the parameters of LogisticRegression classifier using `GridSearchCV()` method. 
1. Specify the dictionary of the parameters for LogisticRegression and the values 
2. Set GridSearchCV for LogisticRegression classifier. 
    * Set cv=10
3. Use the training data and fit the GridSearchCV so that it learns the model and the best parameter
4. Select the best model
5. Verify the accuracy and confusion matrix on the testing data. 
6. Present the importance of each characteristic for Logistic Regression using the `plot_feature_importance_coeff` method below. 

In [None]:
# Step 0
model = LogisticRegression(class_weight='balanced')

# Step 1: 
params = {'penalty':['l1','l2'],
          'C': [0.01, 0.1, 1, 10, 100]}

# Prepare the GridSearch for cross validation
grid_search_log_reg = GridSearchCV(model, # Note the model is DecisionTreeClassifier as stated above
                                   param_grid=params, # The parameters to search over. 
                                   cv=10 # How many hold out sets to use
                                   )

# Do the cross validation on the training data 
grid_search_log_reg.fit(bank_train_X, bank_train_Y)

# Select the best model

best_log_reg_cv = grid_search_log_reg.best_estimator_

# Print the best parameter combination 
print(grid_search_log_reg.best_params_)

In [None]:
# Finally test the performance of the best model on the test data

bank_predict_Y = best_log_reg_cv.predict(bank_test_X)

#Print the accuracy 
print('Accuracy ', sklmetrics.accuracy_score(bank_test_Y, bank_predict_Y))

# Confusion matrix
conf_mat = sklmetrics.confusion_matrix(bank_test_Y, bank_predict_Y)

print('\n', conf_mat)

sns.heatmap(conf_mat, fmt='g',square=True, annot=True, cbar = False, 
            xticklabels = ['Failure','Success'], 
            yticklabels = ['Failure','Success'])
plt.xlabel("Predicted Value")
plt.ylabel("True Value")
plt.show()

In [None]:
plot_feature_importance_coeff(best_log_reg_cv, bank_data_X.columns, cls_nm='Logistic Regression')

## Metrics for evaluating the performance of classification

Until now, we have been always using `accuracy_score` to verify the performance of the classifier on the test data. However, in reality this is not usually the most optimal metric. There are other metrics out there, shown below.

1. Precision, Recall, F1 Score
2. Area Under the Curve (AUC) of Receiver Operating Characteristic (ROC) curve
3. Area under Precision-Recall Curve
4. Specificity and Sensitivity
5. Positive predictive value
6. ... many more


## Many more classification techniques

We discussed the most basic classifier, logistic regression. Another basic classifier is decision tree, which we did not discuss. In addition, there are many other classifiers you can use to improve your classification accuracy. Below, I have provided three main classification methods (by no means they are exhaustive)

* MLPClassifier: Multi Layer Perceptron model. This is a very basic model that is a primer to deep learning neural networks. 
    ```python
    from sklearn.neural_network import MLPClassifier
    ```
* GradientBoostClassifier: Learns multiple trees to select the best classifier. 
    ```python
    from sklearn.ensemble import GradientBoostClassifier
    ```
* RandomForestClassifier: Learns multiple trees to select the best classifier. 
    ```python
    from sklearn.ensemble import RandomForestClassifier
    ```
