# More Classification with Scikit-Learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
import sklearn as skl
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import sklearn.metrics as sklmetrics

## Bank Campaign Dataset

We will be using the dataset available from [UCI data repository](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#), that provides information on the phone campaign run by the bank to see if a customer can be converted to have term deposit at their bank. We will only be using a sample from the data. 

In [None]:
bank_data = pd.read_csv('./data/bank_campaign_small.csv')
bank_data.head()

In [None]:
bank_data.dtypes

In [None]:
print(bank_data['success'].value_counts())

## Solving Class Imbalance problem with `class_weight='balanced'`

Notice below that in the above bank campaign dataset there are more failures than successes. In these case it is important to let the classifier know that it needs to handle the class imbalance problem. There are many ways to handle the class imbalance problem including oversampling and under sampling, changing the cost of making False Negatives and False Positives. 

We can do that by creating a classifier with the parameter `class_weight='balanced'`. In that way, the classifier handles the class imbalance problem by choosing the appropriate cost of making False Negatives and False Positives. 

In [None]:
bank_data.drop(['marital','education','contact'], axis=1, inplace=True)
bank_data.head()

In [None]:
bank_data_X = bank_data.drop('success', axis = 1)
bank_data_Y = bank_data['success']

In [None]:
bank_train_X, bank_test_X, bank_train_Y, bank_test_Y = train_test_split(bank_data_X, bank_data_Y, random_state = 42, 
                                                                        train_size = 0.7)

In [None]:
LogisticRegression?

In [None]:
bank_logistic = LogisticRegression(random_state=42)

bank_logistic.fit(bank_train_X, bank_train_Y)

<div class="alert alert-block alert-warning">
<h5>Side Note: Optimization in Logistic Regression</h5>
<p>
Logistic Regression involves optimization to learn the coefficients of the model. Sometimes those optimizations may not converge. You can change the solvers to see if those can be converged. For example, you can change `solver='lbfgs'` (default) to `solver='liblinear'`
</p>
</div> 

In [None]:
bank_logistic = LogisticRegression(solver='liblinear', random_state=42)

bank_logistic.fit(bank_train_X, bank_train_Y)

In [None]:
bank_predict_Y = bank_logistic.predict(bank_test_X)
print("The accuracy is {0}".format(sklmetrics.accuracy_score(bank_test_Y, bank_predict_Y)))

conf_mat = sklmetrics.confusion_matrix(bank_test_Y, bank_predict_Y, labels =[0,1])
print(conf_mat)

sns.heatmap(conf_mat, square=True, annot=True, cbar = False, xticklabels = ['Failure','Success'], 
            yticklabels = ['Failure','Success'], fmt = 'g')
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

In [None]:
bank_logistic_balanced = LogisticRegression(solver='liblinear', random_state=42, class_weight='balanced')

bank_logistic_balanced.fit(bank_train_X, bank_train_Y)

In [None]:
bank_predict_Y = bank_logistic_balanced.predict(bank_test_X)
print("The accuracy is {0}".format(sklmetrics.accuracy_score(bank_test_Y, bank_predict_Y)))

conf_mat = sklmetrics.confusion_matrix(bank_test_Y, bank_predict_Y, labels =[0,1])
print(conf_mat)

sns.heatmap(conf_mat, square=True, annot=True, cbar = False, xticklabels = ['Failure','Success'], 
            yticklabels = ['Failure','Success'], fmt = 'g')
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

## Handling Categorical Input Variables using `pd.get_dummies()` method

In the previous class, we have deleted the categorical input variables, but you don't have to always delete them. You can convert the cateogrical input variables using `get_dummies()`

<div class="alert alert-block alert-danger">
<h5>Remove the one category for each categorical variable</h5>
<p>
As you will see the `get_dummies` method introducing a column for each value of the cateogorical varible. It is **very important** to remove one column for each categorical variable for the models to appropriately work. 
</p>
<p>
For example, in for the bank data, for the categorical variable marital status, may be we can remove say 'marital_divorced', and for education may be we can remove 'education_unknown' and for contact we can remove 'contact_unknown'. 
</p>
</div> 

In [None]:
bank_data = pd.read_csv('./data/bank_campaign_small.csv')
bank_data.head()

In [None]:
bank_data['marital'].unique()

In [None]:
bank_data['education'].unique()

In [None]:
bank_data['contact'].unique()

In [None]:
bank_data_with_dummies = pd.get_dummies(bank_data)
bank_data_with_dummies.head()

### Remove the additional variables introduced by `pd.get_dummies()` method

After removing the additional columns for each categorical variable, we can use this data with dummy columns added as input to various classifiers. 

In [None]:
bank_data_with_dummies.drop(['marital_divorced','education_unknown','contact_unknown'], axis = 1, inplace=True)
bank_data_with_dummies.head()

#### You can also use the keyword argument `drop_first` to remove the first value for each categorical variable

In [None]:
bank_data_with_dummies_auto = pd.get_dummies(bank_data, drop_first=True)
bank_data_with_dummies_auto.head()

## Parameters of the Classification models

Every classification model has set of parameters that need to be set before the model can be learned. In the folllowing we explain a couple of important parameters for Logistic Regression and Decision Tree Classification

* **Logistic Regression**: When you are running logistic regression, which is very similar to multiple linear regression, having a lot of input variables can lead to overfit in the data. Hence, hence there number of ways to reduce the number of input variables used in the logistic regression. These methods are called **variable selection** methods. One classic machine learning method to do the variable selection with the data is through **regularization**. You can think of regularization as selecting only a subset of input features that can be used in the logistic regression. Rather than doing manually, regularization is a statistically rigorous way of doing the **variable selection**. There are two parameters with respect to regularization in LogisticRegression classifier. The main idea of these parameters is to **penalize models the use more of input variables** in the logistic regression
    * **penalty**: This is a kind of penalty you wish to apply. The most common ones are **['l1','l2']**. 
    * **C**: The weight that needs to be given for regularization. The more the value of C, the more preference for using more input variables. The smaller the value of C, the stronger the regularization and hence less input variables is preferred. The most common values are from **[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]**
    
    
* **Decision Tree**: Recollect decision tree is building a tree with increasing depth. The two important parameters to set are 
    * **max_depth**: How deep should the decision tree be built. Usually, depends on the data and **most common values are [3,4,5]**
    * **max_features**: How many features should be considered when splitting a decision tree node. **The most common values are ['auto','log2', None]**

## Activity

Create LogisticRegression and DecisionTreeClassifier for various parameter choices. **Which parameter combination has best accuracy score for**? Select the model parameter that has good accuracy on the test data. 

**Important Note**: This way of selecting the model parameters is strongly recommended against. **YOU SHOULD NEVER DO THIS**. **This is only for learning**. Look below cross-validation technique to select the right model parameters. 


An example of to create a model with certain parameters is shown below. 

```python
# Logistic regression model with penalty = 'l1' and the weight of penalization C = 1
log_reg_with_parameter = LogisticRegression(solver='liblinear', random_state=42, class_weight='balanced', penalty='l1', C= 0.1)
```

Use `bank_data_with_dummies` as your starting dataset. Remember to do all the steps of the classification and choose the model that has the highest accuracy. Remember the outcome variable of interest is 'success'. 

First step is done for you. 

In [None]:
# Step 1: Split the input variables (X) and outcome variable (Y)

bank_dummies_X = bank_data_with_dummies.drop(['success'], axis = 1)
bank_dummies_Y = bank_data_with_dummies['success']

In [None]:
# Step 2: Split into training and test data.  Use 70% of the data for training.
bank_dum_train_X, bank_dum_test_X, bank_dum_train_Y, bank_dum_test_Y = train_test_split(bank_dummies_X, bank_dummies_Y, 
                                                                                       random_state=42,
                                                                                       train_size = 0.7)

In [None]:
# Step 3a: Learn the classifier with various parameters on the training data for Logistic Regression.  
# Try these combinations

# penalty = 'l1', C = 0.1
# penalty = 'l1', C = 1
# penalty = 'l1', C = 10
# penalty = 'l2', C = 0.1
# penalty = 'l2', C = 0.1
# penalty = 'l2', C = 10
log_reg_with_parameter = LogisticRegression(solver='liblinear', random_state=42, class_weight='balanced', penalty='l1', C= 0.1)


In [None]:
# Step 4: Use the test data to predict the Y and then print the accuracy_score for Logistic Regression 

#Get an accuracy


In [None]:
# Step 3b: Learn the classifier with various parameters on the training data for Decision Tree. 
# Try these combinations

# max_depth = 2, max_features = 'auto'
# max_depth = 3, max_features = 'auto'
# max_depth = 2, max_features = 'log2'
# max_depth = 3, max_features = 'log2'
# max_depth = 2, max_features = None
# max_depth = 3, max_features = None

dec_tree_with_parameter = DecisionTreeClassifier(random_state=42, class_weight='balanced', max_depth = 2, max_features = 'auto')


In [None]:
# Step 4: Use the test data to predict the Y and then print the accuracy_score for Decision Tree


<div class="alert alert-block alert-danger">
<h5>DO NOT SELECT THE PARAMETERS BASED ON TEST DATA </h5>
<p> We have to rather use cross validation techniques discussed below</p>
</div> 


## Cross Validation to Select the Parameters of a Model using `GridSearchCV()`


As we said above, selecting the model parameters has to be done in a cross validation manner. Refer to lecture notes (pdf, pptx) in Sakai for more understanding on cross validation. 

Scikit-Learn provides a wide variety of cross validation techniques, but the most common way is using `GridSearchCV()` method.

**GridSearchCV()** is the process of searching through every possible combination of parameters and selecting the best parameter combination. For example

```python
params = {'max_depth':[2,3],
          'max_features':['auto','log2',None]}
```

Then you have 6 different combination of parameters listed below. Hence, its called grid search. 
* max_depth = 2, max_features = 'auto'
* max_depth = 3, max_features = 'auto'
* max_depth = 2, max_features = 'log2'
* max_depth = 3, max_features = 'log2'
* max_depth = 2, max_features = None
* max_depth = 3, max_features = None

** Finally you can select the best model, based on a metric (usually accuracy).** 


`GridSearchCV()` is very useful method that automates this process of selecting the model with the best set of parameters. The method has four main parameters

* **estimator**: The classifier you want to learn the parameters, LogisticRegression, DecisionTreeClassifier, etc. 
* **param_grid**: Dictionary (dict) of parameters and their values to be searched over. 
* **cv**: How many folds of classification you want to use. Usually 3 for smaller data and 10 for large data. 
* **n_jobs**: Usually you specify this as 1. You can parallelize the process of this search by specifying a value more than 1. **Do not have the n_jobs set to more than 3**, for the first time users. Especially, on a laptop or lab machine or on Vocareum, you will end up stalling the machine and it has to be rebooted. For more experienced students, in the class, check the number of processing cores of the machine before you increase the number.


In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# For decision tree based classification

# The model you want to set the parameters for
model = DecisionTreeClassifier(class_weight='balanced', random_state=42)

# The parameters to search over for the model
params = {'max_depth':[2,3,4],
          'max_features':['auto','log2',None]}


# Prepare the GridSearch for cross validation
grid_search_dec_tree = GridSearchCV(model, # Note the model is DecisionTreeClassifier as stated above
                                    param_grid=params, # The parameters to search over. 
                                   cv=3, # How many hold out sets to use
                                   n_jobs = 1 # Number of parallel processes to run.  
                                   )

# Do the cross validation on the training data 
grid_search_dec_tree.fit(bank_dum_train_X, bank_dum_train_Y)

# Select the best model

best_dec_tree_cv = grid_search_dec_tree.best_estimator_

# Print the best parameter combination 
print(grid_search_dec_tree.best_params_)

**Important Notes**
1. Usually for large datasets, the above `GridSearchCV` method takes a lot of time. You might have to make sure, you run with limited set of parameters, before you increase the number of possible values for the parameters. 
2. Everytime you run the GridSearchCV, you might find a different combination of parameters to be the best one. This is another issue with **consistency** of machine learning algorithms. Addressing this is out of the scope for this course. 

In [None]:
# Finally test the performance of the best model on the test data

bank_dum_pred_Y = best_dec_tree_cv.predict(bank_dum_test_X)

#Print the accuracy 
print(sklmetrics.accuracy_score(bank_dum_test_Y, bank_dum_pred_Y))

conf_mat = sklmetrics.confusion_matrix(bank_dum_test_Y, bank_dum_pred_Y)
print(conf_mat)

# Confusion matrix
sns.heatmap(conf_mat, fmt='g',square=True, annot=True, cbar = False, xticklabels = ['Failure','Success'], 
                                                            yticklabels = ['Failure','Success'])
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

### Understanding the feature importance of the Decision Tree

In [None]:
# Defining a function to plot feature importance for trees
# INPUT: Used for Tree based Classifier
#        Feature Names
# OUTPUT: A plot of top most features

def plot_feature_importance(model, Xnames, cls_nm = None):

    # Measuring important features
    imp_features = pd.DataFrame(np.column_stack((Xnames, model.feature_importances_)), columns = ['feature', 'importance'])
    imp_features[['importance']] = imp_features[['importance']].astype(float)
    imp_features[['abs_importance']] = imp_features[['importance']].abs()
    # Sort the features based on absolute value of importance
    imp_features = imp_features.sort_values(by = ['abs_importance'], ascending = [1])
    
    imp_features = imp_features.iloc[10:]
    
    # Plot the feature importances of the forest
    plt.figure(figsize=(10,6))
    plt.title(cls_nm + " - Feature Importance")
    plt.barh(range(imp_features.shape[0]), imp_features['importance'],
            color="b", align="center")
    plt.yticks(range(imp_features.shape[0]), imp_features['feature'], )
    plt.ylim([-1, imp_features.shape[0]])
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout() 
    plt.savefig(cls_nm + "_feature_imp.png", bbox_inches='tight')
    plt.show()

In [None]:
plot_feature_importance(best_dec_tree_cv, bank_dum_train_X.columns, cls_nm='Best CV Decision Tree')

## Activity

Learn the parameters of LogisticRegression classifier using `GridSearchCV()` method. 
1. Specify the dictionary of the parameters for LogisticRegression and the values 
2. Set GridSearchCV for LogisticRegression classifier. 
    * Set cv=10
3. Use the training data and fit the GridSearchCV so that it learns the model and the best parameter
4. Select the best model
5. Verify the accuracy and confusion matrix on the testing data. 
6. Present the importance of each characteristic for Logistic Regression using the `plot_feature_importance_coeff` method below. 

### Understanding the feature importance of the Logistic Regression

In [None]:
# Defining a function to plot coefficients as feature importance
# INPUT: Used for Logistic Regression Classifier
#        Feature Names
# OUTPUT: A plot of top most Coefficients
def plot_feature_importance_coeff(model, Xnames, cls_nm = None):

    imp_features = pd.DataFrame(np.column_stack((Xnames, model.coef_.ravel())), columns = ['feature', 'importance'])
    imp_features[['importance']] = imp_features[['importance']].astype(float)
    imp_features[['abs_importance']] = imp_features[['importance']].abs()
    # Sort the features based on absolute value of importance
    imp_features = imp_features.sort_values(by = ['abs_importance'], ascending = [1])
    
    # Plot the feature importances of the forest
    plt.figure(figsize=(10,6))
    plt.title(cls_nm + " - Feature Importance")
    plt.barh(range(imp_features.shape[0]), imp_features['importance'],
            color="b", align="center")
    plt.yticks(range(imp_features.shape[0]), imp_features['feature'], )
    plt.ylim([-1, imp_features.shape[0]])
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout() 
    plt.savefig(cls_nm + "_feature_imp.png", bbox_inches='tight')
    plt.show()

## Metrics for evaluating the performance of classification

Until now, we have been always using `accuracy_score` to verif the performance of the classification on the test data. However, in reality this is not usually the most optimal metric. There are other important metrics to use in real-world, shown below. Again, the discussion on these metrics is out of the scope for this course. 

1. Precision, Recall, F1 Score
2. Area Under the Curve (AUC) of Receiver Operating Characteristic (ROC) curve
3. Area under Precision-Recall Curve
4. Specificity and Sensitivity
5. Positive predictive value
6. ... many more


## Many more classification techniques

We discussed two basic models in this section. However, there are many other classifiers you can use to improve your classification accuracy. Below, I have provided three main classification methods (by no means they are exhaustive)

* MLPClassifier: Multi Layer Perceptron model. This is a very basic model that is a primer to deep learning neural networks. 
    ```python
    from sklearn.neural_network import MLPClassifier
    ```
* GradientBoostClassifier: Learns multiple trees to select the best classifier. 
    ```python
    from sklearn.ensemble import GradientBoostClassifier
    ```
* RandomForestClassifier: Learns multiple trees to select the best classifier. 
    ```python
    from sklearn.ensemble import RandomForestClassifier
    ```
