# This notebook demonstrates basic machine learning concepts

Please fill in the codes between the ### START CODE HERE ### and ### END CODE HERE ### comments

##### Installing XGBoost and graphviz, uncomment it if you haven't do it

In [0]:
#! pip install xgboost graphviz

##### Importing the required libraries

In [0]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, \
precision_score, recall_score, f1_score, roc_curve, roc_auc_score, auc
from matplotlib import pyplot as plt
plt.style.use('seaborn')
import seaborn as sns
from datetime import datetime #for calculating running time

##### Uncomment this cell if you are using colab

In [0]:
#from google.colab import drive
#drive.mount('/content/drive/')


In [0]:
#cd '/content/drive/My Drive/'

# Default of Credit Card Clients Dataset

**Data sources**

**Google drive:** https://drive.google.com/file/d/1E5p9qQK8E3kemMBnGpC4RLSMOUr5RliY/view?usp=sharing

**Kaggle:** https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset/home

**UCI machine learning repository:** https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#

----------------------------------

**Dataset Information**

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

**Content**

**There are 25 variables:**

ID: ID of each client

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit

SEX: Gender (1=male, 2=female)

EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

MARRIAGE: Marital status (1=married, 2=single, 3=others)

AGE: Age in years

PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)

PAY_2: Repayment status in August, 2005 (scale same as above)

PAY_3: Repayment status in July, 2005 (scale same as above)

PAY_4: Repayment status in June, 2005 (scale same as above)

PAY_5: Repayment status in May, 2005 (scale same as above)

PAY_6: Repayment status in April, 2005 (scale same as above)

BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

default.payment.next.month: Default payment (1=yes, 0=no)

# Setting the random seed and reading the data

In [0]:
RAND_SEED = 41




































# Exploratory data analysis

# Train test split

In [0]:
### START CODE HERE ### (1~2 lines)
x_train, x_test, y_train, y_test = #call the train_test_split() function
### END CODE HERE ###

y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

x_train = x_train.astype(np.float64)
x_test = x_test.astype(np.float64)

In [0]:
x_train.head()

# Feature scaling

Notice the std of the features, 'SEX' is a categorical variable, 'PAY_0' is a ordinal variable, the rest are continuous variables

In [0]:
x_test[['SEX','PAY_0','AGE','LIMIT_BAL','BILL_AMT1','PAY_AMT1']].describe()

Scale the numerical features only, stack the returned numpy array of scaled features with a numpy array of categorical features

Tips: if df is a Pandas dataframe, you can use df.values to turn a dataframe into a numpy array

In [0]:
cate_features = ['EDUCATION','SEX','MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] #categorical and ordinal features

### START CODE HERE ### (3~4 lines)
scaler =  #initiate the StandardScaler
x_train_s =  #fit the scaler with train set numerical data and then transform it
x_test_s = #transofrm the test set

### END CODE HERE ###

In [0]:
pd.DataFrame(x_train_s).describe()

Combine the array of categorical features with the scaled arrays with np.hstack([ ])

In [0]:
### START CODE HERE ### (2~4 lines)
x_train_s = # combine x_train_s with categorical and orindal features
x_test_s =  # combine x_test_s with categorical and orindal features
### END CODE HERE ###

In [0]:
print('x_train shape:', x_train_s.shape)
print('y_train shape:', y_train.shape)
print('x_test shape:', x_test_s.shape)
print('y_test shape:', y_test.shape)

# Training with XGBoost

In [0]:
from xgboost import XGBClassifier

#Defining the model and its hyperparameters 

### START CODE HERE ### (1 line)
xgb = #specifying the mode
### END CODE HERE ###


In [0]:
#Fitting the data to the model

start = datetime.now()

### START CODE HERE ### (1 line)

#Fitting the model with x_train_s and y_train

### END CODE HERE ###

time_elapsed = datetime.now() - start
print('Time elapsed:',time_elapsed)

In [0]:
#Making the predictions

### START CODE HERE ###
y_train_pred = #predict class for x_train_s
y_train_pred_prob = #predict probability of class 1 for x_train_s

y_test_pred =  #predict class for x_test_s
y_test_pred_prob = #predict probability of class 1 for x_test_s
### END CODE HERE ###

# Evaluating the model

## Functions for model evaluations (Do NOT edit the codes here)

In [0]:
def print_confusion_matrix(confusion_matrix, class_names, figsize = (6,5)):
  df_cm = pd.DataFrame(
      confusion_matrix, index=class_names, columns=class_names, 
  )
  fig = plt.figure(figsize=figsize)
  try:
      heatmap = sns.heatmap(df_cm, annot=True, fmt="d", cmap='Reds')
  except ValueError:
      raise ValueError("Confusion matrix values must be integers.")
  heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=20)
  heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=20)
  plt.title('Confusion matrix', fontsize=25)
  plt.ylabel('True label', fontsize=17)
  plt.xlabel('Predicted label', fontsize=17)
  return fig
  

def evaluate_model(y_true, y_pred, y_pred_prob):
  '''
  y_true: a numpy array of true class label, containing 0 and 1
  y_pred: a numpy array of predicted class label, containing 0 and 1
  y_pred_prob: a numpy array of predicted probability of belonging to a class
  '''  
    
  tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
  cm = np.array([tp,fn,fp,tn]).reshape(2,2)
  acc = accuracy_score(y_true, y_pred)
  precision = precision_score(y_true, y_pred)
  recall = recall_score(y_true, y_pred)
  f1 = f1_score(y_true, y_pred)
  roc_auc = roc_auc_score(y_true, y_pred_prob)

  print('Accuracy: %.3f'%acc)
  print('Precision: %.3f'%precision)
  print('Recall: %.3f'%recall)
  print('F1 Score: %.3f'%f1)
  print('AUC score: %.3f \n'%roc_auc)
  print_confusion_matrix(cm,[1,0])
  
def plot_roc(y_true, y_pred_prob): 
  '''
  y_true: a numpy array of true class label, containing 0 and 1
  y_pred_prob: a numpy array of predicted probability of belonging to a class
  '''  
    
  fpr, tpr, thresholds = roc_curve(y_true, y_pred_prob, pos_label=None, drop_intermediate=False)
  roc_auc = roc_auc_score(y_true, y_pred_prob)
  plt.figure()
  lw = 2
  plt.plot(fpr, tpr, color='darkorange',
          lw=lw, label='ROC curve (area = %0.3f)' % roc_auc)
  plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
  plt.xlim([0.0, 1.0])
  plt.ylim([0.0, 1.05])
  plt.xlabel('False Positive Rate')
  plt.ylabel('True Positive Rate')
  plt.title('ROC Curve')
  plt.legend(loc="lower right")
  plt.show()
  
  

## Model Evaluation on train set

Read the function definitions above, determine what arguements should be passed into the functions

In [0]:
### START CODE HERE ###
evaluate_model() #What should be filled inside?
### END CODE HERE ###

In [0]:
### START CODE HERE ###
plot_roc() #What should be filled inside?
### END CODE HERE ###

## Model Evaluation on test set

In [0]:
### START CODE HERE ###
evaluate_model() #What should be filled inside?
### END CODE HERE ###

In [0]:
### START CODE HERE ###
plot_roc() #What should be filled inside?
### END CODE HERE ###

# Plotting and saving the tree graph

Set rankdir to 'UT' for up to down graph or 'LR' for left to right graph, it will generate a file called 'xgb.jpg'.

Just run the cell, no need to edit.

In [0]:
from xgboost import plot_tree
ax = plt.gca()
ax.figure.set_size_inches(25, 25)
plot_tree(xgb, ax=ax, rankdir='LR')
ax.figure.savefig('xgb.jpg')

# Feature importance
Just run the cell, no need to edit.

In [0]:
features = x_train.columns
feature_imp = pd.DataFrame(sorted(zip(xgb.feature_importances_,features)), columns=['Value','Feature'])

plt.figure(figsize=(7, 7))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('Features Importance')
plt.show()

### Question: Can you fit other machine learning models and compare the performance with XGBoost? 
Many machine learning models are available in sklearn and can be fitted in a similar manner as in XGBoost (XGBoost is compatible with sklearn).

Popoluar classification models include: Support Vector Machine, Logistic Regression, Linear Discriminant Analysis, AdaBoost (another type of tree ensemble model).

See https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model and below code for reference:


In [0]:
#Building a logistic regression classifier

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='newton-cg', random_state=RAND_SEED)

start = datetime.now()
lr.fit(x_train_s, y_train)
time_elapsed = datetime.now() - start
print('Time elapsed:',time_elapsed)

y_test_pred =  #Predict class label
y_test_pred_prob =  #Predict class probability

In [0]:
evaluate_model() #What should be filled inside?

In [0]:
plot_roc() #What should be filled inside?