<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/machine-learning-bookcamp/6-ensemble-learning/01_credit_risk_scoring_using_decision_tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Credit risk scoring project: Decision tree

Imagine that we work at a bank. When we receive a loan application, we need to make
sure that if we give the money, the customer will be able to pay it back. Every application
carries a risk of default — the failure to return the money.

Credit risk scoring is a binary classification problem: the target is positive (“1”) if the
customer defaults and negative (“0”) otherwise.

We will use machine learning to calculate the risk of
default. The plan for the project is the following:

* We will train decision tree model for predicting the probability
of default.
* Then we combine multiple decision trees into one model — a random forest.
* Finally, we explore a different way of combining decision trees — gradient
boosting(XGBoost).

##Setup

In [None]:
import pandas as pd
import numpy as np
import pickle 
import requests

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import os
# content/gdrive/My Drive/Kaggle is the path where kaggle.json is  present in the Google Drive
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/kaggle-keys"

In [None]:
%%shell

# download dataset from kaggle> URL: https://www.kaggle.com/blastchar/telco-customer-churn
kaggle datasets download -d blastchar/telco-customer-churn

unzip -qq telco-customer-churn.zip
rm -rf telco-customer-churn.zip

##Dataset

In [None]:
# let’s read our dataset
data_df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
len(data_df)

7043

In [None]:
# so, let's set the missing values to zero
data_df["TotalCharges"] = pd.to_numeric(data_df.TotalCharges, errors="coerce")
data_df["TotalCharges"] = data_df["TotalCharges"].fillna(0)

In [None]:
# Let’s make the column names uniform by lowercasing everything and replacing spaces with underscores
data_df.columns = data_df.columns.str.lower().str.replace(" ", "_")
string_columns = list(data_df.dtypes[data_df.dtypes == "object"].index)

for col in string_columns:
  data_df[col] = data_df[col].str.lower().str.replace(" ", "_")

In [None]:
# so, let’s convert the target variable to numbers
data_df.churn = (data_df.churn == "yes").astype(int)

In [None]:
# split such that 80% of the data goes to the train set and the remaining 20% goes to the test set.
df_train_full, df_test = train_test_split(data_df, test_size=0.2, random_state=1)

df_train_full = df_train_full.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [None]:
# let's split it one more time into train and validation
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

In [None]:
# Takes the column with the target variable, churn, and saves it outside the dataframe
y_train = df_train.churn.values
y_val = df_val.churn.values

In [None]:
# Deletes the churn columns
del df_train["churn"]
del df_val["churn"]

In [None]:
# let's create two lists for categorical and numerical variables
categorical_cols = [
  'gender', 'seniorcitizen', 'partner', 'dependents',
  'phoneservice', 'multiplelines', 'internetservice',
  'onlinesecurity', 'onlinebackup', 'deviceprotection',
  'techsupport', 'streamingtv', 'streamingmovies',
  'contract', 'paperlessbilling', 'paymentmethod'
]

numerical_cols = ['tenure', 'monthlycharges', 'totalcharges']

##Model

In [None]:
def train(df, y, C):
  # Applies one-hot encoding
  cat = df[categorical_cols + numerical_cols].to_dict(orient="records")

  dv = DictVectorizer(sparse=False)
  dv.fit(cat)

  X = dv.transform(cat)

  model = LogisticRegression(solver="liblinear", C=C)
  model.fit(X, y)
  return dv, model

def predict(df, dv, model):
  cat = df[categorical_cols + numerical_cols].to_dict(orient="records")

  X = dv.transform(cat)

  y_pred = model.predict_proba(X)[:, 1]
  return y_pred

In [None]:
y_train = df_train_full.churn.values
y_test = df_test.churn.values

# Trains the model and makes predictions
dv, model = train(df_train_full, y_train, C=0.5)
y_pred = predict(df_test, dv, model)

auc = roc_auc_score(y_test, y_pred)
print(f"auc={auc:.3f}")

auc=0.858


##Prediction

Let’s use this model to calculate the probability of churning.

In [None]:
customer = {
    'customerid': '8879-zkjof',
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'no',
    'dependents': 'no',
    'tenure': 41,
    'phoneservice': 'yes',
    'multiplelines': 'no',
    'internetservice': 'dsl',
    'onlinesecurity': 'yes',
    'onlinebackup': 'no',
    'deviceprotection': 'yes',
    'techsupport': 'yes',
    'streamingtv': 'yes',
    'streamingmovies': 'yes',
    'contract': 'one_year',
    'paperlessbilling': 'yes',
    'paymentmethod': 'bank_transfer_(automatic)',
    'monthlycharges': 79.85,
    'totalcharges': 3320.75
}

df = pd.DataFrame([customer])
y_pred = predict(df, dv, model)
y_pred[0]

0.061875685940780745

In [None]:
def predict_single(customer, dv, model):
  X = dv.transform([customer])
  y_pred = model.predict_proba(X)[:, 1]
  return y_pred[0]

In [None]:
predict_single(customer, dv, model)

0.061875685940780745

##Save model

In [None]:
# let's save the model using Pickle module
with open("churn-model.bin", "wb") as f_out:
  pickle.dump(model, f_out)

In our case, however, saving just the model is not enough: we also have a DictVectorizer that we “trained” together with the model. 

We need to save both.

In [None]:
# let's save the model using Pickle module
with open("churn-model.bin", "wb") as f_out:
  pickle.dump((dv, model), f_out)

##Load model

In [None]:
# Let's load the saved model
with open("churn-model.bin", "rb") as f_in:
  dv, model = pickle.load(f_in)

In [None]:
# And apply it
prediction = predict_single(customer, dv, model)
print(f"prediction: {prediction:.3f}")

prediction: 0.062


In [None]:
# let’s display the results
if prediction >= 0.5:
  print("verdict: Churn")
else:
  print("verdict: No Churn")

verdict: No Churn
