# Final Exercice: The Rumos Bank 

The Rumos Bank é um banco que tem perdido bastante dinheiro devido à quantidade de créditos que fornece e que não são pagos dentro do prazo devido. 

    - Por cada cliente que se estima que não irá  pagar dentro do prazo e afinal paga, o banco tem um custo de 1000euros. 

    - Por cada cliente que se prevê como sendo um bom pagador e afinal não paga dentro do prazo, o banco tem um custo de 3000euros.


Vocês, data scientists de topo,  são contratados para ajudar o banco a prever quais os clientes que não irão cumprir os prazos, para que este consiga uma melhor gestão dos fundos.

Conseguem construir um modelo que ajude a detectar previamente e com sucesso os clientes que são maus pagadores?


Dataset: https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

Variáveis disponíveis:

    ID: ID of each client
    LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
    SEX: Gender (1=male, 2=female)
    EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
    MARRIAGE: Marital status (1=married, 2=single, 3=others)
    AGE: Age in years
    PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
    PAY_2: Repayment status in August, 2005 (scale same as above)
    PAY_3: Repayment status in July, 2005 (scale same as above)
    PAY_4: Repayment status in June, 2005 (scale same as above)
    PAY_5: Repayment status in May, 2005 (scale same as above)
    PAY_6: Repayment status in April, 2005 (scale same as above)
    BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
    BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
    BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
    BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
    BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
    BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
    PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
    PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
    PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
    PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
    PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
    PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
    default.payment.next.month: Default payment (1=yes, 0=no)

#### Perguntas:

    1. Quantas features estão disponíveis? Quantos clientes?
    2. Quantos clientes têm no dataset que efectivamente foram maus pagadores? E quantos não foram?
    3. Qual o modelo que levou a melhores resultados? Qual a métrica usada para comparar os diferentes modelos?
    4. Quais são as features mais relevantes para decidir se um cliente tem mais propensão para ser mau pagador?
    5. Qual seria o custo que o banco tem sem nenhum modelo?
    6. Qual o custo que o banco passa a tar com o vosso modelo?

Com base na informação dada, podemos definir que:

    True positive - Os maus pagadores são identificados correctamente.
    True negative - Os bons pagadores são identificados correctamente.
    False positive - Um bom pagador é identificado como sendo um mau pagador.
    False negative - Um mau pagador é identificado como sendo um bom pagador.

In [None]:
import mlflow

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import  precision_recall_curve, roc_auc_score, confusion_matrix, accuracy_score, recall_score, precision_score, f1_score,auc, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import tree
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier

In [None]:
ROOT_PATH = '../data/'
SEED = 3
TARGET_COL = "default.payment.next.month"

## Definir a diretoria onde as experiências são guardadas

In [None]:
from pathlib import Path

uri = "http://0.0.0.0:5000"

mlflow.set_tracking_uri(uri)

## Fazer set da experiência "Rumos Bank Experiment"

In [None]:
mlflow.set_experiment("Rumos Bank lending prediction Experiment")

In [None]:
df = pd.read_csv(root_path + 'lending_data.csv')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.ID.nunique()

#### 1. Quantas features estão disponíveis? Quantos clientes?

    - Existem 24 features disponíveis no dataset.
    - Existem 30000 clientes.

Existem missing values?

In [None]:
df.isnull().values.any()

Existem o mesmo número de clientes nas duas classes?

In [None]:
df.groupby('default.payment.next.month')['default.payment.next.month'].count()

Não! O dataset é altamente desequilibrado.

#### 2. Quantos clientes têm no dataset que efectivamente foram maus pagadores? E quantos não foram?

    - 23,364 foram bons pagadores. 6,636 foram maus pagadores.

Existem features não númericas?

In [None]:
df.dtypes

Todas as features são númericas.

Vamos retirar o ID do cliente:

In [None]:
df = df.drop('ID', axis = 1)

## Criar os datasets

In [None]:
train_path = ROOT_PATH + 'rumos_bank_train.csv'
test_path = ROOT_PATH + 'rumos_bank_test.csv'

train_set = pd.read_csv(train_path)
test_set = pd.read_csv(test_path)

X_train = train_set.drop([TARGET_COL], axis = 1)
y_train = train_set[TARGET_COL]

X_test = test_set.drop([TARGET_COL], axis = 1)
y_test = test_set[TARGET_COL]

X_train.head()

Antes de iniciar, vamos calcular a baseline, ou seja, o custo que temos sem nenhum modelo.

$$totalCost = 1000∗FP+3000∗FN$$

In [None]:
y_preds_all_bad = np.ones(y_test.shape) 

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_preds_all_bad).ravel()

print('Number of True Negatives:', tn)
print('Number of True Positives:', tp)
print('Number of False Negatives:', fn)
print('Number of False Positives:', fp)

In [None]:
print('Total Cost:', fp*1000)

In [None]:
accuracy_score(y_test, y_preds_all_bad)

Se todos os clientes fossem considerados como maus pagadores, o banco teria um custo de 4,687,000 Euros.

In [None]:
y_preds_all_good = np.zeros(y_test.shape) 

tn, fp, fn, tp = confusion_matrix(y_test, y_preds_all_good).ravel()

print('Number of True Negatives:', tn)
print('Number of True Positives:', tp)
print('Number of False Negatives:', fn)
print('Number of False Positives:', fp)

In [None]:
print('Total Cost:', fn*3000)

In [None]:
accuracy_score(y_test, y_preds_all_good)

Se todos os clientes fossem considerados bons pagadores, o banco teria um custo de 3,939,000 Euros.

In [None]:
def total_cost(y_test, y_preds, threshold = 0.5):
    
    tn, fp, fn, tp = confusion_matrix(y_test == 1, y_preds > threshold).ravel()
    
    cost_fn = fn*3000
    cost_fp = fp*1000
    
    return cost_fn + cost_fp

In [None]:
def min_cost_threshold(y_test, y_preds):
    
    costs = {}
    
    for threshold in np.arange(0, 1.1, 0.1):
        
        costs[round(threshold, 1)] = total_cost(y_test, y_preds, threshold = threshold)
        
    plt.plot(list(costs.keys()), list(costs.values()))
    plt.ylabel('Cost')
    plt.xlabel('Threshold')
    plt.show()
    

#### Logistic Regression

## Criar uma run

In [None]:
run = mlflow.start_run(run_name="Logistic Regression Run")
RUN_ID = run.info.run_uuid
RUN_ID

## Guardar datasets, modelos, artefactos, métricas e parametros da run

In [None]:
# guardarmos o dataset de treino e de teste associado à run
train_dataset = mlflow.data.from_pandas(train_set, targets=TARGET_COL, name="Logistic Train Dataset")
test_dataset = mlflow.data.from_pandas(test_set, targets=TARGET_COL, name="Logistic Test Dataset")
mlflow.log_input(train_dataset, context="train")
mlflow.log_input(test_dataset, context="test")

# Guardamos a seed utilizado como parametro
mlflow.log_param("seed", SEED)

In [None]:
from mlflow.models import infer_signature

signature = infer_signature(X_train, y_train)

In [None]:
lr_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("logistic_regression", LogisticRegression(max_iter = 500, solver = 'lbfgs', random_state = SEED, class_weight = 'balanced' )
])
parameters = {'C':[0.001, 0.01, 0.1, 1, 10, 100]}

lr_pipeline.fit(X_train, y_train)

# Configurar o GridSearchCV
grid_search_lr = GridSearchCV(lr_pipeline, parameters, cv=5)

grid_search_lr.fit(X_train, y_train)

mlflow.sklearn.log_model(lr_pipeline, artifact_path="lr_pipeline", registered_model_name="logistic_reg", signature=signature)
lr_pipeline

In [None]:
params=lr_pipeline.get_params()

modified_params = {}
for k, v in params.items():
    new_key = k.replace("logistic_regression__", '')
    modified_params[new_key] = v

mlflow.log_params(modified_params)
modified_params

In [None]:
grid_search_lr.score(X_test, y_test)

In [None]:
y_preds = grid_search_lr.predict(X_test)
acc = accuracy_score(y_test, y_preds)
mlflow.log_metric("accuracy", acc)
acc

In [None]:
total_cost(y_test, y_preds, threshold = 0.5)

In [None]:
min_cost_threshold(y_test, y_preds)

In [None]:
total_cost(y_test, y_preds, threshold = 0.6)

O Custo é minimo para um threshold de 0.6: 2,646,000 Euros, que é a melhor que a baseline!

In [None]:
mlflow.end_run()

In [None]:
run = mlflow.get_run(RUN_ID)
run.data

#### KNN

In [None]:
run = mlflow.start_run(run_name="KNN Run")
RUN_ID = run.info.run_uuid
RUN_ID

In [None]:
# guardarmos o dataset de treino e de teste associado à run
train_dataset = mlflow.data.from_pandas(train_set, targets="default.payment.next.month", name="KNN Train Dataset")
test_dataset = mlflow.data.from_pandas(test_set, targets="default.payment.next.month", name="KNN Test Dataset")
mlflow.log_input(train_dataset, context="train")
mlflow.log_input(test_dataset, context="test")

# Guardamos a seed utilizado como parametro
mlflow.log_param("seed", SEED)

In [None]:
KNN_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("logistic_regression", LogisticRegression(random_state=SEED, C=0.1))
])
KNN_pipeline.fit(X_train, y_train)
mlflow.sklearn.log_model(lr_pipeline, artifact_path="KNN_pipeline", registered_model_name="KNN")
KNN_pipeline

In [None]:
params=KNN_pipeline.get_params()

modified_params = {}
for k, v in params.items():
    new_key = k.replace("logistic_nn__", '')
    modified_params[new_key] = v

mlflow.log_params(modified_params)
modified_params