# Assignment 7.1: Pitching Your ML Algorithm

Find an exciting business problem, find data, and solve the problem with machine learning in Python using the CRISP-DM methodology and algorithms covered in the course:
1. Business understanding - What does the business need?
2. Data understanding - What data do we have/need? Is it clean?
3. Data preparation - How do we organize the data for modeling?
4. Modeling - What modeling techniques should we apply?
5. Evaluation - Which model best meets the business objectives?
6. Deployment - How to get the model in production and ensure it works?

Note: it is required that your team use GitHub to host your code, collaborate and manage versions. GitHub helps ensure traceability, allow rollback, and avoid unintended overwrites and loss of code. You can use the integration between Google Colab and GitHub to achieve the goals of the project.

Dataset: **Home Credit Default Dataset**

**Project Scenario:**
Before presenting to the executive decision-making body, there is a code review round and a technical solution review by a technical committee. The committee consists of fellow ML Engineers, Data Engineers, Architects, ML managers, and the Head of Machine Learning.
There are three key deliverables for this final team project:
Google Colab notebook (ipynb)
Business presentation slides (pptx/pdf)
Recorded video presentation (mp4)
The Colab notebook is for the technical committee, whereas the business brief is targeted to a non-technical executive committee. Both committees will view the slides and video. Please see the following explanation for the requirements on each file


**Google Colab Notebook with Python code:**
The Google Clab Notebook must be organized like a report where the code blocks are interspersed with text blocks. The text block that appears before the code block must cover the explanations of the approach. The text blocks that follow the output graphs and tables must contain inference, actionable insight, and recommendations. The code blocks themselves must be annotated with comments so they are readable.

The notebook must contain the following sections:

* Problem statement and justification for the proposed approach.
* Data understanding (EDA) - a graphical and non-graphical representation of relationships between the response variable and predictor variables.
* Data preparation.
* Feature engineering - data pre-processing - missing values, outliers, etc.
* Feature Selection - how were the features selected based on the data analysis?
* Modeling - selection, comparison, tuning, and analysis - consider ensembles.
* Evaluation - performance measures, results, and conclusions.
* Discussion and conclusions - address the problem statement and recommendation.

# Problem Statement

In [None]:
# @title Justification


In [None]:
# @title Deveopment Approach


# Data Preparation

In [2]:
# @title Imports
!pip install Cython dataprep mrmr_selection pymrmr --quiet

import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn import tree
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn import metrics
from tabulate import tabulate
from mrmr import mrmr_classif
from dataprep.clean import clean_df
from dataprep.eda import create_report
from sklearn.preprocessing import LabelEncoder
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

[0m

In [None]:
from dask_saturn import SaturnCluster
from dask.distributed import Client

cluster = SaturnCluster()
client = Client(cluster)
client

In [None]:
import dask.dataframe as dd

# Base directory path
base_path = "/home/jovyan/shared/javonkitson/data/home-credit-default-risk/"

# Read and store CSV files
train_csv_path = base_path + "application_train.csv"
test_csv_path = base_path + "application_test.csv"
bureau_csv = base_path + "bureau.csv"
bureau_balance_csv = base_path + "bureau_balance.csv"
cash_balance_csv = base_path + "POS_CASH_balance.csv"
credit_csv = base_path + "credit_card_balance.csv"
installment_csv = base_path + "installments_payments.csv"

# Load the data first
df_train = dd.read_csv(train_csv_path, dtype='object')

# List of all CSV files to load and merge
csv_files = [test_csv_path, cash_balance_csv, credit_csv, installment_csv]

# Proceed with load and merge incrementally
for csv_file in csv_files:
    df_temp = dd.read_csv(csv_file, dtype='object')
    df_train = df_train.merge(df_temp, on='SK_ID_CURR', how='outer')
    del df_temp  # delete the dataframe that has been merged

# Specific handling for bureau data as it requires merge with bureau_balance before merging with df_train
df_bureau = dd.read_csv(bureau_csv, dtype='object')
df_bureau_balance = dd.read_csv(bureau_balance_csv, dtype='object')
df_bureau = df_bureau.merge(df_bureau_balance, on='SK_ID_BUREAU', how='outer')
del df_bureau_balance
df_train = df_train.merge(df_bureau, on='SK_ID_CURR', how='outer')
del df_bureau

# It is recommended to perform a compute() operation only when necessary because it will load the data into memory.
df_train = df_train.compute()

df_train.to_csv('dataset-dirty.csv') #save df
print(df_train.head())

In [None]:
# Close the Dask client
client.close()

In [2]:
# # Check point
# del df_train
df_train = pd.read_csv("dataset-dirty.csv")
df_train.set_index('SK_ID_CURR', inplace=True)

In [3]:
df_train = df_train.drop(df_train.columns[0], axis=1)
df_train.head()

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE_x,CODE_GENDER_x,FLAG_OWN_CAR_x,FLAG_OWN_REALTY_x,CNT_CHILDREN_x,AMT_INCOME_TOTAL_x,AMT_CREDIT_x,AMT_ANNUITY_x,AMT_GOODS_PRICE_x,...,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY,MONTHS_BALANCE,STATUS
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100002.0,1.0,Cash loans,M,N,Y,0.0,202500.0,406597.5,24700.5,351000.0,...,,,,,,,,,,
100003.0,0.0,Cash loans,F,N,N,0.0,270000.0,1293502.5,35698.5,1129500.0,...,,,,,,,,,,
100004.0,0.0,Revolving loans,M,Y,Y,0.0,67500.0,135000.0,6750.0,135000.0,...,,,,,,,,,,
100006.0,0.0,Cash loans,F,N,Y,0.0,135000.0,312682.5,29686.5,297000.0,...,,,,,,,,,,
100007.0,0.0,Cash loans,M,N,Y,0.0,121500.0,513000.0,21865.5,513000.0,...,,,,,,,,,,


# Feature Engineering

## Cleaning

In [4]:
column_dtypes_dict = df_train.dtypes.to_dict()
print(column_dtypes_dict)

{'TARGET': dtype('float64'), 'NAME_CONTRACT_TYPE_x': dtype('O'), 'CODE_GENDER_x': dtype('O'), 'FLAG_OWN_CAR_x': dtype('O'), 'FLAG_OWN_REALTY_x': dtype('O'), 'CNT_CHILDREN_x': dtype('float64'), 'AMT_INCOME_TOTAL_x': dtype('float64'), 'AMT_CREDIT_x': dtype('float64'), 'AMT_ANNUITY_x': dtype('float64'), 'AMT_GOODS_PRICE_x': dtype('float64'), 'NAME_TYPE_SUITE_x': dtype('O'), 'NAME_INCOME_TYPE_x': dtype('O'), 'NAME_EDUCATION_TYPE_x': dtype('O'), 'NAME_FAMILY_STATUS_x': dtype('O'), 'NAME_HOUSING_TYPE_x': dtype('O'), 'REGION_POPULATION_RELATIVE_x': dtype('float64'), 'DAYS_BIRTH_x': dtype('float64'), 'DAYS_EMPLOYED_x': dtype('float64'), 'DAYS_REGISTRATION_x': dtype('float64'), 'DAYS_ID_PUBLISH_x': dtype('float64'), 'OWN_CAR_AGE_x': dtype('float64'), 'FLAG_MOBIL_x': dtype('float64'), 'FLAG_EMP_PHONE_x': dtype('float64'), 'FLAG_WORK_PHONE_x': dtype('float64'), 'FLAG_CONT_MOBILE_x': dtype('float64'), 'FLAG_PHONE_x': dtype('float64'), 'FLAG_EMAIL_x': dtype('float64'), 'OCCUPATION_TYPE_x': dtype('O

In [4]:
data_types = {
    'TARGET': float,
    'NAME_CONTRACT_TYPE_x': str,
    'CODE_GENDER_x': str,
    'FLAG_OWN_CAR_x': str,
    'FLAG_OWN_REALTY_x': str,
    'CNT_CHILDREN_x': float,
    'AMT_INCOME_TOTAL_x': float,
    'AMT_CREDIT_x': float,
    'AMT_ANNUITY_x': float,
    'AMT_GOODS_PRICE_x': float,
    'NAME_TYPE_SUITE_x': str,
    'NAME_INCOME_TYPE_x': str,
    'NAME_EDUCATION_TYPE_x': str,
    'NAME_FAMILY_STATUS_x': str,
    'NAME_HOUSING_TYPE_x': str,
    'REGION_POPULATION_RELATIVE_x': float,
    'DAYS_BIRTH_x': float,
    'DAYS_EMPLOYED_x': float,
    'DAYS_REGISTRATION_x': float,
    'DAYS_ID_PUBLISH_x': float,
    'OWN_CAR_AGE_x': float,
    'FLAG_MOBIL_x': float,
    'FLAG_EMP_PHONE_x': float,
    'FLAG_WORK_PHONE_x': float,
    'FLAG_CONT_MOBILE_x': float,
    'FLAG_PHONE_x': float,
    'FLAG_EMAIL_x': float,
    'OCCUPATION_TYPE_x': str,
    'CNT_FAM_MEMBERS_x': float,
    'REGION_RATING_CLIENT_x': float,
    'REGION_RATING_CLIENT_W_CITY_x': float,
    'WEEKDAY_APPR_PROCESS_START_x': str,
    'HOUR_APPR_PROCESS_START_x': float,
    'REG_REGION_NOT_LIVE_REGION_x': float,
    'REG_REGION_NOT_WORK_REGION_x': float,
    'LIVE_REGION_NOT_WORK_REGION_x': float,
    'REG_CITY_NOT_LIVE_CITY_x': float,
    'REG_CITY_NOT_WORK_CITY_x': float,
    'LIVE_CITY_NOT_WORK_CITY_x': float,
    'ORGANIZATION_TYPE_x': str,
    'EXT_SOURCE_1_x': float,
    'EXT_SOURCE_2_x': float,
    'EXT_SOURCE_3_x': float,
    'APARTMENTS_AVG_x': float,
    'BASEMENTAREA_AVG_x': float,
    'YEARS_BEGINEXPLUATATION_AVG_x': float,
    'YEARS_BUILD_AVG_x': float,
    'COMMONAREA_AVG_x': float,
    'ELEVATORS_AVG_x': float,
    'ENTRANCES_AVG_x': float,
    'FLOORSMAX_AVG_x': float,
    'FLOORSMIN_AVG_x': float,
    'LANDAREA_AVG_x': float,
    'LIVINGAPARTMENTS_AVG_x': float,
    'LIVINGAREA_AVG_x': float,
    'NONLIVINGAPARTMENTS_AVG_x': float,
    'NONLIVINGAREA_AVG_x': float,
    'APARTMENTS_MODE_x': float,
    'BASEMENTAREA_MODE_x': float,
    'YEARS_BEGINEXPLUATATION_MODE_x': float,
    'YEARS_BUILD_MODE_x': float,
    'COMMONAREA_MODE_x': float,
    'ELEVATORS_MODE_x': float,
    'ENTRANCES_MODE_x': float,
    'FLOORSMAX_MODE_x': float,
    'FLOORSMIN_MODE_x': float,
    'LANDAREA_MODE_x': float,
    'LIVINGAPARTMENTS_MODE_x': float,
    'LIVINGAREA_MODE_x': float,
    'NONLIVINGAPARTMENTS_MODE_x': float,
    'NONLIVINGAREA_MODE_x': float,
    'APARTMENTS_MEDI_x': float,
    'BASEMENTAREA_MEDI_x': float,
    'YEARS_BEGINEXPLUATATION_MEDI_x': float,
    'YEARS_BUILD_MEDI_x': float,
    'COMMONAREA_MEDI_x': float,
    'ELEVATORS_MEDI_x': float,
    'ENTRANCES_MEDI_x': float,
    'FLOORSMAX_MEDI_x': float,
    'FLOORSMIN_MEDI_x': float,
    'LANDAREA_MEDI_x': float,
    'LIVINGAPARTMENTS_MEDI_x': float,
    'LIVINGAREA_MEDI_x': float,
    'NONLIVINGAPARTMENTS_MEDI_x': float,
    'NONLIVINGAREA_MEDI_x': float,
    'FONDKAPREMONT_MODE_x': str,
    'HOUSETYPE_MODE_x': str,
    'TOTALAREA_MODE_x': float,
    'WALLSMATERIAL_MODE_x': str,
    'EMERGENCYSTATE_MODE_x': str,
    'OBS_30_CNT_SOCIAL_CIRCLE_x': float,
    'DEF_30_CNT_SOCIAL_CIRCLE_x': float,
    'OBS_60_CNT_SOCIAL_CIRCLE_x': float,
    'DEF_60_CNT_SOCIAL_CIRCLE_x': float,
    'DAYS_LAST_PHONE_CHANGE_x': float,
    'FLAG_DOCUMENT_2_x': float,
    'FLAG_DOCUMENT_3_x': float,
    'FLAG_DOCUMENT_4_x': float,
    'FLAG_DOCUMENT_5_x': float,
    'FLAG_DOCUMENT_6_x': float,
    'FLAG_DOCUMENT_7_x': float,
    'FLAG_DOCUMENT_8_x': float,
    'FLAG_DOCUMENT_9_x': float,
    'FLAG_DOCUMENT_10_x': float,
    'FLAG_DOCUMENT_11_x': float,
    'FLAG_DOCUMENT_12_x': float,
    'FLAG_DOCUMENT_13_x': float,
    'FLAG_DOCUMENT_14_x': float,
    'FLAG_DOCUMENT_15_x': float,
    'FLAG_DOCUMENT_16_x': float,
    'FLAG_DOCUMENT_17_x': float,
    'FLAG_DOCUMENT_18_x': float,
    'FLAG_DOCUMENT_19_x': float,
    'FLAG_DOCUMENT_20_x': float,
    'FLAG_DOCUMENT_21_x': float,
    'AMT_REQ_CREDIT_BUREAU_HOUR_x': float,
    'AMT_REQ_CREDIT_BUREAU_DAY_x': float,
    'AMT_REQ_CREDIT_BUREAU_WEEK_x': float,
    'AMT_REQ_CREDIT_BUREAU_MON_x': float,
    'AMT_REQ_CREDIT_BUREAU_QRT_x': float,
    'AMT_REQ_CREDIT_BUREAU_YEAR_x': float,
    'NAME_CONTRACT_TYPE_y': str,
    'CODE_GENDER_y': str,
    'FLAG_OWN_CAR_y': str,
    'FLAG_OWN_REALTY_y': str,
    'CNT_CHILDREN_y': float,
    'AMT_INCOME_TOTAL_y': float,
    'AMT_CREDIT_y': float,
    'AMT_ANNUITY_y': float,
    'AMT_GOODS_PRICE_y': float,
    'NAME_TYPE_SUITE_y': str,
    'NAME_INCOME_TYPE_y': str,
    'NAME_EDUCATION_TYPE_y': str,
    'NAME_FAMILY_STATUS_y': str,
    'NAME_HOUSING_TYPE_y': str,
    'REGION_POPULATION_RELATIVE_y': float,
    'DAYS_BIRTH_y': float,
    'DAYS_EMPLOYED_y': float,
    'DAYS_REGISTRATION_y': float,
    'DAYS_ID_PUBLISH_y': float,
    'OWN_CAR_AGE_y': float,
    'FLAG_MOBIL_y': float,
    'FLAG_EMP_PHONE_y': float,
    'FLAG_WORK_PHONE_y': float,
    'FLAG_CONT_MOBILE_y': float,
    'FLAG_PHONE_y': float,
    'FLAG_EMAIL_y': float,
    'OCCUPATION_TYPE_y': str,
    'CNT_FAM_MEMBERS_y': float,
    'REGION_RATING_CLIENT_y': float,
    'REGION_RATING_CLIENT_W_CITY_y': float,
    'WEEKDAY_APPR_PROCESS_START_y': str,
    'HOUR_APPR_PROCESS_START_y': float,
    'REG_REGION_NOT_LIVE_REGION_y': float,
    'REG_REGION_NOT_WORK_REGION_y': float,
    'LIVE_REGION_NOT_WORK_REGION_y': float,
    'REG_CITY_NOT_LIVE_CITY_y': float,
    'REG_CITY_NOT_WORK_CITY_y': float,
    'LIVE_CITY_NOT_WORK_CITY_y': float,
    'ORGANIZATION_TYPE_y': str,
    'EXT_SOURCE_1_y': float,
    'EXT_SOURCE_2_y': float,
    'EXT_SOURCE_3_y': float,
    'APARTMENTS_AVG_y': float,
    'BASEMENTAREA_AVG_y': float,
    'YEARS_BEGINEXPLUATATION_AVG_y': float,
    'YEARS_BUILD_AVG_y': float,
    'COMMONAREA_AVG_y': float,
    'ELEVATORS_AVG_y': float,
    'ENTRANCES_AVG_y': float,
    'FLOORSMAX_AVG_y': float,
    'FLOORSMIN_AVG_y': float,
    'LANDAREA_AVG_y': float,
    'LIVINGAPARTMENTS_AVG_y': float,
    'LIVINGAREA_AVG_y': float,
    'NONLIVINGAPARTMENTS_AVG_y': float,
    'NONLIVINGAREA_AVG_y': float,
    'APARTMENTS_MODE_y': float,
    'BASEMENTAREA_MODE_y': float,
    'YEARS_BEGINEXPLUATATION_MODE_y': float,
    'YEARS_BUILD_MODE_y': float,
    'COMMONAREA_MODE_y': float,
    'ELEVATORS_MODE_y': float,
    'ENTRANCES_MODE_y': float,
    'FLOORSMAX_MODE_y': float,
    'FLOORSMIN_MODE_y': float,
    'LANDAREA_MODE_y': float,
    'LIVINGAPARTMENTS_MODE_y': float,
    'LIVINGAREA_MODE_y': float,
    'NONLIVINGAPARTMENTS_MODE_y': float,
    'NONLIVINGAREA_MODE_y': float,
    'APARTMENTS_MEDI_y': float,
    'BASEMENTAREA_MEDI_y': float,
    'YEARS_BEGINEXPLUATATION_MEDI_y': float,
    'YEARS_BUILD_MEDI_y': float,
    'COMMONAREA_MEDI_y': float,
    'ELEVATORS_MEDI_y': float,
    'ENTRANCES_MEDI_y': float,
    'FLOORSMAX_MEDI_y': float,
    'FLOORSMIN_MEDI_y': float,
    'LANDAREA_MEDI_y': float,
    'LIVINGAPARTMENTS_MEDI_y': float,
    'LIVINGAREA_MEDI_y': float,
    'NONLIVINGAPARTMENTS_MEDI_y': float,
    'NONLIVINGAREA_MEDI_y': float,
    'FONDKAPREMONT_MODE_y': str,
    'HOUSETYPE_MODE_y': str,
    'TOTALAREA_MODE_y': float,
    'WALLSMATERIAL_MODE_y': str,
    'EMERGENCYSTATE_MODE_y': str,
    'OBS_30_CNT_SOCIAL_CIRCLE_y': float,
    'DEF_30_CNT_SOCIAL_CIRCLE_y': float,
    'OBS_60_CNT_SOCIAL_CIRCLE_y': float,
    'DEF_60_CNT_SOCIAL_CIRCLE_y': float,
    'DAYS_LAST_PHONE_CHANGE_y': float,
    'FLAG_DOCUMENT_2_y': float,
    'FLAG_DOCUMENT_3_y': float,
    'FLAG_DOCUMENT_4_y': float,
    'FLAG_DOCUMENT_5_y': float,
    'FLAG_DOCUMENT_6_y': float,
    'FLAG_DOCUMENT_7_y': float,
    'FLAG_DOCUMENT_8_y': float,
    'FLAG_DOCUMENT_9_y': float,
    'FLAG_DOCUMENT_10_y': float,
    'FLAG_DOCUMENT_11_y': float,
    'FLAG_DOCUMENT_12_y': float,
    'FLAG_DOCUMENT_13_y': float,
    'FLAG_DOCUMENT_14_y': float,
    'FLAG_DOCUMENT_15_y': float,
    'FLAG_DOCUMENT_16_y': float,
    'FLAG_DOCUMENT_17_y': float,
    'FLAG_DOCUMENT_18_y': float,
    'FLAG_DOCUMENT_19_y': float,
    'FLAG_DOCUMENT_20_y': float,
    'FLAG_DOCUMENT_21_y': float,
    'AMT_REQ_CREDIT_BUREAU_HOUR_y': float,
    'AMT_REQ_CREDIT_BUREAU_DAY_y': float,
    'AMT_REQ_CREDIT_BUREAU_WEEK_y': float,
    'AMT_REQ_CREDIT_BUREAU_MON_y': float,
    'AMT_REQ_CREDIT_BUREAU_QRT_y': float,
    'AMT_REQ_CREDIT_BUREAU_YEAR_y': float,
    'SK_ID_PREV_x': float,
    'MONTHS_BALANCE_x': float,
    'CNT_INSTALMENT': float,
    'CNT_INSTALMENT_FUTURE': float,
    'NAME_CONTRACT_STATUS_x': str,
    'SK_DPD_x': float,
    'SK_DPD_DEF_x': float,
    'SK_ID_PREV_y': float,
    'MONTHS_BALANCE_y': float,
    'AMT_BALANCE': float,
    'AMT_CREDIT_LIMIT_ACTUAL': float,
    'AMT_DRAWINGS_ATM_CURRENT': float,
    'AMT_DRAWINGS_CURRENT': float,
    'AMT_DRAWINGS_OTHER_CURRENT': float,
    'AMT_DRAWINGS_POS_CURRENT': float,
    'AMT_INST_MIN_REGULARITY': float,
    'AMT_PAYMENT_CURRENT': float,
    'AMT_PAYMENT_TOTAL_CURRENT': float,
    'AMT_RECEIVABLE_PRINCIPAL': float,
    'AMT_RECIVABLE': float,
    'AMT_TOTAL_RECEIVABLE': float,
    'CNT_DRAWINGS_ATM_CURRENT': float,
    'CNT_DRAWINGS_CURRENT': float,
    'CNT_DRAWINGS_OTHER_CURRENT': float,
    'CNT_DRAWINGS_POS_CURRENT': float,
    'CNT_INSTALMENT_MATURE_CUM': float,
    'NAME_CONTRACT_STATUS_y': str,
    'SK_DPD_y': float,
    'SK_DPD_DEF_y': float,
    'SK_ID_PREV': float,
    'NUM_INSTALMENT_VERSION': float,
    'NUM_INSTALMENT_NUMBER': float,
    'DAYS_INSTALMENT': str,
    'DAYS_ENTRY_PAYMENT': float,
    'AMT_INSTALMENT': float,
    'AMT_PAYMENT': float,
    'SK_ID_BUREAU': float,
    'CREDIT_ACTIVE': str,
    'CREDIT_CURRENCY': str,
    'DAYS_CREDIT': float,
    'CREDIT_DAY_OVERDUE': float,
    'DAYS_CREDIT_ENDDATE': float,
    'DAYS_ENDDATE_FACT': float,
    'AMT_CREDIT_MAX_OVERDUE': float,
    'CNT_CREDIT_PROLONG': float,
    'AMT_CREDIT_SUM': float,
    'AMT_CREDIT_SUM_DEBT': float,
    'AMT_CREDIT_SUM_LIMIT': float,
    'AMT_CREDIT_SUM_OVERDUE': float,
    'CREDIT_TYPE': str,
    'DAYS_CREDIT_UPDATE': float,
    'AMT_ANNUITY': float,
    'MONTHS_BALANCE': float,
    'STATUS': str
}
# Set the data types for the DataFrame columns
df_train = df_train.astype(data_types)

In [5]:
df_train.head()

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE_x,CODE_GENDER_x,FLAG_OWN_CAR_x,FLAG_OWN_REALTY_x,CNT_CHILDREN_x,AMT_INCOME_TOTAL_x,AMT_CREDIT_x,AMT_ANNUITY_x,AMT_GOODS_PRICE_x,...,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY,MONTHS_BALANCE,STATUS
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100002.0,1.0,Cash loans,M,N,Y,0.0,202500.0,406597.5,24700.5,351000.0,...,,,,,,,,,,
100003.0,0.0,Cash loans,F,N,N,0.0,270000.0,1293502.5,35698.5,1129500.0,...,,,,,,,,,,
100004.0,0.0,Revolving loans,M,Y,Y,0.0,67500.0,135000.0,6750.0,135000.0,...,,,,,,,,,,
100006.0,0.0,Cash loans,F,N,Y,0.0,135000.0,312682.5,29686.5,297000.0,...,,,,,,,,,,
100007.0,0.0,Cash loans,M,N,Y,0.0,121500.0,513000.0,21865.5,513000.0,...,,,,,,,,,,


In [6]:
column_dtypes_dict = df_train.dtypes.to_dict()
print(column_dtypes_dict)

{'TARGET': dtype('float64'), 'NAME_CONTRACT_TYPE_x': dtype('O'), 'CODE_GENDER_x': dtype('O'), 'FLAG_OWN_CAR_x': dtype('O'), 'FLAG_OWN_REALTY_x': dtype('O'), 'CNT_CHILDREN_x': dtype('float64'), 'AMT_INCOME_TOTAL_x': dtype('float64'), 'AMT_CREDIT_x': dtype('float64'), 'AMT_ANNUITY_x': dtype('float64'), 'AMT_GOODS_PRICE_x': dtype('float64'), 'NAME_TYPE_SUITE_x': dtype('O'), 'NAME_INCOME_TYPE_x': dtype('O'), 'NAME_EDUCATION_TYPE_x': dtype('O'), 'NAME_FAMILY_STATUS_x': dtype('O'), 'NAME_HOUSING_TYPE_x': dtype('O'), 'REGION_POPULATION_RELATIVE_x': dtype('float64'), 'DAYS_BIRTH_x': dtype('float64'), 'DAYS_EMPLOYED_x': dtype('float64'), 'DAYS_REGISTRATION_x': dtype('float64'), 'DAYS_ID_PUBLISH_x': dtype('float64'), 'OWN_CAR_AGE_x': dtype('float64'), 'FLAG_MOBIL_x': dtype('float64'), 'FLAG_EMP_PHONE_x': dtype('float64'), 'FLAG_WORK_PHONE_x': dtype('float64'), 'FLAG_CONT_MOBILE_x': dtype('float64'), 'FLAG_PHONE_x': dtype('float64'), 'FLAG_EMAIL_x': dtype('float64'), 'OCCUPATION_TYPE_x': dtype('O

In [6]:
# Get categorical columns
categorical_cols = df_train.select_dtypes(include=['object', 'category']).columns

# # Drop rows with NaN values
df_train.dropna(inplace=False)

# Apply LabelEncoder on categorical columns
le = LabelEncoder()
for col in categorical_cols:
    df_train[col] = le.fit_transform(df_train[col])

# Print the encoded dataframe
print(df_train.head())

            TARGET  NAME_CONTRACT_TYPE_x  CODE_GENDER_x  FLAG_OWN_CAR_x  \
SK_ID_CURR                                                                
100002.0       1.0                     0              1               0   
100003.0       0.0                     0              0               0   
100004.0       0.0                     1              1               1   
100006.0       0.0                     0              0               0   
100007.0       0.0                     0              1               0   

            FLAG_OWN_REALTY_x  CNT_CHILDREN_x  AMT_INCOME_TOTAL_x  \
SK_ID_CURR                                                          
100002.0                    1             0.0            202500.0   
100003.0                    0             0.0            270000.0   
100004.0                    1             0.0             67500.0   
100006.0                    1             0.0            135000.0   
100007.0                    1             0.0            121

In [7]:
df_train.head()

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE_x,CODE_GENDER_x,FLAG_OWN_CAR_x,FLAG_OWN_REALTY_x,CNT_CHILDREN_x,AMT_INCOME_TOTAL_x,AMT_CREDIT_x,AMT_ANNUITY_x,AMT_GOODS_PRICE_x,...,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY,MONTHS_BALANCE,STATUS
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100002.0,1.0,0,1,0,1,0.0,202500.0,406597.5,24700.5,351000.0,...,,,,,,12,,,,11
100003.0,0.0,0,0,0,0,0.0,270000.0,1293502.5,35698.5,1129500.0,...,,,,,,12,,,,11
100004.0,0.0,1,1,1,1,0.0,67500.0,135000.0,6750.0,135000.0,...,,,,,,12,,,,11
100006.0,0.0,0,0,0,1,0.0,135000.0,312682.5,29686.5,297000.0,...,,,,,,12,,,,11
100007.0,0.0,0,1,0,1,0.0,121500.0,513000.0,21865.5,513000.0,...,,,,,,12,,,,11


In [None]:
column_dtypes_dict = df_train.dtypes.to_dict()
print(column_dtypes_dict)

In [8]:
# @title Cleaning data of null values

inferred_dtypes, df_train1 = clean_df(df_train)

Data Type Detection Report:
	These data types are supported by DataPrep to clean: ['coordinate']
Column Headers Cleaning Report:
	295 values cleaned (100.0%)
Downcast Memory Report:
	Memory reducted from 3556228786 to 1458280824. New size: (41.01%)


In [9]:
df_train.head()

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE_x,CODE_GENDER_x,FLAG_OWN_CAR_x,FLAG_OWN_REALTY_x,CNT_CHILDREN_x,AMT_INCOME_TOTAL_x,AMT_CREDIT_x,AMT_ANNUITY_x,AMT_GOODS_PRICE_x,...,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY,MONTHS_BALANCE,STATUS
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100002.0,1.0,0,1,0,1,0.0,202500.0,406597.5,24700.5,351000.0,...,,,,,,12,,,,11
100003.0,0.0,0,0,0,0,0.0,270000.0,1293502.5,35698.5,1129500.0,...,,,,,,12,,,,11
100004.0,0.0,1,1,1,1,0.0,67500.0,135000.0,6750.0,135000.0,...,,,,,,12,,,,11
100006.0,0.0,0,0,0,1,0.0,135000.0,312682.5,29686.5,297000.0,...,,,,,,12,,,,11
100007.0,0.0,0,1,0,1,0.0,121500.0,513000.0,21865.5,513000.0,...,,,,,,12,,,,11


In [10]:
# check point
df_train.to_csv('dataset-clean.csv') #save df

In [3]:
# del df_train
df_train = pd.read_csv("dataset-clean.csv")

In [4]:
df_train = df_train.drop(df_train.columns[0], axis=1)
df_train.head()

Unnamed: 0,target,name_contract_type_x,code_gender_x,flag_own_car_x,flag_own_realty_x,cnt_children_x,amt_income_total_x,amt_credit_x,amt_annuity_x,amt_goods_price_x,...,cnt_credit_prolong,amt_credit_sum,amt_credit_sum_debt,amt_credit_sum_limit,amt_credit_sum_overdue,credit_type,days_credit_update,amt_annuity,months_balance,status
0,1.0,0,1,0,1,0.0,202500.0,406597.5,24700.5,351000.0,...,,,,,,12,,,,11
1,0.0,0,0,0,0,0.0,270000.0,1293502.5,35698.5,1129500.0,...,,,,,,12,,,,11
2,0.0,1,1,1,1,0.0,67500.0,135000.0,6750.0,135000.0,...,,,,,,12,,,,11
3,0.0,0,0,0,1,0.0,135000.0,312682.5,29686.5,297000.0,...,,,,,,12,,,,11
4,0.0,0,1,0,1,0.0,121500.0,513000.0,21865.5,513000.0,...,,,,,,12,,,,11


In [None]:
# @title Exploration - Post Clean
create_report(df_train)

In [5]:
# Split the train dataset into features and target
X = df_train.drop('target', axis=1)  # Features excluding 'TARGET'
y = df_train['target']  # Target variable

In [6]:
# @title Select top features

# create some data
X, y = make_classification(n_samples=len(X), n_features = X.shape[1])
X = pd.DataFrame(X)
y = pd.Series(y)

# use mrmr classification
selected_features = mrmr_classif(X, y, K = 250) # reduce features to 250

100%|██████████| 250/250 [18:16<00:00,  4.39s/it]


In [7]:
# @title Top Features move to dataframe
column_names = X.columns.tolist()
# Slice the original dataframe using the selected features indices
X_selected = X.iloc[:, selected_features]

# Assign the column names to the selected features
X_selected.columns = [column_names[idx] for idx in selected_features]

X = X_selected
X.shape

(1335422, 250)

In [8]:
 # @title Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [9]:
# @title Attribute Transformation - PCA

pca = PCA(n_components=250)
X_pca = pca.fit_transform(X_scaled)

In [None]:
# Number of components
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()

In [10]:
# @title Components to capture 95% of variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
num_components = np.where(cumulative_variance > 0.95)[0][0] + 1
num_components

236

In [11]:
# Transform the data with selected number of components
pca = PCA(n_components=num_components)
X_pca = pca.fit_transform(X_scaled)
X_pca = pd.DataFrame(X_pca)

# Now you can use pandas methods on df_pca
X_pca.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,226,227,228,229,230,231,232,233,234,235
0,0.745438,0.095931,-1.047625,0.966647,-0.894694,-0.970628,-0.369723,-0.062495,-0.310169,-0.969923,...,-0.545846,0.30562,0.690527,0.371785,0.286293,-1.365181,-0.075678,0.150954,0.32547,-0.959555
1,3.662334,1.020315,1.145328,1.397463,-0.339386,0.341313,1.735152,-0.457706,-0.37834,-0.22367,...,-0.44325,0.842433,-2.122636,0.900954,0.386745,-0.134709,0.896469,-0.206256,0.379123,-0.492379
2,-1.9891,0.700436,-1.543952,-1.069058,1.33059,0.068835,0.814924,1.876856,-0.153752,0.121498,...,-2.197661,-0.925977,0.789072,0.290272,0.621475,-0.717782,0.068961,-0.633579,-0.587526,0.028881
3,0.299007,-1.188679,1.167004,0.509994,-0.928622,1.578074,-0.404966,1.482033,-0.273937,0.142765,...,-0.562447,-0.06433,-0.856476,0.144761,1.501965,-1.695894,0.416346,-0.544713,0.029865,-0.696143
4,-0.088042,0.374622,0.219495,-1.078353,1.468514,0.382709,0.954323,-2.365139,0.505188,-0.298718,...,1.89052,-0.505255,0.443064,-0.651609,0.681985,-0.765576,-0.99766,-0.124857,0.070311,1.056096


# Feature Selection

In [12]:
# Split the train dataset into features and target

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=1) # 70% training and 30% test

# Modeling

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300, 400],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5, 6, 7, 10],
    'subsample': [0.8, 0.9, 1]
}

gbc = GradientBoostingClassifier(random_state=42)

grid_search = GridSearchCV(estimator=gbc, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Create GradientBoostingClassifier with the best parameters
gbc = GradientBoostingClassifier(**best_params, random_state=42)

# Fit the model to the training data
gbc.fit(X_train, y_train)

# Evaluation

In [None]:
#@title Evaluating the Gradient Boost model
y_hat = gbc.predict(X_train) # Predict the response for train dataset
y_pred = gbc.predict(X_test) # Predict the response for test dataset
y_pred_proba = gbc.predict_proba(X_test) # Predicted probabilities of the target classes from a classifier.

# Compute training and test accuracy
print("Train Accuracy:", metrics.accuracy_score(y_train, y_hat))
print("Test Accuracy:", metrics.accuracy_score(y_test, y_pred))

# Compute F1-score
print("Train F1-Score:", f1_score(y_train, y_hat))
print("Test F1-Score:", f1_score(y_test, y_pred))

print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = gbc.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
print("AUC:", auc)

scores = cross_val_score(gbc, X_train, y_train, cv=6)
print("Cross-Validation Accuracy Scores:", scores)

In [None]:
# Create a DataFrame to store the evaluation metrics
metrics_table = pd.DataFrame(columns=['Model', 'Train Accuracy', 'Test Accuracy', 'Train F1-Score', 'Test F1-Score', 'Precision', 'Recall', 'AUC', 'AUPRC'])

# Define the models
models = {
    'gbc': gbc,
}

# Compute and populate the evaluation metrics for each model
for model_name, model in models.items():
    # Predict the response for train dataset
    y_hat = model.predict(X_train)
    # Predict the response for test dataset
    y_pred = model.predict(X_test)
    # Predicted probabilities of the target classes from a classifier
    y_pred_proba = model.predict_proba(X_test)[:, 1]

    # Compute the evaluation metrics
    train_accuracy = metrics.accuracy_score(y_train, y_hat)
    test_accuracy = metrics.accuracy_score(y_test, y_pred)
    train_f1_score = metrics.f1_score(y_train, y_hat)
    test_f1_score = metrics.f1_score(y_test, y_pred)
    precision = metrics.precision_score(y_test, y_pred)
    recall = metrics.recall_score(y_test, y_pred)
    auc = metrics.roc_auc_score(y_test, y_pred_proba)
    auprc = average_precision_score(y_test, y_pred_proba)

    # Compute the confusion matrix
    confusion_mat = metrics.confusion_matrix(y_test, y_pred)

    # Add the metrics to the table
    metrics_table = metrics_table.append({
        'Model': model_name,
        'Train Accuracy': train_accuracy,
        'Test Accuracy': test_accuracy,
        'Train F1-Score': train_f1_score,
        'Test F1-Score': test_f1_score,
        'Precision': precision,
        'Recall': recall,
        'AUC': auc,
        'AUPRC': auprc
    }, ignore_index=True)

    # Print the confusion matrix
    print(f"Confusion Matrix for {model_name}:")
    print(confusion_mat)
    print()

    # Print the additional evaluation metrics
    print("Train Accuracy:", train_accuracy)
    print("Test Accuracy:", test_accuracy)
    print("Train F1-Score:", train_f1_score)
    print("Test F1-Score:", test_f1_score)
    print("Precision:", precision)
    print("Recall:", recall)
    print(metrics.classification_report(y_test, y_pred))
    print("AUC:", auc)
    print("AUPRC:", auprc)
    print()

    # Plot ROC curve
    fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
    plt.plot(fpr, tpr, label=f'{model_name} (AUC = {auc:.2f})')

# Plot the diagonal line
plt.plot([0, 1], [0, 1], 'k--')

# Set labels and title
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')

# Set legend
plt.legend()

# Show the plot
plt.show()

# Display the comparison table
print(tabulate(metrics_table, headers='keys', tablefmt='github'))

# Conclusion