<div style="display: flex; background-color: #3F579F;">
    <h1 style="margin: auto; font-weight: bold; padding: 30px 30px 0px 30px; color:#fff;" align="center">Implement a scoring model - P7</h1>
</div>
<div style="display: flex; background-color: #3F579F; margin: auto; padding: 5px 30px 0px 30px;" >
    <h3 style="width: 100%; text-align: center; float: left; font-size: 24px; color:#fff;" align="center">| Notebook optimization |</h3>
</div>
<div style="display: flex; background-color: #3F579F; margin: auto; padding: 10px 30px 30px 30px;">
    <h4 style="width: 100%; text-align: center; float: left; font-size: 24px; color:#fff;" align="center">Data Scientist course - OpenClassrooms</h4>
</div>

<div style="background-color: #506AB9;" >
    <h2 style="margin: auto; padding: 20px; color:#fff; ">1. Libraries and files</h2>
</div>

<div style="background-color: #506AB9;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">1.1. Libraries</h3>
</div>

In [1]:
import re
import numpy as np
import pandas as pd
from functools import partial

from imblearn.over_sampling import SMOTE

import lightgbm as lgb
from lightgbm import LGBMClassifier

import sklearn
from sklearn.metrics import (roc_auc_score, roc_curve, 
                             precision_recall_curve, confusion_matrix, 
                             PrecisionRecallDisplay, ConfusionMatrixDisplay)
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
from sklearn.preprocessing import StandardScaler

# Hyperparametrization
from hyperopt import tpe, hp, fmin, STATUS_OK, Trials, space_eval
from hyperopt.pyll.base import scope

import joblib

## Own specific functions 
from functions import *

<div style="background-color: #506AB9;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">1.2. Files</h3>
</div>

In [2]:
df = pd.read_csv(r"datasets\df_processed.csv")
df = df.drop(columns=["index"])

In [3]:
df_analysis(df, "df", analysis_type="header")


Analysis Header of df dataset
--------------------------------------------------------------------------------
- Dataset shape:			 356251 rows and 797 columns
- Total of NaN values:			 72099981
- Percentage of NaN:			 25.39 %
- Total of infinite values:		 21
- Percentage of infinite values:	 0.0 %
- Total of full duplicates rows:	 0
- Total of empty rows:			 0
- Total of empty columns:		 0
- Unique indexes:			 True
- Memory usage:				 2.1 GB


<div class="alert alert-block alert-warning">
    <p><b>Observations / Conclusions</b></p>
    <ul style="list-style-type: square;">
        <li><b>Missing values</b> - There are 25.39% of missing-values to treat</li>
        <li><b>Infinite values</b> - There are 25 infinite values</li>
    </ul> 
</div>

<div style="background-color: #506AB9;" >
    <h4 style="margin: auto; padding: 20px; color:#fff; ">1.2.1 Optimizing memory usage</h4>
</div>

<div class="alert alert-block alert-info">
    <p>We should optimize the memory usage to avoid problems during executions</p>
</div>

In [4]:
df["TARGET"].fillna(value=-99, inplace=True)
df["TARGET"] = df["TARGET"].astype("int8")
df["TARGET"] = df["TARGET"].replace(-99, np.nan)

In [5]:
for col in df.columns:
    if df[col].dtype == "int64" and df[col].nunique() == 2:
        df[col] = df[col].astype("int8")

In [6]:
for col in df.columns:
    if df[col].dtype == "float64" and df[col].min() >= -2147483648 and df[col].max() <= 2147483648:
        df[col] = df[col].astype("float32")

In [7]:
df_analysis(df, "df", analysis_type="header")


Analysis Header of df dataset
--------------------------------------------------------------------------------
- Dataset shape:			 356251 rows and 797 columns
- Total of NaN values:			 72099981
- Percentage of NaN:			 25.39 %
- Total of infinite values:		 21
- Percentage of infinite values:	 0.0 %
- Total of full duplicates rows:	 0
- Total of empty rows:			 0
- Total of empty columns:		 0
- Unique indexes:			 True
- Memory usage:				 941.8 MB


In [8]:
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,CC_NAME_CONTRACT_STATUS_Signed_MAX,CC_NAME_CONTRACT_STATUS_Signed_MEAN,CC_NAME_CONTRACT_STATUS_Signed_SUM,CC_NAME_CONTRACT_STATUS_Signed_VAR,CC_NAME_CONTRACT_STATUS_nan_MIN,CC_NAME_CONTRACT_STATUS_nan_MAX,CC_NAME_CONTRACT_STATUS_nan_MEAN,CC_NAME_CONTRACT_STATUS_nan_SUM,CC_NAME_CONTRACT_STATUS_nan_VAR,CC_COUNT
0,100002,1.0,0,0,0,0,202500.0,406597.5,24700.5,351000.0,...,,,,,,,,,,
1,100003,0.0,1,0,1,0,270000.0,1293502.5,35698.5,1129500.0,...,,,,,,,,,,
2,100004,0.0,0,1,0,0,67500.0,135000.0,6750.0,135000.0,...,,,,,,,,,,
3,100006,0.0,1,0,0,0,135000.0,312682.5,29686.5,297000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
4,100007,0.0,0,0,0,0,121500.0,513000.0,21865.5,513000.0,...,,,,,,,,,,


In [9]:
df.select_dtypes(include=["object"]).columns.tolist()

[]

<div class="alert alert-block alert-warning">
    <p><b>Observations / Conclusions</b></p>
    <ul style="list-style-type: square;">
        <li><b>Columns type</b> - All columns are numerics</li>
    </ul> 
</div>

<div style="background-color: #506AB9;" >
    <h4 style="margin: auto; padding: 20px; color:#fff; ">1.2.2. Missing-values</h4>
</div>

<div class="alert alert-block alert-info">
    <p>Before treating the class imbalance in the target, it is necessary to treat the missing-values in all the dataset, to do that, we are going to fill values with SimpleImputer
   </p>
    <p>Let's start by identifying the features with infinite-values and replace them by missing-values
   </p>
</div>
</div>

In [10]:
inf_cols = df.columns.to_series()[np.isinf(df).any()]

In [11]:
for col in inf_cols:
    df[col] = df[col].replace([np.inf, -np.inf], np.nan)

In [12]:
df_analysis(df, "df", analysis_type="header")


Analysis Header of df dataset
--------------------------------------------------------------------------------
- Dataset shape:			 356251 rows and 797 columns
- Total of NaN values:			 72100002
- Percentage of NaN:			 25.39 %
- Total of infinite values:		 0
- Percentage of infinite values:	 0.0 %
- Total of full duplicates rows:	 0
- Total of empty rows:			 0
- Total of empty columns:		 0
- Unique indexes:			 True
- Memory usage:				 941.8 MB


<div class="alert alert-block alert-info">
    <p>Let's continue by identifying the features with missing-values and excluding the TARGET
   </p>
</div>
</div>

In [13]:
nan_cols = [i for i in df.columns if i!="TARGET" and df[i].isnull().any()]

In [14]:
for col in nan_cols:
    mean_value = df[col].mean()
    df[col].fillna(value=mean_value, inplace=True)

In [15]:
df_analysis(df, "df", analysis_type="header")


Analysis Header of df dataset
--------------------------------------------------------------------------------
- Dataset shape:			 356251 rows and 797 columns
- Total of NaN values:			 48744
- Percentage of NaN:			 0.02 %
- Total of infinite values:		 0
- Percentage of infinite values:	 0.0 %
- Total of full duplicates rows:	 0
- Total of empty rows:			 0
- Total of empty columns:		 0
- Unique indexes:			 True
- Memory usage:				 941.8 MB


<div class="alert alert-block alert-success">
    <p>At this point, TARGET is the only column with missing-values</p>
</div>

<div class="alert alert-block alert-info">
    <p>Let's save the customers that we are going to predict</p>
</div>

In [16]:
df_customers_to_predict = df[df["TARGET"].isnull()]

In [17]:
df_analysis(df_customers_to_predict, "df_customers_to_predict", analysis_type="header")


Analysis Header of df_customers_to_predict dataset
--------------------------------------------------------------------------------
- Dataset shape:			 48744 rows and 797 columns
- Total of NaN values:			 48744
- Percentage of NaN:			 0.13 %
- Total of infinite values:		 0
- Percentage of infinite values:	 0.0 %
- Total of full duplicates rows:	 0
- Total of empty rows:			 0
- Total of empty columns:		 1
	+ The empty column is:		 ['TARGET']
- Unique indexes:			 True
- Memory usage:				 129.2 MB


In [18]:
# dropping TARGET feature
df_customers_to_predict.drop("TARGET", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [19]:
df_analysis(df_customers_to_predict, "df_customers_to_predict", analysis_type="header")


Analysis Header of df_customers_to_predict dataset
--------------------------------------------------------------------------------
- Dataset shape:			 48744 rows and 796 columns
- Total of NaN values:			 0
- Percentage of NaN:			 0.0 %
- Total of infinite values:		 0
- Percentage of infinite values:	 0.0 %
- Total of full duplicates rows:	 0
- Total of empty rows:			 0
- Total of empty columns:		 0
- Unique indexes:			 True
- Memory usage:				 130.1 MB


In [20]:
# saving the optimized dataset 
df_customers_to_predict.to_csv("datasets\df_customers_to_predict.csv", index=False)

In [21]:
# saving the optimized dataset 
df.to_csv("datasets\df_optimized.csv", index=False)

In [22]:
ccc

NameError: name 'ccc' is not defined

In [None]:
df[df["SK_ID_CURR"] == 371573]

In [None]:
df_customers_to_predict = df[df["TARGET"].isnull()]

In [None]:
df_analysis(df_customers_to_predict, "df_customers_to_predict", analysis_type="header")

In [None]:
# dropping TARGET feature
df_customers_to_predict.drop("TARGET", axis=1, inplace=True)

In [None]:
df_analysis(df_customers_to_predict, "df_customers_to_predict", analysis_type="header")

In [None]:
# saving the optimized dataset 
df_customers_to_predict.to_csv("datasets\df_customers_to_predict.csv", index=False)