# Data Science Lecure 2 - Exercice 4

## Context

Sometimes (if not always), our dataset contains NaN entries. Those values can result for different reasons: wrong encoding, missing value, unavailable value, etc. Dealing with NaN is crucial in order to conduct a meaninfgful analysis. In this notebook, we will see how to deal with such a case, and how to automate this process in the future.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('sample_data/credit_train.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'sample_data/credit_train.csv'

In [None]:
df.head()

In [6]:
df.columns.to_list()

['Loan ID',
 'Customer ID',
 'Loan Status',
 'Current Loan Amount',
 'Term',
 'Credit Score',
 'Annual Income',
 'Years in current job',
 'Home Ownership',
 'Purpose',
 'Monthly Debt',
 'Years of Credit History',
 'Months since last delinquent',
 'Number of Open Accounts',
 'Number of Credit Problems',
 'Current Credit Balance',
 'Maximum Open Credit',
 'Bankruptcies',
 'Tax Liens']

In [7]:
df['Annual Income'].isna().sum()

19668

In [8]:
ncol_missing = df.apply(lambda x: x.isna().sum(), axis=1) # apply function to each row

In [9]:
#number of missing values per row
ncol_missing

0          1
1          2
2          0
3          1
4          3
          ..
100509    19
100510    19
100511    19
100512    19
100513    19
Length: 100514, dtype: int64

In [10]:
nrow_missing = df.apply(lambda x: x.isna().sum(),axis =0) #apply function to each column

In [11]:
#number of missing values per column
nrow_missing

Loan ID                           514
Customer ID                       514
Loan Status                       514
Current Loan Amount               514
Term                              514
Credit Score                    19668
Annual Income                   19668
Years in current job             4736
Home Ownership                    514
Purpose                           514
Monthly Debt                      514
Years of Credit History           514
Months since last delinquent    53655
Number of Open Accounts           514
Number of Credit Problems         514
Current Credit Balance            514
Maximum Open Credit               516
Bankruptcies                      718
Tax Liens                         524
dtype: int64

In [12]:
#how many rows have more than 18 missing values
(ncol_missing > 18).sum()

514

In [13]:
#drop the rows that has more than 18 missing values each
df_1 = df.drop(df.index[ncol_missing > 18])

### Categorical missing values

In [14]:
#to see which is the most common value for the column 'years in current job' (because it is 'categorical')
df_1['Years in current job'].value_counts()

10+ years    31121
2 years       9134
3 years       8169
< 1 year      8164
5 years       6787
1 year        6460
4 years       6143
6 years       5686
7 years       5577
8 years       4582
9 years       3955
Name: Years in current job, dtype: int64

In [15]:
#replace the nan values of column 'years in current job' - with - '10+ years'
df_1['Years in current job'][df_1['Years in current job'].isna()] = '10+ years'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [16]:
df_1['Years in current job'].value_counts()

10+ years    35343
2 years       9134
3 years       8169
< 1 year      8164
5 years       6787
1 year        6460
4 years       6143
6 years       5686
7 years       5577
8 years       4582
9 years       3955
Name: Years in current job, dtype: int64

### Numerical missing values

In [17]:
#there are no missing values in the column "years in current job" 
df_1.apply(lambda x: x.isna().sum(), axis = 0)

Loan ID                             0
Customer ID                         0
Loan Status                         0
Current Loan Amount                 0
Term                                0
Credit Score                    19154
Annual Income                   19154
Years in current job                0
Home Ownership                      0
Purpose                             0
Monthly Debt                        0
Years of Credit History             0
Months since last delinquent    53141
Number of Open Accounts             0
Number of Credit Problems           0
Current Credit Balance              0
Maximum Open Credit                 2
Bankruptcies                      204
Tax Liens                          10
dtype: int64

In [18]:
#to calculate the mean value for each column
df_1.mean()

Current Loan Amount             1.176045e+07
Credit Score                    1.076456e+03
Annual Income                   1.378277e+06
Monthly Debt                    1.847241e+04
Years of Credit History         1.819914e+01
Months since last delinquent    3.490132e+01
Number of Open Accounts         1.112853e+01
Number of Credit Problems       1.683100e-01
Current Credit Balance          2.946374e+05
Maximum Open Credit             7.607984e+05
Bankruptcies                    1.177402e-01
Tax Liens                       2.931293e-02
dtype: float64

In [19]:
#to fill each nan value with the mean value of the column
df_1.fillna(df_1.mean(), inplace=True)

In [20]:
#there are no missing values in any column 
df_1.apply(lambda x: x.isna().sum(), axis = 0)

Loan ID                         0
Customer ID                     0
Loan Status                     0
Current Loan Amount             0
Term                            0
Credit Score                    0
Annual Income                   0
Years in current job            0
Home Ownership                  0
Purpose                         0
Monthly Debt                    0
Years of Credit History         0
Months since last delinquent    0
Number of Open Accounts         0
Number of Credit Problems       0
Current Credit Balance          0
Maximum Open Credit             0
Bankruptcies                    0
Tax Liens                       0
dtype: int64

# Bonus: Automated Imputation with Sklearn

In [43]:
import sklearn
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [63]:
num_cols = df.select_dtypes(include=[np.number]).columns.to_list()
cat_cols = np.setxor1d(df.columns.to_list(), num_cols).tolist()

In [64]:
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="mean")),
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="most_frequent")),
])

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_cols),
    ("cat", cat_pipeline, cat_cols)
])

In [50]:
df_cleaned = full_pipeline.fit_transform(df)

In [65]:
df_cleaned[0]

array([445412.0, 709.0, 1167493.0, 5214.74, 17.2, 34.90132098422929, 6.0,
       1.0, 228190.0, 416746.0, 1.0, 0.0,
       '981165ec-3274-42f5-a3b4-d104041a9ca9', 'Home Mortgage',
       '14dd8831-6af5-400b-83ec-68e61888a048', 'Fully Paid',
       'Home Improvements', 'Short Term', '8 years'], dtype=object)

In [66]:
df[num_cols+cat_cols].values[0]

array([445412.0, 709.0, 1167493.0, 5214.74, 17.2, nan, 6.0, 1.0, 228190.0,
       416746.0, 1.0, 0.0, '981165ec-3274-42f5-a3b4-d104041a9ca9',
       'Home Mortgage', '14dd8831-6af5-400b-83ec-68e61888a048',
       'Fully Paid', 'Home Improvements', 'Short Term', '8 years'],
      dtype=object)