# Tutorial for using the package `fast-ml` 

This package is as good as having a junior Data Scientist working for you. Most of the commonly used EDA steps, Missing Data Imputation techniques, Feature Engineering steps are covered in a ready to use format

## Part 3. Missing Data Imputation for Numerical Variables & Categorical Variables



#### 1. Import missing data imputation module from the package 
`from fast_ml.missing_data_imputation import MissingDataImputer_Categorical, MissingDataImputer_Numerical`

#### 2. Define the imputer object. 
* For Categorical variables use `MissingDataImputer_Categorical`
* For Numerical variables use `MissingDataImputer_Numerical`

`cat_imputer = MissingDataImputer_Categorical(method = 'frequent')`
<br>or<br>
`num_imputer = MissingDataImputer_Numerical(method = 'median')`

#### 3. Fit the object on your dataframe and provide a list of variables
Note: Even if it is a single variable it has to be passed as a list <br>
`cat_imputer.fit(train, variables = ['BsmtQual'])`

#### 4. Apply the transform method on train / test dataset
`train = cat_imputer.transform(train)`
<br>&<br>
`test = cat_imputer.transform(test)`

#### 5. parameter dictionary gets created which store the values used for imputation. It can be viewed as
`cat_imputer.param_dict_`


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from fast_ml.missing_data_imputation import MissingDataImputer_Categorical, MissingDataImputer_Numerical

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [4]:
df = pd.read_csv('../data/house_prices.csv')
df.shape

(1460, 81)

In [3]:
df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
numeric_type = ['float64', 'int64']
category_type = ['object']

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

## Start Missing Data Imputation

In [2]:
num_imputer = MissingDataImputer_Numerical(method = 'median')

## Numerical Variables 

### 1. GarageYrBlt

In [8]:
# Use the following method for a numerical variable 
num_imputer.fit(df, ['GarageYrBlt'])

<fast_ml.missing_data_imputation.MissingDataImputer_Numerical at 0x1085ddf10>

In [9]:
num_imputer.param_dict_

{'GarageYrBlt': 1980.0}

In [10]:
df = num_imputer.transform(df)

UnboundLocalError: local variable 'var' referenced before assignment

### 2.  GarageYrBlt

In [None]:
# Use the following method for a numerical variable 
eda_obj.eda_numerical_variable('LotFrontage')

## Categorical Variables

### 1. BsmtQual 

In [13]:
#Before Imputation
df['BsmtQual'].value_counts()

TA    649
Gd    618
Ex    121
Fa     35
Name: BsmtQual, dtype: int64

In [14]:
cat_imputer1 = MissingDataImputer_Categorical(method = 'frequent')
cat_imputer1.fit(df, variables = ['BsmtQual'])

df = cat_imputer1.transform(df)

In [15]:
cat_imputer1.param_dict_

{'BsmtQual': 'TA'}

In [16]:
#After Imputation
df['BsmtQual'].value_counts()

TA    686
Gd    618
Ex    121
Fa     35
Name: BsmtQual, dtype: int64

In [17]:
# After Imputation a new indicator variable gets created
df['BsmtQual_nan'].value_counts()

0    1423
1      37
Name: BsmtQual_nan, dtype: int64

### 2. FireplaceQu

In [3]:
#Before Imputation
df['FireplaceQu'].value_counts()

Gd    380
TA    313
Fa     33
Ex     24
Po     20
Name: FireplaceQu, dtype: int64

In [18]:
cat_imputer2 = MissingDataImputer_Categorical(method = 'custom_value', value = 'Missing')
cat_imputer2.fit(df, variables = ['FireplaceQu'])

print (cat_imputer2.param_dict_)

df = cat_imputer2.transform(df)

{'FireplaceQu': 'Missing'}


In [19]:
#After Imputation
df['FireplaceQu'].value_counts()

Missing    690
Gd         380
TA         313
Fa          33
Ex          24
Po          20
Name: FireplaceQu, dtype: int64

In [20]:
#After Imputation
df['FireplaceQu_nan'].value_counts()

0    770
1    690
Name: FireplaceQu_nan, dtype: int64