<h2 style="color:blue" align="center"> Imputation with Pandas fillna() </h2>

### Problem:
 some times our datasets have missing values.
 
 Machine Learning algorithms don't deal well with missing values.

### Solutions
#### Solution1:
   Drop each feature which contains missing values (drop the column)
#### Solution2:
   Drop each entry which contains missing values (drop the row)
#### Solution3:
   Imputation(fill in the missing values)

### Imputation:
Deal with missing data points by substituting new values

Common strategy : replace missing data points with the mean, medium or mode.

### 1. Get the data
From Kaggle.com:

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

In [1]:
import pandas as pd

In [10]:
df = pd.read_csv('F:/03. Suresh/1. Material/05. Data Science/14. Jupyter/09. Kaggle/5. House Prices_Advanced Regression Techniques/train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### Subset of the data to work with
- LotFrontage: Linear feet of street connected to property
- FireplaceQu: Fireplace quality
- GarageYrBlt: Year garage was built
- BsmtCond: General condition of the basement

In [12]:
housing = df[['LotFrontage','FireplaceQu','GarageYrBlt','BsmtCond']].copy()
housing.head()

Unnamed: 0,LotFrontage,FireplaceQu,GarageYrBlt,BsmtCond
0,65.0,,2003.0,TA
1,80.0,TA,1976.0,TA
2,68.0,TA,2001.0,TA
3,60.0,Gd,1998.0,Gd
4,84.0,TA,2000.0,TA


### 2. Explore the missing value
#### Examine missing data

In [14]:
housing.isnull().sum()

LotFrontage    259
FireplaceQu    690
GarageYrBlt     81
BsmtCond        37
dtype: int64

In [15]:
housing.isnull().sum()/len(df)*100

LotFrontage    17.739726
FireplaceQu    47.260274
GarageYrBlt     5.547945
BsmtCond        2.534247
dtype: float64

### Drop columns with more than 25% of missing data

In [16]:
housing.drop('FireplaceQu', inplace=True, axis=1)
housing.head(10)

Unnamed: 0,LotFrontage,GarageYrBlt,BsmtCond
0,65.0,2003.0,TA
1,80.0,1976.0,TA
2,68.0,2001.0,TA
3,60.0,1998.0,Gd
4,84.0,2000.0,TA
5,85.0,1993.0,TA
6,75.0,2004.0,TA
7,,1973.0,TA
8,51.0,1931.0,TA
9,50.0,1939.0,TA


### 3. Impute substitute values
#### Strategy 1: Impute mean

In [17]:
Garage_Yr_mean = housing['GarageYrBlt'].mean()
Garage_Yr_mean

1978.5061638868744

In [18]:
housing['GarageYrBlt'].fillna(Garage_Yr_mean, inplace=True)

#### Strategy 2: Impute median

In [19]:
Frontage_mean = housing['LotFrontage'].mean()
Frontage_mean

70.04995836802665

In [20]:
housing['LotFrontage'].fillna(Frontage_mean, inplace=True)

#### Strategy 3: Impute mode

In [21]:
housing['BsmtCond'].value_counts()

TA    1311
Gd      65
Fa      45
Po       2
Name: BsmtCond, dtype: int64

In [23]:
housing['BsmtCond'].fillna('TA', inplace=True)

### Check for missing data

In [24]:
housing.isnull().sum()

LotFrontage    0
GarageYrBlt    0
BsmtCond       0
dtype: int64