<a href="https://colab.research.google.com/github/bhargav23/AI/blob/master/Lab/Dealing_with_missing_values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### In this lecture
* Identifying missing values
* Approaches to fill the missing values

In [0]:
import pandas as pd
import numpy as np

In [0]:
data = pd.read_csv('https://raw.githubusercontent.com/bhargav23/Dataset/master/Toyota.csv',index_col=0,na_values=['??','????'])

In [3]:
data.head()

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,13500,23.0,46986.0,Diesel,90.0,1.0,0,2000,three,1165
1,13750,23.0,72937.0,Diesel,90.0,1.0,0,2000,3,1165
2,13950,24.0,41711.0,Diesel,90.0,,0,2000,3,1165
3,14950,26.0,48000.0,Diesel,90.0,0.0,0,2000,3,1165
4,13750,30.0,38500.0,Diesel,90.0,0.0,0,2000,3,1170


### Identifying missing values
* In Pandas dataframes missing data is represented by **NaN** (an acronym for Not a Number)
* To check null values in Pandas dataframes **isnull()** and **isna()** are used
* These functions returns a dataframe of Boolean values which are **True** for **NaN** values

In [4]:
data.isnull()

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,True,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
1431,False,True,False,False,False,False,False,False,False,False
1432,False,False,True,False,False,False,False,False,False,False
1433,False,True,False,False,False,False,False,False,False,False
1434,False,False,True,True,False,False,False,False,False,False


In [5]:
data.isna()

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,True,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
1431,False,True,False,False,False,False,False,False,False,False
1432,False,False,True,False,False,False,False,False,False,False
1433,False,True,False,False,False,False,False,False,False,False
1434,False,False,True,True,False,False,False,False,False,False


### Identifying missing values
* Dataframe.isna().sum(), 
* Dataframe.isnull().sum()
* To check the count of missing values present in each column

In [6]:
data.isnull().sum()

Price          0
Age          100
KM            15
FuelType     100
HP             6
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64

In [7]:
data.isna().sum()

Price          0
Age          100
KM            15
FuelType     100
HP             6
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64

* Subsetting the rows that have one or more missing values

In [8]:
data.isnull().any(axis=1)
# Return True if bool(x) is True for any x in the iterable.
# If the iterable is empty, return False.

0       False
1       False
2        True
3       False
4       False
        ...  
1431     True
1432     True
1433     True
1434     True
1435    False
Length: 1436, dtype: bool

In [0]:
missing = data[data.isnull().any(axis=1)]

In [10]:
data.shape

(1436, 10)

In [11]:
missing.shape

(340, 10)

### Approached to fill the missing values
* Two ways of approach
  * Fill the missing values by **mean / median** , in case of **numerical variable**
  * Fill the missing values with the class which has **maximum count** , in case of **categorical variable**

### Imputing missing values
* Look at the description to know whether numerical variables should be imputed with mean or median
* DataFrame.describe()
* Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values

In [12]:
data.describe()

Unnamed: 0,Price,Age,KM,HP,MetColor,Automatic,CC,Weight
count,1436.0,1336.0,1421.0,1430.0,1286.0,1436.0,1436.0,1436.0
mean,10730.824513,55.672156,68647.239972,101.478322,0.674961,0.05571,1566.827994,1072.45961
std,3626.964585,18.589804,37333.023589,14.768255,0.468572,0.229441,187.182436,52.64112
min,4350.0,1.0,1.0,69.0,0.0,0.0,1300.0,1000.0
25%,8450.0,43.0,43210.0,90.0,0.0,0.0,1400.0,1040.0
50%,9900.0,60.0,63634.0,110.0,1.0,0.0,1600.0,1070.0
75%,11950.0,70.0,87000.0,110.0,1.0,0.0,1600.0,1085.0
max,32500.0,80.0,243000.0,192.0,1.0,1.0,2000.0,1615.0


### Imputing missing values of **Age**
* Calculating the mean value of the **Age** variable

In [13]:
data['Age'].mean()

55.67215568862275

* To fill NA/ NaN values using the specified value
* Syntax : DataFrame.fillna()

In [0]:
data['Age'].fillna(data['Age'].mean(),inplace=True)

### Imputing missing values of **KM**
* Calculating the median value of the **KM** variable

In [15]:
data['KM'].median()

63634.0

* To fill NA/ NaN values using the specified value
* Syntax : DataFrame.fillna()

In [0]:
data['KM'].fillna(data['KM'].median(),inplace=True)

### Imputing missing values of **HP**
* Calculating the mean value of the **HP** variable

In [17]:
data['HP'].mean()

101.47832167832168

* To fill NA/ NaN values using the specified value
* Syntax : DataFrame.fillna()

In [0]:
data['HP'].fillna(data['HP'].mean(),inplace=True)

* Check for missing data after filling values

In [19]:
data.isna().sum()

Price          0
Age            0
KM             0
FuelType     100
HP             0
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64

### Imputing missing values of **FuelType**
* Syntax: Series.value_counts()
  * Returns a Series containing counts of unique values
  * The values will be in descending order so that the first element is the most frequently occurring element
  * Excludes NA values by default

In [20]:
data['FuelType'].value_counts()

Petrol    1177
Diesel     144
CNG         15
Name: FuelType, dtype: int64

* To get the mode value of **FuelType**

In [21]:
data['FuelType'].value_counts().index[0]

'Petrol'

* To fill NA/ NaN values using the specified value

In [0]:
data['FuelType'].fillna(data['FuelType'].value_counts().index[0],inplace=True)

### Imputing missing values of **MetColor**
* To get the mode value of **MetColor**

In [23]:
data['MetColor'].mode()

0    1.0
dtype: float64

* To fill NA/ NaN values using the specified value

In [0]:
data['MetColor'].fillna(data['MetColor'].mode()[0],inplace=True)

### Checking for missing values
* Check for missing data after filling values

In [26]:
data.isnull().sum()

Price        0
Age          0
KM           0
FuelType     0
HP           0
MetColor     0
Automatic    0
CC           0
Doors        0
Weight       0
dtype: int64

### Summary
* Identifying missing values
* Approaches to fill the missing values