In [36]:
import pandas as pd
cars_data = pd.read_csv('Toyota.csv')

In [37]:
cars_data = pd.read_csv('Toyota.csv', index_col=0, na_values = ["??", "????"] )

In [38]:
cars_data2 = cars_data.copy()

### Identifying missing values

In Pandas dataframes missing data is represented by NaN (an acronym for Not a Number)

To check null values in Pandas dataframes isnull() and isna() are used

These functions returns a dataframe of Boolean values which are True for NaN values

### Identifying missing values

Dataframe.isna.sum(), Dataframe.isnull.sum()

To check the count of missing values present in each column

In [39]:
cars_data2.isna().sum()

Price          0
Age          100
KM            15
FuelType     100
HP             6
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64

In [40]:
cars_data2.isnull().sum()

Price          0
Age          100
KM            15
FuelType     100
HP             6
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64

#### Subsetting the rows that have one or more missing values

In [41]:
missing = cars_data2[cars_data2.isnull().any(axis = 1)]

In [42]:
missing

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
2,13950,24.0,41711.0,Diesel,90.0,,0,2000,3,1165
6,16900,27.0,,Diesel,,,0,2000,3,1245
7,18600,30.0,75889.0,,90.0,1.0,0,2000,3,1245
9,12950,23.0,71138.0,Diesel,,,0,1900,3,1105
15,22000,28.0,18739.0,Petrol,,0.0,0,1800,3,1185
...,...,...,...,...,...,...,...,...,...,...
1428,8450,72.0,,Petrol,86.0,,0,1300,3,1015
1431,7500,,20544.0,Petrol,86.0,1.0,0,1300,3,1025
1432,10845,72.0,,Petrol,86.0,0.0,0,1300,3,1015
1433,8500,,17016.0,Petrol,86.0,0.0,0,1300,3,1015


### Approach to fill the missing values

Two ways of approach


Fill the missing values by mean / median , in case of numerical variable


Fill the missing values with the class which has maximum count , in case of categorical variable

Inputing missing values

• Look at the description to know whether numerical variables should be imputed with mean or median
    
    DataFrame.describe()

• Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values

In [43]:
cars_data2.describe()

Unnamed: 0,Price,Age,KM,HP,MetColor,Automatic,CC,Weight
count,1436.0,1336.0,1421.0,1430.0,1286.0,1436.0,1436.0,1436.0
mean,10730.824513,55.672156,68647.239972,101.478322,0.674961,0.05571,1566.827994,1072.45961
std,3626.964585,18.589804,37333.023589,14.768255,0.468572,0.229441,187.182436,52.64112
min,4350.0,1.0,1.0,69.0,0.0,0.0,1300.0,1000.0
25%,8450.0,43.0,43210.0,90.0,0.0,0.0,1400.0,1040.0
50%,9900.0,60.0,63634.0,110.0,1.0,0.0,1600.0,1070.0
75%,11950.0,70.0,87000.0,110.0,1.0,0.0,1600.0,1085.0
max,32500.0,80.0,243000.0,192.0,1.0,1.0,2000.0,1615.0


Statistical summary of data

#### Imputing missing values of ‘Age’

Calculating the mean value of the Age variable

In [44]:
cars_data2["Age"].mean()

55.67215568862275

To fill NA/ NaN values using the specified value

DataFrame.fillna()

In [45]:
cars_data2["Age"].fillna(cars_data2["Age"].mean(),
                         inplace = True)

#### Imputing missing values of ‘KM’

Calculating the median value of the KM variable

In [46]:
cars_data2["KM"].median()

63634.0

To fill NA/ NaN values using the specified value

DataFrame.fillna()

In [47]:
cars_data2["KM"].fillna(cars_data2["KM"].median(), inplace= True)

#### Imputing missing values of ‘HP’

• Calculating the mean value of the HP variable

In [48]:
cars_data2["HP"].fillna(cars_data2["HP"].mean(), inplace = True)

#### Check for missing data after filling values

In [49]:
cars_data2.isnull().sum()

Price          0
Age            0
KM             0
FuelType     100
HP             0
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64

#### Imputing missing values of ‘FuelType

Series.value_counts()

• Returns a Series containing counts of unique values

• The values will be in descending order so that the first element is the most frequently occurring element

• Excludes NA values by default

In [50]:
cars_data2["FuelType"].value_counts()

Petrol    1177
Diesel     144
CNG         15
Name: FuelType, dtype: int64

#### To get the mode value of FuelType

In [51]:
cars_data2["FuelType"].value_counts().index[0]

'Petrol'

#### To fill NA/ NaN values using the specified value

DataFrame.fillna()

In [52]:
cars_data2["FuelType"].fillna(cars_data2["FuelType"]\
                              .value_counts().index[0],\
                             inplace = True)

#### Imputing missing values of ‘MetColor

To get the mode value of MetColor

In [53]:
cars_data2["MetColor"].mode()

0    1.0
dtype: float64

To fill NA/ NaN values using the specified value

In [54]:
cars_data2["MetColor"].fillna(cars_data2["MetColor"]\
                             .mode()[0], inplace = True)

#### Check for missing data after filling values

In [55]:
cars_data2.isnull().sum()

Price        0
Age          0
KM           0
FuelType     0
HP           0
MetColor     0
Automatic    0
CC           0
Doors        0
Weight       0
dtype: int64

#### Imputing missing values using lambda functions

To fill the NA/ NaN values in both numerical and categorial variables at one stretch

In [56]:
cars_data3 = cars_data.copy()

In [57]:
cars_data3 = cars_data3.apply(lambda x:x.fillna(x.mean())\
                             if x.dtype== "float" else\
                             x.fillna(x.value_counts().index[0]))

In [58]:
cars_data3.isnull().sum()

Price        0
Age          0
KM           0
FuelType     0
HP           0
MetColor     0
Automatic    0
CC           0
Doors        0
Weight       0
dtype: int64