## Handling Missing Values in Data

* missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. 

&nbsp;
* **```Missing data present various problems```**

     * First, the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false. 
     * Second, the lost data can cause bias in the estimation of parameters. 
     * Third, it can reduce the representativeness of the samples.
     
     &nbsp;
     
* **```Common ways to Handle Missing Data```**

     * ```Encoding``` the Null Values with ```-1 or -9999```.
     * ```Casewise deletion``` of Missing Data, i.e., In certain cases, where more than 60% of the data is missing It is better to completely remove that column.
     * Replacing Missing Values with Mean, Median, and Mode value of the feature in which they occur
         * Use ```Mean``` Values to impute Missing values when the data is continuous variable and when there are lesser or no outliers present in the data.
         * Use ```Median``` Values to impute Missing Values when the data is continuous variable and when there are huge number of outliers present in the data.
         * Use ```Mode Values``` to impute Missing Values when the daata is categorical variable.
      * Using the ```Predictive Models```, to impute the missing values.
      * Use the Missing Values to Create a New Feature, ```Feature Engineering```.

In [2]:
# importing the basic libraries
import numpy as np
import pandas as pd

In [3]:
# lets import a dataset to perform the operations
data = pd.read_csv('Datasets/employee.csv')

# lets check the shape of the dataset
data.shape

(1470, 35)

In [5]:
# lets check the structure of the dataset
data.head(3)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0


### The Functions which will help us to identify the null values present in the dataframes

#### The Functions which help us to know whether the data is null or not
* ```isnull()```: Indicates the presence of Missing values in a dataset, return boolean values. i.e., It will return True if the data is missing and False if the data is not missing.
* ```notnull()```: This Function is completely the Opposite of ```isnull()```, It also returns boolean values.

#### The Functions which helps us to compute operations to check  whether all or some data is missing
* ```any()```: It will be used where we want to include all the data for operations and Identification.
* ```all()```: It will be used where we are only checking on some or any of the data for operations or Identification.

In [7]:
# lets first check if there is any null value present in the data
data.isnull().sum().sum()

0

**We can see that, there are no missing values in the dataset**

* Using ```isnull()``` function returns true, If there any data present in the cell and false if there is no data present in the cell.
&nbsp;

* Using ```isnull().sum()``` function would return the sum of null values present in each of the columns.
&nbsp;

* Using ```isnull().sum().sum()``` function would return the totak number of null values or missing values present in a dataset or dataframe.
&nbsp;

* In the Above Result we can see that every columns present in the dataset are having 0 null values, that means there is no null data in the dataframe.


In [18]:
# lets try to take another dataframe, and try to handle missing values in the dataframe

data = pd.read_csv('Datasets/melbourne.csv')
data.head(3)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,03-09-2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,03-12-2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,04-02-2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0


In [19]:
# lets check the basic information about the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23547 entries, 0 to 23546
Data columns (total 21 columns):
Suburb           23547 non-null object
Address          23547 non-null object
Rooms            23547 non-null int64
Type             23547 non-null object
Price            18396 non-null float64
Method           23547 non-null object
SellerG          23547 non-null object
Date             23547 non-null object
Distance         23546 non-null float64
Postcode         23546 non-null float64
Bedroom2         19066 non-null float64
Bathroom         19063 non-null float64
Car              18921 non-null float64
Landsize         17410 non-null float64
BuildingArea     10018 non-null float64
YearBuilt        11540 non-null float64
CouncilArea      15656 non-null object
Lattitude        19243 non-null float64
Longtitude       19243 non-null float64
Regionname       23546 non-null object
Propertycount    23546 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 3.8+ M

In [30]:
# lets check whether this dataset is having any null values or not

# first lets check how many rows are having all the null values
data.isnull().all(axis = 1).sum()

0

* The returned value is the total number of missing values in the dataframe
* 66, 918 number of data points are missing that is huge.
* let's also check the details of the missing values.
* In the detailed report we will be able to check how many missing values are associated with each of teh columns.

In [14]:
# lets check the number of missing values in the dataset
# we are check the missing values in the context of columns
data.isnull().sum(axis = 0)

Suburb               0
Address              0
Rooms                0
Type                 0
Price             5151
Method               0
SellerG              0
Date                 0
Distance             1
Postcode             1
Bedroom2          4481
Bathroom          4484
Car               4626
Landsize          6137
BuildingArea     13529
YearBuilt        12007
CouncilArea       7891
Lattitude         4304
Longtitude        4304
Regionname           1
Propertycount        1
dtype: int64

* We can see that there are many columns which are having null values, Infact there are some columns where we have huge number of missing values.
* lets also check the percentage of missing values present in the data.



### How to Handle Missing Values from a Dataframe

* **We have already discussed the ways to handle the missing values from the Dataset, but when talking about it in broad sense then only two major choices:**
    * To ```drop/delete``` the missing values
    * To ```impute``` the missinng values
    
* **We have to take decision regarding which of the two options to go with depending on the missing values**.
    * We can drop/delete the missinng values only when the missing percentage is more than 50% or it is not so important.
    * in that case, we have to impute the values using any of the methods listed in the introduction part.
    
lets take some examples using the available dataset to check how to impute or delelete the missing values
    

### Methods we will use to Impute Missing Values

* There are many ways to Impute Missing Values, some of them which are popularly used:
    * ```dropna()```: This Function is used to drops the missing values from a dataset and returns the rest of the dataframe.
    * ```fillna()```: This Function is used to replace the missing value with a specified value.

In [35]:
# lets check the percentage of missing values in the columns of the dataset

percentage_of_missing_values = round(100*(data.isnull().sum()/len(data.index)), 2)
print(percentage_of_missing_values)

Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.88
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2         19.03
Bathroom         19.04
Car              19.65
Landsize         26.06
BuildingArea     57.46
YearBuilt        50.99
CouncilArea      33.51
Lattitude        18.28
Longtitude       18.28
Regionname        0.00
Propertycount     0.00
dtype: float64


<p style="color: green;">It can be noticed very easily from the above results that there are many columns have huge percentage of data missing from the dataset. There are columns which are having ```18% to 26%``` of values missing from the data, and also there are some columns which are having ```33% to 56%``` of values missing from the data.

* What action should we take, should we remove them all, or impute.
    * We ```cannot remove the columns having 18% to 26% missing values```, so In this case we will have to impute the missing values using one of the methods.
        * It is very important to impute the missing values for these columns as If we remove these columns from the data having only 18 to 26% of the missing values then we will face huge data loss, In that case, We might lose some of the very important and relevant data for the analysis and results.
    
    * But, ```We can remove the columns having 33 to 56% of missing values```, so In this case will have to drop these columns from the data.
        * It is very important to remove these columns because if we impute these missing values then we can introduce a huge bias into the data, which will be bad for the analysis of the data, It might also return inappropriate and misleading results.

### Deleting the Columns using drop() Function

In [42]:
# lets check the documentation of drop function
help(pd.DataFrame.drop)

Help on function drop in module pandas.core.frame:

drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
    Drop specified labels from rows or columns.
    
    Remove rows or columns by specifying label names and corresponding
    axis, or by specifying directly index or column names. When using a
    multi-index, labels on different levels can be removed by specifying
    the level.
    
    Parameters
    ----------
    labels : single label or list-like
        Index or column labels to drop.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Whether to drop labels from the index (0 or 'index') or
        columns (1 or 'columns').
    index : single label or list-like
        Alternative to specifying axis (``labels, axis=0``
        is equivalent to ``index=labels``).
    
        .. versionadded:: 0.21.0
    columns : single label or list-like
        Alternative to specifying axis (``labels, axis=1``
        is equivalen

In [36]:
# removing the three columns having 33% to 56% of missing values in the dataset
data = data.drop('BuildingArea', axis=1)
data = data.drop('YearBuilt', axis=1)
data = data.drop('CouncilArea', axis=1)

# lets check the columns left after the deletion of above listed columns
data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [44]:
# count the number of rows having > 5 missing values
# use len(df.index)
len(data[data.isnull().sum(axis=1) > 5].index)

4278

In [46]:
# retaining the rows having <= 5 NaNs
data = data[data.isnull().sum(axis=1) <= 5]

In [48]:
# look at the summary again
round(100*(data.isnull().sum()/len(data.index)), 2)

Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.71
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2          1.05
Bathroom          1.07
Car               1.81
Landsize          9.65
Lattitude         0.13
Longtitude        0.13
Regionname        0.00
Propertycount     0.00
dtype: float64

* It can be seen that ```Price Column``` still has large number of missing values i.e., 21% of the data is missing. If we impute these values it will introduce heavy bias into the data and the results will be misleading so It is better to discard the rows where Price is having missing values.

In [49]:
# removing all the rows where there is a missing value in the Price Column
data = data[~np.isnan(data['Price'])]

In [51]:
# look at the summary again
round(100*(data.isnull().sum()/len(data.index)), 2)

Suburb           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Date             0.00
Distance         0.00
Postcode         0.00
Bedroom2         1.05
Bathroom         1.07
Car              1.76
Landsize         9.83
Lattitude        0.15
Longtitude       0.15
Regionname       0.00
Propertycount    0.00
dtype: float64

* Now, we have Landsize, which is having high percentage of missing values, lets check the range of values present in the Landsize attribute of the data, so that we can impute the missing values present in the Landsize Attribute.

In [52]:
# lets check the description of values in the Landsize Column
data['Landsize'].describe()

count     13603.000000
mean        558.116371
std        3987.326586
min           0.000000
25%         176.500000
50%         440.000000
75%         651.000000
max      433014.000000
Name: Landsize, dtype: float64

* From the above description of the range of values present in the landsize columns supported by mean, standard deviatio, minimum, maximum, 25% percentile, 50% percentile, 75% percentile etc.

* We can see that the Minimum value is 0 and Maximum value is 433014, there is huge difference so both mean and median function would not work

* Also, there is a huge difference in the 25th, 50th, and 75th percentiles indicating that if we impute the values for Landsize column we will most probably introduce bias into the data.

* It is most appropriate to remove the rows where we find null values for Landsize.

In [53]:
# removing  all the rows where there is a null value in the landsize column
data = data[~np.isnan(data['Landsize'])]

# lets check the summary again
round(100*(data.isnull().sum()/len(data.index)), 2)

Suburb           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Date             0.00
Distance         0.00
Postcode         0.00
Bedroom2         0.00
Bathroom         0.01
Car              0.46
Landsize         0.00
Lattitude        0.16
Longtitude       0.16
Regionname       0.00
Propertycount    0.00
dtype: float64

* We can see that now we have very low fraction of missing values in the columns Bathroom, Car, Lattitude, and Longitude
* Lets check the range values of Lattitude and Longitude.

In [55]:
# describing the values for lattitude and longitude
data[['Lattitude','Longtitude']].describe()

Unnamed: 0,Lattitude,Longtitude
count,13581.0,13581.0
mean,-37.809204,144.995221
std,0.079257,0.103913
min,-38.18255,144.43181
25%,-37.85682,144.9296
50%,-37.80236,145.0001
75%,-37.7564,145.05832
max,-37.40853,145.52635


### Imputing using fillna() method

In [57]:
help(pd.DataFrame.fillna)

Help on function fillna in module pandas.core.frame:

fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
    Fill NA/NaN values using the specified method.
    
    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame).  Values not
        in the dict/Series/DataFrame will not be filled. This value cannot
        be a list.
    method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series
        pad / ffill: propagate last valid observation forward to next valid
        backfill / bfill: use next valid observation to fill gap.
    axis : {0 or 'index', 1 or 'columns'}
        Axis along which to fill missing values.
    inplace : bool, default False
        If T

In [56]:
# as there is minute difference between the values of min, mean, max, 25th, 50th, and 75th percentile. 
# so we can use mean function to impute the values in lattitude and longtitude.

data['Lattitude'].fillna(data['Lattitude'].mean(), inplace = True)
data['Longtitude'].fillna(data['Longtitude'].mean(), inplace = True)

In [58]:
# lets check the values present in the Bathroom Column

data['Bathroom'].value_counts()

1.0    7517
2.0    4987
3.0     921
4.0     106
0.0      34
5.0      28
6.0       5
8.0       2
7.0       2
Name: Bathroom, dtype: int64

In [59]:
# lets also check the values present in the Car Column

data['Car'].value_counts()

2.0     5606
1.0     5515
0.0     1026
3.0      748
4.0      507
5.0       63
6.0       54
8.0        9
7.0        8
10.0       3
9.0        1
Name: Car, dtype: int64

In [61]:
# we have to impute the values in Categorical Columns using the Mode function

data['Bathroom'].fillna(data['Bathroom'].mode()[0], inplace = True)
data['Car'].fillna(data['Car'].mode()[0], inplace = True)

In [62]:
# lets check the missing values present in the data

data.isnull().sum().sum()

0

we can see that there no null values left in the dataset i.e., we are done with this tutorial on Handling Missing Values present in Data.