# A Comprehensive guide to handle missing values effectively working with messy data

*You drop Missing data, but how do you know you did it effectively?*

Real world datasets are very messy and in many cases, having many missing values. Often, missing values are one of the most problem that Data Scientists and Machine Learning Engineers are likely to deal with day to day, and it is not straight to know the right strategy. We often just drops. 

There are many ways to deal with missing values, but there is no one fit all strategy. The right strategy depend on the dataset, its size or number of examples you have, the size of missing values in concerned features, what can be tolerated and so on. Choosing the best strategy will help us to provide accurate insights, and avoid us from communicating wrong information. It can also saves us time that we would spend tuning the model, from the fact that good model comes from good data. 

I this tutorial, I will walk through the common ways to handle missing values. Let's get started!

## Loading data

Let's first import tools that we will need throught out the tutorial. 

In [None]:
import numpy as np ## for maths and scientific computations
import pandas as pd ## for data manupulation
import seaborn as sns ## for simple visaulization 
import matplotlib.pyplot as plt ## for visaulization 
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
housing= pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')

In [None]:
housing.head()

In [None]:
housing.isna().sum()

We only have missing values in total_bedrooms. It would be better to have many features missing values for this experiment, but let's go on.

Now that we know the feature which is missing values, let's do some analysis around it and then go through the techniques to handle these values, but trying to understand what are doing.

## Eploratory Data Analysis

In [None]:
sns.pairplot(housing, vars=['total_bedrooms', 'total_rooms','households','housing_median_age' ])

While our goal is not to explore every feature, you can see that the total bedrooms correlate with total rooms and households, and that make sense because the bedrooms are counted in total rooms and depends on the number of people in the house (`households`)

In [None]:
plt.figure(figsize=(10,8))
housing['total_bedrooms'].plot(kind='hist')
plt.xlabel('Total bedrooms')

You can see that most houses have bedrooms between 0 to 1000. Very few houses have over 1000 bedrooms.

In [None]:
print("Recall that the missing values in our dataset are: \n \n {}".format(housing.isna().sum()))

## Handling Missing Values

There are number of methods to handle missing values, but basically, everything we ca do fall into the following:

* Removing the missing values
* Filling the missing values
* Leaving the missing values as they are.


Starting from the first one, if you were to remove the missing values, fairly simple. You can do the following, and you are done.

## 1. Removing the missing values

### A. Removing Missing values completely

In [None]:
housing_df=housing.copy()

housing_cleaned=housing.dropna()

In [None]:
housing_cleaned.isnull().sum()

#Same as housing_cleaned.isna().sum

Now, we are done but we lost data :). Using the above method, we have removed all rows in which the concerned feature miss values. You can see the results that we no longer have any missing value in `total_bedroom`. 

### B. Removing missing values by a condition

What if we had an option to remove the missing values by condition? Say you want to only remove the columns that contain missing values. Since we don't have more than 1 of such columns, this will not change the results, but that's something you can try at your end on different dataset.

In [None]:
housing_cleaned_2=housing.dropna(axis='columns')

In [None]:
housing_cleaned_2.isnull().sum()

`Total_bedroom` is now removed. This is can however led to the loss of data that could be meanigful despite that it contains NaNs. If you wanted more control, then you can use `thresh` to specify how non many missing values (minimum) to keep for a given column or row.

In [None]:
len(housing)

In [None]:
housing_cleaned_3=housing.dropna(axis='columns', thresh=200)

In [None]:
housing_cleaned_3.isnull().sum()

What we did above was to keep any column which has at least more than 200 Non missing values, and we don't have that. Let's see if we change `thresh`.

In [None]:
housing_cleaned_3=housing.dropna(axis='columns', thresh=20600)

In [None]:
housing_cleaned_3.isnull().sum()

What we did above was to say `remove any columns which doesn't have more than 20600 non-missing values` and since we do have that column (total bedrooms), it was removed. It has `20640-207=20433`.

Another interesting thing to try is to determine if the row or column will be removed from the dataframe when we have at least ony missing value or all are missing.

When:

* `how` is set to `any`, remove any column or row which has any missing value

* `how` is set to `all`, remove a column or row if all values are missing. 

In [None]:
housing_cleaned_4=housing.dropna(axis='rows', how='any')

#Remove all rows which contain missing values..All 207 rows will be removed, remaining with 20433

In [None]:
len(housing_cleaned_4)

In [None]:
housing_cleaned_4.isnull().sum()


## 2. Filling the missing values

### A. Constant or Number Fill

With Pandas, filling the missing values is very straight. Here is how you can fill the any missing value with a given number.

In [None]:
housing_filled=housing.fillna(3)

In [None]:
housing_filled.isnull().sum()

### B. Forward and Backward Fill

You could also use the `ffill` (forward fill) or backward fill `bfill`, where you fill the values preeceding or back following the missing value

In [None]:
housing_filled=housing.fillna(method='ffill')

#housing_filled=housing.fillna(method='bfill')

In [None]:
housing_filled[2820:2830]

The downside of this is that it can mislead. Let's take an example at index 2826. The house with total rooms of 154 has a total_bedrooms of 522, which is impossible. 


### C. Mean or Median Imputation

Another strategy that you may want to use in filling the missing values by mean or median of the values in a feature. 

In this case, we will use `Scikit-Learn imputer method` to handle this.

In [None]:
from sklearn.impute import SimpleImputer

housing_numeric=housing.drop('ocean_proximity', axis=1) 
#Simple imputer only work with numeric features, so we drop the OCEAN_PROXIMITY

mean_fill=SimpleImputer(missing_values=np.NaN,strategy='mean')

mean_fill.fit(housing_numeric)

In [None]:
mean_filled=mean_fill.transform(housing_numeric)

In [None]:
mean_filled=pd.DataFrame(mean_filled, columns=housing_numeric.columns)

In [None]:
mean_filled.head()

You can do the same thing about median, just replace `median` in `SimpleImputer(missing_values=np.NaN,strategy='mean')` above. it will be `median_fill=SimpleImputer(missing_values=np.NaN,strategy='median')`. 

You can also fill the missing values by the most frequent number in the feature. You will replace `most_frequent` into the Simple Imputer function. It will be like: `most_frequent_fill=SimpleImputer(missing_values=np.NaN,strategy='most_frequent')`. 

The last thing about using Simple Imputer is that you can use it to use it to replace all missing values by a constant value. You will only need to say: `constant_fill=SimpleImputer(missing_values=np.NaN,strategy='constant')`.

Like all previous strategies, you will have to inspect the results to be sure you are not mismanipulating your data. Let's look at the last strategy!

## D. Multivariate Imputation (Iterative Imputation)

If you had few missing values, no doubt that it would be good to remove them completely to avoid imputing them with irrelevant values. *"Quality over quantity"*


But also you may wish to keep the data and perhaps find a better way to handle the missing values. 

One of the best way out there (considering all the flaws of the above methods) is to fill the given missing value considering the values of other features. 


In this case, we will use Scikit-Learn method called Iterative Imputer. This works on the principle of Linear regression. I will not explain that here, but here is how it is done.

What you have to know is this estimates each feature from all the other features. 

> A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion. From Scikit-Learn Doc! 


In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

iter_imputer = IterativeImputer()
iter_imputer

In [None]:
housing_imputed=iter_imputer.fit_transform(housing_numeric)

In [None]:
housing_imputed=pd.DataFrame(housing_imputed, columns=housing_numeric.columns)

The missing values will be filled in relevance to other features. 

As we said before, the right strategy will depend on your problem and the amount of missing values you have and the size of your dataset. 

## 3. Leaving the missing values as they are

In this case, you will leave the missing values as they are. You will only have to ensure you don't have something like `NaN` in your model input data because most machine learning models accept numeric inputs. 

Though using this strategy you will have empty values, at least you will not have introduced noise or eliminated important data. There is always a tradeoff! 

## This is the end!!  


Thanks for finishing this tutorial, I hope you learned something new or perhaps you found it helpful. If you want to stay in touch, check me on [LinkedIn](https://render.githubusercontent.com/view/www.linkedin.com/in/nyandwi/) and [Twitter](https://twitter.com/Jeande_d). 