**Handling Missing values – the first task when you start coding **

***What is missing value?***

Due to some reasons or sometimes maybe randomly, values do not get captured for all rows of certain columns in a dataset. These are termed as missing values. This is very common in real time datasets.

***Why handle missing values?***

The first problem is that many algorithms don’t work if you input NAs (missing values) in them.
Other than that missing values lead to bias in the data. It can sometimes even lead to bad results. That is why it is very important to handle them appropriately. The best thing is to avoid having missing data at the time it is being created. But in real scenario, you are not the one who collects data. So you must work on what has been collected and handle it on your end.




***Methods of handling them.***

Let us now discuss how to handle the missing values.


We are taking Titanic dataset. Let us check which column has missing values

In [None]:
import numpy as np
import pandas as pd 
data = pd.read_csv("/kaggle/input/titanic/train.csv")
print(data.isna().any())

In [None]:
#removing categorical features for ease
cols = ['Name', 'Sex', 'SibSp',
       'Ticket']
data = data.drop(columns = cols)


As we can see there are three columns having missing valures - Age, Cabin & Embarked.

Let us see different methods.

1.	**Delete the rows having NA:** When you don’t want to spend time thinking on what to do, the immediate solution that comes is mind is to just delete and remove the rows having NA. Neither will there be missing values nor will there be any problem. But this cannot be done every time. Removing rows lead to loss of data and we do not want that to happen. It is suggested that you go to this step only if your column has less than 5% of missing values. 



In [None]:
data1 = data.copy()
data1 = data1.dropna()
print('original data length ',len(data))
print('new data  length ',len(data1))

As you can see, the size of data decreased drastically if we drop nas from our data. This step has lead to huge loss of data. Hence, we should do this only if nas are very less in number.

> **2.	Delete the columns having NA: **Another approach is to remove the columns that have NA. But if we do this with all columns, you may probably lose all your data. You can do this in some columns. But it is suggested that you go to this step only if your column has more than 80% of missing values. It also depends on the importance of the column in your use case.



In [None]:
na_columns = data.columns[data.isnull().any()]
data2 = data.drop(columns = na_columns)
print(data2.columns)

**3.	Missing value Imputation:** This is a widely used and very effective method when it comes to handling missing values. There are lots of options for imputation mainly – mean, median, mode or a constant value. No option is the best. It varies with dataset and use case. You can chose the value to be imputed based on your use case.

In [None]:
### Let us take column age and see 
#Mean
data3 = data.copy()
print(data3)
data3['Age'] = data3.Age.fillna(data3.Age.mean())
print("-----After imputation------")
print(data3)

You can see row 888 has NaN for Age column. We replaced it by mean

In [None]:
#Similarly for median, mode or  a constant value
#Median
data3 = data.copy()
data3['Age'] = data3.Age.fillna(data3.Age.median())
#Mode
data3 = data.copy()
data3['Age'] = data3.Age.fillna(data3.Age.mode())
#Constant
data3 = data.copy()
data3['Age'] = data3.Age.fillna(20)


**4.	Advanced missing value imputation: ** For imputing missing values another option is to fill it using a predictive model. You can use simple regression or classification algorithms to perform this task.



In [None]:
#### Doing this for Age column
data4 = data.copy()
#drop columns having nan
data4 = data4.drop(columns = ['Cabin','Embarked'])

x_train = data4.dropna()
y_train = x_train['Age']
x_train = x_train.drop(columns = ['Age'])

x_predict = data[data['Age'].isnull()]
temp = x_predict[na_columns]
x_predict = x_predict.drop(columns = na_columns)

#Lets fit model
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 100 decision trees
rf = RandomForestRegressor(n_estimators = 100, random_state = 22)
# Train the model on training data
rf.fit(x_train, pd.to_numeric(y_train))

##predict
y_predict = rf.predict(x_predict)
print(y_predict)

Let us re-construct our data back

In [None]:
x_temp = pd.concat([x_train,x_predict])
y_temp = np.concatenate((y_train, y_predict))
data4 = x_temp.copy()
data4['Age'] = y_temp
print(data4)

***What to do if you don’t handle them?***

*Note: I do not suggest to go with this.*

There are many algorithms that have mechanisms to deal with missing values in the data. These algorithms are like Naive Bayes where you can choose what happens to the missing value. I would repeat again, handle your missing values beforehand and don’t rely on the algorithm to handle it for you.


This was a very introductory kernel/post for handling missing values. Let me know if I should discuss them in more detail with more details.

Thank you.