Some data happens to be missing from its required field
Either from human errors, loss from data entry, or any other forms which led to the data either not being there to begin with or have gotten unstandardised in the proces. 

Eg: surveys

Individuals tend to not provide certain data such as their personal information (Contacts, Age, Salary, etc) if not mandated.

---

Some of these missing data can be:
1. Continuous or

2. Categorical

---

Generally, there are a few types of missing data:

1. MCAR - missing completely at random. There is no relationship between the data being missing and the existing observations

2. MNAR - missing data not at random/systematic missing values. There is a relationship between the records being missing and the rest of the dataset

3. MAR - missing data at random. Missing from neglegance or preference to not share.

In [None]:
import numpy as np 
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input/titanicdataset-traincsv'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/titanicdataset-traincsv/train.csv')
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

Since these data were collected after the incident, there are some missing values in the respective fields. From the output above, we can see the the Age and the Cabin columns have missing values from which it is safe to hypothize that the fields have some form of relationship between them.  

In [None]:
df[df['Embarked'].isnull()]

In [None]:
df[df['Age'].isnull()]

In [None]:
df['Cabin_Null'] = np.where(df['Cabin'].isnull(),1,0)
df['Cabin_Null'].mean()

In [None]:
df.columns

In [None]:
df['Cabin_Null'].value_counts()

In [None]:
df.groupby(['Survived'])['Cabin_Null'].mean()

0.876138 of the passengers not survived having missing values

0.602339 of the survived passengers having missing values
Proving the aforementioned hypothesis

## Techniques to handle missing data
The are various techniques to handle them some of which includes:
1. Using central tendencies (Mean,media,mode) replacement
2. Random sample imputation
3. Capturing null values with a new feature
4. End of distribution imputation
5. Arbitrary imputation
6. Frequent categories imputation

### 1. Mean/median/mode imputation 
When to apply?

Assumptions made in mean/median/mode imputation are that the missing values are completely at random (MCAR).

|#  |Pros             |Cons                                    |
|---|----             |----                                    |
|1. |Easy to implement|Change/distortion in the orginal dataset|
|2. |Fast             |                                        |


In [None]:
#Taking certain columns from the dataset
df.head()

In [None]:
df = df.drop(['PassengerId','Pclass','Name','Sex','SibSp','Parch','Ticket','Cabin','Embarked','Cabin_Null'], axis=1)
df.head()

In [None]:
#Checking the percentage of missing values
df.isnull().mean()

In [None]:
median = df['Age'].median()
def impute(dataset, variable, median):
    df[variable+'_median'] = df[variable].fillna(median)

In [None]:
impute(df,'Age',median)
df.head()

In [None]:
df.describe()

In [None]:
(df.Age.std())-(df.Age_median.std())

In [None]:
df.isnull().mean()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde',ax=ax)
df['Age_median'].plot(kind='kde',ax=ax,color='red')
lines,labels = ax.get_legend_handles_labels()
ax.legend(lines,labels, loc='best')

### 2. Random sample imputation
Random observations are taken from the dataset in order to replace the missing values (MCAR).

|#  |Pros                         |Cons                                       |
|---|----                         |----                                       |
|1. |Easy to implement            |Randomness will not work in every situation|
|2. |Lesser distortion in variance|                                           |


In [None]:
df = pd.read_csv('../input/titanicdataset-traincsv/train.csv', usecols=['Age', 'Fare', 'Survived'])
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.isnull().mean()

In [None]:
df['Age'].dropna().sample(df['Age'].isnull().sum(),random_state=0)

In [None]:
median = df['Age'].median()
def impute(dataset, variable, median):
    df[variable+'_median'] = df[variable].fillna(median)
    df[variable+'_random'] = df[variable]
    random_sample = df[variable].dropna().sample(df[variable].isnull().sum(),random_state=0)
    random_sample.index = df[df[variable].isnull()].index
    df.loc[df[variable].isnull(),variable+'_random']=random_sample

In [None]:
impute(df,'Age',median)

In [None]:
df.isnull().mean()

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
(df.Age.std())-(df.Age_random.std())

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde',ax=ax)
df['Age_median'].plot(kind='kde',ax=ax,color='red')
df['Age_random'].plot(kind='kde',ax=ax,color='yellow', alpha=0.3, linewidth=5)
lines,labels = ax.get_legend_handles_labels()
ax.legend(lines,labels, loc='best')

The distortion is much more minimized as compared to the previous imputation

### 3. Capturing Null values with new features/variables
If the missing data are not MNAR-type, then only it will perform well.
It provides the model with atleast some information regarding the missing values.
Although, it will create additional features for each an every missing values which will ultimately increase the number of fields.

|#  |Pros                                 |Cons|
|---|----                                 |----|
|1. |Easy to implement                    |Curse of dimensionality|
|2. |Captures importance of missing values||

In [None]:
df = pd.read_csv('../input/titanicdataset-traincsv/train.csv', usecols=['Age', 'Fare', 'Survived'])
df.head()

In [None]:
df['Age_Null'] = np.where(df["Age"].isnull(),1,0)
df.head()

In [None]:
df.Age.mean()

In [None]:
df.Age.median()

In [None]:
df['Age'].fillna(df.Age.median(),inplace=True)
df.head()

In [None]:
df

### 4. End of distribution imputation
Taking data from the end of the distribution to impute



|#  |Pros                                 |Cons                   |
|---|----                                 |----                   |
|1. |Easy to implement|Distorts original distribution|
|2. |Captures the importance of missingness if there is one|If the missingness is not important, it may mask the predictive power of the original variable|
|3. ||If the number of NAN is large, it will mask true outliers in the distribution|
|4. ||If the number of NAN is small, the replaced null values may be considered an outlier and the pre-processed in a subsequent step of feature engineering|

In [None]:
df = pd.read_csv('../input/titanicdataset-traincsv/train.csv', usecols=['Age', 'Fare', 'Survived'])
df.head()

In [None]:
df.Age.hist(bins=50)

In [None]:
df.Age.mean()

In [None]:
df.Age.describe()

In [None]:
df.Age.mean()+3*df.Age.std()

In [None]:
import seaborn as sns
sns.boxplot('Age',data=df)

In [None]:
median = df.Age.median()
extreme_value = df.Age.mean()+3*df.Age.std()

def impute(df, variable, median, extreme_value):
    df[variable+'_end_of_distribution'] = df[variable].fillna(extreme_value)
    df[variable].fillna(median,inplace=True)

In [None]:
df

In [None]:
impute(df, 'Age', median, extreme_value)

In [None]:
df

In [None]:
 df['Age'].hist(bins=50)

In [None]:
 df['Age_end_of_distribution'].hist(bins=50)

In [None]:
sns.boxplot('Age_end_of_distribution',data=df)

### 5. Arbitrary imputation

Def: Arbitrary value imputation consists of replacing all occurrences of missing values within a variable by an arbitrary value. Ideally arbitrary value should be different from the median/mean/mode, and not within the normal values of the variable.

Involves replacing NULL values with another arbitrary value besides central tendecies.
An unlikely method to be applied. Not suitable for every use case.

Properties:
* It should not be present frequently i.e. rare values

# Handling Categorical Features

### 6. Frequent Category Imputation

|#  |Pros                                 |Cons                   |
|---|----                                 |----                   |
|1. |Easy and fast to implement|Using the most frequent labels, they may be used in an over represented way given that if there are a higher number of NULL values|
|2. |Captures missingness of values|Distorts the relationship of the most frequent labels|


In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input/house-prices-advanced-regression-techniques'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#This dataset was chosen due to it having lots of categorical values  
df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
df.head()

In [None]:
df.columns

In [None]:
del df

In [None]:
#Taking 3 features for simplicity
df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv', usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])
df.head()

In [None]:
df.shape[0]

In [None]:
df.isnull().sum()

In [None]:
df.isnull().mean().sort_values(ascending=True)

In [None]:
# Computing the frequency of every feature
#df['BsmtQual'].value_counts().plot.bar()
df.groupby(['BsmtQual'])['BsmtQual'].count().sort_values(ascending=False).plot.bar()

In [None]:
df['GarageType'].value_counts().plot.bar()

In [None]:
df['FireplaceQu'].value_counts().plot.bar()

In [None]:
df['GarageType'].value_counts().index[0]

In [None]:
#Imputation function
def impute(df, variable):
    #most_frequent = df[variable].mode()[0]
    most_frequent = df[variable].value_counts().index[0]
    df[variable].fillna(most_frequent,inplace=True)    

In [None]:
for features in ['BsmtQual','FireplaceQu','GarageType']:
    impute(df,features)
df.isnull().sum()

### 6. Adding a Variable to Capture NULL

|#  |Pros                                 |Cons                   |
|---|----                                 |----                   |
|1. |||
|2. |||


In [None]:
df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv', usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])
df.head()

In [None]:
df.isnull().sum()

In [None]:
df['BsmtQual_Nulls'] = np.where(df['BsmtQual'].isnull(),1,0)

In [None]:
df.isnull().sum()

In [None]:
df.head()

In [None]:
frequent = df['BsmtQual'].mode()[0]

In [None]:
df['BsmtQual'].fillna(frequent,inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df['FireplaceQu_Nulls'] = np.where(df['FireplaceQu'].isnull(),1,0)
frequent = df['FireplaceQu'].mode()[0]
df['FireplaceQu'].fillna(frequent,inplace=True)

In [None]:
df.head()

In [None]:
df.isnull().sum()

# This will distort the distribution of the data
Therefore, we replace the NULLs with a new category

In [None]:
df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv', usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])
df.head()

In [None]:
def impute(df,variable):
    df[variable+'_new_feature'] = np.where(df[variable].isnull(),'Missing',df[variable])

In [None]:
for features in ['BsmtQual','FireplaceQu','GarageType']:
    impute(df,features)
df.isnull().sum()

In [None]:
df.head()

In [None]:
df = df.drop(['BsmtQual','FireplaceQu','GarageType'],axis=1)
df.head()

## One Hot Encoding

In [None]:
df = pd.read_csv('../input/titanicdataset-traincsv/train.csv', usecols=['Sex'])
df.head()

In [None]:
pd.get_dummies(df)

In [None]:
#But instead we can do
pd.get_dummies(df,drop_first=True)

In [None]:
df = pd.read_csv('../input/titanicdataset-traincsv/train.csv', usecols=['Embarked'])
df.head()

In [None]:
df.Embarked.unique()

In [None]:
df.dropna(inplace=True)

In [None]:
df.Embarked.unique()

In [None]:
'''Dropping the first column because "C" 
could be determined based on the absence 
or presence in the "Q" and "S" columns''' 

pd.get_dummies(df,drop_first=True)

Using a different dataset

In [None]:
df = pd.read_csv('../input/mercedes-benz-greener-manufacturing/train.csv.zip')

In [None]:
df.head()

### One Hot Encoding with Many Categorical Features

In [None]:
df.info()

In [None]:
#Taking only the categorical features
df = pd.read_csv('../input/mercedes-benz-greener-manufacturing/train.csv.zip', usecols=['X0','X1','X2','X3','X4','X5','X6','X8'])
df.head()

In [None]:
df.X0.value_counts()

In [None]:
df.X1.value_counts()

In [None]:
for i in df.columns:
    print(len(df[i].unique()))

Using One Hot Encoding would not be efficient as there already are quite a huge number of columns

Therefore, a slightly different technique should be used

Hence, taking the top 10 most frequent features then One Hot Encoding them

In [None]:
df.X0.value_counts().sort_values(ascending=False).head(10)

In [None]:
list_top_10 = list(df.X0.value_counts().sort_values(ascending=False).head(10).index)

In [None]:
for categories in list_top_10:
    df[categories] = np.where(df['X0']==categories,1,0)

In [None]:
list_top_10.append('X0')

In [None]:
df[list_top_10]

Showing the presence of the top 10 categories with the highest frequency only

# Ordinal Data Encoding

Eg: Years of Experience

* 20 years - 1
* 10 years - 2
* 5 years - 3

Eg: Grades: A,B,C,D,F

* A - 1
* B - 2
* C - 3
* D - 4
* F - 5

In [None]:
import datetime

In [None]:
todays_date = datetime.datetime.today()
todays_date

In [None]:
# Creating a dataset
# List comprehension
days = [todays_date - datetime.timedelta(x) for x in range(0,15)]

In [None]:
data = pd.DataFrame(days)
data.columns = ["Day"]
data

In [None]:
data['Weekday'] = data['Day'].dt.day_name()
data.head()

In [None]:
#Encoding the categorical feature, Weekday

dict = {
    'Monday':1,
    'Tuesday':2,
    'Wednesday':3,
    'Thursday':4,
    'Friday':5,
    'Saturday':6,
    'Sunday':7
}
dict

In [None]:
data['Weekday_Ordinal'] = data['Weekday'].map(dict)
data

# Frequency Encoding / Count Encoding

|#  |Pros             |Cons                   |
|---|----             |----                   |
|1. |Easy to implement|If features share the same number of frequency, they will provide the same weight|
|2. |Not increasing feature space||

In [None]:
df = pd.read_csv('../input/adult-dataset/adult.csv',header=None)
df.head()

In [None]:
df[1].unique(),len(df[1].unique())

In [None]:
df.shape

In [None]:
df.info()

In [None]:
cat_feature_columns = [1,3,5,6,7,8,9,13]
cat_feature_columns

In [None]:
df = df[cat_feature_columns]

In [None]:
df.columns = ['Employment', 'Education', 'Status', 'Position', 'Family', 'Race', 'Sex', 'Country']
df.head()

In [None]:
for feature in df.columns[:]:
    print(feature, ': ', len(df[feature].unique()), ' labels')

In [None]:
#Converting to dictionary
country_map = df['Country'].value_counts().to_dict()

In [None]:
df['Country'] = df['Country'].map(country_map)
df.head()

# Target Guided Ordinal Encoding
Labels are ordered according to the target

Or, labels are replaced by the joint probability or being 0 or 1 in classification problems 


|#  |Pros             |Cons                   |
|---|----             |----                   |
|1. |||
|2. |||

In [None]:
df = pd.read_csv('../input/titanicdataset-traincsv/train.csv', usecols=['Cabin','Survived'])
df.head()

In [None]:
df['Cabin'].fillna('Missing',inplace=True)
df.head()

In [None]:
df.Cabin.unique()

The first letter represents the block a cabin belongs to

In [None]:
df['Cabin'] = df['Cabin'].astype(str).str[0]
df.head()

In [None]:
df.Cabin.unique()

In [None]:
df.groupby(['Cabin'])['Survived'].mean()

In [None]:
df.groupby(['Cabin'])['Survived'].mean().sort_values().index

In [None]:
ordinal_labels = df.groupby(['Cabin'])['Survived'].mean().sort_values().index
ordinal_labels

In [None]:
enumerate(ordinal_labels,0)

In [None]:
#Mapping the labels to a number
k, key = labels
i, value = rank
ordinal_labels_2 = {k:i for i,k in enumerate(ordinal_labels,0)}
ordinal_labels_2

In [None]:
df['Cabin_Ordinal_Labels'] = df['Cabin'].map(ordinal_labels_2)
df.head()

# Mean Encoding

|#  |Pros             |Cons                   |
|---|----             |----                   |
|1. |Captures information within label|Leads to overfitting|
|2. |Creates monotonic relationship with feature and target||

In [None]:
df.groupby(['Cabin'])['Survived'].mean()

In [None]:
mean_ordinal = df.groupby(['Cabin'])['Survived'].mean().to_dict()
mean_ordinal

In [None]:
df['Mean_Ordinal_Encode'] = df['Cabin'].map(mean_ordinal)
df.head()

# Probability Ratio Encoding

In [None]:
df = pd.read_csv('../input/titanicdataset-traincsv/train.csv', usecols = ['Cabin','Survived'])
df.head()

In [None]:
df['Cabin'].fillna('Missing',inplace=True)
df.head()

In [None]:
df['Cabin'].unique()

In [None]:
df['Cabin']=df['Cabin'].astype(str).str[0]
df.head()

In [None]:
df.Cabin.unique()

In [None]:
prob_df=df.groupby(['Cabin'])['Survived'].mean()

In [None]:
prob_df=pd.DataFrame(prob_df)
prob_df

In [None]:
prob_df['Died']=1-prob_df['Survived']

In [None]:
prob_df.head()

In [None]:
prob_df['Probability_ratio']=prob_df['Survived']/prob_df['Died']
prob_df.head()

In [None]:
probability_encoded=prob_df['Probability_ratio'].to_dict()

In [None]:
df['Cabin_encoded']=df['Cabin'].map(probability_encoded)
df.head()

In [None]:
df.head(20)