The objective is to explore the data for NAs and missing values using various techniques and to fill them in.

In [None]:
import numpy as np
import pandas as pd 
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [None]:
df = pd.read_csv('../input/bank-loan2/madfhantr.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
def summary(df):
    
    types = df.dtypes
    counts = df.apply(lambda x: x.count())
    uniques = df.apply(lambda x: [x.unique()])
    nas = df.apply(lambda x: x.isnull().sum())
    distincts = df.apply(lambda x: x.unique().shape[0])
    missing = (df.isnull().sum() / df.shape[0]) * 100
    sk = df.skew()
    krt = df.kurt()
    
    print('Data shape:', df.shape)

    cols = ['Type', 'Total count', 'Null Values', 'Distinct Values', 'Missing Ratio', 'Unique Values', 'Skewness', 'Kurtosis']
    dtls = pd.concat([types, counts, nas, distincts, missing, uniques, sk, krt], axis=1, sort=False)
  
    dtls.columns = cols
    return dtls

In [None]:
details = summary(df)
details

Skewness and Kurtosis show if the data is normally disctributed or not. If the skewness is equal to zero, the data is normally distributed, meaning it's symmetric. Negative values for the skewness indicate data that it's skewed left and it's left 'tail' is longer compare to the right one. And vice versa. If the data are multi-modal, then this may affect the sign of the skewness.
![Skewness](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cc/Relationship_between_mean_and_median_under_different_skewness.png/651px-Relationship_between_mean_and_median_under_different_skewness.png)

The symmetrical distribution is present if kurtosis equals to zero. Similarly to skewness, positive kurtosis indicates a "heavy-tailed" distribution and negative kurtosis indicates a "light tailed" distribution.

Many classical statistical tests and intervals depend on normality assumptions. Significant skewness and kurtosis indicate that data is not normal and it needs to be normalized.

## Method 1: isnull().sum()

Total number of missing values for each column of the dataframe is a good way to get an idea on missing values.

In [None]:
df.isnull().sum()

## Method 2: Seaborn Heatmap

The heatmap is another good way to see where we are missing the data the most. Red color marks the NAs in each column.

In [None]:
cols = df.columns 
colours = ['g', 'r'] 
f, ax = plt.subplots(figsize = (12,8))
sns.set_style("whitegrid")
plt.title('Missing Values Heatmap', )
sns.heatmap(df[cols].isnull(), cmap=sns.color_palette(colours));

## Method 3: Missinggo

In [None]:
import missingno as msno
msno.matrix(df);

msno.bar() is a simple visualization of nullity by column.

In [None]:
msno.bar(df,  color = 'y', figsize = (14,10));

You can switch to a logarithmic scale by specifying log=True. It provides the same information as matrix, but in a simpler format.

In [None]:
msno.bar(df, log = True, color = 'g');

Heatmap presents the correlation of missing values between every 2 columns. In our example, the correlation between Dependents and Married is 0.4 which means if Married value is present then Dependents value is more likely to be present too.
If the value is near -1 means that if one variable appears then the other variable is very likely to be missing.
If the value near 0 means there is no correlation between the apperance of missing values of variables.
If the value near 1 means that if one variable appears then the other variable is very likely to appear.

In [None]:
msno.heatmap(df,  cmap='GnBu_r');

Dendrogram plot shows a tree representing groups of columns that have strong nullity correlations. This is a more advance representation of the the heatmap and identifies groups that are correlated instead of pairs. 

In [None]:
ax = msno.dendrogram(df)

## Method 4: percentage of the missing values

In [None]:
for col in df.columns:
    prct = np.mean(df[col].isnull())
    print('{}:{}%'.format(col, round(prct*100)))

# Handling Missing Data

## Dropping NAs

Dropping is only advised to be used if missing values are few (say 0.01–0.5% of our data)

To drop all rows with 'any' NAs in a particular column, I used .dropna() and specified the subset = column.

In [None]:
df.dropna(subset = ['Loan_Amount_Term'], axis = 0, how = 'any', inplace = True)
df.isnull().sum()

## Filling missing values

How to fill in NAs and missing values, it depends on variables: 

In [None]:
df['Gender'].fillna((df['Gender'].mode()[0]),inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace = True)
df['Dependents'].fillna((df['Dependents'].mode()[0]),inplace=True)
df.isnull().sum()

Back-fill ('bfill') or forward-fill('ffill') can be used to implace NAs. 

In [None]:
df['Credit_History'].fillna(method = 'ffill', inplace = True)
df['Self_Employed'].fillna(method = 'bfill', inplace = True)
df.isnull().sum()

Filling in with statistics values is another way to fill in the missing values:

In [None]:
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace = True)
df.isnull().sum()

Nots: Depending on the method that is used here, if a previous/ next value is also a NA value, then, the NA remains.

And voilà! We discovered and fixed all of the missing values.

My reference sources for this kernel were [Data Cleaning in Python: the Ultimate Guide (2020)](https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d), [Advanced Configuration](https://github.com/ResidentMario/missingno/blob/master/CONFIGURATION.md) and [Missing data visualization module for Python](https://github.com/ResidentMario/missingno). 