In [None]:
%%HTML

<style type="text/css">

div.h2 {
    background-color: #3B3B3B; 
    color: white; 
    padding: 10px; 
    padding-right: 300px; 
    font-size: 25px;  
    margin-top: 2px;
    margin-bottom: 10px;
}

div.h3 {
    background-color: white; 
    color: #fe0000; 
    padding: 5px; 
    padding-right: 300px; 
    font-size: 20px; 
    margin-top: 2px;
    margin-bottom: 10px;
}
</style>

# "Missing" Data Analysis

#### Hi, in this notebook I will introduce you to a special python library built to analyse missing values.

### <div class="h2">Table of Contents</div>
* [Importing Libraries](#section-one)
* [Reading the data files](#section-two)
* [Overview](#section-three)
* [Missing Data Analysis (MDA)](#section-four)
    - [1. List of features having null values](#subsection-fourone)
    - [2. Bar Graph](#subsection-fourtwo)
    - [3. Nullity Correlation Heatmap](#subsection-fourthree)
    - [4. Nullity Matrix](#subsection-fourfour)
    - [5. Dendogram](#subsection-fourfive)

<a id="section-one"></a>
### <div class="h2">Importing Libraries</div>

In [None]:
#Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

import warnings
warnings.filterwarnings("ignore")

<a id="section-two"></a>
### <div class="h2">Reading the data files</div>

In [None]:
#Reading the data files

train = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv')
sample = pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv')

<a id="section-three"></a>
---

### <div class="h2">Overview</div>

In [None]:
print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')

train.head()

<a id="section-four"></a>

### <div class="h2">Missing Data Analysis (MDA)</div>

In this section, we will analyse missing values in this dataset using MissingNo library with various plots.

<a id="subsection-fourone"></a>
## 1. List of features having null values

In [None]:
x = train.isnull().any()

missing1 = len(x[x==True].index)
missing0 = train.shape[1] - len(x[x==True].index)
missing1per = missing1 / train.shape[1] * 100
missing0per = missing0 / train.shape[1] * 100

plt.figure(figsize=(10,6))
sns.barplot(['Columns with missing values', 'Columns without missing values'], [missing1, missing0], palette = 'Greys')

plt.xlabel('No. of Columns', size=12, labelpad=15)
plt.ylabel('Count', size=12, labelpad=15)
plt.xticks((0, 1), ['Columns with missing values ({0:.2f}%)'.format(missing1per), 'Columns without missing values ({0:.2f}%)'.format(missing0per)])
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)

plt.title('Distribution of features with Missing Values', size=15, y=1.05)

### Observations:

About 1/3rd of total features contain missing values. 2/3rd of total features are not having any missing value.

<a id="subsection-fourtwo"></a>
## 2. Bar Graph

Bar chart displays a count of values present per columns ignoring missing values.

In [None]:
sns.set()
msno.bar(train, labels = list(train.columns), fontsize = 10, color = 'grey', figsize=(25, 10), sort = 'ascending')

### Observations:

All features are atleast 98% filled i.e. there are less than 2% missing values in this data.

<a id="subsection-fourthree"></a>
## 3. Nullity Correlation Heatmap
    
Correlation heatmap measures nullity correlation between columns of the dataset i.e., how strongly the presence or absence of one variable affects the presence of another.

**The heatmap approach is more suitable for smaller datasets.**

In [None]:
msno.heatmap(train.sample(300), fontsize = 10, cmap = 'Greys')

### Observations:

Absense of F_4_3 is moderately correlated with F_1_5 feature.

Absense of F_1_9 is well correlated with features F_4_6 & F_3_7. 

<a id="subsection-fourfour"></a>
## 4. Nullity Matrix

Nullity matrix allows us to see the distribution of data across all columns in the whole dataset. It also shows a sparkline (or, in some cases, a striped line) that emphasizes rows in a dataset with the highest and lowest nullity.

In [None]:
msno.matrix(train, fontsize = 10, labels = list(train.columns))

### Obervationa:

The sparkline on the right shows the completeness of each row. When a row has all values filled in each column, the line will be at the maximum right position. As missing values start to increase within that row the line will move towards the left.

The white colour indicates the sparsity in the features.

The set of features in middle of graph do not contain any missing values.

<a id="subsection-fourfive"></a>
## 5. Dendogram

A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters.

In [None]:
msno.dendrogram(train, filter = 'top', fontsize = 10, orientation = 'top')

### Observations:

The feature pairs with a closer U loop are more similar in nature than with larger U loop.

For example: Features F_2_23 & F_4_4 are more similar to each other compared to F_3_21 & F_4_2.

### <div class="h2">Ending with a Quote</div>
## "You can have data without information, but you cannot have information without data."

# The End!
Thank you for reading this notebook. I hope you found this analysis interesting and useful.

### Please upvote if you liked the analysis. It will motivate me to do better :)
![](http://68.media.tumblr.com/e1aed171ded2bd78cc8dc0e73b594eaf/tumblr_o17frv0cdu1u9u459o1_500.gif)