# Introduction

For this Kernel, I'm not going to do any predictive model. The intention of it, instead, is to do an in-depth analysis of the Missing values in the tabular data. I'll try to find useful insights in the data, and spend some time exploring the data, which is an often neglected step.

## Table of Contents

1. [Importing Libraries and Dataset](#imp)

2. [Exploratopry Data Analysis](#eda)

    2.1 [Describing the Dataset and Missing Values](#desc)
        
3. [Chi-Square Test of Independence](#chi)
    
    3.1 [Analyzing the Results](#p)
        
    3.2 [Post-Hoc Testing](#posthoc)
    
    3.3 [Adjusted P-Value](#adj)
    
    3.4[Observations](#obs)
 
4. [Conclusions](#concl)

<a id = 'imp'></a>

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-sep-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-sep-2021/test.csv')

<a id = 'desc'></a>

In [None]:
train.corr(), train.describe()

In [None]:
X1 = train.drop(columns = ['claim'])
X2 = train.drop(columns = ['claim'])
y = train.claim

In [None]:
X1.isna().sum(), test.isna().sum()

<a id = 'chi'></a>
# Chi-Square Test of Independence 

During a test of independence, our intention is to determine through statistical analysis the **approvation or refection** of a Null Hipothesis.

Depending on the nature of the explanatory and response variables, if they are Categorical or Numerical, we will use a different Independence Test.

In this case, I'll use a Chi-Squared test of Independence.

## Understanding our Relation

I want to know if the **target variable**, if there was or not a claim, **depends** on the number of missing values in each instance(each row). Under this context, let's define our H0(Null Hipothesis) and H1(Alternative Hipothesis)

**H0**: There is **NO** relation between the number of missing values in each instance and the target variable.

**H1**: There is a relation between the number of missing values in each instance and the target variable.



In [None]:
X1[~X1.isnull()] = 0
X1[X1.isnull()] = 1
X2[~X1.isnull()] = 0
X2[X1.isnull()] = 1

X1 = X1.sum(axis=1).astype(int)

In [None]:
contigency= pd.crosstab(X1, y)
contigency

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(contigency, annot=True, cmap="YlGnBu")

In [None]:
chi2_contingency(contigency)

In [None]:
total = contigency[0] + contigency[1]
cont_rel_0 = contigency[0] / total
cont_rel_1 = contigency[1]  / total
pd.DataFrame(np.array([cont_rel_0,cont_rel_1]))

<a id = 'p'></a>
# Analyzing the Results

In the Above Table, you can see the **Percentages of claim depending on the number of null values**.  

As you can see from the Chi-Square Contingency Table, our P-Value is near 0. This number is less than the common standard 0.05 to which we use to compare the P-Value. But this kind of analysis can be prone to many **Type 1 Errors**, 
<a id = 'posthoc'></a>
# Post-Hoc Tests

To know if this dependency applay to different pair groups of missing values, I'll conduct a Post-Hoc Test, studying the Chi-Square Test for each pair comparison possible.

The number of possible compaissons is: $$n(n-1)/2$$ where n is the number of explanatory types(no errors, til 14).
<a id = 'adj'></a>
# Adjusted P-Value

For this Pair Comparisons, we will use the p-Value with the **Bonferroni Adjustment**, $$0.05/c$$ where c is the number of pair comparisons run in the test.

In [None]:
n = 15
n*(n-1)/2

In [None]:
0.05 /105

In [None]:
pair_comparisons = np.zeros((15,15))
for i in range(15):
    for e in range(i+1,15):
        temp = X1[(X1 == i) | (X1 == e)]
        y_temp = y[temp.index]
        temp_contingency = pd.crosstab(temp, y_temp)
        c, p, dot, expected = chi2_contingency(temp_contingency)
        pair_comparisons[e,i] = p

In [None]:
pair_comparisons = pd.DataFrame(pair_comparisons)

In [None]:
pair_comparisons

In [None]:
sns.heatmap(pair_comparisons < (0.05 /105))
plt.title('Pair Comparisons Test Results < Adjusted P-Value')
plt.show()

<a id = 'obs'></a>
# Observations

As you can see from the above DataFrame and it's heatmap representation, **there are many pair relations where we should accept the Null Hipothesis**, which is the opposite to the results obtained at the first approximation.

It's clear that, as the number of missing values increase, the adjusted p-Value stops being signifficant.
<a id = 'concl'></a>
# Conclusions

The Chi-Square Test of Independence allows you to compare significance between two categorical data. From here, you did some Post Hoc Tests and dag deeper into the statistical relations between your explanatory features.


What do you think? What information do you think this evaluation is giving?

drK~