### Clustermap

In this file, we are trying to establish an idea of how data is entered.  More specifically, we want to see if there is any correlation between missing/present data values between variables.  While some of these relationships are obvious (i.e. if `SAT_reading` is present, so is `SAT_math`) there might be others we wouldn't have thought of otherwise.

Import necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
# alt.renderers.enable('notebook')

import matplotlib.pyplot as plt
import warnings
import folium
warnings.filterwarnings('ignore')

Read in the .csv file as a DataFrame.

In [None]:
filename = '../data/processed/CriticalPath_Data_EM_Confidential_lessNoise.csv'

df = pd.read_csv(filename).drop(columns=['Unnamed: 0'])

First, we construct the shadow matrix (`M`) of the data.  That is, we construct a matrix with the same dimensions as the orginal data set made up of only 1's and 0's.  We place a 0 where there is a missing entry in the data, and place 1's everywhere else.

From there, we only take on the columns that have a standard deviation $\neq$ 0, in order to compute the correlation matrix.  We can use `sns.clustermap()` to plot an organized heatmap of the data in order to highlight areas of high and low correlation.

In [None]:
M = df.isnull().astype(int)
cols = M.std()==0
shadow_matrix = M[list(M.columns[~cols])]

clustergrid = sns.clustermap(shadow_matrix.corr(),figsize=(7,7),cmap='coolwarm');

We can then do this year-by-year as well.  This appears to have little to no effect on the resulting matrix.

In [None]:
for year in [201930,201830,201730,201630]:
        
    M = df[df['Year_of_entry']==year].isnull().astype(int)
    cols = M.std()==0
    shadow_matrix = M[list(M.columns[~cols])]

    f, axes = plt.subplots(figsize=(10,6))
    
    plt.title(year)
    clustergrid = sns.heatmap(shadow_matrix.corr(),cmap='coolwarm');

We can also do this with the actual dataset, and see if there exists a relationship between the actual data values.  In theory, this should map up pretty closely with the coorelation matrix generated from the shadow matrix, since if data is entered together they are most likely related.

In [None]:
M = df.select_dtypes(exclude=['object']) # all numeric
M = M.fillna(-999)
clustergrid = sns.clustermap(M.corr(),figsize=(14,14),cmap='coolwarm');