In [None]:
import pandas as pd
from case_study_utils import CorrCommunity
from case_study_ui import show_communities
import qgrid
pd.set_option("display.max_rows", None)

# Load Correlation

In [None]:
corrs = pd.read_csv('./chicago_month_census_tract.csv')
corr_community = CorrCommunity(corrs, 'chicago')

# Get correlation communities

Nexus searches for an optimal set of signals that, when applied as filters, yield a correlation graph with the highest modularity score. The signals that we consider for chicago open data include:

    - Missing value ratio in the aggregated column
    - Missing value ratio in the original column
    - Zero value ratio in the aggregated column
    - Zero value ratio in the original column
    - The absolute value of correlation coefficient
    - Overlap: number of samples used to calculate the correlation

In chicago open data, the best set of thresholds for the above signals are [1.0, 1.0, 1.0, 0.8, 0.6, 70], which means we include correlations whose missing_ratio <= 1.0, missing_ratio_original<=1.0, zero_ratio <=1.0, zero_ratio_original <= 0.8, |r| >= 0.6, |samples| >= 70.

You can play with different set of thresholds as well!

In [None]:
signal_thresholds = [1.0, 1.0, 1.0, 0.8, 0.6, 70]
corr_community.get_correlation_communities_chicago(signal_thresholds)

# Examine Correlation Communities

We implement a simple interface for you to explore our correlation communities. Each community is composed of a group of variables. By default, the display is set to only show the tables where these variables are found. To view the specific variables within a community, simply click the "Show Variables" button.

Clicking the "Show Correlations" button will reveal all the correlations within a community. Once displayed, you have the flexibility to apply any filters to the resulting dataframe.

FAQ:
1. Why do some communities display the exact same set of tables?

    The reason is that while the tables might be the same, the variables within these communities differ. We construct the correlation graph based on variables, and then present it in a table-view for clarity.


In [None]:
show_communities(corr_community, show_corr_in_same_tbl=False)

# Report interesting correlation

If you find any interesting correlations (you are free to propose any creteria for "interesting"), simply record and report its index.