Supplemental Figures for _Uncovering mental structure through data-driven ontology discovery_

In [None]:
import matplotlib.pyplot as plt
from os import path
import pandas as pd
import pickle
import seaborn as sns
pd.set_option('max_rows', 200)

from dimensional_structure import DA_plots
from dimensional_structure.EFA_plots import plot_heatmap_factors, plot_bar_factors, plot_factor_correlation
from dimensional_structure.HCA_plots import plot_subbranches, plot_results_dendrogram
from selfregulation.utils.utils import get_recent_dataset
from selfregulation.utils.result_utils import load_results

In [None]:
%matplotlib inline
dataset = get_recent_dataset()
results = load_results(datafile=dataset)

# Exploratory Factor Analysis Results

Below are the loading matrices for the exploratory factor analysis (EFA) solutions for surveys, tasks, and the outcome measures. These matrices are depicted as heatmaps, as well as dataframes with the actual values.

### Survey Exploratory Factor Analysis Loadings

12 factors were determined using a BIC criteria for exploratory factor analysis. The 66 survey DVs are grouped and ordered based on the largest (absolute) factor loading for that DV. Dotted lines indicate separate groups derived from this criteria, and are used for visualization purposes only

In [None]:
survey_results = results['survey']
survey_factor_loading = survey_results.EFA.get_loading()

In [None]:
survey_c = survey_results.EFA.results['num_factors']
plot_heatmap_factors(survey_results, survey_c, thresh=0, size=12)

Full loading matrix, as a dataframe...

In [None]:
survey_factor_loading

### Task Exploratory Factor Analysis Loadings

5 factors were determined using a BIC criteria for exploratory factor analysis. The 130 survey DVs are grouped and ordered based on the largest (absolute) factor loading for that DV. Dotted lines indicate separate groups derived from this criteria, and are used for visualization purposes only.

In [None]:
task_results = results['task']
task_factor_loading = task_results.EFA.get_loading()

In [None]:
task_c = task_results.EFA.results['num_factors']
plot_heatmap_factors(task_results, task_c, thresh=0, size=13)

In [None]:
task_factor_loading

### Outcome Exploratory Factor Analysis Loadings

9 factors were determined using a BIC criteria for exploratory factor analysis. The 55 target measures are grouped and ordered based on the largest (absolute) factor loading for that target measure. Dotted lines indicate separate groups derived from this criteria, and are used for visualization purposes only.

In [None]:
outcome_factor_loading = task_results.DA.get_loading()

In [None]:
outcome_c = task_results.DA.results['num_factors']
DA_plots.plot_heatmap_factors(task_results, outcome_c, thresh=0, size=8, DA=True)

In [None]:
outcome_factor_loading

## Factor Robustness Analyses

Factor robustness proceeded in two ways:

**(1)** By dropping one measure at a time, recalculating the survey factor solution (with the same number of factors as the full sample) and correlating the new factor loadings with the original factor loadings. This correlation is calculated on all DVs excepting those dropped out because of the dropped measure. Tables for these correlations are shown with values below .9 highlighted in red

**(2)** By using a bootstrap procedure (see fa.sapa), which creates confidence intervals for each loading. 95% confidence intervals are plotted as bar plots for each loading

In [None]:
# helper plotting function
def plot_bootstrap_results(boot_results):
    mean_loading = boot_stats['means']
    std_loading = boot_stats['sds']
    coef_of_variation = std_loading/mean_loading
    f, (ax1, ax2) = plt.subplots(1,2, figsize=(12,6))
    ax1.plot(mean_loading.values.flatten(), coef_of_variation.values.flatten(), 'o')
    ax1.set_xlabel('Mean Loading')
    ax1.set_ylabel('Coefficient of Variation')
    ax1.set_ylim([-1,1])
    ax1.grid()

    ax2.plot(mean_loading.values.flatten(), std_loading.values.flatten(),'o')
    ax2.set_xlabel('Mean Loading')
    ax2.set_ylabel('Standard Deviation of Loading')

    plt.subplots_adjust(wspace=.3)

### Survey Robustness

#### Measurement Drop

Individual surveys sometimes had large effects on the factor structure, likely because of spare measurement of highly discriminat psychological constructs

In [None]:
f = path.join(results['survey'].get_output_dir(), 'EFAdrop_robustness.pkl')
survey_EFA_robustness = pickle.load(open(f, 'rb'))

In [None]:
def color(val):
    if val < .9:
        color = 'red'
        
    else:
        color = 'black'
    return 'color: %s' % color
survey_robustness_df = pd.DataFrame(survey_EFA_robustness).T
survey_robustness_df.style.applymap(color)

#### Bootstrap

It's clear that loadings are robust to the particulars of the sample. The standard deviation of the loadings are very small relative to the mean loadings.

In [None]:
plot_bootstrap_results(survey_results.EFA.get_boot_stats())

### Task Robustness

#### Measurement Drop

Task factors are more robust to dropping out particular measures, likely due to the greater overlap in the psychological constructs measured by individual tasks

In [None]:
f = path.join(results['task'].get_output_dir(), 'EFAdrop_robustness.pkl')
task_EFA_robustness = pickle.load(open(f, 'rb'))
task_robustness_df = pd.DataFrame(task_EFA_robustness).T
task_robustness_df.style.applymap(color)

#### Bootstrap

It's clear that loadings are robust to the particulars of the sample. The standard deviation of the loadings are very small relative to the mean loadings.

In [None]:
plot_bootstrap_results(task_results.EFA.get_boot_stats())

# Hierarchical Clustering

Hierarchical clustering was used to order dependent variables based on the similarity of their loading vectors. This resulted in a dendrogram, which was subset into clusters using the DynamicTreeCut algorithm. These clusters are separately plotted below, allowing the constituent DVs to be read.

### Survey Clusters

Below is the survey dendrogram (reproduced from the main manuscript). Following are the 13 clusters. separately plotted. The third and fourth clusters, referenced in the main text, together reflect canonical components of "self-control".

In [None]:
_ = plot_results_dendrogram(survey_results, size=20, drop_list=[1,3,5,7, 9,11])

In [None]:
plot_subbranches(survey_results, size=6)

### Task Clusters

Below is the task dendrogram (reproduced from the main manuscript). Following are the 13 clusters separately plotted. THe 8th and 9th clusters, referenced in the ain text, divide two groups of "information processing" tasks.

In [None]:
_ = plot_results_dendrogram(task_results, size=20, drop_list=[1,3,5,7,9,11,13,15], double_drop_list=[2,6,10,14])

In [None]:
plot_subbranches(task_results, size=6)

## Cluster Robustness Analyses

### Survey Cluster Robustness

### Task Cluster Robustness