# Reconstructing out-of-sample DVs

Given a quantitative ontology, or psychological space, that DVs can be projected into, how can we deterine the embedding of new variables?

Currently, our embedding is determined by factor analysis. Thus ontological embedding are only known for the DVs entered into the original model. How could we extend this?

One possibility is measuring new variables in the same population that completed our original battery. After doing this we could either (1) run the model anew, or (2) use linear regression to map the already discovered factors onto the new variables. The former is better, but results in small changes to the actual factors with each new variable. The latter method ensures that our factors stay the same. Neither is scalable, however, as we do not, in general, have access to a constant population that can be remeasured whenever new measures come into the picture.

Another possibility that works with new populations requires that the new population completes the entire battery used to estimate the original factors, in addition to whatever new variables are of interest. Doing so allows the calculation of factor scores for this new population based on the original model, which can then be mapped to the new measures of interest. This allows researchers to capitalize on the original model (presumably fit on more subjects than the new study), while expanding the ontology. Problems exist here, however.
- The most obvious problem is that you have to measure the new sample on the entire battery used to fit the original EFA model. Given that this takes many hours (the exact number depending on whether tasks, surveys or both are used), this is exceedingly impractical. In our cas we did have our new fMRI sample take the entire battery (or at least a subset of participants), so this problem isn't as relevant
- Still problems remain. If N is small, the estimate of the ontological embedding for new DVs is likely unstable.

This latter problem necessitates some quantitative exploration. This notebook simulates the issue by:
1. Removing a DV from the original ontology dataset
2. Performing EFA on this subset
3. Using linear regression to map these EFA factors to the left out variable

(3) is performed on smaller population sizes to reflect the reality of most studies (including ours) and is repeated to get a sense of the mapping's variability

### Small issues not currently addressed

- The EFA model is fit on the entire population. An even more stringent simulation would subset the subjects used in the "new study" and fit the EFA model on a completely independent group. I tried this once - the factor scores hardly differed. In addition, I want the EFA model to be as well-powered as possible, as that will be the reality for this method moving forward
- I am currently not holding out entire tasks, but only specific DVs

In [1]:
import argparse
import numpy as np
from os import makedirs, path
import pandas as pd
import pickle
from sklearn.linear_model import LinearRegression, Ridge, Lasso

from dimensional_structure.reconstruction_utils import (get_reconstruction_results, 
                                                        k_nearest_reconstruction, 
                                                        linear_weighted_reconstruction)
from selfregulation.utils.plot_utils import format_num
from selfregulation.utils.result_utils import load_results
from selfregulation.utils.utils import get_recent_dataset, get_info, get_retest_data

  from numpy.core.umath_tests import inner1d
Using TensorFlow backend.


In [2]:
# argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-pop_sizes', nargs='+', default=[50, 100, 400], type=int)
    parser.add_argument('-n_reps', default=100)
    parser.add_argument('-n_measures', default=None, type=int)
    parser.add_argument('-dataset', default=None)
    parser.add_argument('-rerun', default=False)
    args, _ = parser.parse_known_args()
    pop_sizes = args.pop_sizes
    n_reps = args.n_reps
    n_measures = args.n_measures
    rerun = args.rerun
    if args.dataset is not None:
        dataset = args.dataset
    else:
        dataset = get_recent_dataset()

In [3]:
# additional setup
np.random.seed(12412)
results = load_results(dataset)['task']
retest_data = get_retest_data(dataset.replace('Complete', 'Retest'))
c = results.EFA.results['num_factors']

classifiers = {'Ridge': Ridge(fit_intercept=False),
               'LR': LinearRegression(fit_intercept=False)}
# get output dir to store results
output_dir = path.join(get_info('results_directory'),
                       'ontology_reconstruction', results.ID)
makedirs(output_dir, exist_ok=True)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
# get a random subset of variables to perform the calculation on if n_vars is set
measures = np.unique(['^'+i.split('.')[0] for i in results.data.columns])
if n_measures is not None:
    measure_list = np.random.choice(measures, n_measures, replace=False)
else:
    measure_list = measures
# get all variables from selected tasks
var_list = results.data.filter(regex='|'.join(measure_list)).columns

Run simulation for every variable at different population sizes. 

That is, do the following:

1. take a variable (say stroop incongruent-congruent RT), remove it from the data matrix
2. Run EFA on the data matrix composes of the 522 (subject) x N-1 (variable) data matrix
3. Calculate factor scores for all 522 subjects
4. Select a subset of "pop_size" to do an "ontological mapping". That is, pretend that these subjects did the whole battery (missing the one variable) *and then* completed one more task. The idea is we want to do a mapping from those subject's factor scores to the new variable
   1. We can do a linear mapping (regression) from the ontological scores to the output variable
   2. We can do a k-nearest neighbor interpolation, where we say the unknown ontological factor is a blend of the "nearest" variables in the dataset
5. Repeat (4) a number of times to get a sense for the accuracy and variability of that mapping
6. Compare the estimated ontological scores for the held out var (stroop incongruent-congruent) to the original "correct" ontological mapping (that would have been obtained if the variable was included in the original data matrix

## Perform reconstruction

### K Nearest Neighbor Reconstruction

In [None]:
k_list = list(range(1,20))
filename = path.join(output_dir, 'k_reconstruct.pkl')
if not path.exists(filename) or rerun:
    k_reconstruction=get_reconstruction_results(results, measure_list, pop_sizes, 
                                                n_reps=n_reps, 
                                                recon_fun=k_nearest_reconstruction, 
                                                k_list=k_list)
    k_reconstruction.to_pickle(filename)
else:
    k_reconstruction = pd.read_pickle(filename)

*******************************************************************************
Starting K Nearest reconstruction, measures: ['adaptive_n_back.hddm_drift', 'adaptive_n_back.hddm_drift_load', 'adaptive_n_back.hddm_non_decision', 'adaptive_n_back.hddm_thresh', 'adaptive_n_back.mean_load']
*******************************************************************************
Starting full reconstruction
Starting partial reconstruction, pop size: 50
Rep 0
*******************************************************************************
Starting K Nearest reconstruction, measures: ['angling_risk_task_always_sunny.keep_adjusted_clicks', 'angling_risk_task_always_sunny.keep_coef_of_variation', 'angling_risk_task_always_sunny.release_adjusted_clicks', 'angling_risk_task_always_sunny.release_coef_of_variation']
*******************************************************************************
Starting full reconstruction
Starting partial reconstruction, pop size: 50
Rep 0
**********************************

In [None]:
summary = k_reconstruction.query('label=="partial_reconstruct"') \
                .groupby(['pop_size', 'k', 'weighted'])['corr_score'].agg([np.mean, np.std])

In [None]:
# summarize further
k_best_params = {}
for pop_size in pop_sizes:
    tmp=summary.query('pop_size == %s' % pop_size)
    best_params = tmp.loc[:,'mean'].idxmax()
    best_val = tmp.loc[best_params,'mean']
    k_best_params[pop_size] = {'k': best_params[1], 
                               'weighted': bool(best_params[2]),
                               'best_val': best_val}

In [None]:
k_best_reconstruction = pd.DataFrame()
for k, v in k_best_params.items():
    tmp = k_reconstruction.query('pop_size == %s and \
                                 k == %s and \
                                 weighted == %s' % (k, v['k'], v['weighted']))
    k_best_reconstruction = pd.concat([k_best_reconstruction, tmp], axis=0)
k_best_reconstruction.groupby('pop_size')['corr_score'].agg(['mean','std'])

### Linear Weighted Reconstruction

In [None]:
weight_functions = [('Lasso', Lasso(fit_intercept=False)),
                    ('Ridge', Ridge(fit_intercept=False))]
weighted_reconstructions = {}
for name, clf in weight_functions:
    filename = path.join(output_dir, '%s_weighted_reconstruct.pkl' % name)
    if not path.exists(filename) or rerun:
        tmp_reconstruction=get_reconstruction_results(results, measure_list, pop_sizes, 
                                                      n_reps=n_reps, 
                                                      recon_fun=linear_weighted_reconstruction,
                                                      clf=clf)
        tmp_reconstruction.to_pickle(filename)
    else:
        tmp_reconstruction = pd.read_pickle(filename)
    weighted_reconstructions[name] = tmp_reconstruction

### Linear Reconstruction

In [None]:
linear_reconstructions = {}
for name, clf in classifiers.items():
    for robust in [True, False]:
        out=get_reconstruction_results(results, measure_list, pop_sizes, n_reps=n_reps, clf=clf, robust=robust)
        filename = path.join(output_dir, 'linearreconstruct_clf-%s_robust-%s.pkl' % (name, str(robust)))
        out.to_pickle(filename)
        robust_str = '_robust' if robust else ''
        linear_reconstructions['%s%s' % (name, robust_str)] = out

In [None]:
# summarize further
for name, reconstruction_df in linear_reconstructions.items():
    print(name)
    print(reconstruction_df.query('label=="partial_reconstruct"') \
            .groupby(['pop_size']).agg([np.mean, np.std])[['score-MSE', 'score-corr']])
    print('')

## Visualization

Of concern is the average correspondence and variability between the estimated ontological fingerprint of a DV and its "ground-truth" (the original estimate when it was part of the EFA model)

One way to look at this is just the average reconstruction score (e.g., for example) and variability of reconstruction score as a function of pseudo-pop-size and model parameters

In [None]:
%matplotlib inline
# import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import MDS, TSNE
from scipy.spatial.distance import pdist, squareform

### K Nearest Visualization

#### Average Performance by Model Parameters

In [None]:
sns.set_context('talk')
n_cols = 1
n_rows = len(pop_sizes)//n_cols
f, axes = plt.subplots(n_rows, n_cols, figsize=(12,n_rows*6))
axes = f.get_axes()
legend_on=True
for ax, pop_size in zip(axes, pop_sizes):
    sns.boxplot(x='k', y='corr_score', hue='weighted', 
                data=k_reconstruction.query('pop_size==%s' % pop_size),
                ax=ax)
    ax.set_title('Simulated Population Size: %s' % pop_size)
    ax.set_ylim(-1.1,1.1)
    ax.legend().set_visible(legend_on)
    legend_on=False
plt.subplots_adjust(hspace=.4)

#### Performance for each DV by reliability

Only taking the best parameters from the k-nearest neighbor algorithm

In [None]:
retest_index = [i.replace('.logTr','').replace('.ReflogTr','') for i in k_reconstruction['var'].unique()]
retest_vals = retest_data.loc[retest_index,'icc']
sns.set_context('talk')
f, axes = plt.subplots(1,2,figsize=(14,6))
colors = sns.color_palette(n_colors = len(pop_sizes))
for i, pop_size in enumerate(pop_sizes):
    reconstruction = k_best_reconstruction.query('pop_size == %s' % pop_size) \
                                     .groupby('var')['corr_score'].agg(['mean','std'])
    sns.regplot(retest_vals, reconstruction['mean'], 'o', label=pop_size, ax=axes[0], color=colors[i])
    sns.regplot(retest_vals, reconstruction['std'], 'o', label=pop_size, ax=axes[1], color=colors[i])
axes[1].legend()
plt.subplots_adjust(wspace=.3)

We can dive in and look at one high/mediun/low reliable variable to see the reconstruction performance

In [None]:
sorted_retest_vals = retest_vals.sort_values().index
N = len(sorted_retest_vals)
high_var = sorted_retest_vals[N-1]
med_var = sorted_retest_vals[N//2]
low_var = sorted_retest_vals[0]

In [None]:
f, axes = plt.subplots(1,3, figsize=(20,8))
for ax, var in zip(axes, [high_var, med_var, low_var]):
    retest_in = var.replace('.logTr','').replace('.ReflogTr','')
    reliability = format_num(retest_data.loc[retest_in]['icc'])
    plot_df = k_best_reconstruction.query('var == "%s" and label=="partial_reconstruct"' % var)
    sns.boxplot(x='pop_size', y='corr_score', data=plot_df,  ax=ax)
    ax.set_title('%s\nICC: %s' % (var, reliability))
plt.subplots_adjust(wspace=.6)

More complicate, we can visualize this by looking at the MDS plotting:
1. The original DVs
2. The "best" reconstruction using all the data
3. The n_reps simulated estimates with a smaller population size

In [None]:
plot_df = pd.concat([k_best_reconstruction,
                    k_reconstruction.query('label=="true"')]).reset_index(drop=True)

In [None]:
# MDS
mds_reduced = []
tsne_reduced = []
for pop_size in pop_sizes:
    mds = MDS(2, dissimilarity='precomputed')
    tsne = TSNE(2, metric='precomputed')
    subset = plot_df.query('label=="true" or pop_size == %s'% pop_size)
    reconstructions = subset.iloc[:, :c]
    distances = squareform(pdist(reconstructions, metric='correlation'))
    mds_reduced.append(mds.fit_transform(distances)) # taking too long
    tsne_reduced.append(tsne.fit_transform(1-reconstructions.T.corr()))

In [None]:
tmp_subset = plot_df.query('label=="true" or pop_size == %s'% pop_sizes[-1]).reset_index(drop=True)
colored_vars = np.random.choice(var_list, size=10, replace=False)
base_colors = sns.color_palette(palette='Paired', n_colors=len(colored_vars))
color_map = {k:v for k,v in zip(colored_vars, base_colors)}
colored_indices = tmp_subset[tmp_subset['var'].isin(colored_vars)].index
color_list = list(tmp_subset.loc[colored_indices,'var'].apply(lambda x: color_map[x]))
colored_sizes = [200 if x=='true' else 75 for x in tmp_subset.loc[colored_indices,'label']]
uncolored_indices = list(set(tmp_subset.index) - set(colored_indices))

In [None]:
N_pop = len(pop_sizes)
f,axes = plt.subplots(N_pop,1,figsize=(12,12*N_pop))
for ax, reduced, pop_size in zip(axes, tsne_reduced, pop_sizes):
    ax.scatter(reduced[uncolored_indices,0], reduced[uncolored_indices,1], s=10, c=[.5,.5,.5])
    ax.scatter(reduced[colored_indices,0], reduced[colored_indices,1], s=colored_sizes,
               c=color_list, edgecolor='black', linewidth=2)
    ax.set_title('Pseudo-Population Size: %s' % pop_size)