# Reconstructing out-of-sample DVs

Given a quantitative ontology, or psychological space, that DVs can be projected into, how can we deterine the embedding of new variables?

Currently, our embedding is determined by factor analysis. Thus ontological embedding are only known for the DVs entered into the original model. How could we extend this?

One possibility is measuring new variables in the same population that completed our original battery. After doing this we could either (1) run the model anew, or (2) use linear regression to map the already discovered factors onto the new variables. The former is better, but results in small changes to the actual factors with each new variable. The latter method ensures that our factors stay the same. Neither is scalable, however, as we do not, in general, have access to a constant population that can be remeasured whenever new measures come into the picture.

Another possibility that works with new populations requires that the new population completes the entire battery used to estimate the original factors, in addition to whatever new variables are of interest. Doing so allows the calculation of factor scores for this new population based on the original model, which can then be mapped to the new measures of interest. This allows researchers to capitalize on the original model (presumably fit on more subjects than the new study), while expanding the ontology. Problems exist here, however.
- The most obvious problem is that you have to measure the new sample on the entire battery used to fit the original EFA model. Given that this takes many hours (the exact number depending on whether tasks, surveys or both are used), this is exceedingly impractical. In our cas we did have our new fMRI sample take the entire battery (or at least a subset of participants), so this problem isn't as relevant
- Still problems remain. If N is small, the estimate of the ontological embedding for new DVs is likely unstable.

This latter problem necessitates some quantitative exploration. This notebook simulates the issue by:
1. Removing a DV from the original ontology dataset
2. Performing EFA on this subset
3. Using linear regression to map these EFA factors to the left out variable

(3) is performed on smaller population sizes to reflect the reality of most studies (including ours) and is repeated to get a sense of the mapping's variability

### Small issues not currently addressed

- The EFA model is fit on the entire population. An even more stringent simulation would subset the subjects used in the "new study" and fit the EFA model on a completely independent group. I tried this once - the factor scores hardly differed. In addition, I want the EFA model to be as well-powered as possible, as that will be the reality for this method moving forward
- I am currently not holding out entire tasks, but only specific DVs

In [None]:
import numpy as np
import pandas as pd

from dimensional_structure.prediction_utils import assess_var_reconstruction
from selfregulation.utils.result_utils import load_results
from selfregulation.utils.utils import get_recent_dataset

In [None]:
results = load_results(get_recent_dataset())['task']
c = results.EFA.results['num_factors']
n_reps = 100
n_vars=30

Run simulation for every variable

In [None]:
reconstruction_results = {}
pop_sizes = [50, 100, 200, 400]
var_list = np.random.choice(results.data.columns, n_vars, replace=False)
for pop_size in pop_sizes:     
    var_out = {}
    for var in var_list:
        var_out[var] = assess_var_reconstruction(results, var,
                                                 pseudo_pop_size=pop_size,
                                                 n_reps=n_reps)
    reconstruction_results[pop_size] = var_out

In [None]:
reconstruction_df = pd.DataFrame()
for pop_size, out in reconstruction_results.items():
    for k, v in out.items():
        reps = v[1].shape[0]
        combined = pd.concat([v[0], v[2], v[1].T], axis=1).T
        combined.reset_index(drop=True, inplace=True)
        combined.insert(combined.shape[1], 
                        'score-corr', 
                        combined.T.corr().iloc[1:,0])
        combined.insert(combined.shape[1],
                       'score-MSE',
                       ((v[1]-v[0])**2).mean(1))
        label = ['true'] + ['full_reconstruct'] + ['partial_reconstruct'] * reps
        combined.loc[:, 'label'] = label
        combined.loc[:, 'var'] = k
        combined.loc[:, 'pop_size'] = pop_size
        reconstruction_df = pd.concat([reconstruction_df, combined])

In [None]:
reconstruction_df.query('label=="partial_reconstruct"') \
    .groupby(['pop_size','var']).agg([np.mean, np.std])[['score-MSE', 'score-corr']]

## Visualization

Of concern is the average correspondence and variability between the estimated ontological fingerprint of a DV and its "ground-truth" (the original estimate when it was part of the EFA model)

One way to look at this is just the average reconstruction score (e.g., for example) and variability of reconstruction score as a function of pseudo-pop-size

In [None]:
# import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import MDS, TSNE
from scipy.spatial.distance import pdist, squareform

In [None]:
plot_df = reconstruction_df.query('label=="partial_reconstruct"') \
            .groupby(['pop_size','var']) \
            .agg([np.mean, np.std])[['score-corr']].reset_index()
plot_df.columns = [' '.join(col).strip() for col in plot_df.columns.values] # flatten hierarchical columns
plot_df.head()

In [None]:
plt.figure(figsize=(12,8))
sns.swarmplot(x='pop_size', y='score-corr mean', data=plot_df, size=7)

More complicate, we can visualize this by looking at the MDS plotting:
1. The original DVs
2. The "best" reconstruction using all the data
3. The n_reps simulated estimates with a smaller population size

In [None]:
reconstruction_df.sort_values(by='label', inplace=True)

In [None]:
# MDS
mds_reduced = []
tsne_reduced = []
for pop_size in pop_sizes:
    mds = MDS(2, dissimilarity='precomputed')
    tsne = TSNE(2, metric='precomputed')
    subset = reconstruction_df.query('pop_size == %s'% pop_size)
    reconstructions = subset.iloc[:, :c]
    distances = squareform(pdist(reconstructions, metric='correlation'))
    #mds_reduced.append(mds.fit_transform(distances)) # taking too long
    tsne_reduced.append(tsne.fit_transform(1-reconstructions.T.corr()))

In [None]:
tmp_subset = reconstruction_df.query('pop_size == %s'% pop_sizes[-1])
base_colors = sns.color_palette(n_colors=len(var_list))
color_map = {var_list[i]:base_colors[i] for i in range(len(var_list))}
color_list = list(tmp_subset.loc[:,'var'].apply(lambda x: color_map[x]))
edge_colors = [color_list[i] if x=='partial_reconstruct' else [1,1,1] for i,x in enumerate(tmp_subset.label)]
size_list = [30 if x=='partial_reconstruct' else 200 for x in tmp_subset.label]

In [None]:
N_pop = len(pop_sizes)
f,axes = plt.subplots(N_pop,1,figsize=(8,8*N_pop))
for ax, reduced, pop_size in zip(axes, tsne_reduced, pop_sizes):
    ax.scatter(reduced[:,0], reduced[:,1], c=color_list, s=size_list, edgecolors=edge_colors, linewidth=2)
    ax.set_title('Pseudo-Population Size: %s' % pop_size)