# Imputation of missing data using an SGD regression approach

In [None]:
# Library imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 
import random as rd
import seaborn as sns

# Read in the competition data
sample = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2022/sample_submission.csv')
data = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2022/data.csv', index_col='row_id')

TTESTS = True
# Take a sample of the data just for notebook development purposes:
#MAXROWS = 100000
MAXROWS = data.shape[0]
#TCOLS = [f'F_1_{x}' for x in range(3)] + \
#    [f'F_2_{x}' for x in range(4)] + \
#    [f'F_3_{x}' for x in range(5)] + \
#    [f'F_4_{x}' for x in range(6)]
TCOLS = data.columns
data = data.head(MAXROWS)
data = data[TCOLS]

# Introduction 

This month's TPS is slightly different from the previous TPS competitions in that there is no target variable. Instead we are asked to impute (infill) missing data. I tried mean (and median) infilling, using `IterativeImputer`, and a clustering idea, which didn't work as well as I had hoped. My next approach to this problem is to treat each column as a target (predictor) variable, fit a model to the rest of the columns, and use this model to impute the missing values. The main difficulty with applying this methodology is that we are unable to select an appropriate model as there is no training data available. As such, the first step is to create a training dataset using complete rows of the provided data.

* Identify the missing mechanism
* Create a training dataset
* Fit models on each column of the training dataset
* Use model parameters for each column to impute data 

## Missing mechanism

It is generally accepted that there are three different types of missingness:

* Missing Completely at Random (MCAR): the occurrence of missing values is basically a random process. In particular, missingness in a particular column doesn't depend on values in other columns or in the column itself.
* Missing at Random (MAR): the occurrence of missing values depends on values in a different column.
* Missing Not at Random (MNAR): the occurrence of missing values depends on (unseen) values in the column itself.

Identifying the type of missingness is important as it will guide the choice of imputation method. Unfortunately it is not really easy to determine the type of missingness. In particular, it can be hard, if not impossible, to identify MNAR as we need to know the values of the missing data (or at least have some meta-understanding of why missing data is present). Differentiating between MCAR and MAR is a bit easier, but it is like identifying outliers - different statistical tests gives us evidence but there is generally no conclusive way to identify whether data is MAR or MCAR.

# Creating a dataset to test ideas

## Feature engineering

Just create summary columns of standard deviation of values across `Fx` variables. Also `F_4` variables do better after a power transform to make them more normal.

In [None]:

F1cols = [x for x in data.columns if 'F_1' in x]
F2cols = [x for x in data.columns if 'F_2' in x]
F3cols = [x for x in data.columns if 'F_3' in x]
F4cols = [x for x in data.columns if 'F_4' in x]
data['F1_sd'] = data.loc[:,F1cols].std(axis=1)
data['F2_sd'] = data.loc[:,F2cols].std(axis=1)
data['F3_sd'] = data.loc[:,F3cols].std(axis=1)
data['F4_sd'] = data.loc[:,F4cols].std(axis=1)

from sklearn.preprocessing import PowerTransformer

yjpt = PowerTransformer(method='yeo-johnson', standardize=False)
data[F4cols] = yjpt.fit_transform(data[F4cols])



## Identifying the missingness mechanism (MAR vs. MCAR)


Normally we have a training dataset provided, which can be split using one of the [scikit-learn cross-validation splitters](https://scikit-learn.org/stable/modules/cross_validation.html). Unfortunately we weren't given a training data set this month, so let's make one using the subset of the complete data. First, we need to work out what kind of missingness we have.

As discussed above, there are generally three kinds of missingness considered: MNAR, MAR and MCAR. It is difficult to identify MNAR, but here we consider whether the data is MAR or MCAR. Two tests are applied: relating the occurrence of missingness to values in other columns, and relating missingness to the occurrence of missingness in other columns.

### Is missing data related to values in other columns?

Iterating through each column with missing data, we can check if missingness is related to values in all the other columns in turn by performing a t-test for the difference in means of the variable for missing data and non-missing data in the response column. 



In [None]:
%%time 
if TTESTS:

    from statsmodels.stats.weightstats import ttest_ind
    results = pd.DataFrame(columns = ['missing_column','regressor_column','p_value'])
    results_missing_indicator = pd.DataFrame(columns = ['missing_column','regressor_column','p_value'])
    row = 0
    for j, y_column in enumerate(data.columns):
        #print('-----')
        #print(f'Column {y_column}')
        dlc = data.loc[:,y_column].isna().value_counts()
        #print(dlc)
        if len(dlc) == 1:
            #print(f'Column {y_column} has no (or all) missing data, skipping')
            for x_column in data.columns:
                results.loc[row] = [y_column, x_column, 1]
                results_missing_indicator.loc[row] = [y_column, x_column, 1]
                row +=1 

            continue
        Bvals = [x for x in data.groupby(data.loc[:,y_column].isna())[y_column].groups.keys()]
        x0=[list(x) for x in data.groupby(data.loc[:,y_column].isna())[y_column].groups.values()]

        for i, x_column in enumerate(data.columns):
            if sum(data[x_column].isna()) == 0:
                results.loc[row] = [y_column, x_column, 1]
                results_missing_indicator.loc[row] = [y_column, x_column, 1]
                row +=1 
                continue
            if x_column == y_column:
                continue
            ttest_res = ttest_ind(data.loc[x0[0], x_column].dropna(), data.loc[x0[1], x_column].dropna())
            results.loc[row] = [y_column, x_column, ttest_res[1]]
            ttest_res_missing = ttest_ind(data.loc[x0[0], x_column].isna(), data.loc[x0[1], x_column].isna())
            if ttest_res_missing[1] == np.nan:
                raise Exception
            #print('          ',ttest_res_missing[1])
            #print('xxx',ttest_res_missing)
            results_missing_indicator.loc[row] = [y_column, x_column, ttest_res_missing[1]]
            row += 1
            #print(f'Missing data column: {y_column} against {x_column}, t-test p-value = {ttest_res[1]:.3f}')
    results_missing_indicator


    plt.figure(figsize=(9,9))
    sns.heatmap(results.pivot(index='missing_column', columns='regressor_column', values='p_value'))

    results

Another way missing data could be related to other columns is if data in the response column is more (or less) likely to be missing if data in another column is missing. We can check this using a t-test for the difference in missingness (say as an indicator variable) in each other column.

In [None]:
if TTESTS:
    sns.heatmap(results_missing_indicator.pivot(index='missing_column', columns='regressor_column', values='p_value'))
    results_missing_indicator

Based on these graphs, there doesn't appear to be any pattern in the missingness. [eduus710](https://www.kaggle.com/eduus710) came to a similar conclusion with a more complete [missingness analysis](https://www.kaggle.com/code/eduus710/tps-jun2022-how-random-are-the-nans). So, to create a dataset for method testing, we can create missing values randomly (and independently) in each column based on the proportions of missingness from the full dataset.

### Caveats

1. I have not considered interactions between variables here. If missingness is related to the interaction of two variables, this won't come up in this test.
2. The middle variables are categorical (ordinal) variables and this methodology doesn't check this either. For example, if a particular column is more likely to be missing if, say variable `F_2_1` is equal to 2, a t-test will not necessarily pick this up.

## Proportion of missing values in each column

Let's have a look at the proportion of missing values in each column. We can use this information to make the missing columns in the testing data similar to the problem data.

In [None]:

%%time 
plt.figure(figsize=(18,6))
missing_pcts = [x/data.shape[0] for x in np.sum(data.isna())]
plt.bar(list(data.columns), missing_pcts)
#plt.hist(missing_pcts)
nzero_missing_pcts = [x for x in missing_pcts if x > 0]
# Get mean and sd for use in generating number of missing rows for each column:
norm_params = (np.mean(nzero_missing_pcts), 
               np.sqrt(np.var(nzero_missing_pcts)))
ax = plt.gca()
_=ax.set(ylabel='Proportion of values missing',
         xlabel='Column')
_=plt.xticks(rotation=90)


Now let's put together the testing dataset. The dataframe `incomplete_data` will be like the `data` dataset from the competition, and `missing` is like `sample_submission`, i.e. with one row per missing value. `complete_data` contains all rows of `data` with no missing values, and we have the missing values themselves in the `missing` data frame in the `answers` column.

In [None]:
%%time 

missing_cols = np.sum(data.isna(),0)
missing_col_ixes = np.nonzero(list(missing_cols>0))[0]
incomplete_rows = np.sum(data.isna(), 1)
complete_row_ixes = np.nonzero(list(incomplete_rows==0))[0]

complete_data = data.iloc[complete_row_ixes,:]
complete_data.reset_index(drop=True, inplace=True)
complete_data.index.name = 'row_id'
#complete_data
incomplete_data = complete_data.copy()
ncomplete_rows,_ = complete_data.shape

rd.seed(1)
missing = pd.DataFrame(columns = ['row','col','answer'])
row = 0
for col_ix in missing_col_ixes:
    nmissing_col = int(np.round(rd.normalvariate(*norm_params) * len(complete_row_ixes)))
    missing_ixes_col = [int(x) for x in rd.sample(range(ncomplete_rows), nmissing_col)]
    missing = pd.concat([missing,
                         pd.DataFrame(np.column_stack((missing_ixes_col, [col_ix,] * nmissing_col, [np.nan,] * nmissing_col)),
                                      columns=['row','col','answer'])])
    #print(f'Column {complete_data.columns[col_ix]}')
    for j in range(nmissing_col):
        x = complete_data.iloc[missing_ixes_col[j], col_ix]
        #print(x)
        missing.iloc[row + j,2] = x
        incomplete_data.iloc[missing_ixes_col[j], col_ix] = np.nan
    row += nmissing_col


missing = missing.astype({'row': int, 'col': int, 'answer': float})
missing['row-col'] = [f'{int(x["row"])}-{complete_data.columns[int(x["col"])]}' for i,x in missing.iterrows()]
missing.head()

Save the data for possible use elsewhere:

In [None]:
missing.to_csv('missing.csv', index=False)
incomplete_data.to_csv('incomplete_data.csv', index=False)
complete_data.to_csv('complete_data.csv', index=False)

# Model development and model selection

Now that we have a training dataset with an appropriate (hopefully!) missingness structure, models can be developed for the data by using the constructed training data. We iterate through each column separately, so let's suppose we have fixed a response column. The data will have missingness randomly distributed through each column:

In [None]:
# I adapted these diagrams from https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html
cmap_data = plt.cm.Paired
fig, ax = plt.subplots()
ncols = 5
npts = 100
thresh = 0.85
groups = np.reshape((np.random.rand(ncols*npts) < thresh)*1, (npts,ncols))
for col in range(ncols):
    _=ax.scatter(
        range(len(groups[:,col])),
        [col] * len(groups[:,col]),
        c=groups[:,col],
        marker="_",
        lw=15,
        cmap=cmap_data,
    )
_=ax.set(
    ylim=[-1, 5],
    yticks=range(5),
    yticklabels=['Response (missing col)'] + [f'Variable {x}' for x in [4,3,2,1]],
    xlabel="Row index",
)



A train/test split on this data can then be set up based on missingness in the response column:

In [None]:
fig, ax = plt.subplots()
order = np.argsort(groups[:,0])[::-1]
groups_sorted = groups[order, :]
for col in range(ncols):
    _=ax.scatter(
        range(len(groups_sorted[:,col])),
        [col] * len(groups_sorted[:,col]),
        c=groups_sorted[:,col],
        marker="_",
        lw=15,
        cmap=cmap_data,
    )
_=ax.set(
    ylim=[-1.5, 5],
    yticks=range(5),
    yticklabels=['Response (missing col)'] + [f'Variable {x}' for x in [4,3,2,1]],
    xlabel="Row index",
)

obs_prop_missing = sum(groups_sorted[:,0])/len(groups_sorted[:,0])
_=ax.plot([-1,-1],[-1.25,4.15],color='black')
_=ax.plot([obs_prop_missing*npts-.75,]*2,[-1.25,4.15],color='black')
_=ax.plot([npts,]*2,[-1.25,4.15],color='black')
_=ax.text(30,-1.255, 'Training\n data', fontsize=14)
_=ax.text(obs_prop_missing*npts + 2,-1.255, 'Test\ndata', fontsize=14)

As can be seen, we have missing data in the other columns throughout, but no missing data in the training data split of our manufactured dataset. The test data split has all missing data for the response column, but of course we know the correct values for this as the `answer` column in the `missing` dataframe.

My approach is to fit multiple linear regressions to each response column separately, which seems reasonable given the [more or less normal distribution](https://www.kaggle.com/code/matthewszhang/tps-june-interactive-eda-sklearn-imputer) of all continuous variables. This approach could prove problematic for the discrete (ordinal) variables, `F_2_xx`. The main issue here is that we don't want to fit linear regressions to all of the other columns as we will end up with an overfit model. Some kind of variable selection is needed, such as stepwise regression variable selection. Of course, nobody does stepwise regression anymore as it is way too slow. 

I recently read the [Lasso chapter](https://www.cambridge.org/core/books/computer-age-statistical-inference/sparse-modeling-and-the-lasso/F840511B2F6A5A756FDDF1EA91BBA9DE) of Comuter Age Statistical Inference by Efron and Hastie (good book by the way). The fundamental idea here is that we can use a regularisation parameter (optimised with cross-validation) to prevent overfitting of a regression model. [Ridge regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) is essentially multiple linear regression using a squared (L2) regularisation function, whereas [Lasso regresion](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) uses a absolute value (L1) regularisation function. As the L1-regularisation of Lasso has sharp edges as opposed to the curved boundary of the L2-regularisation function, lasso regression can have many coefficients set to zero, which gives us variable selection almost for free (or at least much cheaper than using stepwise regression). 

I tried a few different linear models (both ridge regression and the lasso), as well as the usual suspects (random forests, xgboost, etc.) but ended up settling on [SGD Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html). This is a very flexible linear model methodology which can encompass logistic regression, linear support vector machines (SVM), ridge regression and lasso regression through the choice of model parameters. The SGD in SGDRegressor stands for stochastic gradient descent and refers to the parameter optimisation method under the hood. 

I use `GridSearchCV` to choose the optimal regularisation parameter (`alpha`) and `l1_ratio` parameter. According to the function API, `l1_ratio` equal to 1 corresponds to lasso regression, and `l1_ratio` equal to 0 corresponds to ridge regression. `SGDRegressor` thus allows both models and a kind of ridge-lasso hybrid regression model if the parameter is between 0 and 1.

The preprocessing and modelling pipeline consists of three steps:
1. Infill missing values in the predictor columns using mean values, or similar,
2. Standardise predictor columns (this is important for `SGDRegressor`, but not important for something like `RandomForestRegressor`),
3. Fit regression model

I have implemented this in a scikit-learn type estimator using `BaseEstimator` (to inherit a few things like the `set_params` method) and `fit` and `predict` methods. This way it works directly with `GridSearchCV`.

In [None]:
%%time 

from sklearn.base import BaseEstimator

class Model(BaseEstimator):
    def __init__(self, alpha=1, l1_ratio=0):#, alpha=1):
        self.alpha = alpha
        self.l1_ratio = l1_ratio
    def predict(self, X):
        #print('predicting...')
        Xtr = self.si.transform(X)
        Xtr2 = self.ss.transform(Xtr)
        #print('  infilling zeros...')
        ypred = self.mod.predict(Xtr2)
        return ypred
    def fit(self, X, y):
        from sklearn.impute import SimpleImputer
        from sklearn.preprocessing import StandardScaler
        #print('infilling...')
        self.si = SimpleImputer()
        Xtr = self.si.fit_transform(X)
        self.ss = StandardScaler()
        #Xtr2 = self.ss.fit_transform(Xtr)
        Xtr2 = self.ss.fit_transform(Xtr)
        #print('fitting model...')
        from sklearn.linear_model import SGDRegressor
        self.mod = SGDRegressor(alpha=self.alpha, 
                                l1_ratio=self.l1_ratio,
                                penalty='elasticnet', 
                                random_state = 1)

        self.mod.fit(Xtr2,y)
def RMSE(y_pred, y_true):
    from sklearn.metrics import mean_squared_error
    return mean_squared_error(y_pred, y_true, squared=False)


Now, perform the hyperparameter tuning on a range of values of `alpha` and `l1_ratio`. The `PredefinedSplit` cross validator allows us to specify the training/test data split directly (i.e. using the missingness of the response column) for use in `GridSearchCV`:

In [None]:
alphas = [pow(10,x) for x in range(-8,1)]
l1_ratios = [0,0.25,0.5,0.75,1]



from sklearn.model_selection import GridSearchCV, PredefinedSplit
from sklearn.metrics import make_scorer
plt.figure(figsize=[16,16])
panel = 1
best_params = {}

for response_column in complete_data.columns[np.unique(missing['col'])]:
    
    print(response_column, end=' ')

    # PredefinedSplit uses the following coding:
    # -1: train
    #  0: test
    pds_indices = (-1*(1-incomplete_data[response_column].isna()*1))

    ixes = list(PredefinedSplit(pds_indices).split())
    train_ix, test_ix = ixes[0]

    gscv = GridSearchCV(estimator = Model(),
                        cv = PredefinedSplit(pds_indices),
                        param_grid = {'alpha': alphas,
                                      'l1_ratio': l1_ratios},
                        scoring = make_scorer(RMSE,
                                             greater_is_better=False),
                        verbose=1,refit=True)

    gscv.fit(incomplete_data.drop(response_column, axis=1), 
             complete_data[response_column])
    
    cv_results = pd.DataFrame(gscv.cv_results_['params'])
    cv_results['score'] = gscv.cv_results_['mean_test_score']
    results_for_plotting = cv_results.pivot(index='alpha',columns='l1_ratio', values='score')
    plty = np.array(results_for_plotting)
    plt.subplot(7,8,panel)
    plt.semilogx(alphas,-plty)
    plt.plot(alphas, 
             [RMSE(complete_data.iloc[test_ix,:][response_column], 
                   [np.mean(complete_data.iloc[test_ix,:][response_column]),]*complete_data.iloc[test_ix,:].shape[0]),]*len(alphas),
             'k:')
    ax3 = plt.gca()
    plt.title(response_column)
    ax3.set_yticklabels([])
    ax3.set_xticklabels([])
    ix = gscv.cv_results_['params'].index(gscv.best_params_)
    plt.plot(gscv.best_params_['alpha'], -gscv.cv_results_['mean_test_score'][ix],'ro')
    
    
    ypred_train = gscv.best_estimator_.predict(incomplete_data.iloc[train_ix,:].drop(response_column, axis=1))
    RMSE_train = RMSE(ypred_train, complete_data.iloc[train_ix,:][response_column])
    ypred_test = gscv.best_estimator_.predict(incomplete_data.iloc[test_ix,:].drop(response_column, axis=1))
    RMSE_test = RMSE(ypred_test, complete_data.iloc[test_ix,:][response_column])

    best_params[response_column] = dict(gscv.best_params_, **{'RMSE_train': RMSE_train,
                                                              'RMSE_test': RMSE_test}) # concatenate dictionaries
    panel += 1
    #print('-------------------------------------')
    
print()
# Plot final panel as legend
plt.subplot(7,8,panel)
_=plt.plot([0,1],np.array(((1,1),(2,2),(3,3),(4,4),(5,5))).T)
ax = plt.gca()
ax.set_xlim(0,1.3)
for i in range(1,6):
    _=ax.text(1.05, i, 'l1-ratio = '+str(l1_ratios[i-1]),size=12)
_=ax.axis('off')

plt.suptitle('Test RMSE (y-axis) vs. regularisation parameter (x-axis)')
# Save results
res_df = pd.DataFrame.from_dict(best_params, orient='index')
res_df
res_df.to_csv('cv_results.csv')

Solid lines in each panel show the test-set RMSE for different values of `alpha` (x-axis) and the `l1-ratio` (line colour). The best model based on test-set RMSE occurs at the red dot. In almost all cases, the `GridSearchCV` is picking out a minimal-RMSE model. 

The dotted line shows the RMSE of the test data from the mean-infilled baseline model, i.e. $\sqrt{1/n\sum(y_i-\bar{y})^2}$. For some variables, the model is doing better than the mean-infilled baseline model, whereas in other cases the mean model fits the test data better. 


# Model prediction

Next, we fit the best models to the competition data. Data is filled in left-to-right, and so imputed columns will be used for predictions for later columns.

In [None]:
%%time 

for response_column in data.columns[list(np.nonzero(np.array(np.sum(data.isna())))[0])]:

    print(response_column, end=' ')
    missing_ixes = data[response_column].isna()
    train_i = data.loc[~missing_ixes]
    test_i = data.loc[missing_ixes]
    

    #Xte = si.transform(test_i.drop(response_column, axis=1))
    mod_i = Model(alpha = res_df.loc[response_column,'alpha'],
                  l1_ratio = res_df.loc[response_column,'l1_ratio'])
    mod_i.fit(train_i.drop(response_column, axis=1), 
              train_i[response_column])
    ypred = mod_i.predict(test_i.drop(response_column, axis=1))
    data.iloc[test_i.index, np.nonzero(data.columns == response_column)[0][0]] = ypred
print()

# Untransform the F_4 variables

data[F4cols] = yjpt.inverse_transform(data[F4cols])

In [None]:
%%time

rc = [x.split('-') for x in sample['row-col']]
row,col = zip(*rc)
row = [int(x) for x in row]
sample['row'] = row
sample['col'] = col
sample_ix = np.bitwise_and(np.array(row) < MAXROWS, [x in TCOLS for x in col])
sample = sample.iloc[sample_ix,:]

sample['col_ix'] = [list(TCOLS).index(c) for c in sample['col']]

sample.head()

## Populate the submission data frame

In [None]:
%%time 

submission = sample.copy()
values = [np.nan,] * submission.shape[0]
i = 0
for _,x in submission.iterrows():
    #if i % (int(submission.shape[0]/10)) == 0:
    #    print(i)
    values[i] = data.iloc[submission.iloc[i,:]['row'], submission.iloc[i,:]['col_ix']]
    i += 1
submission['value'] = values

submission[['row-col','value']].head()

submission[['row-col','value']].to_csv('submission.csv', index=False)

## Error estimate

An estimate of the RMSE on the competition data can be obtained from the cross-validation model fitting - which is of course likely to be an underestimation for the usual reasons. 

In [None]:
np.sqrt(np.mean([x**2 for x in res_df['RMSE_test']]))

# Conclusions

Schafer and Graham (2002) suggest that: "with or without missing data, the goal of a statistical procedure should be to make valid and efficient inferences about a population of interest — not to estimate, predict, or recover missing observations nor to obtain the same results that we would have seen with complete data." The goals of this month's TPS competition oppose this somewhat, in that our goal is explicitly to recover missing observations as well as possible. 

The column-by-column SGDRegression approach I developed here definitely does a better job than mean-infilled imputation. However, several of the variables end up with mean-infilled models anyway. I ended up getting better results in this competition playing around with parameters from `IterativeImputer`.

Apart from the simple mean-infilled type of imputation, imputation seems to take forever. The `IterativeImputer` approach from sklearn runs for a similar order of magnitude to the approach I developed here, and comments in the discussion suggest that lots of people struggled with developing a non-trivial approach that ran relatively quickly.
 
Faced with imputation methods that might end up taking longer than the actual modelling itself, mean infilling looks attractive. However, the main issue with simple approaches like this is underestimation of variance. If we need to use the imputed data column for subsequent modelling, infilling with the mean could end up biasing model parameters, variances, and estimates of predictive uncertainty, etc. As Schafer and Graham (2002) write: "The average of the variable is preserved, but other aspects of its distribution - variance, quantiles, and so forth - are altered with potentially serious ramifications." In this case, resampling ("hot deck") type approaches start looking attractive.

Finally, given missing data problems are not (or should not) really be prediction problems, perhaps a better result could be achieved using unsupervised models. I spent some time early on this month investigating this kind of approach. The basic idea behind this approach would be to cluster the complete data into 'similar' groups and use the groups as analogues to records with missing values. After identifying clusters, we'd identify which cluster each missing row belongs to and infill using the mean, or perhaps a more sophisticated estimate, of the variable for the group. If there was more time left I'd probably spend some of it exploring this kind of approach in more depth.

# References

I found the following (classic) papers helpful:

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.

Schafer, J. & Graham, J. (2002). Missing Data: Our View of the State of the Art. Psychological Methods. 7. 147-177. 10.1037/1082-989X.7.2.147. 

