# Predicting GPA
## Baseline analysis using OLS regression

***Note that these analyses were performed after the completion of the competition and that these models were not submitted as part of the challenge.***

The initial pre-processing steps are exactly the same as in the main notebook although any set up pertaining to Keras has been removed.


# Loading packages and data

In [1]:
# Set up to ensure reproducibility following https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development
import numpy as np
import random
import os

os.environ['PYTHONHASHSEED'] = '0'

# The below is necessary for starting Numpy generated random numbers
# in a well-defined initial state.
np.random.seed(42)

# The below is necessary for starting core Python generated random numbers
# in a well-defined state.
random.seed(54321)

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

Loading the files

***Note: These data cannot be provided on Github and I will delete my copies in accordance with the FFC agreement. If you would like copies of the data to replicate these analyses please contact the Fragile Familes survey.***

In [2]:
train=pd.read_csv('../../ff_data/train.csv',low_memory=False, index_col='challengeID')
predictions=pd.read_csv('../../ff_data/prediction.csv',low_memory=False, index_col='challengeID')

To generate `full_imputed.p` the script `clean_files.py` must first be run. If necessary it can be executed and run by uncommenting the line below.

In [3]:
#! python clean_files.py

In [3]:
data = pd.read_pickle('../../ff_data/full_imputed.p') # load imputed data output after running the clean_files.py

In [4]:
data.shape

(4242, 4568)

Extract the outcomes from the imputed data.

In [5]:
y = data[['gpa','grit','materialHardship','eviction','layoff','jobTraining']]
X = data
for c in X.columns:
    if c in list(y.columns):
        del X[c]

# Data processing

***Note: These are the exact same preprocessing steps as used with the neural network models so the baseline comparison is fair. It is possible, and indeed highly likely, that a linear regression might perform better with different inputs or preprocessing.***

Before modelling the data there are two types of transformations that I use to optimize them for the neural network.

Categorical variables are transformed using one-hot encoding. Continuous variables are also normalized to have a mean of zero.

To identify which columns belong to which group I use same heuristic as in the imputation script.

In [6]:
# Identify categorical columns
cat_cols = []
non_cat_cols = []
for i, c in enumerate(X.columns):
    is_categorical = False
    vals = set(list(X[c]))
    vals = {x for x in vals if x==x} # Removes nans, otherwise treated as unique
    if X[c].dtype == 'float64': # if float and low num distinct then treat as cat
        if len(vals) <= 20:
            is_categorical = True
        else:
            pass
    else:
        is_categorical = True
    
    # Now append to relevant list of columns
    if is_categorical:
        cat_cols.append(c)
        
    else:
        non_cat_cols.append(c)

In [7]:
X_dummies = pd.get_dummies(X, columns=cat_cols)
# Note that sklearn also has one-hot encoding but doesn't relabel

In [8]:
X_dummies.head()

Unnamed: 0_level_0,m1lenmin,m1citywt,m1e1d1,m1e1d2,m1e1d3,m1i2a,m1i2b,m1j2a,m1j2b,cm1hhinc,...,hv4mflag_1.0,hv4mflag_2.0,hv4mflag_3.0,hv4mompreg_0.0,hv4mompreg_1.0,hv4selfht_0.0,hv4selfht_1.0,hv4selfht_2.0,hv4selfwt_0.0,hv4selfwt_1.0
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,40.0,202.485367,25.0,6.723174,13.260396,38.0,1682.415602,0.038262,2.211822,29579.694329,...,0,0,0,1,0,1,0,0,1,0
2,40.0,45.608219,43.0,16.0,3.0,25.0,3050.504448,0.110909,1.985703,20829.093487,...,0,0,0,0,1,1,0,0,1,0
3,35.0,39.060299,49.0,46.0,23.0,20.0,0.0,12.158179,1.386592,132483.450592,...,0,0,0,1,0,1,0,0,1,0
4,30.0,22.304855,23.0,23.169628,5.699719,20.0,0.0,4.165048,1.157385,0.0,...,0,0,0,1,0,1,0,0,1,0
5,25.0,35.518272,90.0,64.0,58.0,12.0,1974.812374,12.212538,2.965919,49026.982561,...,0,0,0,1,0,1,0,0,1,0


In [9]:
normalizer = StandardScaler()
for c in non_cat_cols:
    normed = normalizer.fit_transform(X_dummies[c].values.reshape(-1,1))
    X_dummies[c] = normed

In [10]:
X_dummies.head()

Unnamed: 0_level_0,m1lenmin,m1citywt,m1e1d1,m1e1d2,m1e1d3,m1i2a,m1i2b,m1j2a,m1j2b,cm1hhinc,...,hv4mflag_1.0,hv4mflag_2.0,hv4mflag_3.0,hv4mompreg_0.0,hv4mompreg_1.0,hv4selfht_0.0,hv4selfht_1.0,hv4selfht_2.0,hv4selfwt_0.0,hv4selfwt_1.0
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.364267,0.675623,-0.545055,-0.788818,0.183309,0.255698,-0.245954,-1.512365,0.017524,-0.10533,...,0,0,0,1,0,1,0,0,1,0
2,0.364267,-0.197911,0.606432,-0.149595,-1.041256,-0.986483,-0.080941,-1.497946,-0.168142,-0.377045,...,0,0,0,0,1,1,0,0,1,0
3,-0.071568,-0.234372,0.990261,1.917563,1.345718,-1.464245,-0.44888,0.893159,-0.660072,3.089933,...,0,0,0,1,0,1,0,0,1,0
4,-0.507403,-0.32767,-0.672998,0.34443,-0.719048,-1.464245,-0.44888,-0.693293,-0.848273,-1.023809,...,0,0,0,1,0,1,0,0,1,0
5,-0.943238,-0.254094,3.613094,3.157858,5.522921,-2.228664,-0.210687,0.903948,0.636712,0.498527,...,0,0,0,1,0,1,0,0,1,0


In [11]:
X = X_dummies # rename X

Now splitting the X and y matrices to separate cases in the training set and the prediction set.

In [12]:
X_training=X.loc[X.index.isin(train.index)]
X_pred=X.loc[~X.index.isin(train.index)]

In [13]:
y_training=y.loc[y.index.isin(train.index)]
y_pred=y.loc[~y.index.isin(train.index)]

# Modeling

Randomly splitting the data into training and test sets, where 20% of data is held out for validation and testing. 

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X_training, y_training.gpa, test_size=0.20, random_state=12345)

## Baseline model: OLS regression

In [15]:
baseline1 = LinearRegression(normalize = False) # Features already standardized so normalize is False
baseline1_params = params = [{'fit_intercept': [True, False] }]
grid = GridSearchCV(baseline1,
                         param_grid=baseline1_params,
                         scoring='neg_mean_squared_error', #sklearn optimizing by maximizing negative MSE
                         n_jobs=1,
                         verbose=2,
                         cv=5)

In [16]:
%%time
grid.fit(np.array(X_train), np.array(y_train))

print('The parameters of the best model are: ')
print(grid.best_params_)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] fit_intercept=True ..............................................




[CV] ............................... fit_intercept=True, total=  30.1s
[CV] fit_intercept=True ..............................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   30.2s remaining:    0.0s


[CV] ............................... fit_intercept=True, total=  30.2s
[CV] fit_intercept=True ..............................................
[CV] ............................... fit_intercept=True, total=  30.1s
[CV] fit_intercept=True ..............................................
[CV] ............................... fit_intercept=True, total=  30.8s
[CV] fit_intercept=True ..............................................
[CV] ............................... fit_intercept=True, total=  31.3s
[CV] fit_intercept=False .............................................
[CV] .............................. fit_intercept=False, total=  28.9s
[CV] fit_intercept=False .............................................
[CV] .............................. fit_intercept=False, total=  29.8s
[CV] fit_intercept=False .............................................
[CV] .............................. fit_intercept=False, total=  30.8s
[CV] fit_intercept=False .............................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  5.1min finished


The parameters of the best model are: 
{'fit_intercept': True}
CPU times: user 7min 1s, sys: 9.3 s, total: 7min 11s
Wall time: 6min 2s


In-sample MSE:

In [18]:
abs(grid.best_score_)

0.30053974716511073

Test set MSE of best model identified by grid search on the validation set:

In [19]:
mean_squared_error(y_test, grid.predict(np.array(X_test)))

0.37037734985025444

Now making predictions for all observations. These will then be sent to be scored on the hold-out data.

In [19]:
preds = grid.predict(np.array(X))
predictions['gpa'] = preds
predictions.to_csv('regression_baseline_predictions.csv')

This model was not submitted to the FFC competition but it was scored on the leaderboard and final held-out data so that the results could be used in the final paper. The MSE scores on these datasets are as follows:

***Leaderboard:*** 0.467964593

***Held-out:*** 0.445437226

Overall we see that performance has deteriotated out-of-sample, suggesting that the original model over-fit the training data.