<a href="https://colab.research.google.com/github/sofials2002/SOFIA/blob/master/Lasso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wage Prediction using Lasso

In this notebook we will answer what determines the wage of workers from a predictive perspective.

This example focuses on a sample of Registered Nurses in the US collected during 2017. The hourly wage of a nurse is denoted by $Y$ and $X$ is a vector of nurses' characteristics, e.g., human capital, demographics, job-relevant characteristics. The question that we want to answer is:

- How to use nurses' characteristics, such as education and experience, to best predict wages?


In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.pipeline import Pipeline

import warnings
warnings.simplefilter('ignore')

## Data

In [2]:
# Read data
path_data = "https://github.com/pabloestradac/causalml-basics/raw/main/data/"
df = pd.read_csv(path_data + 'wages_nurses.csv')
df.tail()

Unnamed: 0,lwage,female,age,race,children,marital,education,RN_experience,left_nursing,country_ed_US,english_only,military,certificates,labor_union,work_setting,work_situation,level_care,care_specialty,state
5617,3.496087,1,72,white,0,widowed_divorced_sep,ed_assoc,51,left_0,1,1,never_served_mil,0,0,SET_long_term_inpatient,SIT_self_employed,LC_nursing_home,CC_no_patient_care,state_n13
5618,3.678408,1,50,white,0,never_married,ed_msn,14,left_0,1,1,never_served_mil,0,0,SET_hospital,SIT_agency_facility,LC_education,CC_emergency_care,state_n24
5619,3.952845,0,36,white,0,currently_married,ed_bsn,4,left_0,1,1,never_served_mil,2,1,SET_hospital,SIT_agency_facility,LC_inpatient,CC_oncology,state_n32
5620,3.95796,1,61,white,0,currently_married,ed_bsn,40,left_0,1,1,never_served_mil,0,0,SET_hospital,SIT_agency_facility,LC_inpatient,CC_no_patient_care,state_n37
5621,3.10027,1,38,white,0,currently_married,ed_assoc,3,left_0,1,1,never_served_mil,2,0,SET_hospital,SIT_agency_facility,LC_others,CC_medical_surgical,state_n29


Construct a prediction rule for hourly (log) wage $Y$, which depends linearly on relevant characteristics $X$:

$$
\log(\text{wage}) = \beta' X + e
$$

Then, assess the predictive performance of a given model using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$. Thus, we measure the prediction quality of the models via data splitting.


## Basic vs Flexible Model

We employ two different specifications for prediction:

1. Basic Model: $X$ consists of a set of raw regressors

2. Flexible Model: $X$ consists of all raw regressors from the basic model plus a dictionary of transformations.

Let's start with OLS.

In [None]:
# 0. Traditional Model
model_base = ('lwage ~ female + age + race + children + marital + education + RN_experience')
base = smf.ols(model_base, data=df)
results_base = base.fit()

In [None]:
X, y = base.data.exog, base.data.endog
rsquared_base = results_base.rsquared
rsquared_adj_base = results_base.rsquared_adj
mse_base = np.mean(results_base.resid**2)
mse_adj_base = results_base.mse_resid
print(f'No_Features = {X.shape[1]:.0f}')
print(f'Rsquared = {rsquared_base:.3f}')
print(f'Rsquared_adjusted = {rsquared_adj_base:.3f}')
print(f'MSE = {mse_base:.3f}')
print(f'MSE_adjusted = {mse_adj_base:.3f}')

In [None]:
# 1. Basic Model
model_base = ('lwage ~ female + age + race + children + marital '
              '+ education + RN_experience + left_nursing + country_ed_US + english_only + military + certificates'
              '+ labor_union + work_setting + work_situation + level_care + care_specialty + state')
base = smf.ols(model_base, data=df)
results_base = base.fit()

In [None]:
X, y = base.data.exog, base.data.endog
rsquared_base = results_base.rsquared
rsquared_adj_base = results_base.rsquared_adj
mse_base = np.mean(results_base.resid**2)
mse_adj_base = results_base.mse_resid
print(f'No_Features = {X.shape[1]:.0f}')
print(f'Rsquared = {rsquared_base:.3f}')
print(f'Rsquared_adjusted = {rsquared_adj_base:.3f}')
print(f'MSE = {mse_base:.3f}')
print(f'MSE_adjusted = {mse_adj_base:.3f}')

In [None]:
# 2. Flexible Model
model_flex = ('lwage ~ female + age + race + children + marital '
              '+ education + left_nursing + country_ed_US + english_only + military + certificates'
              '+ labor_union + work_setting + work_situation + level_care + care_specialty + state'
              '+ (RN_experience + RN_experience**2 + RN_experience**3 + RN_experience**4)'
              '*(age + race + children + marital '
              '+ education + left_nursing + country_ed_US + english_only + military + certificates'
              '+ labor_union + work_setting + work_situation + level_care + care_specialty + state)')
flex = smf.ols(model_flex, data=df)
results_flex = flex.fit()

In [None]:
X, y = flex.data.exog, flex.data.endog
rsquared_flex = results_flex.rsquared
rsquared_adj_flex = results_flex.rsquared_adj
mse_flex = np.mean(results_flex.resid**2)
mse_adj_flex = results_flex.mse_resid
print(f'No_Features = {X.shape[1]:.0f}')
print(f'Rsquared = {rsquared_flex:.3f}')
print(f'Rsquared_adjusted = {rsquared_adj_flex:.3f}')
print(f'MSE = {mse_flex:.3f}')
print(f'MSE_adjusted = {mse_adj_flex:.3f}')

## Flexible model using Lasso

Re-estimate the flexible model using Lasso (the least absolute shrinkage and selection operator) rather than ols. Use the sklearn Lasso with cross-validation to tune the regularization hyperparameter.

In [None]:
X = flex.data.exog[:, 1:] # exclude the intercept
y = flex.data.endog

# 3. Train model using Lasso with cross validation and variable normalization
lasso = Pipeline([('scale', StandardScaler()),
                  ('lasso', LassoCV())])
lasso.fit(X, y)

In [None]:
n, p = X.shape
p += 1
mse_lasso = np.mean((y - lasso.predict(X))**2)
mse_adj_lasso = mse_lasso * n / (n - p)
rsquared_lasso = 1 - mse_lasso / np.var(y)
rsquared_adj_lasso = 1 - mse_adj_lasso / np.var(y)
print(f'No_Features = {p:.0f}')
print(f'Rsquared = {rsquared_lasso:.3f}')
print(f'Rsquared_adjusted = {rsquared_adj_lasso:.3f}')
print(f'MSE = {mse_lasso:.3f}')
print(f'MSE_adjusted = {mse_adj_lasso:.3f}')

In [None]:
# Store results
res_df = pd.DataFrame()

res_df['Model'] = ['Basic reg', 'Flexible reg', 'Flexible Lasso']
res_df['p'] = [results_base.params.shape[0],
               results_flex.params.shape[0],
               results_flex.params.shape[0]]
res_df['R2'] = [rsquared_base, rsquared_flex, rsquared_lasso]
res_df['MSE'] = [mse_base, mse_flex, mse_lasso]
res_df['adj_R2'] = [rsquared_adj_base, rsquared_adj_flex, rsquared_adj_lasso]
res_df['adj_MSE'] = [mse_adj_base, mse_adj_flex, mse_adj_lasso]

res_df

The flexible model performs slightly better than the basic model.

Let's now use sample splitting.

## Out-of-sample performance

Now that we have seen in-sample fit, we evaluate our models on the out-of-sample performance.


In [None]:
# Use smf.ols just to get the full data frame and sm.OLS to test out of sample for convenience
tmp = smf.ols(model_base, data=df)
X_full = tmp.data.exog
y_full = tmp.data.endog
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=.2, random_state=42)

# Predict out of sample
reg_basic = sm.OLS(y_train, X_train).fit()
yhat_reg_base = reg_basic.predict(X_test)

# Calculate out-of-sample MSE
MSE_test1 = sum((y_test-yhat_reg_base)**2)/y_test.shape[0]
R2_test1 = 1. - MSE_test1/np.var(y_test)

print("Test MSE for the basic model: "+ str(MSE_test1))
print("Test R2 for the basic model: "+ str(R2_test1))

In [None]:
# Use smf.ols just to get the full data frame and sm.OLS to test out of sample for convenience
tmp = smf.ols(model_flex, data=df)
X_full = tmp.data.exog
y_full = tmp.data.endog
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=.2, random_state=42)

# Predict out of sample
reg_flex = sm.OLS(y_train, X_train).fit()
yhat_reg_flex = reg_flex.predict(X_test)

# Calculate out-of-sample MSE
MSE_test2 = np.mean((y_test - yhat_reg_flex)**2)
R2_test2 = 1. - MSE_test2 / np.var(y_test)

print("Test MSE for the flexible model: "+ str(MSE_test2))
print("Test R2 for the flexible model: "+ str(R2_test2))

In [None]:
# Predict out of sample
lasso = Pipeline([('scale', StandardScaler()),
                  ('lasso', LassoCV())])
lasso.fit(X_train[:, 1:], y_train)
yhat_test_lasso = lasso.predict(X_test[:, 1:])

# Calculate out-of-sample MSE
MSE_test3 = np.mean((y_test - yhat_test_lasso)**2)
R2_test3 = 1. - MSE_test3 / np.var(y_test)

print("Test MSE for the lasso model: "+ str(MSE_test3))
print("Test R2 for the lasso model: "+ str(R2_test3))

In [None]:
# Store results
res_df2 = pd.DataFrame()

res_df2['Model'] = ['Basic Reg', 'Flexible Reg', 'Flexible Lasso']
res_df2['MSE_test'] = [MSE_test1, MSE_test2, MSE_test3]
res_df2['R2_test'] = [R2_test1, R2_test2, R2_test3]

res_df2

## Extra flexible model



In [None]:
# Extra Flexible Model
model_extra = ('lwage ~ (female + age + race + children + marital '
              '+ education + RN_experience + RN_experience**2 + RN_experience**3 + RN_experience**4'
              '+ left_nursing + country_ed_US + english_only + military + certificates'
              '+ labor_union + work_setting + work_situation + level_care + care_specialty + state)**2')
tmp = smf.ols(model_extra, data=df) # just to extract df, not actually using this model
print(f'No_Features = {tmp.data.exog.shape[1]:.0f}')

# In-sample fit
insamplefit = tmp.fit()
rsquared_ex = insamplefit.rsquared
rsquared_adj_ex = insamplefit.rsquared_adj
mse_ex = np.mean(insamplefit.resid**2)
mse_adj_ex = insamplefit.mse_resid
print(f'(In-sample) Rsquared = {rsquared_ex :.3f}')
print(f'(In-sample) Rsquared_adjusted = {rsquared_adj_ex :.3f}')
print(f'(In-sample) MSE = {mse_ex :.3f}')
print(f'(In-sample) MSE_adjusted = {mse_adj_ex:.3f}')

# Train test Split
X_full = tmp.data.exog
y_full = tmp.data.endog
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=.2, random_state=42)

# Predict out of sample
reg_extra = sm.OLS(y_train, X_train).fit()
yhat_reg_extra = reg_extra.predict(X_test)

# Calculate out-of-sample MSE
MSE_test4 = np.mean((y_test - yhat_reg_extra)**2)
R2_test4 = 1. - MSE_test4 / np.var(y_test)

print(f'(Out-of-sample) MSE = {MSE_test4:.3f}')
print(f'(Out-of-sample) R2 = {R2_test4:.3f}')

A simple OLS overfits when the dimensionality of covariates is high, while the out-of-sample performance suffers dramatically.

In [None]:
# Train model using Lasso
lasso = Pipeline([('scale', StandardScaler()),
                  ('lasso', LassoCV())])
lasso.fit(X_train[:, 1:], y_train)

# In-sample fit
yhat_train_lasso = lasso.predict(X_train[:, 1:])
R2_L = 1 - np.sum((yhat_train_lasso - y_train)**2) / np.sum((y_train - np.mean(y_train))**2)
pL = np.sum(lasso.named_steps['lasso'].coef_ != 0)
ntrain = len(X_train)
R2_adjL = 1 - (np.sum((yhat_train_lasso - y_train)**2) / (ntrain - pL - 1)) / (np.sum((y_train- np.mean(y_train))**2) / (ntrain - 1))
lasso_res = y_train - yhat_train_lasso
MSEL = np.mean(lasso_res**2)
MSE_adjL = (ntrain / (ntrain - pL - 1)) * MSEL
print(f'No_Nonzero_Features = {pL:.0f}')
print(f'(In-sample) Rsquared = {R2_L:.3f}')
print(f'(In-sample) Rsquared_adjusted = {R2_adjL:.3f}')
print(f'(In-sample) MSE = {MSEL:.3f}')
print(f'(In-sample) MSE_adjusted = {MSE_adjL:.3f}')

# Out-of-sample fit
yhat_test_lasso = lasso.predict(X_test[:, 1:])
MSE_test5 = np.mean((y_test - yhat_test_lasso)**2)
R2_test5 = 1. - MSE_test5 / np.var(y_test)

print(f'(Out-of-sample) MSE = {MSE_test5:.3f}')
print(f'(Out-of-sample) R2 = {R2_test5:.3f}')

Overfitting is mitigated with the penalized regression model.