In [1]:
%%capture
import stata_setup, os
if os.name == 'nt':
    stata_setup.config('C:/Program Files/Stata17/','mp')
else:
    stata_setup.config('/usr/local/stata17','mp')

## Preparing the data

In [2]:
%%stata -qui

use "../data/data", clear
rename log_flesch_kincaid_grade_level FKG
quietly tabulate year, generate(y_)
quietly tabulate cluster, generate(c_)

local journals  ecm jpe qje res  //AER based category

local jel_imp a_imp b_imp c_imp  e_imp f_imp g_imp h_imp i_imp j_imp k_imp /// 
		l_imp m_imp n_imp o_imp p_imp q_imp r_imp y_imp z_imp // D JEL based case




We use ```splitsample``` with the option ```split(.75 .25)``` to generate the variable ```sample```, which is 1 for a 75% of the sample and 2 for the remaining 25% of the sample. The assignment of each observation in sample to 1 or 2 is random, but the ```rseed``` option makes the random assignment reproducible.

In [None]:
%%stata
splitsample , generate(sample) split(.75 .25) rseed(42)
label define slabel 1 "Training" 2 "Validation"
label values sample slabel
tabulate sample

## OLS

In [None]:
%%stata -qui
#delimit ;
regress FKG log_num_authors log_num_pages both_genders prop_women
            `journals' `jel_imp' y_2-y_20  c_2-c_215  jel_flag
        if sample==1;
estimate store ols;
#delimit cr

## Ridge

In [None]:
%%stata -qui
#delimit ;
elasticnet linear FKG log_num_authors log_num_pages both_genders prop_women
        `journals' `jel_imp' y_2-y_20  c_2-c_215  jel_flag,
        alpha(0) lambda(1.4) nolog;
estimate store ridge;
#delimit cr

## Lasso

In [None]:
%%stata -qui
#delimit ;
lasso linear FKG log_num_authors log_num_pages both_genders prop_women
             `journals' `jel_imp' y_2-y_20  c_2-c_215  jel_flag,
             lambda(0.004) nolog;
estimates store lasso;
#delimit cr

## Elastic Net

In [None]:
%%stata -qui
#delimit ;
elasticnet linear FKG log_num_authors log_num_pages both_genders prop_women
                  `journals' `jel_imp' y_2-y_20  c_2-c_215  jel_flag,
                  alpha(.0001) nolog;
estimate store elasticnet;
#delimit cr

## In- \& Out-of-Sample Prediction

In [None]:
%%stata
lassogof ols ridge lasso elasticnet, over(sample)

<strong>Postselection</strong> coefficients should not be used with <em>elasticnet</em> and, in particular, with <em>ridge regression</em>. Ridge works by shrinking the coefficient estimates, and these are the estimates that should be used for prediction. Because postselection coefficients are OLS regression coefficients for the selected coefficients and because ridge always selects all variables, postselection coefficients after ridge are OLS regression coefficients for all potential variables, which clearly we do not want to use for prediction.
