### Introduction to Python and machine learning for economists (intermediate Python users) - FINAL EXAM

Date: 14/11/2025

**_PART 2 (10 points)_**

In [None]:
Last name: 
First name:
Student ID:

**Data and context**

* The data are based on Giacobino, Huillery, Michel and Sage (AEJ: Applied Economics 2024). 

* This paper studies the impact of providing scholarships for secondary education to adolescent girls in Niger on the incidence of child marriage. More specifically the authors evaluate a program of the government of Niger in which girls entering middle school (approximately 13 years old) were randomly selected to receive a scholarship covering the cost of housing, food and school supplies for three years. The authors collected baseline data on all girls admitted to middle school in 2017 in 285 villages provided that they were still enrolled in school at the time of the baseline survey in December 2017. A follow-up survey was conducted in August 2020.

* You are provided with the GHMS_2024.dta database which contains data for 1,344 girls of whom 680 were randomly assigned to the scholarship (T100 = 1).
You also have access to a dictionary for all variables in the attached xlsx file (dictionary_GHMS). The treatment variable is T100 and the two key outcomes of interest are `m_gq_dropout` and `m_gq_Married`, the school enrolment status and marriage status of girls in August 2020 respectively.

* Baseline covariates are identified by the prefix ‘b_’ which is followed by ‘gq’ if the variable comes from the girls’ questionnaire (measured at the girl level) and by ‘hq’ if the variable was collected as part of the household questionnaire (information provided by an adult member of the household, generally measured at the household level).


**Assignment**

Using the GHMS_2024.dta data, you are asked to complete the tasks listed below. You will present your analysis in this Jupyter Notebook. You should comment each cell of code in your notebook to show that you understand what you are doing. You are allowed to write your comments in English or in French.

1.	Load the GHMS_2024.dta dataset in a dataframe and describe it. Hint: you will need to use the read_stata function of pandas to load the data. (/1pt)
2.  Using the sub-sample of girls assigned to the control group (C100 = 1) and the validation set approach, predict the school enrollment status of Nigerian girls approximately three years after they enter middle school (i.e: the endline enrollment status) using a **logistic** model. Use the list of baseline covariates proposed below (object `features`) as predictors. Use a seed of 3. Note: don't worry if you receive a "RuntimeWarning:" message at this stage. (/3pts)
3. Obtain the predicted values of the response variable and use them to compute the accuracy rate of your model. Compare it to the accuracy rate you would obtain if you predicted the endline enrollment status by flipping a coin. What is the percentage of improvement in accuracy that you achieve?   (/2 pts)
4. Fit a LASSO model to the same data, using the same predictors. Use 5-fold cross-validation to select the optimal lambda and a validation set to obtain the trained model. Always use a seed of 3. (/2 pts)
5. Obtain the predicted values of the response variable for the test sample of your outer split and use them to compute the accuracy rate of your LASSO model. Compare it to the accuracy rate of your logistic model. What is the percentage of improvement in accuracy that you achieve? (/1pt). Hint: To obtain the matrix of predictor values for you test sample, you can use:

`for train_idx, test_idx in outer_valid.split(X):`

    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    Y_train, Y_test = Y.iloc[train_idx], Y.iloc[test_idx]
 
6. Create a dataframe containing one column for the labels of predictor variables and another one with the coefficients on each predictor variable in the trained LASSO model. What is the predictor for which the coefficient has the largest absolute value? Hint : you can use DataFrame.sort_values(by = '', ascending=False) to sort your dataframe (replace DataFrame by the name of your dataframe and pass the name of the column on which you want to sort to the `by` argument). (/1pt)

Note: We would normally need to use the logistic version of LASSO given that our response variable is binary. Here, we stick to standard LASSO for simplicity.


**List of predictors to be included in your models**

features = ['b_hq_a2_wood', 'b_hq_a2_stone', 'b_hq_a2_brick', 'b_hq_a2_cement', 'b_hq_a6_Index_bw_Q2', 'b_hq_a6_Index_bw_Q3', 'b_hq_a6_Index_bw_Q4', 'b_hq_a6_Index_bw_miss', 'b_hq_d_Index_auto_cat2', 'b_hq_d_Index_auto_cat3', 'b_hq_d_Index_auto_cat4', 'b_hq_d_Index_auto_cat5', 'b_hq_d_Index_egal_cat2', 'b_hq_d_Index_egal_cat3', 'b_hq_d_Index_egal_cat4', 'b_hq_d_Index_egal_cat5', 'b_hq_e7_h', 'b_gq_age17', 'b_gq_age16', 'b_gq_age15', 'b_gq_age14', 'b_gq_age13', 'b_gq_age12', 'b_gq_age11', 'b_gq_age10', 'b_gq_c1_ever', 'b_gq_c16', 'b_gq_c26', 'b_gq_c29', 'b_gq_d42', 'b_gq_e7a_cat2m', 'b_gq_e7a_cat3m', 'b_gq_e12a_2', 'b_gq_e12a_3', 'b_gq_e12a_4', 'b_gq_e12a_5', 'b_gq_e12a_6', 'b_gq_e12a_7', 'b_gq_e12a_8', 'b_gq_g_Index_auto_cat2', 'b_gq_g_Index_auto_cat3', 'b_gq_g_Index_auto_cat4', 'b_gq_g_Index_egal_cat2', 'b_gq_g_Index_egal_cat3', 'b_gq_g_Index_egal_cat4', 'b_gq_k7_h', 'b_gq_k7_l', 'b_hh_size', 'b_hq_hh_b6', 'b_hq_hh_b9_30_35', 'b_hq_hh_b9_35_40', 'b_hq_hh_b9_40_45', 'b_hq_hh_b9_45_50', 'b_hq_hh_b9_50_55', 'b_hq_hh_b9_55_60', 'b_hq_hh_b9_60_65', 'b_hq_hh_b9_65_70', 'b_hq_hh_b9_70', 'b_hq_hh_b12a_0', 'b_hq_hh_b12a_missing', 'b_hq_hh_b10_poly', 'b_hq_hh_b10_wido', 'b_hq_hh_b17_Kanouri', 'b_hq_hh_b17_Peul', 'b_hq_hh_b17_Touareg', 'b_hq_hh_b17_Other', 'b_hq_hh_b19_1', 'b_hq_hh_b19_2', 'b_hq_hh_b19_3', 'b_hq_hh_b19_4', 'b_hq_hh_b19_5', 'b_hq_hh_b19_6', 'b_hq_hh_b18_muslim', 'b_hq_a6f_dum', 'b_hq_a6g_dum', 'b_gq_b_a_any', 'b_gq_c_40_all_16', 'b_gq_c_40_all_17', 'b_gq_c_40_all_18', 'b_gq_c_40_all_19', 'b_gq_c_40_all_20', 'b_gq_c_40_all_21', 'b_gq_c_40_all_22', 'b_gq_c_40_all_23', 'b_gq_c_40_all_24', 'b_gq_c_40_all_25', 'b_gq_c_40_all_26', 'b_gq_c_40_all_27', 'b_gq_c_40_all_28', 'b_gq_c_40_all_29', 'b_gq_c_40_all_30', 'b_gq_c_42_all', 'b_gq_d_knowledge', 'b_gq_f_control_cat2', 'b_gq_f_control_cat3', 'b_gq_f_control_cat4', 'b_gq_f_efficacite_cat2', 'b_gq_f_efficacite_cat3', 'b_gq_f_efficacite_cat4', 'b_gq_f_estime_low', 'b_gq_f_estime_ave', 'b_gq_f_estime_hig', 'b_gq_f_estime_very_hig', 'b_gq_j_sp_cat2', 'b_gq_j_sp_cat3', 'b_gq_j_sp_cat4', 'b_admin_moyenne_cat2', 'b_admin_moyenne_cat3', 'b_admin_moyenne_cat4']

In [5]:
!pip install ISLP

import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import \
     (cross_validate,
      KFold,
      ShuffleSplit)
from ISLP.models import sklearn_sm
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

Collecting ISLP
  Downloading ISLP-0.4.0-py3-none-any.whl.metadata (7.0 kB)
Collecting lxml (from ISLP)
  Downloading lxml-6.0.2-cp313-cp313-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.metadata (3.6 kB)
Collecting lifelines (from ISLP)
  Downloading lifelines-0.30.0-py3-none-any.whl.metadata (3.2 kB)
Collecting pygam (from ISLP)
  Downloading pygam-0.10.1-py3-none-any.whl.metadata (9.7 kB)
Collecting torch (from ISLP)
  Downloading torch-2.9.1-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting pytorch-lightning (from ISLP)
  Downloading pytorch_lightning-2.5.6-py3-none-any.whl.metadata (20 kB)
Collecting torchmetrics (from ISLP)
  Downloading torchmetrics-1.8.2-py3-none-any.whl.metadata (22 kB)
Collecting autograd>=1.5 (from lifelines->ISLP)
  Downloading autograd-1.8.0-py3-none-any.whl.metadata (7.5 kB)
Collecting autograd-gamma>=0.3 (from lifelines->ISLP)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getti

**1. Load the GHMS_2024.dta dataset in a dataframe and describe it. Hint: you will need to use the read_stata function of pandas to load the data.**

In [10]:
ghms = pd.read_stata('GHMS_2024.dta')
ghms.describe()

Unnamed: 0,hh_id,sample,clus,T100,C100,m_gq_dropout,m_gq_Married,b_hq_a2_wood,b_hq_a2_stone,b_hq_a2_brick,...,b_gq_f_estime_ave,b_gq_f_estime_hig,b_gq_f_estime_very_hig,b_gq_j_sp_cat1,b_gq_j_sp_cat2,b_gq_j_sp_cat3,b_gq_j_sp_cat4,b_admin_moyenne_cat2,b_admin_moyenne_cat3,b_admin_moyenne_cat4
count,1344.0,1344.0,1344.0,1344.0,1344.0,1344.0,1344.0,1344.0,1344.0,1344.0,...,1344.0,1344.0,1344.0,1344.0,1344.0,1344.0,1344.0,1344.0,1344.0,1344.0
mean,2186.673363,1.0,144.235119,0.505952,0.494048,0.297619,0.107887,0.123512,0.273065,0.05878,...,0.316964,0.234375,0.002232,0.158482,0.40997,0.272321,0.159226,0.272321,0.200149,0.122024
std,685.601257,0.0,84.506538,0.500151,0.500151,0.457381,0.310353,0.329146,0.4457,0.2353,...,0.465467,0.423765,0.04721,0.365328,0.492011,0.44532,0.366023,0.44532,0.40026,0.327435
min,1001.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1597.75,1.0,66.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2162.5,1.0,146.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2814.25,1.0,215.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
max,3366.0,1.0,284.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**2. Using the sub-sample of girls assigned to the control group (C100 = 1) and the validation set approach, predict the school enrollment status of Nigerian girls approximately three years after they enter middle school (endline enrollment status) using a logit model. Use all available baseline covariates as predictors. Use a seed of 3. Hint: you will need to drop a few variables which do not make sense as predictors from the list of predictors (such as the unique IDs of individuals and households, you can also drop the strata).**

In [11]:
#features = ['b_hq_a2_wood', 'b_hq_a2_stone', 'b_hq_a2_brick', 'b_hq_a2_cement', 'b_hq_a6_Index_bw_Q2', 'b_hq_a6_Index_bw_Q3', 'b_hq_a6_Index_bw_Q4', 'b_hq_a6_Index_bw_miss', 'b_hq_d_Index_auto_cat2', 'b_hq_d_Index_auto_cat3', 'b_hq_d_Index_auto_cat4', 'b_hq_d_Index_auto_cat5', 'b_hq_d_Index_egal_cat2', 'b_hq_d_Index_egal_cat3', 'b_hq_d_Index_egal_cat4', 'b_hq_d_Index_egal_cat5', 'b_hq_e7_h', 'b_hq_e7_m', 'b_gq_agem', 'b_gq_age17', 'b_gq_age16', 'b_gq_age15', 'b_gq_age14', 'b_gq_age13', 'b_gq_age12', 'b_gq_age11', 'b_gq_age10', 'b_gq_c1_ever', 'b_gq_c16', 'b_gq_c26', 'b_gq_c29', 'b_gq_d42', 'b_gq_e7a_m', 'b_gq_e7a_cat2m', 'b_gq_e7a_cat3m', 'b_gq_e12a_2', 'b_gq_e12a_3', 'b_gq_e12a_4', 'b_gq_e12a_5', 'b_gq_e12a_6', 'b_gq_e12a_7', 'b_gq_e12a_8', 'b_gq_g_Index_auto_cat2', 'b_gq_g_Index_auto_cat3', 'b_gq_g_Index_auto_cat4', 'b_gq_g_Index_egal_cat2', 'b_gq_g_Index_egal_cat3', 'b_gq_g_Index_egal_cat4', 'b_gq_k7_h', 'b_gq_k7_l', 'b_gq_k7_m', 'b_hh_size', 'b_hq_hh_b6', 'b_hq_hh_b6_missing', 'b_hq_hh_b9_30_35', 'b_hq_hh_b9_35_40', 'b_hq_hh_b9_40_45', 'b_hq_hh_b9_45_50', 'b_hq_hh_b9_50_55', 'b_hq_hh_b9_55_60', 'b_hq_hh_b9_60_65', 'b_hq_hh_b9_65_70', 'b_hq_hh_b9_70', 'b_hq_hh_b9_m', 'b_hq_hh_b12a_0', 'b_hq_hh_b12a_missing', 'b_hq_hh_b10_poly', 'b_hq_hh_b10_wido', 'b_hq_hh_b10_missing', 'b_hq_hh_b17_Kanouri', 'b_hq_hh_b17_Peul', 'b_hq_hh_b17_Touareg', 'b_hq_hh_b17_Other', 'b_hq_hh_b17_Missing', 'b_hq_hh_b19_1', 'b_hq_hh_b19_2', 'b_hq_hh_b19_3', 'b_hq_hh_b19_4', 'b_hq_hh_b19_5', 'b_hq_hh_b19_6', 'b_hq_hh_b18_muslim', 'b_hq_hh_b18_missing', 'b_hq_a6f_dum', 'b_hq_a6g_dum', 'b_gq_b_a_any', 'b_gq_c_40_all_m', 'b_gq_c_40_all_16', 'b_gq_c_40_all_17', 'b_gq_c_40_all_18', 'b_gq_c_40_all_19', 'b_gq_c_40_all_20', 'b_gq_c_40_all_21', 'b_gq_c_40_all_22', 'b_gq_c_40_all_23', 'b_gq_c_40_all_24', 'b_gq_c_40_all_25', 'b_gq_c_40_all_26', 'b_gq_c_40_all_27', 'b_gq_c_40_all_28', 'b_gq_c_40_all_29', 'b_gq_c_40_all_30', 'b_gq_c_42_all', 'b_gq_d_knowledge', 'b_gq_f_control_cat2', 'b_gq_f_control_cat3', 'b_gq_f_control_cat4', 'b_gq_f_efficacite_cat2', 'b_gq_f_efficacite_cat3', 'b_gq_f_efficacite_cat4', 'b_gq_f_estime_low', 'b_gq_f_estime_ave', 'b_gq_f_estime_hig', 'b_gq_f_estime_very_hig', 'b_gq_j_sp_cat2', 'b_gq_j_sp_cat3', 'b_gq_j_sp_cat4', 'b_admin_moyenne_cat2', 'b_admin_moyenne_cat3', 'b_admin_moyenne_cat4']

features = ['b_hq_a2_wood', 'b_hq_a2_stone', 'b_hq_a2_brick', 'b_hq_a2_cement', 'b_hq_a6_Index_bw_Q2', 'b_hq_a6_Index_bw_Q3', 'b_hq_a6_Index_bw_Q4', 'b_hq_a6_Index_bw_miss', 'b_hq_d_Index_auto_cat2', 'b_hq_d_Index_auto_cat3', 'b_hq_d_Index_auto_cat4', 'b_hq_d_Index_auto_cat5', 'b_hq_d_Index_egal_cat2', 'b_hq_d_Index_egal_cat3', 'b_hq_d_Index_egal_cat4', 'b_hq_d_Index_egal_cat5', 'b_hq_e7_h', 'b_gq_age17', 'b_gq_age16', 'b_gq_age15', 'b_gq_age14', 'b_gq_age13', 'b_gq_age12', 'b_gq_age11', 'b_gq_age10', 'b_gq_c1_ever', 'b_gq_c16', 'b_gq_c26', 'b_gq_c29', 'b_gq_d42', 'b_gq_e7a_cat2m', 'b_gq_e7a_cat3m', 'b_gq_e12a_2', 'b_gq_e12a_3', 'b_gq_e12a_4', 'b_gq_e12a_5', 'b_gq_e12a_6', 'b_gq_e12a_7', 'b_gq_e12a_8', 'b_gq_g_Index_auto_cat2', 'b_gq_g_Index_auto_cat3', 'b_gq_g_Index_auto_cat4', 'b_gq_g_Index_egal_cat2', 'b_gq_g_Index_egal_cat3', 'b_gq_g_Index_egal_cat4', 'b_gq_k7_h', 'b_gq_k7_l', 'b_hh_size', 'b_hq_hh_b6', 'b_hq_hh_b9_30_35', 'b_hq_hh_b9_35_40', 'b_hq_hh_b9_40_45', 'b_hq_hh_b9_45_50', 'b_hq_hh_b9_50_55', 'b_hq_hh_b9_55_60', 'b_hq_hh_b9_60_65', 'b_hq_hh_b9_65_70', 'b_hq_hh_b9_70', 'b_hq_hh_b12a_0', 'b_hq_hh_b12a_missing', 'b_hq_hh_b10_poly', 'b_hq_hh_b10_wido', 'b_hq_hh_b17_Kanouri', 'b_hq_hh_b17_Peul', 'b_hq_hh_b17_Touareg', 'b_hq_hh_b17_Other', 'b_hq_hh_b19_1', 'b_hq_hh_b19_2', 'b_hq_hh_b19_3', 'b_hq_hh_b19_4', 'b_hq_hh_b19_5', 'b_hq_hh_b19_6', 'b_hq_hh_b18_muslim', 'b_hq_a6f_dum', 'b_hq_a6g_dum', 'b_gq_b_a_any', 'b_gq_c_40_all_16', 'b_gq_c_40_all_17', 'b_gq_c_40_all_18', 'b_gq_c_40_all_19', 'b_gq_c_40_all_20', 'b_gq_c_40_all_21', 'b_gq_c_40_all_22', 'b_gq_c_40_all_23', 'b_gq_c_40_all_24', 'b_gq_c_40_all_25', 'b_gq_c_40_all_26', 'b_gq_c_40_all_27', 'b_gq_c_40_all_28', 'b_gq_c_40_all_29', 'b_gq_c_40_all_30', 'b_gq_c_42_all', 'b_gq_d_knowledge', 'b_gq_f_control_cat2', 'b_gq_f_control_cat3', 'b_gq_f_control_cat4', 'b_gq_f_efficacite_cat2', 'b_gq_f_efficacite_cat3', 'b_gq_f_efficacite_cat4', 'b_gq_f_estime_low', 'b_gq_f_estime_ave', 'b_gq_f_estime_hig', 'b_gq_f_estime_very_hig', 'b_gq_j_sp_cat2', 'b_gq_j_sp_cat3', 'b_gq_j_sp_cat4', 'b_admin_moyenne_cat2', 'b_admin_moyenne_cat3', 'b_admin_moyenne_cat4']


In [12]:
#Filter dataframe and select relevant columns
filt = ghms['C100'] == 1
df = ghms.loc[filt, :]

In [13]:
#Create response vector Y and matrix of predictors X

df_train, df_test = train_test_split(df,
                                         test_size=0.25,
                                         random_state=3)

Y_train, Y_test = df_train['m_gq_dropout'], df_test['m_gq_dropout']


X_train, X_test = df_train[features], df_test[features]

In [14]:
#Standardize and add constants to X matrices
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_std = sm.add_constant(X_train_scaled) 
X_test_std = sm.add_constant(X_test_scaled)

#Fit logistic model on training data
glm = sm.GLM(Y_train,
             X_train_scaled,
             family=sm.families.Binomial())
results_logit = glm.fit()
results_logit.summary()


  t = np.exp(-z)
  special.gammaln(n - y + 1) + y * np.log(mu / (1 - mu + 1e-20)) +
  special.gammaln(n - y + 1) + y * np.log(mu / (1 - mu + 1e-20)) +


0,1,2,3
Dep. Variable:,m_gq_dropout,No. Observations:,498.0
Model:,GLM,Df Residuals:,396.0
Model Family:,Binomial,Df Model:,101.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,
Date:,"Fri, 14 Nov 2025",Deviance:,15289.0
Time:,17:52:25,Pearson chi2:,7.48e+17
No. Iterations:,79,Pseudo R-squ. (CS):,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
x1,3.894e+14,3.59e+06,1.09e+08,0.000,3.89e+14,3.89e+14
x2,1.426e+14,3.61e+06,3.95e+07,0.000,1.43e+14,1.43e+14
x3,1.504e+14,3.58e+06,4.2e+07,0.000,1.5e+14,1.5e+14
x4,-9.073e+13,3.52e+06,-2.58e+07,0.000,-9.07e+13,-9.07e+13
x5,9.862e+14,4.09e+06,2.41e+08,0.000,9.86e+14,9.86e+14
x6,3.603e+14,4.34e+06,8.29e+07,0.000,3.6e+14,3.6e+14
x7,-1.437e+14,4.61e+06,-3.12e+07,0.000,-1.44e+14,-1.44e+14
const,1.5487,2.31e-08,6.7e+07,0.000,1.549,1.549
x8,-7.347e+14,1.39e+07,-5.28e+07,0.000,-7.35e+14,-7.35e+14


**3. Obtain the predicted values of the response variable and use them to compute the accuracy rate of your model. Compare it to the accuracy rate you would obtain if you predicted the endline enrollment status by flipping a coin. What is the percentage of improvement in accuracy that you achieve?**

In [15]:
#Compute predicted probabilities
probs_logit = results_logit.predict(X_test_std)

#Transform probabilities into predicted failure/success
Y_pred_logit = np.array([0]*X_test_std.shape[0])
Y_pred_logit[probs_logit>0.5]= 1

#Compute error and accuracy rates
error_rate_logit = np.mean(Y_pred_logit != Y_test)
accuracy_logit = 1 - error_rate_logit
print(accuracy_logit)

#Compute percentage of improvement
imp = (accuracy_logit - 0.5) / 0.5 * 100
print("Percentage of improvement:", imp)

0.5602409638554218
Percentage of improvement: 12.048192771084354


  t = np.exp(-z)


**4. Fit a LASSO model to the same data, using the same predictors. Use 5-fold cross-validation to select the optimal lambda and a validation set to obtain a test MSE. Always use a seed of 3.**

In [16]:
# Create Y and X
Y = df['m_gq_dropout']
X = df[features]

# Create search grid for tuning parameter (lambda)
lambdas = 10**np.linspace(8, -2, 100) / Y.std()

# Choose splitting rule for outer split (train set vs. test set)
outer_valid = skm.ShuffleSplit(n_splits=1, 
                               test_size=0.25,
                               random_state=3)

# Choose splitting rule for cross-validation step (used to select optimal lambda)
inner_cv = skm.KFold(n_splits=5,
                     shuffle=True,
                     random_state=3)

# Program lasso regression with lambda selected based on inner_CV
lassoCV = skl.ElasticNetCV(alphas=lambdas,
                           l1_ratio=1,
                           cv=inner_cv)

# Standardize predictors: 
scaler = StandardScaler(with_mean=True,  with_std=True)
pipeCV = Pipeline(steps=[('scaler', scaler),
                         ('lasso', lassoCV)])

# Obtain fitted model from validation set using outer split
results_lasso = skm.cross_validate(pipeCV, 
                             X,
                             Y,
                             cv=outer_valid,
                             scoring='neg_mean_squared_error',
                             return_estimator=True) #Add this line to store estimator results (to print selected alpha below)

est = results_lasso['estimator'][0]


**5. Obtain the predicted values of the response variable and use them to compute the accuracy rate of your LASSO model. Compare it to the accuracy rate of your logit model. What is the percentage of improvement in accuracy that you achieve?**

In [17]:
#Obtain matrix of predictor values for the test sample in the outer split
for train_idx, test_idx in outer_valid.split(X):

    X_train_lasso, X_test_lasso = X.iloc[train_idx], X.iloc[test_idx]
    Y_train_lasso, Y_test_lasso = Y.iloc[train_idx], Y.iloc[test_idx]

#Compute success probabilities and convert them to dummy
prob_lasso = est.predict(X_test_lasso)
Y_pred_lasso = np.array([0]*X_test_lasso.shape[0])
Y_pred_lasso[prob_lasso>0.5]= 1

#Compute accuracy rate of LASSO
error_rate_lasso = np.mean(Y_pred_lasso != Y_test_lasso)
accuracy_lasso = 1 - error_rate_lasso
print(accuracy_lasso)
np.mean(Y_test)

#Compute percentage of improvement over coin
imp_coin = (accuracy_lasso - 0.5) / 0.5 * 100
print("Percentage of improvement over coin:", imp_coin)

#Compute percentage of improvement over logit
imp_logit = (accuracy_lasso - accuracy_logit) / accuracy_logit * 100
print("Percentage of improvement over logit:", imp_logit)

0.608433734939759
Percentage of improvement over coin: 21.68674698795181
Percentage of improvement over logit: 8.602150537634396


**6. Create a dataframe containing one column for the labels of predictor variables and another one with the coefficients on each predictor variable in the trained LASSO model. What is the predictor for which the coefficient has the largest absolute value? Hint : you can use DataFrame.sort_values(by = '', ascending=False) to sort your dataframe (replace DataFrame by the name of your dataframe and pass the name of the column on which you want to sort to the by argument)**

In [18]:
#Print selected coefs
selected_coefs = est.named_steps['lasso'].coef_
coefs = pd.DataFrame({'coefs': selected_coefs, 'labels': features})
coefs = coefs.sort_values(by='coefs', ascending=False)
coefs

Unnamed: 0,coefs,labels
40,0.009134,b_gq_g_Index_auto_cat3
76,0.002621,b_gq_c_40_all_16
25,0.000568,b_gq_c1_ever
17,0.000095,b_gq_age17
2,-0.000000,b_hq_a2_brick
...,...,...
84,-0.010893,b_gq_c_40_all_24
1,-0.013641,b_hq_a2_stone
55,-0.015289,b_hq_hh_b9_60_65
61,-0.021764,b_hq_hh_b10_wido


The coefficient with the largest absolute magnitude is the coefficient on b_hq_a6g_dum (TV ownership dummy).