### Introduction to Python and machine learning for economists (intermediate Python users) - FINAL EXAM

Date: 14/11/2025

**_PART 2 (10 points)_**

In [None]:
Last name: 
First name:
Student ID:

**Data and context**

* The data are based on Giacobino, Huillery, Michel and Sage (AEJ: Applied Economics 2024). 

* This paper studies the impact of providing scholarships for secondary education to adolescent girls in Niger on the incidence of child marriage. More specifically the authors evaluate a program of the government of Niger in which girls entering middle school (approximately 13 years old) were randomly selected to receive a scholarship covering the cost of housing, food and school supplies for three years. The authors collected baseline data on all girls admitted to middle school in 2017 in 285 villages provided that they were still enrolled in school at the time of the baseline survey in December 2017. A follow-up survey was conducted in August 2020.

* You are provided with the GHMS_2024.dta database which contains data for 1,344 girls of whom 680 were randomly assigned to the scholarship (T100 = 1).
You also have access to a dictionary for all variables in the attached xlsx file (dictionary_GHMS). The treatment variable is T100 and the two key outcomes of interest are `m_gq_dropout` and `m_gq_Married`, the school enrolment status and marriage status of girls in August 2020 respectively.

* Baseline covariates are identified by the prefix ‘b_’ which is followed by ‘gq’ if the variable comes from the girls’ questionnaire (measured at the girl level) and by ‘hq’ if the variable was collected as part of the household questionnaire (information provided by an adult member of the household, generally measured at the household level).


**Assignment**

Using the GHMS_2024.dta data, you are asked to complete the tasks listed below. You will present your analysis in this Jupyter Notebook. You should comment each cell of code in your notebook to show that you understand what you are doing. You are allowed to write your comments in English or in French.

1.	Load the GHMS_2024.dta dataset in a dataframe and describe it. Hint: you will need to use the read_stata function of pandas to load the data. (/1pt)
2.  Using the sub-sample of girls assigned to the control group (C100 = 1) and the validation set approach, predict the school enrollment status of Nigerian girls approximately three years after they enter middle school (i.e: the endline enrollment status) using a **logistic** model. Use the list of baseline covariates proposed below (object `features`) as predictors. Use a seed of 3. Note: don't worry if you receive a "RuntimeWarning:" message at this stage. (/3pts)
3. Obtain the predicted values of the response variable and use them to compute the accuracy rate of your model. Compare it to the accuracy rate you would obtain if you predicted the endline enrollment status by flipping a coin. What is the percentage of improvement in accuracy that you achieve?   (/2 pts)
4. Fit a LASSO model to the same data, using the same predictors. Use 5-fold cross-validation to select the optimal lambda and a validation set to obtain the trained model. Always use a seed of 3. (/2 pts)
5. Obtain the predicted values of the response variable for the test sample of your outer split and use them to compute the accuracy rate of your LASSO model. Compare it to the accuracy rate of your logistic model. What is the percentage of improvement in accuracy that you achieve? (/1pt). Hint: To obtain the matrix of predictor values for you test sample, you can use:

`for train_idx, test_idx in outer_valid.split(X):`

    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    Y_train, Y_test = Y.iloc[train_idx], Y.iloc[test_idx]
 
6. Create a dataframe containing one column for the labels of predictor variables and another one with the coefficients on each predictor variable in the trained LASSO model. What is the predictor for which the coefficient has the largest absolute value? Hint : you can use DataFrame.sort_values(by = '', ascending=False) to sort your dataframe (replace DataFrame by the name of your dataframe and pass the name of the column on which you want to sort to the `by` argument). (/1pt)

Note: We would normally need to use the logistic version of LASSO given that our response variable is binary. Here, we stick to standard LASSO for simplicity.


**List of predictors to be included in your models**

features = ['b_hq_a2_wood', 'b_hq_a2_stone', 'b_hq_a2_brick', 'b_hq_a2_cement', 'b_hq_a6_Index_bw_Q2', 'b_hq_a6_Index_bw_Q3', 'b_hq_a6_Index_bw_Q4', 'b_hq_a6_Index_bw_miss', 'b_hq_d_Index_auto_cat2', 'b_hq_d_Index_auto_cat3', 'b_hq_d_Index_auto_cat4', 'b_hq_d_Index_auto_cat5', 'b_hq_d_Index_egal_cat2', 'b_hq_d_Index_egal_cat3', 'b_hq_d_Index_egal_cat4', 'b_hq_d_Index_egal_cat5', 'b_hq_e7_h', 'b_gq_age17', 'b_gq_age16', 'b_gq_age15', 'b_gq_age14', 'b_gq_age13', 'b_gq_age12', 'b_gq_age11', 'b_gq_age10', 'b_gq_c1_ever', 'b_gq_c16', 'b_gq_c26', 'b_gq_c29', 'b_gq_d42', 'b_gq_e7a_cat2m', 'b_gq_e7a_cat3m', 'b_gq_e12a_2', 'b_gq_e12a_3', 'b_gq_e12a_4', 'b_gq_e12a_5', 'b_gq_e12a_6', 'b_gq_e12a_7', 'b_gq_e12a_8', 'b_gq_g_Index_auto_cat2', 'b_gq_g_Index_auto_cat3', 'b_gq_g_Index_auto_cat4', 'b_gq_g_Index_egal_cat2', 'b_gq_g_Index_egal_cat3', 'b_gq_g_Index_egal_cat4', 'b_gq_k7_h', 'b_gq_k7_l', 'b_hh_size', 'b_hq_hh_b6', 'b_hq_hh_b9_30_35', 'b_hq_hh_b9_35_40', 'b_hq_hh_b9_40_45', 'b_hq_hh_b9_45_50', 'b_hq_hh_b9_50_55', 'b_hq_hh_b9_55_60', 'b_hq_hh_b9_60_65', 'b_hq_hh_b9_65_70', 'b_hq_hh_b9_70', 'b_hq_hh_b12a_0', 'b_hq_hh_b12a_missing', 'b_hq_hh_b10_poly', 'b_hq_hh_b10_wido', 'b_hq_hh_b17_Kanouri', 'b_hq_hh_b17_Peul', 'b_hq_hh_b17_Touareg', 'b_hq_hh_b17_Other', 'b_hq_hh_b19_1', 'b_hq_hh_b19_2', 'b_hq_hh_b19_3', 'b_hq_hh_b19_4', 'b_hq_hh_b19_5', 'b_hq_hh_b19_6', 'b_hq_hh_b18_muslim', 'b_hq_a6f_dum', 'b_hq_a6g_dum', 'b_gq_b_a_any', 'b_gq_c_40_all_16', 'b_gq_c_40_all_17', 'b_gq_c_40_all_18', 'b_gq_c_40_all_19', 'b_gq_c_40_all_20', 'b_gq_c_40_all_21', 'b_gq_c_40_all_22', 'b_gq_c_40_all_23', 'b_gq_c_40_all_24', 'b_gq_c_40_all_25', 'b_gq_c_40_all_26', 'b_gq_c_40_all_27', 'b_gq_c_40_all_28', 'b_gq_c_40_all_29', 'b_gq_c_40_all_30', 'b_gq_c_42_all', 'b_gq_d_knowledge', 'b_gq_f_control_cat2', 'b_gq_f_control_cat3', 'b_gq_f_control_cat4', 'b_gq_f_efficacite_cat2', 'b_gq_f_efficacite_cat3', 'b_gq_f_efficacite_cat4', 'b_gq_f_estime_low', 'b_gq_f_estime_ave', 'b_gq_f_estime_hig', 'b_gq_f_estime_very_hig', 'b_gq_j_sp_cat2', 'b_gq_j_sp_cat3', 'b_gq_j_sp_cat4', 'b_admin_moyenne_cat2', 'b_admin_moyenne_cat3', 'b_admin_moyenne_cat4']


# **Good luck!**