# 2. Grameen Bank 

## 2.a.

Using the Khandker’s Grameen bank dataset hh_98.dta, estimate the intent-to-treat (ITT), the
impact on household consumption, log total expenditures (lexptot) of having a i) female
microcredit program in the village (progvillf) ii) male microcredit program in the
village(progvillm)--without using control variables in either case. Remember that since you
are carrying out a regression on log total expenditures, that you can interpret your coefficients as
a percentage increase or decrease in total expenditures. Interpret your results in this light. Can
you give reasons for why your results are why they are? How might the lack of control variables
have contributed to your results?

**Answer**

Households in villages with a female microcredit program spend, on average, ~13% more than households in villages without the program.The result is statistically significant (p = 0.045 < 0.05).

Households in villages with a male microcredit program spend, on average, 4.74% less, but this effect is not statistically significant (p = 0.130 > 0.05).

This results could be because of household spending patterns, female might be more likely to handle household spending, so the credits given might be used directly for household expenses.

The lack of control variables can cause omitted variable bias and make our estimates less precise.

In [17]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.diagnostic import het_white
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
# Import the dataset
df2 = pd.read_csv('../data/hh_98.csv')
df2.head()

Unnamed: 0,nh,year,villid,thanaid,agehead,sexhead,educhead,famsize,hhland,hhasset,...,rice,wheat,milk,potato,egg,oil,lexptot,lnland,progvillm,progvillf
0,11054,1,1,1,79,1,0,2,36.0,33295.0,...,12.631389,8.120178,11.503587,8.547428,2.199215,40.600895,9.159501,0.307485,1,1
1,11061,1,1,1,43,1,6,4,116.0,180325.0,...,12.631389,8.120178,11.503587,8.547428,2.199215,40.600895,9.863307,0.770108,1,1
2,11081,1,1,1,52,0,0,7,91.0,80735.0,...,12.631389,8.120178,11.503587,8.547428,2.199215,40.600895,8.923725,0.647103,1,1
3,11101,1,1,1,48,1,0,7,8.0,16755.0,...,12.631389,8.120178,11.503587,8.547428,2.199215,40.600895,8.582025,0.076961,1,1
4,12021,1,2,1,35,1,10,5,10.0,18795.0,...,10.150224,6.090134,10.826905,6.868469,2.030045,43.307621,10.113386,0.09531,0,1


In [11]:
df['lexptot'] = np.log(df['exptot'] + 1)

# Female program ITT
model_f = smf.ols("lexptot ~ progvillf", data=df).fit()
print(model_f.summary())

# Male program ITT
model_m = smf.ols("lexptot ~ progvillm", data=df).fit()
print(model_m.summary())


                            OLS Regression Results                            
Dep. Variable:                lexptot   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     4.035
Date:                Fri, 12 Sep 2025   Prob (F-statistic):             0.0448
Time:                        22:54:11   Log-Likelihood:                -847.79
No. Observations:                1129   AIC:                             1700.
Df Residuals:                    1127   BIC:                             1710.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      8.3285      0.063    132.843      0.0

## 2.b. 

Now include the following controls (see Khandker p. 147 for definitions) in both of these
estimations: sexhead agehead educhead lnland vaccess pcirr rice
wheat milk oil egg. How do results change if at all?

**Answer**

Households with a female microcredit program still spend more on average (~10%), but the effect is slightly smaller than the naive model (~13%). Statistical significance has decreased (from p = 0.045 → p = 0.075).

This means that part of the naive effect was likely due to pre-treatment differences in household characteristics (wealth, land, education, access), which are now controlled for.

Households in villages with male microcredit programs spend slightly less on average (~5.8%) and significant.

Including controls made the negative effect statistically significant.

In [14]:
df.columns

Index(['nh', 'year', 'villid', 'thanaid', 'agehead', 'sexhead', 'educhead',
       'famsize', 'hhland', 'hhasset', 'expfd', 'expnfd', 'exptot', 'dmmfd',
       'dfmfd', 'weight', 'vaccess', 'pcirr', 'rice', 'wheat', 'milk',
       'potato', 'egg', 'oil', 'lexptot', 'lnland', 'progvillm', 'progvillf'],
      dtype='object')

In [23]:
# Female program ITT
model_f = smf.ols("lexptot ~ progvillf + sexhead + agehead + educhead + " \
                  "hhland + vaccess + pcirr + rice + wheat + milk + oil + egg", data=df).fit()
print(model_f.summary())

# Male program ITT
model_m = smf.ols("lexptot ~ progvillm + sexhead + agehead + educhead + " \
                  "hhland + vaccess + pcirr + rice + wheat + milk + oil + egg", data=df).fit()
print(model_m.summary())

                            OLS Regression Results                            
Dep. Variable:                lexptot   R-squared:                       0.202
Model:                            OLS   Adj. R-squared:                  0.193
Method:                 Least Squares   F-statistic:                     23.51
Date:                Fri, 12 Sep 2025   Prob (F-statistic):           3.81e-47
Time:                        23:27:18   Log-Likelihood:                -722.56
No. Observations:                1129   AIC:                             1471.
Df Residuals:                    1116   BIC:                             1537.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      7.3913      0.229     32.271      0.0

## 2.c.

For both male and female program placement, carry out the White test for heteroscedasticity
(estat imtest, white) with controls after your OLS estimation. Do you find evidence of
heteroscedasticity in either estimation? If so, correct for it by using robust standard errors and
re-run your estimations. How do results and significance of results change?

**Answer**

p-values in both tests < 0.05, so we can reject the null hypothesis that the variance of the residuals is homoskedasticity. So, heteroskedasticity presents in both datasets, this violate the GM 5th assumption. And we are going to get a bias SEs. 

After correcting by using robust standard errors:
* The female program leads to an ~10.5% increase in household expenditures. p-value = 0.089, marginally significant at 10%, but no longer significant at 5%. Robust SE is larger than the OLS SE (0.062 vs. 0.059), reflecting heteroscedasticity correction. We are now more cautious about claiming a strong effect; the positive effect is still there, but with more uncertainty.

* The male program leads to an ~5.8% decrease in household expenditures.p-value = 0.054, marginally significant at 5–10%, slightly weaker than before. Robust SE increased slightly (0.030 vs. 0.029), reflecting heteroscedasticity correction.

In [24]:
# White test
white_test_f = het_white(model_f.resid, model_f.model.exog)
labels = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
white_df_f = pd.DataFrame([white_test_f], columns=labels)
print(white_df_f)

white_test_m = het_white(model_m.resid, model_m.model.exog)
white_test_m = het_white(model_m.resid, model_m.model.exog)
white_df_m = pd.DataFrame([white_test_m], columns=labels)
print(white_df_m)

   LM Statistic  LM-Test p-value  F-Statistic  F-Test p-value
0    197.142578     1.039146e-10     2.563305    4.380980e-12
   LM Statistic  LM-Test p-value  F-Statistic  F-Test p-value
0    171.500729     1.764877e-07     2.143182    2.912712e-08


In [25]:
# Female program with robust SE
model_f_robust = model_f.get_robustcov_results(cov_type='HC3')
print(model_f_robust.summary())

# Male program with robust SE
model_m_robust = model_m.get_robustcov_results(cov_type='HC3')
print(model_m_robust.summary())


                            OLS Regression Results                            
Dep. Variable:                lexptot   R-squared:                       0.202
Model:                            OLS   Adj. R-squared:                  0.193
Method:                 Least Squares   F-statistic:                     21.03
Date:                Fri, 12 Sep 2025   Prob (F-statistic):           4.10e-42
Time:                        23:27:32   Log-Likelihood:                -722.56
No. Observations:                1129   AIC:                             1471.
Df Residuals:                    1116   BIC:                             1537.
Df Model:                          12                                         
Covariance Type:                  HC3                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      7.3913      0.235     31.505      0.0

# 2.d.

Now let’s look at impacts of microcredit at the individual level where the variables dmmfd and
dfmfd indicate the presence of a male and female microcredit borrower in a household
respectively. Carryout a regression of log total expenditures (lexptot) on individual
household borrowing by both females and males with your controls and using robust standard
errors.

**Answer**

Households with a female borrower has a significant positive effect, spend ~ 8.1% more on average house expenses. 

Having a male borrower in the household has no statistically significant effect on total household expenditures.

In [27]:
# OLS regression at individual level
model_indv = smf.ols("lexptot ~ dmmfd + dfmfd + sexhead + agehead + educhead + " \
                  "hhland + vaccess + pcirr + rice + wheat + milk + oil + egg", data=df).fit()
print(model_indv.summary())

                            OLS Regression Results                            
Dep. Variable:                lexptot   R-squared:                       0.206
Model:                            OLS   Adj. R-squared:                  0.196
Method:                 Least Squares   F-statistic:                     22.20
Date:                Fri, 12 Sep 2025   Prob (F-statistic):           1.45e-47
Time:                        23:36:31   Log-Likelihood:                -719.86
No. Observations:                1129   AIC:                             1468.
Df Residuals:                    1115   BIC:                             1538.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      7.4900      0.221     33.934      0.0

## 2.e.

We might have good reason to think that consumption (log total expenditures) are correlated
within villages. Carry out your estimation using clustered standard errors at the village level
(villid) for both individual female and male borrowers. How does this affect your inference
about statistical significance?

**Answer**

Clustering did not substantially change inference. Female borrowers remain significant; male borrowers remain insignificant.

In [28]:
# OLS with clustered SEs at the village level (villid)
model_clustered = smf.ols("lexptot ~ dmmfd + dfmfd + sexhead + agehead + educhead + " \
                  "hhland + vaccess + pcirr + rice + wheat + milk + oil + egg", data=df).fit(
    cov_type='cluster', cov_kwds={'groups': df['villid']})
print(model_clustered.summary())

                            OLS Regression Results                            
Dep. Variable:                lexptot   R-squared:                       0.206
Model:                            OLS   Adj. R-squared:                  0.196
Method:                 Least Squares   F-statistic:                -2.769e+13
Date:                Fri, 12 Sep 2025   Prob (F-statistic):               1.00
Time:                        23:37:32   Log-Likelihood:                -719.86
No. Observations:                1129   AIC:                             1468.
Df Residuals:                    1115   BIC:                             1538.
Df Model:                          13                                         
Covariance Type:              cluster                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      7.4900      0.180     41.595      0.0



## 2.f.

Now with our individual borrower treatment again, let’s try bootstrapping our standard errors
(non-clustered at first). Do this by using the bootstrap command in Stata and prefixing your
reg command with bootstrap, reps(1000) seed(12345): Do this for both the male
and female borrower treatment variables. Compare your results to non-bootstrapped, non-robust,
and non-clustered (i.e. normal) standard error regressions. What do you find?

**Answer**

Coefficients are quite stable across all methods.
* Female borrowers are consistently significant.
* Male borrowers are consistently insignificant.

However, as expected, standard errors vary slightly (due to village patterns, heteroskedasticity):
* Clustered SEs are slightly smaller than other two.
* Bootstrapped SEs are very similar to normal OLS SEs.

Using clustered or bootstrap SEs confirms that our estimates are not sensitive to heteroscedasticity or village correlation.

In [29]:
# Number of bootstrap repetitions
n_boot = 1000
np.random.seed(12345)

# Store bootstrap coefficients
coef_boot = []

# Formula
formula = "lexptot ~ dmmfd + dfmfd + sexhead + agehead + educhead + " \
          "hhland + vaccess + pcirr + rice + wheat + milk + oil + egg"

# Bootstrap loop
for i in range(n_boot):
    # Resample the data with replacement
    df_boot = df.sample(n=len(df), replace=True)
    
    # Fit OLS on the bootstrap sample
    model_boot = smf.ols(formula, data=df_boot).fit()
    
    # Store the coefficients
    coef_boot.append(model_boot.params)

# Convert to DataFrame
coef_boot = pd.DataFrame(coef_boot)

# Compute bootstrap mean, std (as SE), 95% CI
bootstrap_summary = pd.DataFrame({
    'coef_mean': coef_boot.mean(),
    'boot_se': coef_boot.std(),
    'ci_lower': coef_boot.quantile(0.025),
    'ci_upper': coef_boot.quantile(0.975)
})

print(bootstrap_summary)


           coef_mean   boot_se  ci_lower  ci_upper
Intercept   7.481130  0.224800  7.040269  7.935806
dmmfd      -0.013867  0.033277 -0.074573  0.052976
dfmfd       0.083240  0.028892  0.026315  0.136075
sexhead    -0.055694  0.055432 -0.171299  0.048979
agehead     0.003910  0.001141  0.001669  0.006281
educhead    0.052563  0.004474  0.043876  0.061381
hhland      0.000407  0.000134  0.000210  0.000706
vaccess    -0.006375  0.040732 -0.091232  0.070325
pcirr       0.152146  0.047656  0.059687  0.249065
rice        0.004552  0.008718 -0.012741  0.020931
wheat      -0.038098  0.016415 -0.069570 -0.007662
milk        0.020606  0.005611  0.009606  0.031844
oil         0.009534  0.003314  0.002884  0.015902
egg         0.104170  0.048609  0.013849  0.197825


## 2.g. 

Finally, let’s do a clustered bootstrap for male and female borrowers where we cluster again at
the villid level using the prefix command bootstrap, reps(1000) seed(12345)
cluster(villid): What do you find in these regressions? Are they similar to your non-
bootstrapped clustered standard errors?

**Answer**

clustered SEs for males remain unchanged. clustered SEs for females gets slightly smaller. However, cooefficients remain the same as before. This confirms that our clustered SEs was robust. 

In [None]:
# Number of bootstrap repetitions
np.random.seed(12345)
n_boot = 1000

# Regression formula
formula = "lexptot ~ dmmfd + dfmfd + sexhead + agehead + educhead + hhland + vaccess + pcirr + rice + wheat + milk + oil + egg"

# Get list of unique villages
villages = df['villid'].unique()

# Store bootstrap coefficients
coef_boot = []

for i in range(n_boot):
    # Sample villages with replacement
    sampled_villages = np.random.choice(villages, size=len(villages), replace=True)
    
    # Build a bootstrap dataset by concatenating all households in sampled villages
    df_boot = pd.concat([df[df['villid'] == v] for v in sampled_villages])
    
    # Fit OLS on the bootstrap sample
    model_boot = smf.ols(formula, data=df_boot).fit()
    
    # Store coefficients
    coef_boot.append(model_boot.params)

# Convert to DataFrame
coef_boot = pd.DataFrame(coef_boot)

# Compute bootstrap mean, SE, and 95% CI
bootstrap_summary = pd.DataFrame({
    'coef_mean': coef_boot.mean(),
    'boot_se': coef_boot.std(),
    'ci_lower': coef_boot.quantile(0.025),
    'ci_upper': coef_boot.quantile(0.975)
})

print(bootstrap_summary.loc[['dmmfd', 'dfmfd']])


       coef_mean   boot_se  ci_lower  ci_upper
dmmfd  -0.022956  0.027358 -0.072843  0.023228
dfmfd   0.084918  0.024632  0.045687  0.138884
