In [58]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd
import statsmodels.api as smf
pd.options.mode.chained_assignment = None # gets rid of annoying warnings from pandas

**Reweighting Methods**

In this notebook, we will demonstrate the use and importance of reweighting methods in econometrics by analyzing regression results on the gender gap between the number of papers accepted to top journals by male and female economists. Data is from a very recent paper by Card, Della Vigna, Funk and Iriberri to examine gender differences in the probability that female versus male economists are selected to become Fellows of the Econometric Society - an honorific Society.

Reweighting methods are useful when you have a group of data that you want to perform analysis on with regards to a "reference group." Once such example would be if you would like to compare the estimated earnings of males with regards to a males as a reference group. One drawback of running a simple OLS regression with a dummy for both groups is that there may be inherent differences in the distribution of each group that affects the overall results. In order to account for this, we can "reweight" the group of interest to match the distribution of the reference group by imagining a "conterfactual" world, where the group of interest has the same distribution of the reference group but retains the same mean within its group. This allows us to find out how much of the gap between means of groups is due to differences in distribution across categories as opposed to differences within individual categories.

In this notebook, we will analyze the difference in probability of becoming a new fellow of the Econometric Society for males vs. females.

As in a normal OLS model, let $x$ represent observations and $y$ represent outcomes. For the purposes of this notebook, let the $x_{i}$'s be composed to $g$ dummies, or categories, $x_{i} = (D_{1i}, D_{2i}, ..., D_{gi})$

Suppose there are two groups within the observations, group $a$ and group $b$. Let group $a$ be the "reference group" and let group $b$ be the group of interest. 

Additionally, let $N$ be the total number of observations, $N^{a}$ be the total number of obs. for group $a$, and $N^{b}$ be the total number of obs. for group $b$. 

Then, 

$\bar{y}^{a} = \frac{1}{N^{a}}\sum_{i\in a}{y_{i}}$ is the mean of the outcome for group $a$,  

$\bar{y}^{a}_{g} = \frac{1}{N^{a}}\sum_{i\in g, a}{y_{i}}$ is the mean of the outcome for group $a$, category $g$, and

$\bar{p}^{a}_{g} = \frac{1}{N^{a}}\sum_{i\in a}{D_{gi}}$  is the fraction of group $a$ that is in category $g$. 


Drawing on the results above, we can see that: 

$\bar{y}^{a} = \sum_{i \in g}{\bar{p}^{a}_{g}\bar{y}^{a}_{g}} = (\bar{x}^{a})'\hat{\beta}^{a}$

and 

$\bar{y}^{b} = \sum_{i \in g}{\bar{p}^{b}_{g}\bar{y}^{b}_{g}} = (\bar{x}^{b})'\hat{\beta}^{b}$

Let us confirm this fact using the data below:

In [51]:
fellows_df = pd.read_csv("fellows.csv")
fellows_df

Unnamed: 0,author,AER_cumulative,ECTA_cumulative,JPE_cumulative,QJE_cumulative,REStud_cumulative,new_fellow,female,top5
0,(Vela) K. Velupillai,0,0,0,0,0,0,0,0
1,A. Arnon,0,0,0,0,0,0,0,0
2,A. J. J. Talman,0,0,0,0,0,0,0,0
3,A. J. Vermeulen,0,0,0,0,0,0,0,0
4,A. K. M. Mahbub Morshed,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
21246,Zvi Lerman,0,0,0,0,0,0,0,0
21247,Zvi Lotker,0,0,0,0,0,0,0,0
21248,Zvi Safra,1,6,0,0,1,0,0,8
21249,Zvi Wiener,0,0,0,0,0,0,0,0


Let group $a$ be represented by the males in the dataset, and group $b$ be represented by the females in the dataset. Let the number of categories, $g$ be represented by a dummy variable for the number of articles that each author has in one of the top 5 journals (top5) + 1. That is, $g = top5+1$. 

Now, let's find the means for each group within the data: 

In [52]:
gender_means = fellows_df.groupby('female').agg('mean').reset_index()
gender_means

Unnamed: 0,female,AER_cumulative,ECTA_cumulative,JPE_cumulative,QJE_cumulative,REStud_cumulative,new_fellow,top5
0,0,0.228161,0.113579,0.11411,0.109326,0.093615,0.004016,0.646861
1,1,0.13125,0.036806,0.043287,0.055093,0.04838,0.003935,0.314583


Looking at the table above, it looks like there are more papers published in journals (shown in the first 5 columns) by male authors when compared to female authors, on average. Furthermore, there seems to be slightly more male authors than female authors accepted as new fellows on average, where as a large proportion more of male authors have published in the top 5 journals when compared to female authors on average.

Now, lets find the fraction within each gender group that become New Fellows, along with other metrics...

In [53]:
# create male column for later use
fellows_df.loc[:, 'male'] = 1 - fellows_df.loc[:, 'female']

# create dummies
for i in range(1, 12): 
    col = 'D' + str(i)
    fellows_df[col] = (fellows_df['top5'] == (i-1)).astype(int)

# group a: males
# group b: females 
grp_a_df = fellows_df[(fellows_df.loc[:, 'female'] == 0)]
grp_b_df = fellows_df[(fellows_df.loc[:, 'female'] == 1)] 

N = len(fellows_df)
Na = len(grp_a_df)
Nb = len(grp_b_df)

# find xbar and ybar (vector of proportions for categories in each group)
xbar_a = list(grp_a_df.groupby('top5').sum().loc[:, 'male']/Na)
xbar_b = list(grp_b_df.groupby('top5').sum().loc[:, 'female']/Nb)

In [64]:
#-number of male economists (not yet fellows) in that bucket (i.e., with thatnumber of top5’s)
#-fraction of all male economists with previous publications in that bucket
#-fraction of male economists in that bucket who become new Fellows
#-number of female economists (not yet fellows) in that bucket
#-fraction of all female economists with previous publications in that bucket
#-fraction of female economists in that bucket who become new Fellows

num_male_nf = np.zeros(11)
num_female_nf = np.zeros(11)
frac_male_pubs = np.zeros(11)
frac_female_pubs = np.zeros(11)
frac_male_f = np.zeros(11)
frac_female_f = np.zeros(11)

num_males = Na
num_females = Nb
    
print("Finished finding number in each bucket...") 

for index, row in fellows_df.iterrows():
    t5idx = row['top5']
    if (row['female'] == 1): 
        # females with previous publications
        frac_female_pubs[t5idx] += 1
        
        # females that are not new fellows
        if (row['new_fellow'] == 0): 
            num_female_nf[t5idx] += 1
        
        # females that are new fellows
        elif (row['new_fellow'] == 1): 
            frac_female_f[t5idx] += 1
                    
            
    elif (row['female'] == 0): 
        # males with previous publications
        frac_male_pubs[t5idx] += 1
        
        # males that are not new fellows 
        if (row['new_fellow'] == 0): 
            num_male_nf[t5idx] += 1
            
        # males that are new fellows
        elif (row['new_fellow'] == 1): 
            frac_male_f[t5idx] += 1
            
            
Ngb = fellows_df[fellows_df.loc[:, 'female']==1].groupby('top5').sum().loc[:, 'female']
Nga = fellows_df[fellows_df.loc[:, 'male']==1].groupby('top5').sum().loc[:, 'male']

print("Finished calculating each columns sums...")
num_male_nf = num_male_nf
num_female_nf = num_female_nf 
frac_male_pubs = frac_male_pubs / num_males
frac_female_pubs = frac_female_pubs / num_females
frac_male_f = frac_male_f / Nga
frac_female_f = frac_female_f / Ngb

mean_df = pd.DataFrame({'top5': np.arange(0,11),
                     'Number of male economists that are not yet fellows ': num_male_nf,
                     'Number of female economists that are not yet fellows': num_female_nf,
                     'Fraction of all male economists with previous publications': frac_male_pubs,
                     'Fraction of all female economists with previous publications': frac_female_pubs,
                     'Fraction of male economists who become new Fellows': frac_male_f,
                     'Fraction of female economists who become new Fellows': frac_female_f})

mean_df

Finished finding number in each bucket...
Finished calculating each columns sums...


Unnamed: 0_level_0,top5,Number of male economists that are not yet fellows,Number of female economists that are not yet fellows,Fraction of all male economists with previous publications,Fraction of all female economists with previous publications,Fraction of male economists who become new Fellows,Fraction of female economists who become new Fellows
top5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,12212.0,3517.0,0.72128,0.81412,0.0,0.0
1,1,2466.0,523.0,0.145827,0.121528,0.001215,0.00381
2,2,843.0,145.0,0.050499,0.034259,0.014035,0.02027
3,3,483.0,62.0,0.028764,0.014352,0.008214,0.0
4,4,293.0,27.0,0.017601,0.007176,0.016779,0.129032
5,5,195.0,12.0,0.011931,0.003241,0.034653,0.142857
6,6,134.0,12.0,0.008328,0.003009,0.049645,0.076923
7,7,78.0,2.0,0.00502,0.000694,0.082353,0.333333
8,8,54.0,3.0,0.003426,0.001157,0.068966,0.4
9,9,31.0,0.0,0.002067,0.000231,0.114286,1.0


Now, let's run a regression using the dummies for each category for the probability of being selected as a new fellow for each gender and confirm that the calculated means using the estimates are the same as the means found above for the mean number of new fellows for each group by reconstructing 

$\bar{y}^{a} = (\bar{x}^{a})'\hat{\beta}^{a}$

and 

$\bar{y}^{b} = (\bar{x}^{b})'\hat{\beta}^{b}$

In [65]:
# create dataset for each regression using results from above 
male_y = list(mean_df.iloc[:, 5])
female_y = list(mean_df.iloc[:, 6])


# create y_values for each group 
def assign_y(x, female): 
    #print(x)
    if (female == 0): 
        return male_y[x]
    else: 
        return female_y[x]

grp_a_y = grp_a_df.loc[:, 'top5'].apply(lambda x: assign_y(x, 0))
grp_b_y = grp_b_df.loc[:, 'top5'].apply(lambda x: assign_y(x, 1))

grp_a_df.loc[:, 'y'] = grp_a_y
grp_b_df.loc[:, 'y'] = grp_b_y

grp_a_df.head()

Unnamed: 0,author,AER_cumulative,ECTA_cumulative,JPE_cumulative,QJE_cumulative,REStud_cumulative,new_fellow,female,top5,male,...,D3,D4,D5,D6,D7,D8,D9,D10,D11,y
0,(Vela) K. Velupillai,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.0
1,A. Arnon,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.0
2,A. J. J. Talman,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.0
3,A. J. Vermeulen,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.0
4,A. K. M. Mahbub Morshed,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.0


In [59]:
# OLS Regression for Group A
X = grp_a_df.iloc[:, -12:-1]
Y = grp_a_df['y']
model1 = smf.OLS(Y, X) 
res1 = model1.fit()
print(res1.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.211e+34
Date:                Thu, 28 Oct 2021   Prob (F-statistic):               0.00
Time:                        23:07:58   Log-Likelihood:             6.5957e+05
No. Observations:               16931   AIC:                        -1.319e+06
Df Residuals:                   16920   BIC:                        -1.319e+06
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
D1                  0   2.64e-20          0      1.0

In [60]:
# OLS Regression for Group B
X = grp_b_df.iloc[:, -12:-1]
Y = grp_b_df['y']
model2 = smf.OLS(Y, X) 
res2 = model2.fit()
print(res2.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.859e+34
Date:                Thu, 28 Oct 2021   Prob (F-statistic):               0.00
Time:                        23:08:10   Log-Likelihood:             1.6836e+05
No. Observations:                4320   AIC:                        -3.367e+05
Df Residuals:                    4309   BIC:                        -3.366e+05
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
D1                  0   4.85e-20          0      1.0

In [66]:
pd.DataFrame({'top5': np.arange(0, 11),
              'OLS Reg. for grp. a': res1.params, 
              'Mean probabilities for grp. a': list(mean_df.iloc[:, 5]),
              'OLS Reg. for grp. b': res2.params, 
              'Mean probabilities for grp. b': list(mean_df.iloc[:, 6])
             })

Unnamed: 0,top5,OLS Reg. for grp. a,Mean probabilities for grp. a,OLS Reg. for grp. b,Mean probabilities for grp. b
D1,0,0.0,0.0,0.0,0.0
D2,1,0.001215,0.001215,0.00381,0.00381
D3,2,0.014035,0.014035,0.02027,0.02027
D4,3,0.008214,0.008214,0.0,0.0
D5,4,0.016779,0.016779,0.129032,0.129032
D6,5,0.034653,0.034653,0.142857,0.142857
D7,6,0.049645,0.049645,0.076923,0.076923
D8,7,0.082353,0.082353,0.333333,0.333333
D9,8,0.068966,0.068966,0.4,0.4
D10,9,0.114286,0.114286,1.0,1.0


Comparing the results of the regressions with the probabilities of becoming a new fellow for group a (males) and group b (females) as shown in the last two columns of the above table, we can indeed verify that the coefficients from both regressions ($\hat{\beta}^{a}$ and $\hat{\beta}^{b}$) are the same as these mean probabilities for each group, category subset. 

Let's now use these estimated regression coefficients and the calculated means, $\bar{x}^a$ and $\bar{x}^b$, above to find: 

$\bar{y}^{a} = (\bar{x}^{a})'\hat{\beta}^{a}$

and 

$\bar{y}^{b} = (\bar{x}^{b})'\hat{\beta}^{b}$

In [67]:
xbar_a = np.array(xbar_a)
xbar_b = np.array(xbar_b)
beta_a = np.array(res1.params)
beta_b = np.array(res2.params)

# calculating ybar for each group using xbars and coefficients
ybar_a = np.dot(xbar_a, beta_a)
ybar_b = np.dot(xbar_b, beta_b)

# finding ybar for each group directly 
ybar_a_actual = gender_means['new_fellow'][0]
ybar_b_actual = gender_means['new_fellow'][1]

print("ybar for group a obtained through OLS: ", ybar_a, " actual ybar for group a: ", 
      ybar_a_actual)
print("ybar for group b obtained through OLS: ", ybar_b, " actual ybar for group b: ", 
      ybar_b_actual)

ybar for group a obtained through OLS:  0.004016301458862442  actual ybar for group a:  0.004016301458862442
ybar for group b obtained through OLS:  0.003935185185185185  actual ybar for group b:  0.003935185185185185


We can see above that the values obtained from the regressions result in the same calculated mean of selection for both groups. Thus, this part is verified.

In order to continue with the reweighting analysis, let us assume a world where the distribution of female authors is the same as the distribution of male authors. That is, the probability of a female author of being in group $g$ is the same of a male authoer in group $g$. However, assume that the means of each group remain the same. Then, we can calculate the "counterfactual" mean for the female authors group, or group $b$: 

$\bar{y}^{b}_{counterf} = \sum_{i \in g}{\bar{p}^{a}_{g}\bar{y}^{b}_{g}} = (\bar{x}^{a})'\hat{\beta}^{b}$


In [69]:
ybar_b_cf = np.dot(xbar_a, beta_b)

print("The calculated ybar_b counterfactual mean is: ", ybar_b_cf)
print("Adjusted selection gap: ", ybar_b_cf - ybar_a)
print("Actual gap: ", ybar_b_actual - ybar_a_actual)

The calculated ybar_b counterfactual mean is:  0.01656281603606705
Adjusted selection gap:  0.012546514577204609
Actual gap:  -8.111627367725707e-05


Comparing the actual gap, $\bar{y}^b - \bar{y}^a$ to the counterfactual gap, $\bar{y}^{b}_{counterf} - \bar{y}^a$, we can see that the adjusted selection gap is actually much larger in magnitude when compared to the actual calculated gap. The adjusted selection gap uncovers how much of the actual gap between groups is caused by the distribution across categories $g$ as opposed to differences within categories. Thus, while the actual selection gap of the adjusted selection gap is a weighted average of the difference in probability of becoming a new fellow across categories of # of top 5 journal publications.

Looking at the table from above where we calculated the mean within each group of becoming new fellows, we can infer that the reason we see a bigger positive effect for females when they are redistributed to look like men may be because of the way the men are distributed. There is a much higher fraction of all male economists with previous publications when compared to fraction of all female economists with previous publications when looking at those with >5 top5 publications. Thus, we can infer that those with higher numbers of top 5 publications would be more likely to become new fellows. By redistributing the females to look like men in this case, there is a larger positive effect for them to become new fellows if they have more top 5 journal publications. 

Thus, through reweighting, we find greater nuance within the actual selection gap. This analysis suggests that while the selection gap is small, each subgroup is selected according to a different criteria due to fundamental differences within each group. This data could speak towards a push for greater diversity within the Econometric Society, as well as provide a piece of evidence of how bias within the field of economics is being addressed. 