# Resume Experiment Analysis

Assignment descriptions: https://www.unifyingdatascience.org/html/exercises/exercise_resume.html

## Checking for Balance

### Exercise 1

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_stata("resume_experiment.dta")
df.head()

Unnamed: 0,education,ofjobs,yearsexp,computerskills,call,female,black
0,4,2,6,1,0.0,1.0,0.0
1,3,3,6,1,0.0,1.0,0.0
2,4,1,6,1,0.0,1.0,1.0
3,3,4,6,1,0.0,1.0,1.0
4,3,3,22,1,0.0,1.0,0.0


In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   education       4870 non-null   int8   
 1   ofjobs          4870 non-null   int8   
 2   yearsexp        4870 non-null   int8   
 3   computerskills  4870 non-null   int8   
 4   call            4870 non-null   float32
 5   female          4870 non-null   float32
 6   black           4870 non-null   float32
dtypes: float32(3), int8(4)
memory usage: 114.1 KB
None


In [4]:
treat = df[df['black'] == 1]
ctrl = df[df['black'] == 0]

In [5]:
# from pandas_profiling import ProfileReport
# profT = ProfileReport(treat, title="Treatment Group Balance Check Report")
# profC = ProfileReport(ctrl, title="Control Group Balance Check Report")

In [6]:
# profT.to_file(output_file='Treatment_balance_check.html')
# profC.to_file(output_file='Control_balance_check.html')

In [7]:
meanT = treat[['female', 'computerskills', 'yearsexp']].mean()
meanC = ctrl[['female', 'computerskills', 'yearsexp']].mean()
print(meanT - meanC) # mean difference across groups

female            0.010678
computerskills    0.023819
yearsexp         -0.026694
dtype: float64


In [8]:
print(stats.ttest_ind(treat['female'], ctrl['female']))
print(stats.ttest_ind(treat['computerskills'], ctrl['computerskills']))
print(stats.ttest_ind(treat['yearsexp'], ctrl['yearsexp']))

Ttest_indResult(statistic=0.8841321018026016, pvalue=0.37666856909823254)
Ttest_indResult(statistic=2.1664271042751966, pvalue=0.030326933955391936)
Ttest_indResult(statistic=-0.18461970685747395, pvalue=0.8535350182481283)


Genders, computer skills, and years of experience look balanced across race groups.

### Exercise 2

In [9]:
# education
obs1 = np.array([treat['education'], ctrl['education']])
stat, p, dof, expected = stats.chi2_contingency(obs1)
print(f"p value is {p}.") 

p value is 1.0.


In [10]:
# number of previous jobs 
obs2 = np.array([treat['ofjobs'], ctrl['ofjobs']])
stat, p, dof, expected = stats.chi2_contingency(obs2)
print(f"p value is {p}.") 

p value is 1.0.


Yes, they are balanced across groups.

### Exercise 3

balanced data-->no selection bias-->good causal inference

## Estimating Effect of Race

### Exercise 4

In [11]:
print(stats.ttest_ind(treat['call'], ctrl['call']))

Ttest_indResult(statistic=-4.114705290861751, pvalue=3.940802103128886e-05)


The extremely low p-value indicates that Black-named fictitious applicant resumes are more likely to receive a call back for an interview compared to White-named applicant resumes. 

### Exercise 5

In [13]:
X0 = np.c_[np.ones(df.shape[0]), df['black']]
reg0 = LinearRegression().fit(X0, df['call'])
reg0.coef_

array([ 0.        , -0.03203285])

Given the negative coefficient, Black-named fictitious applicants are expected to receive 0.03 less calls than White-named applicants. Therefore, there is racial discrimination by employers.

### Exercise 6

In [14]:
df['education'] = df['education'].astype('category') # treat education as a categorical variable
# df['education'].dtypes

In [15]:
X1 = np.c_[np.ones(df.shape[0]), df[['black', 'education', 'yearsexp', 'female', 'computerskills']]]
reg1 = LinearRegression().fit(X1, df['call'])
reg1.coef_

array([ 0.        , -0.03163539, -0.00181811,  0.00316307,  0.01140299,
       -0.01862955])

The effect of the racial feature remains approximately the same.

## Estimating Heterogeneous Effects

### Exercise 7

Let's assume applicants with college degrees are candidates with high educations.

In [16]:
high = df[(df['education'] == 3) | (df['education'] == 4)]
high.head()

Unnamed: 0,education,ofjobs,yearsexp,computerskills,call,female,black
0,4,2,6,1,0.0,1.0,0.0
1,3,3,6,1,0.0,1.0,0.0
2,4,1,6,1,0.0,1.0,1.0
3,3,4,6,1,0.0,1.0,1.0
4,3,3,22,1,0.0,1.0,0.0


In [17]:
X2 = np.c_[np.ones(high.shape[0]), high[['black', 'education', 'yearsexp', 'female', 'computerskills']]]
reg2 = LinearRegression().fit(X2, high['call'])
reg2.coef_

array([ 0.        , -0.03287964, -0.00076322,  0.00279903,  0.01637111,
       -0.01628077])

Unfortunately, among candidates with high education, racial discrimination still exists.

### Exercise 8

Recalling our finding in Exercise 6, the estimated coefficient for females is positive, which implies women have higher chances of calling back regardless of racial groups. Therefore, the racial discrimination for Black men is greater than Black women.

### Exercise 9

What is the share of applicants in our dataset with college degrees?

In [18]:
share = round(high.shape[0]/df.shape[0], 4)
print(f'The share of applicants in our dataset with college degrees is {share}.')

The share of applicants in our dataset with college degrees is 0.9261.


In [19]:
treat_high = treat[(treat['education'] == 3) | (treat['education'] == 4)]
share_treat = round(treat_high.shape[0]/treat.shape[0], 4)
print(f'The share of Black adult Americans have college degrees is {share_treat}.')

The share of Black adult Americans have college degrees is 0.9253.


### Exercise 10

The treatment effect of Black races is not sensitive to the change in educational levels since candidates with highe education are the majority, over 90%, in the fictitious sample.