# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [40]:
import pandas as pd
import numpy as np
from scipy import stats

In [41]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [42]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [43]:
sum(data[data.race=='b'].call)

157.0

In [44]:
data.describe()

Unnamed: 0,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,...,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind
count,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,...,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0
mean,3.61848,3.661396,7.842916,0.052772,0.411499,0.097125,0.448049,215.637782,3.48152,0.559548,...,0.106776,0.437166,0.07269,0.082957,0.03039,0.08501,0.213963,0.267762,0.154825,0.165092
std,0.714997,1.219126,5.044612,0.223601,0.492156,0.296159,0.497345,148.127551,2.038036,0.496492,...,0.308866,0.496083,0.259649,0.275854,0.171677,0.278932,0.410141,0.442847,0.361773,0.371308
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,7.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,3.0,5.0,0.0,0.0,0.0,0.0,27.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,4.0,6.0,0.0,0.0,0.0,0.0,267.0,4.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,4.0,9.0,0.0,1.0,0.0,1.0,313.0,6.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,4.0,7.0,44.0,1.0,1.0,1.0,1.0,903.0,6.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [45]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

**1. What test is appropriate for this problem? Does CLT apply?**
    
Since we want to evalutate whether there is a significant difference between two sets of samples from two groups (b and w), it is appropriate to use a two-sample t-test. The central limit theorem can be applied in this case, because we are calculating averages based on large samples.
    

**2. What are the null and alternate hypotheses?**
    
The null hypothesis is that there are no systematic differences in requests for interview from employers for candidates with black and white-sounding names, i.e., the probability of requests is the same for both groups. The alternate hypothesis is that there exist this systematic difference.

In [70]:
w = data[data.race=='w'].call
b = data[data.race=='b'].call

**3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.**

In [71]:
# basic sample statistics
n_w = w.count() # number of w samples
n_b = b.count() # number of b samples
mu_w = w.mean() # sample mean
mu_b = b.mean()
s2_w = mu_w*(1-mu_w) # sample variance of the Bernoulli distribution
s2_b = mu_b*(1-mu_b)
print(n_w, n_b, mu_w, mu_b, s2_w, s2_b)

2435 2435 0.09650924056768417 0.0644763857126236 0.08719520705273304 0.060319181398060584


In [72]:
# sample pair statistics
d_mu = mu_w - mu_b # difference of sample means
d_s2 = s2_w/n_w + s2_b/n_b  # variance of the sampling distribution for the difference of means

Because of the large sample numbers, we can assume that the sampling distributions of the means are normal. Also, for such a large number of the degrees of freedom, the t-distribution is well approximated by the standard normal distribution, and we can therefore use z* critical values instead of t* critical values for a given confidence interval.

Here I will target 95% confidence interval, for which z* is 1.96.

In [73]:
print('Difference of means:', d_mu)
moe = 1.96*np.sqrt(d_s2)
print('Margin of error for 95% confidence interval:', moe)

Difference of means: 0.03203285485506058
Margin of error for 95% confidence interval: 0.0152554063487


Even after taking into account the margin of error for 95% confidence interval, the two means lie outside.

In [74]:
# Calculate p-value from t-statistic
t_statistic = d_mu/np.sqrt(d_s2)
print('t statistic:', t_statistic)
# degrees of freedom
dof = (s2_w/n_w + s2_b/n_b)**2/((s2_w/n_w)**2/(n_w-1) + (s2_b/n_b)**2/(n_b-1))
print('degrees of freedom', dof, n_w+n_b)
print('p-value:', stats.t.sf(np.abs(t_statistic), dof))
# calculate p-value assuming normal distribution (large number of dof)
print('p-value(approximate):', stats.norm.sf(np.abs(t_statistic)))

t statistic: 4.115550519
degrees of freedom 4711.60244264 4870
p-value: 1.9642854884e-05
p-value(approximate): 1.93128190645e-05


The very small value suggests that we should reject the null hypothesis.

Calculate the same using bootstraping approach

In [75]:
# define bootstaraping function
def bootstrap(in_array, bs_number=1):
    
    bs_samples = []
    for _ in range(bs_number):
        bs_samples.append(np.random.choice(in_array, size=len(in_array)))
                          
    return np.array(bs_samples)    

In [94]:
# Shift the b mean to coincide with w
b_shifted = b - mu_b + mu_w

# bootstrap means from w and b
bs_mean_w = np.mean(bootstrap(w.values, bs_number=100000), axis=1)
bs_mean_b = np.mean(bootstrap(b_shifted.values, bs_number=100000), axis=1)

# calculate the differences between the bootstrap means for w and b
bs_mean_dif = bs_mean_w - bs_mean_b
print('Difference of means:', d_mu, '\nMean of differences:', np.mean(bs_mean_w - bs_mean_b))

# p-value: fraction of mean differences same or larger than observed difference between sample means
pvalue = np.sum(np.abs(bs_mean_dif) >= np.abs(d_mu))/len(bs_mean_dif)
print('P-value', pvalue)

Difference of means: 0.03203285485506058 
Mean of differences: -5.55202e-05
P-value 3e-05


The minimal two-sided p-value (which agrees well with the frequentist value above) suggests that we should reject the null hypothesis of equal rate of callbacks for people with white and black-sounding names.

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

**4. Write a story describing the statistical significance in the context of the original problem.**

The results of the analysis show a significant bias of employers against applicants that are thought black. The sample was large and brought enough to show that this is a widespread problem accross industries and locations. If some action were to be taken, a more detailed analysis of the data could show if there are specific industries or regions where this problem is more pronounced, and focus on those first. Since the dataset is now relatively old (years 2000-2002) and performed at a specific time of bursting a tech bubble, it would be interesting to perform similar surveys again and see if the financial crisis, new technologies, very different politics, and general development of the society have influenced the results one way or another.

**5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?**

While the data suggest that race is an important factor in callback success, there are many other factors that were not explored in the above analysis, such as education, year of experience, etc. To obtain more definitive answers about these factors, the above analysis should be amended by similar hypothesis tests. Ideally, the resumes should be modified by random assignement of factors, such as military service, to obtain uncorrelated samples. Alternatively, we could look for combinations of several factors that maximize/minimize call backs.