# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [6]:
data.head()


Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1,0,1,0,0,0,0,0,0,
1,b,1,3,3,6,0,1,1,0,316,...,1,0,1,0,0,0,0,0,0,
2,b,1,4,1,6,0,0,0,0,19,...,1,0,1,0,0,0,0,0,0,
3,b,1,3,4,6,0,1,0,1,313,...,1,0,1,0,0,0,0,0,0,
4,b,1,3,3,22,0,0,0,0,313,...,1,1,0,0,0,0,0,1,0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

#### Q1: Since we are comparing the mean number of calls between two samples, we will use a two sample test to examine whether the mean number of callbacks differs between white-sounding and black-sounding names.

In [7]:
w = data[data.race=='w']
b = data[data.race=='b']
print len(w), len(b)

2435 2435


#### Our sample sizes are >30, BUT we don't know our population standard deviation, so a t-test is the more appropriate procedure.

#### Additionally, since our sample sizes are >30, we know our sample size is sufficiently large for the CLT to apply.

#### Q2: Our null hypothesis is that there is no difference between the mean number of callbacks between white-sounding and black-sounding names, and our alternative hypothesis is that there is a difference.

In [8]:
# Your solution to Q3 here

### First let's use bootstrapping ###

''' The procedure will be to shift both the white-sounding and black-sounding samples to the same mean, simulate computing the mean many from those shifted samples many times, finding the difference in the means of those shifted samples, then seeing what fraction of those simulated mean differences are as extreme or more than the difference that is actually observed
'''
np.random.seed(42)
# set up function for bootstrapping
def bs_test(data, func, size=1):
    bs_reps = np.empty(size)
    for i in range(size):
        bs_sample = np.random.choice(data, size=len(data))
        bs_rep = func(bs_sample)
        bs_reps[i] = bs_rep
    return bs_reps

w_call = w['call']
b_call = b['call']

# find mean that we will shift both white-sounding and black-sounding samples to
mean_shift = np.mean(data['call'])

# shift both means to the same value (i.e., setting up the null hypothesis assumption)
w_shifted = w_call - np.mean(w_call) + mean_shift
b_shifted = b_call - np.mean(b_call) + mean_shift

# now let's simulate computing the mean for both of these mean-shifted samples 100,000 times
bs_rep_w = bs_test(w_shifted, np.mean, 100000)
bs_rep_b = bs_test(b_shifted, np.mean, 100000)

# calculate the difference in the bootstrap replicant mean
bs_difference = bs_rep_w - bs_rep_b

# compute the actual observed difference
obs_difference = np.mean(w_call) - np.mean(b_call)

# let's see the fraction times our simulated mean differences are as extreme as the actual difference--i.e., let's compute the p-value
p_val = np.sum(bs_difference >= obs_difference) / float(len(bs_difference))
print 'p-value = %.5f' % p_val
# compute the 95% confidence interval and margin of error
ci = np.percentile(bs_difference, [2.5, 97.5])
print '95 percent confidence interval between %.3f and %.3f' % (ci[0], ci[1])
margin = 1.96 * (np.std(bs_difference) / np.sqrt(len(bs_difference)))
print 'Margin of error: %.6f' % (margin)

p-value = 0.00000
95 percent confidence interval between -0.015 and 0.015
Margin of error: 0.000048


#### As shown, the p-value from 100,000 simulations is 0, indicating a very small p-value, and that the null hypothesis should be rejected.

#### Now we repeat the procedure using the frequentist approach

In [19]:
### Now let's use the frequentist approach ###
w_call = w['call']
b_call = b['call']

t, p = stats.ttest_ind(w_call, b_call, equal_var=False)
print 't-statistic: %.3f with p-value of: %.6f' % (t, p)
print '\n'

print '---WHITE SOUNDING---'
ci_w = np.percentile(w_call, [2.5, 97.5])
print '95 percent confidence interval between %.3f and %.3f' % (ci_w[0], ci_w[1])
margin_w = 1.96 * (np.std(w_call) / np.sqrt(len(w_call)))
print 'Margin of error: %.6f' % (margin_w)
print '\n'

print '---BLACK SOUNDING---'
ci_b = np.percentile(b_call, [2.5, 97.5])
print '95 percent confidence interval between %.3f and %.3f' % (ci_b[0], ci_b[1])
margin_b = 1.96 * (np.std(b_call) / np.sqrt(len(b_call)))
print 'Margin of error: %.6f' % (margin_b)

t-statistic: 4.115 with p-value of: 0.000039


---WHITE SOUNDING---
95 percent confidence interval between 0.000 and 1.000
Margin of error: 0.011729


---BLACK SOUNDING---
95 percent confidence interval between 0.000 and 1.000
Margin of error: 0.009755


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

#### Q4: In both the boostrapping and frequentist analysis, we see strong evidence that we should reject our null hypothesis: that there is no difference between the mean number of callbacks between white-sounding and black-sounding names. With the bootstrapping analysis, after 100,000 simulations, none of simulated mean differences had a value as extreme as was observed, and with the frequentist analysis, we found a very small p-value of ~4x10^-5, both indicating that the null hypothesis is extremely unlikely and should be rejected.

#### Q5: This analysis does not imply that race is THE MOST important factor in callback success--rather that race, or at least white-sounding vs black-sounding names, is one significant factor in callback success, and that the results show that it is very unlikely that there is no difference between the mean number of callbacks between white-sounding and black-sounding names. For one, we have yet to examine all the other factors that could contribute to callback success and assess their significance, so we don't know whether it's THE MOST important. One could potentially amend their analysis by sending resumes with the same names but different background to assess the significance of that factor. On the other hand, perhaps there are geographic factors based on where resumes are sent that could skew results. Either way, this analysis shows race is a significant factor, but as for THE MOST important, more tests could be done.