# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

We want to compare the population proportion.

Each value is a Bernouilli distribution: call=0 is failure, call=1 is success.

1.The samples are random, according to the experiment description

In [5]:
n = len(data)
p = sum(data.call)/len(data)
nw = sum(data.race=='w')
nb = sum(data.race=='b')
print('n = ', n)
print('p = ', p)
print('n black = ', nb)
print('n white = ', nw)

if (nb*p>10 and nb*(1-p)>10):
    print('The black distribution is normal')
if (nw*p>10 and nw*(1-p)>10):
    print('The white distribution is normal')

n =  4870
p =  0.0804928131417
n black =  2435
n white =  2435
The black distribution is normal
The white distribution is normal


2.The samples are normal

3.The samples are independent because n is less than 10% of the entire population. 

Given 1, 2, 3, we can apply CLT.

The null hypothesis is: both samples have the same proportion of calls: pb=pw or pb-pw=0

The alternate hypothesis is: the black sample proportion is less than the white sample proportion: pb<pw or pb-pw<0

This is a one-tail problem.

In [6]:
w = data[data.race=='w']
b = data[data.race=='b']

In [7]:
# Your solution to Q3 here

I set the significance level alpha=0.05

In [8]:
alpha = 0.05

First, use $z$ score to calculate the p_value

In [9]:
pb = sum(b.call)/nb
pw = sum(w.call)/nw
print('proportion call black sample = ', pb)
print('proportion call white sample = ', pw)
z = (pb-pw) / np.sqrt((pb*(1-pb)/nb) + (pw*(1-pw)/nw))
print('z_score = ',z)
p_value = stats.norm.cdf(z)
print('p_value = ',p_value)
if p_value < alpha:
    print('Reject the null hypothesis')
else:
    print('Cannot reject the null hypothesis')

proportion call black sample =  0.064476386037
proportion call white sample =  0.0965092402464
z_score =  -4.11555043573
p_value =  1.93128260376e-05
Reject the null hypothesis


In [10]:
# margin of error
std_diff = np.sqrt((pb*(1-pb)/nb) + (pw*(1-pw)/nw))
mel, meh = stats.norm.interval(0.9, loc=pb-pw, scale=std_diff)
# I use 0.9 because the function is 2 tails, and my problem is 1 tail
print('The difference of proportion of calls is {:f} with a margin of error {:f}, a confidence interval up to {:f} at a confidence level of 95%'.format(pb-pw, meh-(pb-pw), meh))

The difference of proportion of calls is -0.032033 with a margin of error 0.012803, a confidence interval up to -0.019230 at a confidence level of 95%


The confidence interval does not include 0, so there is a discrimination.

Now, use bootstraping permutation

In [11]:
# initialize replicate : difference of proportion
size = 50000
bs_r = np.empty(size)

# bootstrap
for i in range(size):
    # permute the samples
    s = np.random.permutation(data.call)
    # assign samples to groups
    s1 = s[:nb]
    s2 = s[nb:]
    # compute replicate: difference of proportion
    bs_r[i] = sum(s1)/len(s1) - sum(s2)/len(s2)
    
# compute p_value: proportion of bs replicates as extreme as the difference in my samples
p_value = sum(bs_r<(pb-pw)) / size

print('p_value = ', p_value)
if p_value < alpha:
    print('Reject the null hypothesis')
else:
    print('Cannot reject the null hypothesis')

p_value =  4e-05
Reject the null hypothesis


With both methods, I reject the null hypothesis.

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

Is there racial discrimination in the US labor market?

Let's take a random, independent, normal distributed sample of 4870 resumes, we give black-sounding names to half of them and white-sounding names to the other half.

And we compare the proportion of resumes that receive a call from an employer.

We get 6.4% of calls for black-sounding names resumes and 9.7% of calls for white-sounding names resumes.

With no discrimination, the probability to get this difference is 0.002%.

This means there is discrimination with a confidence interval of 99.998%.

This analysis does not mean that race/name is the most important factor in callback success, because we did not compare its influence with the influence of other factors.
We could do the same analysis with other factors to compare the proportion of calls with/without that factor.