# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [15]:
import pandas as pd
import numpy as np
from scipy import stats

In [16]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [17]:
data.groupby('race').size()

race
b    2435
w    2435
dtype: int64

In [18]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [19]:
sum(data[data.race=='b'].call)

157.0

In [27]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
<p>1.What test is appropriate for this problem? Does CLT apply?</p>
<p>Both race and call feature are categorical data. However, since call feature is measuring ratio, proportion z test should be used. Since data are randomized, independent, and the size is large enough, CLT can be applied. </p>
<p>2.What are the null and alternate hypotheses?</p>
<p>null hypotheses: the call back ratio between black and white are the same  </p>
<p>alternative hypotheses: the call back ratio between balck and white are not the same </p>
</div>

In [30]:
w = data[data.race=='w']
b = data[data.race=='b']

In [53]:
# Your solution to Q3 here
#3.Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
import matplotlib.pyplot as plt
def bootstrap_replicate_1d(data, func):
    return func(np.random.choice(data, size=len(data)))

def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data, func)

    return bs_replicates

# Take 10,000 bootstrap replicates of the sum: bs_replicates
sample = draw_bs_reps(w['call'], np.sum, size = 10000)
sem = np.std(w['call']) / np.sqrt(len(w))
print(sem)

# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(sample)
print(bs_std)

# Compute the 95% confidence interval: conf_int
conf_int = np.percentile(sample, [2.5,97.5])

# Print the confidence interval
print('95% confidence interval =', conf_int, 'calls')



# Take 10,000 bootstrap replicates of the sum: bs_replicates
sampleb = draw_bs_reps(b['call'], np.sum, size = 10000)
semb = np.std(b['call']) / np.sqrt(len(b))
print(semb)

# Compute and print standard deviation of bootstrap replicates
bs_stdb = np.std(sampleb)
print(bs_stdb)

# Compute the 95% confidence interval: conf_int
conf_intb = np.percentile(sampleb, [2.5,97.5])

# Print the confidence interval
print('95% confidence interval =', conf_intb, 'calls')

0.0059840016981803105
14.378770345199898
95% confidence interval = [207. 263.] calls
0.004977108869798699
12.081098194700678
95% confidence interval = [134. 181.] calls


In [54]:
import statsmodels.api as sm
z_score, p_value = sm.stats.proportions_ztest([235, 157], [2435, 2435])
print(z_score, p_value)

4.108412152434346 3.983886837585077e-05


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
<p> 4.Write a story describing the statistical significance in the context or the original problem.</p>
<p>p-value is 0.00003 which rejects null hypotheses. That means there is a difference in call back rate between balck and white.I am 95% confident that white people would receive 207-263 calls out of 2435 calls but only 134-181 calls for black people.</p>
<p></p>
<p>5.Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?</p>
<p>It does not mean that race/name is the most important factor. I would have to take a look at closer look into the relationship between the call back rate and other features in the data.</p>


</div>