
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for balck-sounding names and total number of black sounding names
cb_b, n_b = sum(data[data.race=='b'].call), len(data[data.race=='b'])

In [4]:
# number of callbacks for white-sounding names and total number of white sounding names
cb_w, n_w = sum(data[data.race=='w'].call), len(data[data.race=='w'])

##1. What test is appropriate? Are conditions for the CLT met?

The first step in this problem is to check the conditions of the Central Limit Theorem for proportions:

1) Independence:
    * Both samples must be randomly selected: This is met because the samples are taken randomly
    * n1 and n2 < 10% of population: There are 2435 samples of each case, which is less than the population of black and white sounding names

2) Sample Size
    * There are at least 10 successes and at least 10 failures for both white and black sounding names
    
We can conclude that the conditions for CLT are met. This allows us to use a Z-Test to do a hypothesis test.

##2. What are the null and alternate hypotheses?

The hypotheses for this test will be set up as follows:

    * H0: p_w - p_b = 0, i.e. there is no significant difference in resume callbacks based on racially associated names
    * HA: p_w - p_b <> 0, i.e. there is a significant difference in resume callbacks based on racially associated names

##3. Compute the margin of error, confidence interval, and p-value

In [5]:
#Compute the margin of error for a significance level of 0.05
p_b, p_w = cb_b/n_b, cb_w/n_w
z = stats.norm.ppf(0.975)
SE = np.sqrt((p_b*(1-p_b))/n_b + (p_w*(1-p_w))/n_w)
margin_error = z*SE
print('Margin of Error = ', margin_error)
print('Confidence Interval: ', p_w - p_b, '+/-', margin_error)

Margin of Error =  0.0152551260282
Confidence Interval:  0.0320328542094 +/- 0.0152551260282


In [6]:
#Compute the p-value
z_s = (p_w - p_b)/SE
p_value = (1 - stats.norm.cdf(z_s))*2
print('z-statistic = ', z_s, 'p-value = ', p_value)

z-statistic =  4.11555043573 p-value =  3.86256520752e-05


##4. Discuss Statistics Significance

Based on the results above, it appears that the racial association of names does have an effect on resume callbacks. The p-value calculated is less than the significance level of 0.05. Although we can reject this null hypothesis, it should be noted that the result of this experiment do not definitively make a statement on racial discrimination. The experiment was done using black and white sounding names, which do not necessarily associate with people who are actually black or white. Also, resume callbacks do not provide indication of discrimination when the race of the person is definitively known either by self identification or by meeting and seeing the person. The result of the overall study indicates racial discrimination but does not necessarily imply causation.