
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
data = pd.io.stata.read_stata('us_job_market_discrimination.dta')

In [4]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

# 1. What test is appropriate for this problem? Does CLT apply?

# 2. What are the null and alternate hypotheses?

# 3. Compute margin of error, confidence interval, and p-value.

In [5]:
# check data is there
df = data[['race','call']]
df.tail (10)

Unnamed: 0,race,call
4860,w,0.0
4861,w,1.0
4862,w,0.0
4863,w,0.0
4864,b,0.0
4865,b,0.0
4866,b,0.0
4867,w,0.0
4868,b,0.0
4869,w,0.0


In [6]:
# create black and white dfs
df_b = df[df['race'] == 'b']
df_w = df[df['race'] == 'w']

In [7]:
# create dataframes with calls == 1
df_b1 = df_b[df_b.call == 1]
df_w1 = df_w[df_w.call == 1]

In [8]:
# mean of sample black and white with calls
b_mean = len(df_b1) / len(df_b)
w_mean = len(df_w1) / len(df_w)

In [9]:
# standard deviation of black and white sample bernoulli distribution
b_sd = b_mean * (1 - b_mean)
w_sd = w_mean * (1 - w_mean)

In [10]:
# The estimate of population mean
mean_diff = np.abs(b_mean - w_mean)

In [11]:
# The estimate of population SD
pop_sd_est = np.sqrt((b_sd / len(df_b)) + (w_sd / len(df_w)))

# Margin of Error

In [12]:
# assume margin of error is for probability of true population mean is within 2 SDs (95%) of the sample distribution
mo_error = 2 * pop_sd_est
print ("margin of error: ", np.round(mo_error,4) * 100, '%')

margin of error:  1.56 %


# Confidence Interval

In [None]:
#95% CI lower and upper limit is the margin of error calculated in previos step

In [13]:
print ("95% chance that the P1-P2 is within :", np.round(mean_diff, 4), "+-", np.round(mo_error, 4))

95% chance that the P1-P2 is within : 0.032 +- 0.0156


# p-value

In [None]:
#p-value can be obtained from z-test tables, online calculators, or derived from pandas cdf function

In [14]:
p_value = 2 * stats.norm.cdf(0, mean_diff , pop_sd_est)

In [15]:
print ("p-value = ", p_value)

p-value =  3.86256520752e-05


# 4. Discuss statistical significance.

p-value is less than the 95% CI level. Hence we can safely reject the null hypothesis P1 = P2. In order words, the alternate hypothesis is valid - that there is a difference in the proportion of white and black receiving calls, with the inference that racisal discrimination in the US labour market exists.