# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [4]:
import pandas as pd
import numpy as np
from scipy import stats

In [11]:
data = pd.read_stata('data/us_job_market_discrimination.dta')

In [24]:
#  number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [13]:
#data.info()
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [14]:
len(data)  #Total number of people who participated the test

4870

In [15]:
len(data[data.call==1])  #Total number of people who got a call back

392

Since the variable 'call' is a categorical variable with the value of either '1' or '0'. We will use Chi-squared test.The CLT would apply , because the CLT is insensitive to the original data distribution. 

Our null hypothesis:

There is no difference in call back rates between black and race named resumes.

Alternate hypothsis:

There is a difference in call back rates between black and race named resumes. 

In [28]:
s1=sum(data[data.race=='b'].call)
s2=sum(data[data.race=='w'].call)
print s1,s2

157.0 235.0


Assuming that there is no difference in call back between black and race named resumes. Since the totaly number of people who particiated is 4870 , the total number of people who got a call back is 392, we can calculate:


In [20]:
tpop=4870    #Total number of people who participated this study.
rbpop=392    #Total number of people who got a call back.
cbratio= 1.0*392/4870   #Call back Ratio

b_pop=len(data[data.race=='b'])  # Total number of black named participant  2435
w_pop=len(data[data.race=='w'])  #  Total number of white named participant  2435
  
b_e_pop=b_pop*cbratio          #Expected number of black named participant who got a call back  
w_e_pop=w_pop*cbratio          #Expected number of black named participant who got a call back

b_a_pop=sum(data[data.race=='b'].call)  #Actual number of black named participant who got a call back
w_a_pop=sum(data[data.race=='w'].call)  #Actual number of black named participant who got a call back 

In [21]:
print b_e_pop
print w_e_pop
print b_a_pop
print w_a_pop

196.0
196.0
157.0
235.0


In [23]:
#Now we can calculate Chi-square statistics:

Chisq= (b_a_pop-b_e_pop)**2/b_e_pop+(w_a_pop-w_e_pop)**2/w_e_pop
Chisq

Df=(2-1)*(2-1)  #Number of the degree of freedom 

15.520408163265309

By looking at the Chi-square table with 1 degree of freedom, we can see the P-value is less than 0.005. So we can confidently reject the null hypothesis and conclude that there is a difference on call back rate between black and white named resumes. 

For a 95% confidence interval: alpha=0.05. Low end Chi-square value for alpha/2(0.025) is 5.024. Hight end Chi-square value of 1-alpha/2 (0.975) is 0.001. So the 95% confidence interval is (0.001, 5.024)

We can also do this with Python's package

According to the analysis, we can see that there is indeed a race difference in the callback success.  But our test didn't in any way test if the race/name is the most important factor in callback success. Actually from the data, we can see that even white named resumes only has 235/2435=9.6% call back success. In order to test if the race/name is the most important factor in callback success, we conduct other tests to compare different factors of the call back success. 