# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

1. A two sample proportion z-test is appropriate.  There 4870 resumes in this experiment which is much greater than 30 and I want to look at the difference between the proportion of white name resumes that get a call back versus the proportion of black name resumes.

2. The null hypothesis is that there is no difference between the proportion of white and black sounding names that get call backs.  The proportion is the same.  The alternative hypothesis is that white name resumes will result in a higher proportion of call backs than black name resumes.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [2]:
import pandas as pd
import numpy as np
from scipy import stats as st

In [3]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [4]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call), sum(data[data.race=='w'].call)

(157.0, 235.0)

In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [8]:
white, black = data[data.race=='w'], data[data.race=='b']
whitep, blackp = white.call.sum()/len(white), black.call.sum()/len(black)
white_stdv, black_stdv = ((whitep*(1-whitep))/len(white))**.5, ((blackp*(1-blackp))/len(black))**.5
diff = whitep-blackp
stdv = (white_stdv**2+black_stdv**2)**.5
diff, stdv

(0.032032854209445585, 0.0077833705866767544)

In [11]:
margin_of_error = stdv*1.96 #95% confidence
confidence_interval = [diff-margin_of_error,diff+margin_of_error]
pvalue = 1- st.norm.cdf(diff/stdv)
print 'margin of error: ', margin_of_error, 'confindence interval: ', confidence_interval, 'p-value: ', pvalue

margin of error:  0.0152554063499 confindence interval:  [0.016777447859559147, 0.047288260559332024] p-value:  1.93128260376e-05


4. Researchers studied how prevalent racism is in the US labor market.  To do so they randomly assigned identical resumes to white-sounding and black-sounding names before submitting them to employers.  They then recorded the number of calls these ficticious job searchers received regarding their job application.  The study showed that 3.2% more resumes with white-sounding names received calls than their black-sounding counterparts.  The margin of error was only .015 so if in truth there was no racism at play here, then the study had a 1.9*10^-5 chance of recording this 3.2% difference by random chance.  Clearly the result is significant and the call back rate is not the same for black and white sounding names.

5. The description of the experiment is frustratingly ambiguous.  The study says it randomly assigns identical resumes to the different groups.  Does that mean that each group gets one of every resume in the pool?  Does it mean that all the resumes are the same?  If its either of these cases then racism is the biggest effect in callback success.  However, if there are lots of different resumes that look similar with different qualifications, then the qualifications are probably the biggest factor in call back success.  I would modify the study so that the same resumes are assigned to white and black names, or perhaps break the analysis up and only compare resumes based on how qualified they are.  Perhaps I could split the analysis into three categories of resume qualification: low, medium, and highly qualified.