# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [5]:
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


#### 1. What test is appropriate for this problem? Does CLT apply?

##### This will be a two-sample test, because we are comparing two sample populations.  The CLT does apply because the sample size is large (>30) and we will be testing the mean of the sample populations.  This will be a Z-Test, because we have a significant number of samples in our sample population, and because we can estimate the population standard deviation.

#### 2. What are the null and alternate hypotheses?

##### Null Hypothesis: The number of callbacks for black-sounding names is the same as for white-sounding names.  P1 - P2 = 0
Alternate Hypothesis: The number of callbacks for black-sounding names is NOT the same as for white-sounding names. P1 - P2 != 0

#### 3. Compute margin of error, confidence interval, and p-value.

In [23]:
# P1 = White names % callback
p1 = sum(data[data.race=='w'].call)/len(data[data.race=='w'])

# P2 = Black names % callback
p2 = sum(data[data.race=='b'].call)/len(data[data.race=='b'])

# Difference in sample population means
mean = p1 - p2
mean

0.032032854209445585

In [21]:
# Standard Deviation of the differences in the means
std = ((p1*(1-p1) + p2*(1-p2))/2435)**(1/2.0)

# Margin of error 
std*1.96

0.015255406349886438

In [20]:
# 95% confidence interval of the difference in means (z score of 1.96)
mean - 1.96*std, mean + 1.96*std

(0.016777447859559147, 0.047288260559332024)

#### 4. Write a story describing the statistical significance in the context or the original problem.

##### Based on the sample population, we are 95% confident that the difference between the percent of white name callbacks and the percent of black name callbacks will be between 1.678% and 4.729%.  Based on this, we reject the null hypothesis that there is no difference between the frequency that white and black names get callbacks.

#### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

##### My analysis does not necessarily mean that race / name is the most important factor in callback success.  In order to determine the most important factor in callback success, I would also need to analyze the importance of other factors, including years of experience, education level, number of prior jobs, gender, and other factors that are likely to impact the perception of the candidate.