# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [23]:
import pandas as pd
import numpy as np
from scipy import stats
import math

In [15]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
print( len(data[data.race == 'b']))
print(len(data[data.race == 'w']))

2435
2435


# 1. What test is appropriate for this problem? Does CLT apply?

Since the objective of this analysis is to verify whether race has significant impact to the call back rate, we should use two-sample test to compare the call back rate between white-sounding and back-sounding sample groups. 

The result of call back is either 0 or 1, therefore, the distribution of call back rate should be a bernoulli distribution. Would CLT still apply in this case? Let's examine the criteria of CLT
1. The values have to be drawn independently
2. The values have to come from the same distribution
3. The values have to be drawn from a distribution with finite mean and variance
4. The number of samples is enough

Since the idantical resume was randomly assigned to back-sounding and white-sounding name, we can be a bit confident on the independency of the data. For the second criteria, sinc all the data were draw from the experience at once, they should come from the same distribution. Thirdly, the distribution of call-back is either 1 or 0 – a typical Bernoulli distribution with finite mean and variance. Lastly, the number of samples for each black-sounding and white-sounding group is 2435 – a sufficient amount of data for the convergnece of mean distribution. Therefore, CLT can still be apply to this case.

# 2. What are the null and alternate hypotheses?

The objective is to analyze whether the race, black-sounding and white-sounding, will impact the call back rate. Therefore, we will set the null hypothesis as the race will have no significant impact to the call back rate, and alternate hypothesis as race will have significant impact to the call bakc rate. In this test, the confidence level will be set to 5%.

 # 3. Compute margin of error, confidence interval, and p-value.

In [33]:
B = data[data.race == 'b']
W = data[data.race == 'w']

# Compute descriptive statistics of both back-sounding and white-sounding group
meanB = B.call.mean()
varB = B.call.var()
meanvarB = varB/len(B)
meanW = W.call.mean()
varW = W.call.var()
meanvarW = varW/len(W)

# Compute the difference of sample mean
meanDiff = meanB - meanW
meanvarDiff = meanvarB + meanvarW
stderrDiff = math.sqrt(meanvarDiff)

# Compute margin of error with 95% confidence interval (two tail)
zscore = stats.norm.ppf(0.975)
marginError = zscore*stderrDiff
pvalue = stats.norm.cdf(meanDiff/stderrDiff)

print('Difference of all back rate =', round(meanDiff,4))
print('Margin of error =',round(marginError,4))
print('95% Confidence interval =',-round(marginError,4),' to ', round(marginError,4))
print('P-value =', round(pvalue,7))

Difference of all back rate = -0.032
Margin of error = 0.0153
95% Confidence interval = -0.0153  to  0.0153
P-value = 1.94e-05


From the test result, we can see that the null hypothesis has been rejected. The back-sounding group is 3.2% less likely to receive call back than the white-sounding group. The margin of error is 1.53% for a 95% confidence interval. The p-value is 0.002%, implying that there's a high significant difference between the groups.

# 4. Write a story describing the statistical significance in the context or the original problem.

How would the race impacts on the job application call back rate? 

Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

Among the test result, we found that the call back rate for black-sounding names will have 3.2% lower chance to be called back than white-sounding names. The margin of error for this result is 1.53%, and the p-value is 0.002%, implying that there's significant call back rate between the two groups.

# 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

The analysis does not prove that the race/name is the most important factor in callback success. It only means that the race/name will have significant impact on callback success. To verify whether the race/name is actaully the most important factor, we should compare the impact with that of other factors.