# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [14]:
import pandas as pd
import numpy as np
from scipy import stats
import math

In [15]:
data = pd.io.stata.read_stata('C:/springbord/Dsc_racial_disc/EDA_racial_discrimination/data/us_job_market_discrimination.dta')

In [16]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [17]:
# number of callbacks for white-sounding names
sum(data[data.race=='w'].call)

235.0

In [18]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1,0,1,0,0,0,0,0,0,
1,b,1,3,3,6,0,1,1,0,316,...,1,0,1,0,0,0,0,0,0,
2,b,1,4,1,6,0,0,0,0,19,...,1,0,1,0,0,0,0,0,0,
3,b,1,3,4,6,0,1,0,1,313,...,1,0,1,0,0,0,0,0,0,
4,b,1,3,3,22,0,0,0,0,313,...,1,1,0,0,0,0,0,1,0,Nonprofit


<div class="span5 alert alert-success">
<p>Q 1.What test is appropriate for this problem? Does CLT apply?</p>
</div>

 ANS The appropriate test to use would be a hypothesis test. CLT theorm can be apply if if the population is binomial, provided that min(np, n(1-p))> 10, below  line result satisfying this condition 

In [19]:
# Get number of observation for black and white
data_b = data[data.race=='b']
data_w = data[data.race=='w']
num_b = len(data_b)
num_w = len(data_w)
print "Number of observations where race is b : ",num_b
print "Number of observations where race is w : ",num_w

Number of observations where race is b :  2435
Number of observations where race is w :  2435


In [20]:
# no of call recived by black and white sounding name 
b_success = len(data_b[data_b.call == 1])
w_success = len(data_w[data_w.call == 1])
print b_success, w_success

157 235


In [21]:
# calculate portion of black and white sounding name recived call 
p_b = 1.0 * b_success/num_b
p_w = 1.0 * w_success/num_w
print "Proportion of black sounding names getting a callback : ",p_b
print "Proportion of white sounding names getting a callback : ",p_w

Proportion of black sounding names getting a callback :  0.064476386037
Proportion of white sounding names getting a callback :  0.0965092402464


In [22]:
# check second condition for center limit theorm 
print num_b * p_b > 10
print num_b * (1-p_b) >10
print "---"
print num_b * p_w > 10
print num_b * (1-p_w)> 10

True
True
---
True
True


Q 2 : What are the null and alternate hypotheses?

Ans 2:

Null Hypothesis: (H0: pw = pb) The null hypothesis would be that there is no racial discrimination between black/white races on rates of callback for resumes.

Alternate Hypothesis: (H1: pw != pb) The alternate hypothesis would be that there is some racial discrimination between black/white races on rates of callback for resumes.

Q 3: Compute margin of error, confidence interval, and p-value.

ANS 3 : We can use the z-statistic to place a confidence interval on this sample statistic.Hence, the margin of error is  $Z_{\alpha/2} * SE$. For a 95% confidence interval, the z-value is 1.96.

In [23]:
z = 1.96
margin = z * math.sqrt( ( p_w*(1-p_w) / num_b) + (p_b*(1-p_b)/num_w) )

print "Margin of error = ", margin

Margin of error =  0.0152554063499


In [24]:
print "The confidence interval is given by :", p_w-p_b-z*margin,"to", p_w-p_b+z*margin

The confidence interval is given by : 0.00213225776367 to 0.0619334506552


In [30]:
from statsmodels.stats.proportion import proportions_ztest as pz
stat, pval = pz(np.array([b_success,w_success]),np.array([num_b,num_w]),value=0)
print (stat,pval)
print('{0:0.3f}'.format(pval))

(-4.1084121524343464, 3.9838868375850767e-05)
0.000


The second value is the p-value and it is much lesser than 0.05. Hence, we can reject the null hypothesis

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

Ans 4: After the hypothesis testing, one can support the alternative hypothesis of having some racial discrimination between black/white races on rates of callback for resumes because of the P-Value Score. When a p_value < 0.05, then there is a more probable chance of the alternative hypothesis to occur.

ANS 5: We would probably have to conduct more experiments in order to come to a conclusion about this analysis. It could just be a correlation that race/name had an impact with rates of call back. A better way to measure the accuracy for the analysis would to compare different features of the data set and see overall how each corresponds with one another. There is always room for more experiments.