# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [4]:
import pandas as pd
import numpy as np
from scipy import stats

In [5]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [6]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [7]:
data.head()


Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [8]:
black_df=data[data.race=='b']
white_df=data[data.race=='w']

data.groupby('race').call.sum()

race
b    157.0
w    235.0
Name: call, dtype: float32

In [24]:
len(black_df)
len(white_df)


2435

In [10]:
b_callback=len(black_df[black_df.call==1])
b_notcall=len(black_df[black_df.call==0])
w_callback=len(white_df[white_df.call==1])
w_notcall=len(white_df[white_df.call==0])

In [11]:
#form a contingency table for a chi squared test 
contingency_table=pd.DataFrame({'black':{'called':b_callback,'not_called':b_notcall},
                       'white':{'called':w_callback,'not_called':w_notcall}})

In [12]:

contingency_table


Unnamed: 0,black,white
called,157,235
not_called,2278,2200


In [34]:
sum_black=contingency_table.iloc[0]['black']+contingency_table.iloc[1]['black']
sum_white=contingency_table.iloc[0]['white']+contingency_table.iloc[1]['white']

p_b_success=b_callback/sum_black
p_w_success=w_callback/sum_white
diff_p= abs(p_b_success-p_w_success)

CLT applies here since we have a large set of indepdent data points and for all categories it folows that np>10 and n(1-p)>10. We will be using a Chi-squared test since we are analyzing binary states of categorical data. 

The null hypothesis is that the rate of callback should not be affected by name, and the alternative hypothesis is that there is a statistical difference between the two groups examined in the study.

H0: pb=pw, HA: pb != pw

The Standard Error for the sample statistic is given by $\sqrt{\frac{\hat{p}_b(1-\hat{p}_b)}{n_b} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}} $

We can use the z-statistic to place a confidence interval on this sample statistic.Hence, the margin of error is  $Z_{\alpha/2} * SE$. For a 95% confidence interval, the z-value is 1.96.

The confidence interval, subsequently, is $\hat{p}_b - \hat{p}_w \pm {Z_{\alpha/2} * SE}$

"

In [35]:
import math
standard_error = math.sqrt( ( p_w_success*(1-p_w_success) / len(white_df)) + (p_b_success*(1-p_b_success)/len(black_df)) )

In [36]:
margin_error = 1.96* standard_error

In [37]:
lower_CI=diff_p-margin_error
higher_CI=diff_p+margin_error
print ("The confidence interval is given by :", lower_CI,"to", higher_CI)


The confidence interval is given by : 0.0167774478596 to 0.0472882605593


In [41]:
from statsmodels.stats.proportion import proportions_ztest as pz
pz(np.array([b_callback,w_callback]),np.array([len(black_df),len(white_df)]),value=0)

(-4.1084121524343464, 3.9838868375850767e-05)

given a p value that is less than .05, we can reject the null hypothesis.
This means that there is a significant difference in the number of callbacks. names that are white sounding get higher callbacks than names that are black sounding