# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [53]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [6]:
len(data)

4870

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [5]:
# number of callbacks for white-sounding names
sum(data[data.race=='w'].call)

235.0

In [7]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


### 1. What test is appropriate for this problem? Does CLT apply?

Here we want to test whether call-back rates for black sounding names is statistically different from call-back rates for white sounding names. Call back rates are probabilities which can also be seen as expectations of bernnoilli random variable (1 if called, 0 if not called). Call back rate is (scaled) binomial distributed. With thousands of independent samples, CLT clearly applies and therefore Call back rates can be modeled as approximately normal. 

### 2. What are the null and alternate hypotheses?

H0 = Black and White sounding names have the same call back rates,   H1 = Black and White sounding names have different call back rates

In [19]:
# callbacks rate for black-sounding names
mb = sum(data[data.race=='b'].call)/len(data[data.race=='b'].call)
mb

0.064476386036960986

In [20]:
# callbacks rate for white-sounding names
mw = sum(data[data.race=='w'].call)/len(data[data.race=='w'].call)
mw

0.096509240246406572

In [22]:
# standard error in callback rates of black sounding names
sb = data[data.race=='b'].call.std()/np.sqrt(len(data[data.race=='b'].call))
sb

0.0049781310543264307

In [23]:
# standard error in callback rates of white sounding names
sw = data[data.race=='w'].call.std()/np.sqrt(len(data[data.race=='w'].call))
sw

0.005985230735411586

In [31]:
## Clearly the means of blacks and whites are several standard deviations away and we seem like set to reject the null. 
abs(mb - mw)/sw

5.3519831775111646

In [35]:
#Lets calculate 99% confidence intervals
#with more than 100 called observations for both black and white sounding names, it is safe to assume call back rates 
# (sum of hundres of iid bernnoilli using CLT) is normal
#99% confidence interval of standard normal
stats.norm.interval(.99)

(-2.5758293035489004, 2.5758293035489004)

In [39]:
#To get the confidence interval for call back rate for white sounding names, lets simply scale up with confidence interval
conf_b = np.array(stats.norm.interval(.99))*sb + mb
conf_b

array([ 0.05165357,  0.0772992 ])

In [40]:
conf_w = np.array(stats.norm.interval(.99))*sw + mw
conf_w

array([ 0.08109231,  0.11192617])

In [41]:
#margin of error
merror_b = (conf_b[1] - conf_b[0])/2
merror_b

0.0128228158466408

In [42]:
#margin of error
merror_w = (conf_w[1] - conf_w[0])/2
merror_w

0.015416932716774703

In [47]:
#pvalue
1 - stats.norm.cdf(abs(mb - mw)/sw)

4.3497744073306421e-08

### 4. Write a story describing the statistical significance in the context or the original problem.



<div class="span5 alert alert-info">

### White sounding names are 50% more likely to get an interiew call than black sounding names

A study conducted by Marianne Bertrand and Sendhil Mullainathan in Chicago and Boston between 2000 and 2002 confirms that there exists racial discrimination in the job market in the United States. A random sample of 4870 resumes was selected and their names were jumbled up. CVs which were identical in all other respects were randomly allotted white sounding and black sounding names. Many such CVs were sent to relevant industries and whoever got an interview request was recorded. It was found that while 6.44 % black sounding names got an interview call, about 9.65 % white sounding names got a request for being interviewed, in other words, for otherwise identical resumes, if simply the name of the candidates sounded like a white, the chances of him or her being called for an interview were 49.7 % higher!


To make sure the observation was not attributed to a "chance" error, a statistical test was conducted. Margin of error around the measuremet of 6.44% "black callback rate" was estimated to be 1.28% with a 99% confidence. Margin of error aroud the 9.65% of "white callback rate" was estimated as 1.54% with 99% confidence. It is clear that effect size is well outside the margin of error in both measurements. 

</div>
****

### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

No, it does not. We have not compared race/name with other factors. We have just compared relative success of blacks and whites.

In order to find out the most important factor, we will have analyze the correlation of call-back rates with values of other factors in the data