# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [4]:
import pandas as pd
import numpy as np
from scipy import stats

In [5]:
%time data = pd.io.stata.read_stata('/users/youcefdjeddar/downloads/EDA_racial_discrimination/data/us_job_market_discrimination.dta')

CPU times: user 113 ms, sys: 10.6 ms, total: 124 ms
Wall time: 183 ms


In [6]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [7]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [8]:
w = data[data.race=='w']
b = data[data.race=='b']

In [9]:
w.head(10)

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit
5,b,1,4,2,6,1,0,0,0,266,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
6,b,1,4,2,5,0,1,0,0,13,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
11,b,1,4,4,8,0,0,0,0,316,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit
13,b,1,4,2,4,0,0,0,0,21,...,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,Private
15,b,1,1,3,4,0,0,0,0,316,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
16,b,1,4,3,5,0,1,0,0,268,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
18,b,1,4,2,6,1,1,0,0,266,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

In [10]:
b.head(10)

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
7,b,1,3,4,21,0,1,0,1,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit
8,b,1,4,3,3,0,0,0,0,316,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
9,b,1,4,2,6,0,1,0,0,263,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
10,b,1,4,4,8,0,1,0,1,379,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit
12,b,1,4,4,4,0,0,0,1,27,...,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,Private
14,b,1,4,2,5,0,1,0,0,263,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
17,b,1,4,3,6,0,0,0,0,267,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
19,b,1,2,2,8,0,0,0,1,265,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private


Question 1

A z-test via a z-statistics can be used because the STD is known. 
Alternatively, CLT can be performed since the samples are independant.  

In [12]:
# Black Rate of Callback
n1 = len(data[data.race=='b']) # number of black-sounding names
sum_b = sum(data[data.race=='b'].call) # number of callbacks for black-sounding names

rate_b = sum_b / n1

# White Rate of Callback
n2 = len(data[data.race=='w']) # number of white-sounding names
sum_w = sum(data[data.race=='w'].call) # number of callbacks for white-sounding names

rate_w = sum_w / n2

print ("The black rate of callback is", rate_b)
print ("The white rate of callback is", rate_w)

The black rate of callback is 0.06447638603696099
The white rate of callback is 0.09650924024640657


Question 2: 

Null Hypothesis (Ho): rate_w - rate_b = 0

Alternative Hypothesis (Ha): rate_w = rate_b != 0



In [16]:
#Question 3 (marging of error): 

p = rate_w - rate_b

std = np.sqrt((rate_w * (1-rate_w) / n2) + (rate_b * (1-rate_b)/n1))

z_score = 1.96

# with alpha = .05
# 95% chance that rate_w - rate_b is within 1.96 standard deviations from
# our sample proportion according to the z-table.

In [18]:
ME = z_score * std

lower = p - ME
upper = p + ME

print ('Margin of Error: ', ME)
print ('Confidence Interval', [lower, upper])

Margin of Error:  0.015255406349886438
Confidence Interval [0.016777447859559147, 0.047288260559332024]


In [20]:
# New calculation for std becuase we are assuming null hypothesis is true
# Therefore must calculate population std such that p1 = p2 = p_hat
# p_hat is proportion of total callbacks for whole sample
# calculating proportion of callbacks disregarding race

p = rate_w = rate_b

p_hat = (sum_b + sum_w) / (n1 + n2)

std = np.sqrt( (2*p_hat*1-p_hat) / n1 ) #n1 = n2 so divide by either

# Calculate z-score
# How many standard devations away from the mean is our sample statistics
z_score = (p - 0) / std

z_score

11.214285714285715

There is a 95% chance that the true difference of white-sounding call back rates and black-sounding call back rates is between .016 and .04.

The probability of getting a z-score of 11 is very small, even while assuming the null hypothesis is true. This means that the effect that we see (a difference in the proportion of callbacks between white and balck sounding names) is significant.

Therefore, race has a major impact on the rate of callbacks for resumes.

Question 5:

The analysis only looked at the relationship between black/white sounding names and callbacks. There are a number of other factors that should be taken into consideration.

We must test the signiicance of these factors to callback rate to determine which is the best to determine callback success.

The next step in this analysis would be to add on the additional data in the table and test for significance and coorelation.