# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
<h3> Exercises </h3>
<p>You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.</p>

<p>Answer the following questions <b>in this notebook below and submit to your Github account.</b></p>

<ol>
   <li>  What test is appropriate for this problem? Does CLT apply?
   <li> What are the null and alternate hypotheses?
   <li> Compute margin of error, confidence interval, and p-value.
   <li> Write a story describing the statistical significance in the context or the original problem.
   <li> Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?
</ol>

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
   

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****
</div>


In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.api as sms

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


#### Solutions ####
1)What test is appropriate for this problem? Does CLT apply?

In [7]:
# checking number of resumes for each race: data_b, data_w
data_b = data[['race', 'call']][data['race'] =='b']
data_w = data[['race', 'call']][data['race'] =='w']
print("Black resumes: ", len(data_b))
print("White resumes: ", len(data_w))

Black resumes:  2435
White resumes:  2435


**Sol-1)** We can use **two-sample** test as there are two samples available.
Since the sample size is 4870(2435 for black and 2435 for black), CLT can be applied to this distribution.


2)What are the null and alternate hypotheses?

**Sol-2)** Hypothesis
1. $H_0$ :There is NO difference between the means of 2 groups.
2. $H_A$ :The difference between the means of 2 groups is signiﬁcant

3)Compute margin of error, confidence interval, and p-value.

**Sol-3)** Computation as below:

In [8]:
# first will calculate the t-value and p value.
# will create two different arrays for black and white: array_data_b, array_data_w
array_data_b = data[data['race'] == 'b']
array_data_w = data[data['race'] == 'w'] 

#calcluating mean of black and white: mean_call_b, mean_call_w
mean_call_b = array_data_b.call.mean()
mean_call_w = array_data_w.call.mean()

#calculate the variance of black and white: var_call_b, var_call_w
var_call_b = array_data_b.call.var()
var_call_w = array_data_w.call.var()


# printing all the variables calculated above.
print("The mean for the callback of black race is ", round(mean_call_b, 3))
print("The mean for the callback of white race is ", round(mean_call_w, 3))
print("The difference of mean = ", round((mean_call_w - mean_call_b), 3))
print('--------------------------------------------------')
print("The variance of balck race is ", round(var_call_b, 3))
print("The variance of white race is ", round(var_call_w, 3))
print("The average variance = ",(var_call_b + var_call_w)/2) # since the length of array is same

The mean for the callback of black race is  0.064
The mean for the callback of white race is  0.097
The difference of mean =  0.032
--------------------------------------------------
The variance of balck race is  0.06
The variance of white race is  0.087
The average variance =  0.0737863127142191


In [9]:
from scipy.stats import ttest_ind
print(ttest_ind(array_data_w.call, array_data_b.call, equal_var=True))

Ttest_indResult(statistic=4.1147052908617514, pvalue=3.9408021031288859e-05)


In [57]:
cm = sms.CompareMeans(sms.DescrStatsW(array_data_w.call), sms.DescrStatsW(array_data_b.call))

# calculation of margin of error: MOE
MOE = 1.96 * ((var_call_b + var_call_w)/2) # considering 95% confidence variable value 1.96

# calculation of confidence interval
conf_int = cm.tconfint_diff(usevar = 'pooled')
print('Confidence Interval = ', conf_int)
print("Margine of error = ", MOE)

Confidence Interval =  (0.016770799977396562, 0.047294908441494608)
Margine of error =  0.14462117291986942


4)Write a story describing the statistical significance in the context or the original problem.

**Sol-4): Story and Explination**

Since p-value is very small we can reject the null hypothesis and there is a difference between the means of 2 groups. Therefore there is racial descrimination towards job-seekers with similar background. However we haven't take many other factor in this above analysis which may have contributed to the fact that there is racial descrimination towards job seekers. Name based racial discrimination cann't be justified.

5)Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

**Sol-5:** 

Not necessarily. The analysis above indicates that the sounding of names (race) is signiﬁcant in affecting the number of callback. However, we are still not sure about whether other variables are also signiﬁcant or whether race is the most important factor. To understand the relation between callback and other variable, we could run a regression test as further analysis.