# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for white-sounding names
(data[data.race=='w'].call).value_counts()

0.0    2200
1.0     235
Name: call, dtype: int64

In [4]:
# number of callbacks for black-sounding names
(data[data.race=='b'].call).value_counts()

0.0    2278
1.0     157
Name: call, dtype: int64

In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [6]:
list(data)

['id',
 'ad',
 'education',
 'ofjobs',
 'yearsexp',
 'honors',
 'volunteer',
 'military',
 'empholes',
 'occupspecific',
 'occupbroad',
 'workinschool',
 'email',
 'computerskills',
 'specialskills',
 'firstname',
 'sex',
 'race',
 'h',
 'l',
 'call',
 'city',
 'kind',
 'adid',
 'fracblack',
 'fracwhite',
 'lmedhhinc',
 'fracdropout',
 'fraccolp',
 'linc',
 'col',
 'expminreq',
 'schoolreq',
 'eoe',
 'parent_sales',
 'parent_emp',
 'branch_sales',
 'branch_emp',
 'fed',
 'fracblack_empzip',
 'fracwhite_empzip',
 'lmedhhinc_empzip',
 'fracdropout_empzip',
 'fraccolp_empzip',
 'linc_empzip',
 'manager',
 'supervisor',
 'secretary',
 'offsupport',
 'salesrep',
 'retailsales',
 'req',
 'expreq',
 'comreq',
 'educreq',
 'compreq',
 'orgreq',
 'manuf',
 'transcom',
 'bankreal',
 'trade',
 'busservice',
 'othservice',
 'missind',
 'ownership']

### The test
To test whether race is a factor in call-backs from employers we employ a two-sample t-test on the difference between the means of the call variable. The call variable is 0 if there was no call-back and 1 if the employer called back. Averages can be computed for such variables returning values of 0.0965 for whites and 0.0645 for blacks. This means that 9.65% of white applicants were called back, while only 6.45% of black applicants were called back. 

Our null hypothesis is that the mean call-back rate difference between the samples is 0, and the alternative hypothesis is that it is not. We can apply a two-sample t-test as the sample sizes are very large, 4870 resume were sent equally distributed by race. 

In [7]:
white = data[data.race=='w'].call
print('Number of "white" resume: ', len(white), ' Number of call-backs: ', int(white.sum()), ' as a percentage: ', white.mean())
black = data[data.race=='b'].call
print('Number of "black" resume: ', len(black), ' Number of call-backs: ', int(black.sum()), ' as a percentage: ', black.mean())

Number of "white" resume:  2435  Number of call-backs:  235  as a percentage:  0.09650924056768417
Number of "black" resume:  2435  Number of call-backs:  157  as a percentage:  0.0644763857126236


In [8]:
stats.ttest_ind(white, black)

Ttest_indResult(statistic=4.1147052908617514, pvalue=3.9408021031288859e-05)

In [9]:
# significance level
alpha = .05
# degrees of freedom
dof = len(white)+len(black)-2
# critical value for t-distribution
t_crit = stats.t.ppf(1-alpha/2, dof)
# standard error
se = ( ( 2*(white.std()**2+black.std()**2) )/( len(white) ))**(.5)
#margin of error
moe = t_crit * se
# confidence interval
ci = stats.t.interval(1-alpha, dof, loc=white.mean()-black.mean(), scale=se)
print('Margin of error: ', moe)
print('95% confidence interval:', ci)

Margin of error:  0.0215836310103
95% confidence interval: (0.010449223844740261, 0.053616485865380897)


The test shows a Chi square statistics of 4.11 with a p-value of 4e-05, or 0.00004, indicating that the 3.2% difference in the call backs is not due to randmness of the sample. Although the difference might seem small, as the call-back rates are small, it is actually relatively large. We can see that whites are 50% more likely to be called back, as (9.65-6.45)/6.45 = .50, or alternatively blacks are 33% less likely to be called back, as (6.45-9.65)/9.65 = -.33.

In addition we can see that the 95% confidence interval, (0.010, 0.054), does not include the 0 value. This means that we are  more than 95% confident that the two samples have different means, or average call-back rates. Stated in different words: we are 95% confident that the difference in means is 3.2% plus or minus 2.2%.

### Is race the main factor?

We modify the race column data to ones (1) and zeros (0), instead of 'w' and 'b', so that we can calculate the Pearson correlation coefficient with the call column data and its p-value. From scipy.org _The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets._ That is: low p-values mean that the two quantities are indeed somewhat correlated, high p-values instead would indicate that such correlation value is possibly due to randomness.

In [10]:
temp = data[['race','call']]
temp.race = temp.race.map({'w': 1, 'b': 0})
stats.pearsonr(temp.race,temp.call)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


(0.05887209917662773, 3.9408030032935854e-05)

We obtain a correlation coefficient of 0.059 with a very low p-value, confirming that indeed the race and call-backs variables are correlated. 
Next, we run similar correlation calculations between all numeric columns and the 'call' column. Finally, we select the columns who have a correlation coefficient higher than that of the 'race' column.

In [11]:
tempdf = data.call
df = data.select_dtypes(include=[np.number])  # selecting numeric columns
#df = df.drop('call',1)
corr_array = df.corrwith(tempdf)          
# only consider correlation coefficients' values higher (in absolute value) of the 'race-call' coefficient
high_corr = corr_array[abs(corr_array) > .05887]
high_corr

yearsexp         0.061436
honors           0.071951
empholes         0.071888
specialskills    0.111074
call             1.000000
adid             0.063178
dtype: float64

We can see that, excluding call which is perfectly correlated to itself, there are 5 other variables which have a higher correlation coefficient than race. These variables are: yearsexp, honors, empholes, specialskills, and adid. In particular the correlation coefficient of _special skills_ is significantly higher than the others. It seems like having _special skills_ goes a long way towards getting the coveted call back, even more so than being white.

It must be noted though that having special skills does not "replace" being white, in the sense that a black person with special skills will still be discriminated against when compared to a white person with the same skills. Let's see how being black or white influences the correlation between special skills and call back.

In [12]:
# first we split the dataframe into two parts depending on race, then we select the 'specialskills' and 'call' columns 
black_special = data[data.race=='b'].loc[:,['specialskills','call']]
white_special = data[data.race=='w'].loc[:,['specialskills','call']]

In [13]:
# compute correlation coefficient for 'black'
stats.pearsonr(black_special.specialskills,black_special.call)

(0.098398639666004178, 1.1453317980714801e-06)

In [14]:
# compute correlation coefficient for 'white'
stats.pearsonr(white_special.specialskills, white_special.call)

(0.1224525826132314, 1.3424342882078501e-09)

We can see that for whites the correlation is about 0.122 while for blacks is only 0.098. This shows that while race might not be the main factor that determines whether an applicant receives a call back, it does influence the outcome. Given two candidates, one white and one black, all else being equal the black one is 33% less likely to receive a call-back. 