# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('/Users/ruhama.ahale/Documents/Springboard_Coursework/racial_disc/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [6]:
# a list of column names:
list(data)

['id',
 'ad',
 'education',
 'ofjobs',
 'yearsexp',
 'honors',
 'volunteer',
 'military',
 'empholes',
 'occupspecific',
 'occupbroad',
 'workinschool',
 'email',
 'computerskills',
 'specialskills',
 'firstname',
 'sex',
 'race',
 'h',
 'l',
 'call',
 'city',
 'kind',
 'adid',
 'fracblack',
 'fracwhite',
 'lmedhhinc',
 'fracdropout',
 'fraccolp',
 'linc',
 'col',
 'expminreq',
 'schoolreq',
 'eoe',
 'parent_sales',
 'parent_emp',
 'branch_sales',
 'branch_emp',
 'fed',
 'fracblack_empzip',
 'fracwhite_empzip',
 'lmedhhinc_empzip',
 'fracdropout_empzip',
 'fraccolp_empzip',
 'linc_empzip',
 'manager',
 'supervisor',
 'secretary',
 'offsupport',
 'salesrep',
 'retailsales',
 'req',
 'expreq',
 'comreq',
 'educreq',
 'compreq',
 'orgreq',
 'manuf',
 'transcom',
 'bankreal',
 'trade',
 'busservice',
 'othservice',
 'missind',
 'ownership']

"In practice, we typically send four resumes in response to each ad: two higher-quality and two lower-quality ones"

n each detailed occupational category into
two groups: high and low quality. 

 criteria such as labor
market experience, career profile, existence of
gaps in employment, and skills listed.

. Such a
classification is admittedly subjective but it is
made independently of any race assignment on
the resumes 

In [4]:
white_high = data[(data['race'] == 'w') & (data['h'] == 1)]
black_high = data[(data['race'] == 'b') & (data['h'] == 1)]
white_low = data[(data['race'] == 'w') & (data['l'] == 1)]
black_low = data[(data['race'] == 'b') & (data['l'] == 1)]

In [5]:
callback_white_high = len(white_high.loc[(white_high['call'] == 1), :])
callback_white_low = len(white_low.loc[(white_low['call'] == 1), :])

In [6]:
callback_white_rate = (callback_white_high + callback_white_low)  / (len(white_high) + len(white_low))
callback_white_rate

0.09650924024640657

In [7]:
callback_black_high = len(black_high.loc[(black_high['call'] == 1), :])
callback_black_low = len(black_low.loc[(black_low['call'] == 1), :])

In [8]:
callback_black_rate = (callback_black_high + callback_black_low)  / (len(black_high) + len(black_low))
callback_black_rate

0.06447638603696099

### 1. What test is appropriate for this problem? Does CLT apply? and 2. What are the null and alternate hypotheses? 3.Compute margin of error, confidence interval, and p-value.

We want to test if a person that has a black name has different probability of getting a call back from a person that has a white name, such that we write our hypothesis testing $H_0: p_{black} - p_{white} = 0$ vs $H_1: p_{black} - p_{white} \neq 0$. And we know that $E[X_{black}] = E[\frac{\sum{X_{black}}}{n_{black}}]  = p_{black}$ or $E[X_{white}] = E[\frac{\sum{X_{white}}}{n_{white}}] = p_{white}$. And we know that by CLT, any sample mean with sufficiently large sample size follows a normal distribution. Given that we know our sample size is large (>1000), so we can assume by CLT that $\hat{p} = \frac{\sum{x}}{n} \sim N(\mu_{\bar{x}}, \sigma_{\bar{x}}^2)$, and we can only estimate $\hat{\mu_{\bar{x}}} = \frac{\sum{x}}{n}$ and $\hat{\sigma_{\bar{x}}} = \hat{\sigma_{x}} / \sqrt{N}$. Given $X$ follows a bernoulli distribution, $\hat{\sigma_{x}} = \sqrt{\hat{p} (1-\hat{p})}$.   
we can then form a test statistic $ t = \frac{\bar{x}_{black} - \bar{x}_{white}}{\sqrt{\frac{s_{black}^2}{n_{black}} + \frac{s_{white}^2}{n_{white}} }}$ which follows a standard t distribution.

In [10]:
import math
s_sq_black = callback_black_rate*(1-callback_black_rate)
s_sq_white = callback_white_rate*(1-callback_white_rate)
sd_dif = math.sqrt(s_sq_black/ (len(black_high) + len(black_low)) +  s_sq_white / (len(white_high) + len(white_low)))
t = (callback_black_rate -callback_white_rate) / sd_dif
d_f = len(black_high) + len(black_low) - 1
sd_dif

0.0077833705866767544

In [92]:
# margin of error:
import scipy.stats as stats
const = stats.t.interval(0.95, d_f)[1] * sd_dif
const

0.0152627157139896

In [94]:
# the confidence interval is just:
(s_sq_black -s_sq_white -const, s_sq_black -s_sq_white + const)

(-0.042138740826883778, -0.011613309398904578)

In [93]:
import scipy.stats as stats
p = stats.t.cdf(t, df = d_f) * 2
p

3.991074585679982e-05

### 4.Write a story describing the statistical significance in the context or the original problem.

If we set significance level at 5%, we can reject the null hypothesis and conclude that race is a significant factor that affects whether a person gets callback. By the experimental design, the study has randomly assigned both high quality resumes and low quality resumes to white names and black names, therefore the qualifications of both races in terms name should be pretty even. Thus we are left with only factor of race, and in this study, we have strong reason to believe that a black name is less likely to get call back in comparison to white name.

### 5.Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

We can check how the quality of resume affects the callback success using similar approach.

In [13]:
callback_high_rate = (callback_white_high + callback_black_high) / (len(white_high) + len(black_high))
callback_low_rate = (callback_white_low + callback_black_low) / (len(white_low) + len(black_low))

In [14]:
dif_high_low = callback_high_rate - callback_low_rate
dif_high_low

0.014057435997074763

In [17]:
import math
s_sq_high = callback_high_rate*(1-callback_high_rate)
s_sq_low = callback_low_rate*(1-callback_low_rate)
sd_dif = math.sqrt(s_sq_high/ (len(white_high) + len(black_high)) +  s_sq_low / (len(white_low) + len(black_low)))
t = dif_high_low / sd_dif
d_f = (len(white_low) + len(black_low)) - 1
p = stats.t.cdf(-t, df = d_f) * 2
p


0.071326145693657708

Based on p value, the overall quality of resume is not an important factor if we pool both black and white together.

In [144]:
callback_white_high_rate = callback_white_high / len(white_high)
callback_white_low_rate = callback_white_low / len(white_low)
dif_hl_in_white = callback_white_high_rate - callback_white_low_rate
import math
s_sq_high_white = callback_white_high_rate*(1-callback_white_high_rate)
s_sq_low_white = callback_white_low_rate*(1-callback_white_low_rate)
sd_dif = math.sqrt(s_sq_high_white/ len(white_high) +  s_sq_low_white / len(white_low))
t = dif_hl_in_white / sd_dif
d_f = len(white_low) - 1
p = stats.t.cdf(-t, df = d_f) * 2
dif_hl_in_white

0.02294781808516093

In [143]:
p

0.055122895785852957

based on p value, we can conclude that a higher quality resume in white race (name) will have roughly 2.29% more chance than a lower quality resume, but this value is considered insignificant from 0.

In [145]:
callback_black_high_rate = callback_black_high / len(black_high)
callback_black_low_rate = callback_black_low / len(black_low)
dif_hl_in_black = callback_black_high_rate - callback_black_low_rate
import math
s_sq_high_black = callback_black_high_rate*(1-callback_black_high_rate)
s_sq_low_black = callback_black_low_rate*(1-callback_black_low_rate)
sd_dif = math.sqrt(s_sq_high_black/ len(black_high) +  s_sq_low_black / len(black_low))
t = dif_hl_in_black / sd_dif
d_f = len(black_low) - 1
p = stats.t.cdf(-t, df = d_f) * 2
dif_hl_in_black

0.005167053908988611

In [146]:
p

0.60372207527297927

based on p value, we can conclude that a highler quality resume in black race(name) will have 0.5% more chance than a lower quality resume, but this value is considered insignificant from 0.

We could also potential fit regression model to find out what factors such as educations and experience are considered more important. But so far if we only consider whether quality of resumes and race, race seems to be a dominant factor. 