# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
<h3>Exercises<a class="anchor-link" href="#Exercises">¶</a>
</h3>
<p>You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.</p>
<p>Answer the following questions <strong>in this notebook below and submit to your Github account</strong>.</p>
<ol>
<li>What test is appropriate for this problem? Does CLT apply?</li>
<li>What are the null and alternate hypotheses?</li>
<li>Compute margin of error, confidence interval, and p-value.</li>
<li>Write a story describing the statistical significance in the context or the original problem.</li>
<li>Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?</li>
</ol>
<p>You can include written notes in notebook cells using Markdown:</p>
<ul>
<li>In the control panel at the top, choose Cell &gt; Cell Type &gt; Markdown</li>
<li>Markdown syntax: <a href="http://nestacms.com/docs/creating-content/markdown-cheat-sheet">http://nestacms.com/docs/creating-content/markdown-cheat-sheet</a>
</li>
</ul>
<h4>Resources<a class="anchor-link" href="#Resources">¶</a>
</h4>
<ul>
<li>Experiment information and data source: <a href="http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states">http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states</a>
</li>
<li>Scipy statistical methods: <a href="http://docs.scipy.org/doc/scipy/reference/stats.html">http://docs.scipy.org/doc/scipy/reference/stats.html</a> </li>
<li>Markdown syntax: <a href="http://nestacms.com/docs/creating-content/markdown-cheat-sheet">http://nestacms.com/docs/creating-content/markdown-cheat-sheet</a>
&lt;/div&gt;</li>
</ul>
<hr>

</div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import scipy.stats as stats
import statsmodels.stats.weightstats as wstats
import math
%matplotlib inline

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
sum(data[data.race!='b'].call)

235.0

In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        

In [7]:
data[['race','call']].describe()

Unnamed: 0,call
count,4870.0
mean,0.080493
std,0.272079
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [8]:
data[['race','call']].head(10)

Unnamed: 0,race,call
0,w,0.0
1,w,0.0
2,b,0.0
3,b,0.0
4,w,0.0
5,w,0.0
6,w,0.0
7,b,0.0
8,b,0.0
9,b,0.0


In [9]:
data.race.unique()

array(['w', 'b'], dtype=object)

In [10]:
# datafamr for black-sounding names
df_color=data[data.race=='b']
# dataframe for non-black-sounding names
df_non_color=data[data.race!='b']

In [11]:
# describe df_black
df_color[['race','call']].describe()

Unnamed: 0,call
count,2435.0
mean,0.064476
std,0.245649
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [12]:
# describe df_non_black
df_non_color[['race','call']].describe()

Unnamed: 0,call
count,2435.0
mean,0.096509
std,0.295346
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [13]:
# variable 
# for 95% the critcal z value is 1.96 ( based on z table)
critical_value =1.96

mean_color_calls, sd_color_calls, var_color_calls= df_color.call.mean(), df_color.call.std(), df_color.call.var()
mean_non_color_calls, sd_non_color_calls, var_non_color_calls= df_non_color.call.mean(), df_non_color.call.std(), df_non_color.call.var()

print (' The mean for calls for color candidate is ' + str(mean_color_calls))
print (' The standard deviation for calls for color candidate is ' + str(sd_color_calls))
print (' The varience for calls for color candidate is ' + str(var_color_calls))
print (' The mean for calls for non color candidate is ' + str(mean_non_color_calls))
print (' The standard deviation for calls for non color candidate is ' + str(sd_non_color_calls))
print (' The varience for calls for non color candidate is ' + str(var_non_color_calls))

 The mean for calls for color candidate is 0.0644763857126236
 The standard deviation for calls for color candidate is 0.24564945697784424
 The varience for calls for color candidate is 0.060343656688928604
 The mean for calls for non color candidate is 0.09650924056768417
 The standard deviation for calls for non color candidate is 0.2953455150127411
 The varience for calls for non color candidate is 0.08722896873950958


<div class="span5 alert alert-info">
   1. What test is appropriate for this problem? Does CLT apply?
 </div>


<div class="span5 alert alert-info" style="background-color:#ffff66; color:black">
<b>Answer :</b> This is Bernoulli distribution of with white-sounding and black sounding names. The two-sided test is appropriate as the comparison will be for both scenarios i.e. if white sounding names get <b>more</b> or <b>less</b> selection then black sounding names.<br/>
The sample set is large and we also know the population mean so z test will be applied. 
<br/> 
The CTL is applicable as the average of the sample will follow the normal distribution.
</div>

<div class="span5 alert alert-info">
   2. What are the null and alternate hypotheses?
</div>

<div class="span5 alert alert-info" style="background-color:#ffff66; color:black">
<b>Answer:</b> The hypotheses are as follows 
    <br/>
    <b> Null Hypothesis: </b>  There is no discrimination in selection based on race. Which means of calls for black sounding name candidate - means of calls for white sounding name candidate equals Zero.
    <br/>
    <b> Alternate Hypothesis: </b>  There is discrimination in selection based on color. Which means of calls for black sounding name candidate - means of calls for white sounding name candidate greater or less than Zero.
</div>

<div class="span5 alert alert-info">
   3. Compute margin of error, confidence interval, and p-value.
</div>

### Mean for all calls based on bernoulli distribution is difference of  population mean (non_color) calls and population mean(color) calls

In [14]:
difference_population_mean =mean_non_color_calls-mean_color_calls
difference_population_mean

0.03203285485506058

###  Variance for all calls based on bernoulli distribution is sum of  Variance (non_color) calls and Variance(color) calls

In [15]:
# proporation calculation for colored people
n_c=len(df_color)# total no of people of color
c_got_call=len(df_color[df_color.call==1]) # color people got call 
p_c=c_got_call/n_c # population proporation of color people who got call 
print('The total colored people with call in population '+str(c_got_call))
print('The total colored people in population '+str(n_c))
print('The colored people call /  total color people '+str(p_c))

# proporation calculation for non colored people
n_nc=len(df_non_color)# total no of people of non color
nc_got_call=len(df_non_color[df_non_color.call==1])# non color people got call 
p_nc=nc_got_call/n_nc # population proporation of non color people who got call 
print('The total non colored people with call in population '+str(nc_got_call))
print('The total non colored people in population '+str(n_nc))
print('The non colored people call / total non color people '+str(p_nc))

# calculate variance of color people
#P(1-p)/N
variance_color=(p_c*(1-p_c))/n_c
print('The colored people variance %.12f' % float(variance_color))

# calculate variance of non color people
variance_non_color=(p_nc*(1-p_nc))/n_nc
print('The non colored people variance %.12f' % float(variance_non_color))

population_variance=variance_color+variance_non_color

print('The total population varience %.12f' % (float(population_variance)))

The total colored people with call in population 157
The total colored people in population 2435
The colored people call /  total color people 0.06447638603696099
The total non colored people with call in population 235
The total non colored people in population 2435
The non colored people call / total non color people 0.09650924024640657
The colored people variance 0.000024771738
The non colored people variance 0.000035809120
The total population varience 0.000060580858


###  standard deviation for all calls square root of total varaience

In [16]:
# standard devaiation for total population
population_standard_deviation=math.sqrt(population_variance)
print('The total population standard deviation '+str(population_standard_deviation))

The total population standard deviation 0.0077833705866767544


### 95 % chance which is critical value(1.96) that poupulaton mean (true mean) is within  difference of population  mean (0.032) will be Confidence interval

In [17]:
# 95 % chance for confidence interval 
conf_interval= critical_value*population_standard_deviation

print('The confidence interval with 95 % '+str(conf_interval))

The confidence interval with 95 % 0.015255406349886438


<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b> The confidence interval with 95% assumpition is <b>0.015255406349886438</b></div>

In [18]:
# margin of erro 
#population_standard_deviation-conf_interval , population_standard_deviation+conf_interval
difference_population_mean+np.array([-1, 1]) * conf_interval

array([ 0.01677745,  0.04728826])

<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b> The margin of error intreval is between  <b>0.01677745 & 0.04728826 </b></div>

### Based on null hypothesis, population proporation of color people who got call  -population proporation of non color people who got call is zero. So difference_population_mean (0.032) away from 0 . We get z score for this 

#### To calculate the z score we need standard deviation with null hypothesis. And based on null hypothesis both proporation are equal and so proproartion of color people got call equals pepole of non color got call

In [19]:
# null hypoths,p1=p2
# z score for 
null_hypothese_proporation=(nc_got_call+c_got_call)/(n_c+n_nc)  #(157+235)/4870
#new sd based on hypo= math.sqrt(2p*(1-p)/n)
standard_deviation_z=math.sqrt((2*null_hypothese_proporation*(1-null_hypothese_proporation))/(2435))
print(standard_deviation_z)

0.007796894036170457


#### Z score will be difference_population_mean (0.032) - Hypothesis mean(0) and standard deviation of null hypothese above

In [20]:
#t=difference_population_mean-0/(new sd based on hypo)
z_value = (difference_population_mean-0)/standard_deviation_z
print (z_value)
#z_value = (difference_population_mean-0)/standard_deviation_z
#print (z_value)

4.108412235238472


In [21]:
#p_value=stats.norm.cdf(z_value)
p_value = scipy.stats.norm.sf(abs(z_value))*2
print('The p value from the z value above is %.12f' %float(p_value))
#print((p_value))

The p value from the z value above is 0.000039838854


<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b> The p- value for z number is   <b>0.000039838854 </b></div>

In [22]:
if (z_value>critical_value):
    print ('The calculated z-score is higher than critical z value hence reject the null hypothesis')
else:
    print('Null hypotheses is true')

The calculated z-score is higher than critical z value hence reject the null hypothesis


#### Recheck Z zcore using wstat

In [23]:
zstat, pvalue = wstats.ztest(df_non_color['call'],df_color['call'], alternative='two-sided',
                    value=0, usevar='pooled', ddof=1.0)
print ('The z stat is '+str(zstat)+ ' and p value is ' +str(pvalue))

The z stat is 4.11470535675 and p value is 3.87674291161e-05


In [24]:
if (zstat>critical_value):
    print ('The calculated p-value is lower than 0.05 % significant value hence reject the null hypothesis')
else:
    print('Null hypotheses is true')

The calculated p-value is lower than 0.05 % significant value hence reject the null hypothesis


In [25]:
siginificane_level =0.005
if (pvalue<siginificane_level):
    print ('The calculated p-value is lower than 0.05 % significant value hence reject the null hypothesis')
else:
    print('Null hypotheses is true')

The calculated p-value is lower than 0.05 % significant value hence reject the null hypothesis


#### Try checking null hypothesis usint T score

In [26]:
# T-score
tstat, p_from_t = stats.ttest_ind(df_non_color['call'],df_color['call'],equal_var=False)
print('t-statistic: ', tstat)
print('p-value: ', p_from_t)

t-statistic:  4.11470529086
p-value:  3.94294151365e-05


In [27]:
if (tstat>critical_value):
    print ('The calculated t-score is higher than critical z value hence reject the null hypothesis')
else:
    print('Null hypotheses is true')

The calculated t-score is higher than critical z value hence reject the null hypothesis


In [28]:
siginificane_level =0.005
if (p_from_t<siginificane_level):
    print ('The calculated p-value is lower than 0.05 % significant value hence reject the null hypothesis')
else:
    print('Null hypotheses is true')

The calculated p-value is lower than 0.05 % significant value hence reject the null hypothesis


<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b>The calculated z-score, t-score, and p-value conclude the rejection of the null hypothesis.</div>

<div class="span5 alert alert-info">
   4. Write a story describing the statistical significance in the context or the original problem.
</div>

<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b>As above the calculated z-score, it is clear that it lies beyond the 95 % of hypothesis mean. Which tells that if there was no discrimination, then the difference in mean is way beyond 95 % and hence our assumption of the hypothesis is wrong. </div>

<div class="span5 alert alert-info">
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?
</div>

<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer:</b> The analysis of this data based on above testing found the difference in call-back rates associated with resumes for black-sounding names, versus resumes for white-sounding names is <b>statistically significant</b> .<br/> This cannot be the final conclusion because there are many other columns like years of experience, education, skills and others which also might have an impact on the selection process. More analysis needs to be done with consideration of other columns. </div>