# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import scipy.stats as stats
import statsmodels.stats.weightstats as wstats
import math
%matplotlib inline

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
sum(data[data.race!='b'].call)

235.0

In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        

In [7]:
data[['race','call']].describe()

Unnamed: 0,call
count,4870.0
mean,0.080493
std,0.272079
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [8]:
data[['race','call']].head(10)

Unnamed: 0,race,call
0,w,0.0
1,w,0.0
2,b,0.0
3,b,0.0
4,w,0.0
5,w,0.0
6,w,0.0
7,b,0.0
8,b,0.0
9,b,0.0


In [9]:
data.race.unique()

array(['w', 'b'], dtype=object)

In [10]:
cal_race= data.groupby(['race','call']).call.count().unstack()#data.groupby('race')
#print(t.sum().call)
#print(t.count().call)

In [11]:
# datafamr for black-sounding names
df_color=data[data.race=='b']
# dataframe for non-black-sounding names
df_non_color=data[data.race!='b']


In [12]:
# describe df_black
df_color[['race','call']].describe()

Unnamed: 0,call
count,2435.0
mean,0.064476
std,0.245649
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [13]:
# describe df_non_black
df_non_color[['race','call']].describe()

Unnamed: 0,call
count,2435.0
mean,0.096509
std,0.295346
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [14]:
# variable 
# for 95% the critcal z value is 1.96 ( based on z table)
critical_value =1.96

mean_color_calls, sd_color_calls, var_color_calls= df_color.call.mean(), df_color.call.std(), df_color.call.var()
mean_non_color_calls, sd_non_color_calls, var_non_color_calls= df_non_color.call.mean(), df_non_color.call.std(), df_non_color.call.var()

print (' The mean for calls for color candidate is ' + str(mean_color_calls))
print (' The standard deviation for calls for color candidate is ' + str(sd_color_calls))
print (' The varience for calls for color candidate is ' + str(var_color_calls))
print (' The mean for calls for non color candidate is ' + str(mean_non_color_calls))
print (' The standard deviation for calls for non color candidate is ' + str(sd_non_color_calls))
print (' The varience for calls for non color candidate is ' + str(var_non_color_calls))

 The mean for calls for color candidate is 0.0644763857126236
 The standard deviation for calls for color candidate is 0.24564945697784424
 The varience for calls for color candidate is 0.060343656688928604
 The mean for calls for non color candidate is 0.09650924056768417
 The standard deviation for calls for non color candidate is 0.2953455150127411
 The varience for calls for non color candidate is 0.08722896873950958


<div class="span5 alert alert-info">
   1. What test is appropriate for this problem? Does CLT apply?
 </div>
****

<div class="span5 alert alert-info" style="background-color:#ffff66; color:black">
<b>Answer :</b> This is Bernoulli distribution for color and non color candidates. The two sided test is appropriate as the comaprison will be for both scenario i.e. if non color candiates get <b>more</b> or <b>less</b> selection then colored candidates.<br/>
The sample set is large and we also know poulation mean so z test will be applied.
<br/> 
The CTL is applicable as the average of sample will follow normal distrubition. 
</div>


<div class="span5 alert alert-info">
   2. What are the null and alternate hypotheses?
</div>

<div class="span5 alert alert-info" style="background-color:#ffff66; color:black">
<b>Answer :</b> The hypothesis are as follows
    <br/>
    <b> Null Hypothesis : </b> There is <b>no</b> discrmination in selection based on color. Which <i>means for colored candiate - mean of non color candidate</i> equals <b>Zero</b> .
    <br/>
    <b> Alternate Hypothesis : </b> There is discrmination in selection based on color
</div>

<div class="span5 alert alert-info">
   3. Compute margin of error, confidence interval, and p-value.
</div>

### μ of all means for both category based on bernoulli distribution is difference of  population mean


In [15]:
mean_call_diff =mean_non_color_calls-mean_color_calls
mean_call_diff

0.03203285485506058

### σ  of all means for both category based on bernoulli distribution is sum of  population standard deviation

In [16]:
n_c=len(df_color)
c_got_call=len(df_color[df_color.call==1])
p_c=c_got_call/n_c
print('The total colored people with call in population '+str(c_got_call))
print('The total colored people in population '+str(n_c))
print('The colored people call and population proporation '+str(p_c))

n_nc=len(df_non_color)
nc_got_call=len(df_non_color[df_non_color.call==1])
p_nc=nc_got_call/n_nc
print('The total non colored people with call in population '+str(nc_got_call))
print('The total non colored people in population '+str(n_nc))
print('The non colored people call and population proporation '+str(p_nc))

#P(1-p)/N
c_sd=(p_c*(1-p_c))/n_c
print('The colored people standard deviation %.12f' % float(c_sd))

nc_sd=(p_nc*(1-p_nc))/n_nc
print('The non colored people standard deviation %.12f' % float(nc_sd))

pop_var=c_sd+nc_sd

print('The total population varience %.12f'+str(float(pop_var)))

pop_sd=math.sqrt(pop_var)
print('The total population standard deviation '+str(pop_sd))


The total colored people with call in population 157
The total colored people in population 2435
The colored people call and population proporation 0.06447638603696099
The total non colored people with call in population 235
The total non colored people in population 2435
The non colored people call and population proporation 0.09650924024640657
The colored people standard deviation 0.000024771738
The non colored people standard deviation 0.000035809120
The total population varience %.12f6.058085768954485e-05
The total population standard deviation 0.0077833705866767544


In [17]:
# 95 % chance for confidence interval 
conf_interval= critical_value*pop_sd

print('The confidence interval with 95 % '+str(conf_interval))

The confidence interval with 95 % 0.015255406349886438


<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b> The confidence interval with 95% assumpition is <b>0.015255406349886438</b></div>

In [18]:
# margin of erro 
#pop_sd-conf_interval , pop_sd+conf_interval
mean_call_diff+np.array([-1, 1]) * conf_interval

array([ 0.01677745,  0.04728826])

<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b> The margin of error intreval is between  <b>0.01677745 & 0.04728826 </b></div>

In [19]:
# null hypoths,p1=p2
# z score for 
prop_zp=(157+235)/4870
#new sd based on hypo= math.sqrt(2p*(1-p)/n)
sd_z=math.sqrt((2*prop_zp*(1-prop_zp))/(n_nc+n_c))
print(sd_z)


#t=mean_call_diff-0/(new sd based on hypo)
z_value = (mean_call_diff-0)/sd_z

print (z_value)

0.005513236645169081
5.810172302893809


In [20]:
#p_value=stats.norm.cdf(z_value)

p_value = scipy.stats.norm.sf(abs(z_value))*2
print('The z value from the p value above is %.12f' %float(p_value))

The z value from the p value above is 0.000000006241


<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b> The p- value for z number is   <b>0.000000006241 </b></div>

In [21]:
if (z_value>critical_value):
    print ('The calulcated z-score is higher then 5 % significant value hence reject the null hypothesis')
else:
    print('Null hypotheses is true')

The calulcated z-score is higher then 5 % significant value hence reject the null hypothesis


<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b> The calulcated z-score is higher then 5 % significant value hence <b>reject</b> the null hypothesis.</div>

In [22]:
# till here

<div class="span5 alert alert-info">
   4. Write a story describing the statistical significance in the context or the original problem.
</div>

<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b> As above the calulcated z-score, it is clear that it lies beyond the 95 % of hypothesis mean. Which tells that if there was no discrmination, then difference in mean is way beyond 95 % and hence the our assumption of hypothesis is wrong. </div>

<div class="span5 alert alert-info">
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?
</div>

<div class="span5 alert alert-info" style='background-color:#ffff66; color:black'><b>Answer :</b> The alternate hypothesis proved that <b><i>discrimination exist based on race</i></b>.<br/> But there are many other columns which might have impact on the selection. More analyis needs to be done with consideration of other columns. </div>