
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        

In [4]:
len(data)

4870

In [5]:
dataW=data[data.race=='w']
len(dataW)

2435

In [6]:
dataB=data[data.race=='b']
len(dataB)

2435

In [7]:
# number of callbacks for balck-sounding names
sum(dataW.call)

235.0

#### Calculate mean of calls received for White sounding name and SD from sample dataW

In [13]:
meanW= dataW.call.mean()
print (meanW)
varianceW=np.var(dataW.call)
print(varianceW)
sdW=np.std(dataW.call)
print(sdW)

0.09650924056768417
0.08719314634799957
0.29528485627949086


In [9]:
var1= 235*(np.square(1-0.09650924056768417))+ 2200*(np.square(0-0.09650924056768417))
print(np.sqrt(var1/2435))

0.29528834517


####  Calculate mean of calls received for Black sounding name and SD from sample dataB

In [10]:
# number of callbacks for balck-sounding names
sum(data[data.race=='w'].call)

235.0

In [14]:
meanB= dataB.call.mean()
print (meanB)
varianceB=np.var(dataB.call)
print(varianceB)
sdB=np.std(dataB.call)
print(sdB)

0.0644763857126236
0.060318876057863235
0.24559901477380408


####  1. We use two sample test.  We use CLT for sample mean distribution hence CLT applies for this hypothesis testing

#### 2. Null hypothesis is black sounding name makes no difference in number of calls received. Alternative hypothesis is white sounding name gets more calls.
    1. meanB-meanW =0  There is no difference in sample means for given 2 samples.
    2. In terms of distribution of probablity of getting differnce of means of sample that we get in these two samples is less then a thesholdvalue
    3. For singnificance level of 95% , proability of getting (meanB-meanW) is less then 5 %

In [15]:
diffSampleMean=meanW-meanB
print (diffSampleMean)

0.03203285485506058


#### If null hypothesis is correct then probability of getting sample mean difference of 0.03203285485506058 for given samples is less then 5 %

In [16]:
#### Get SD of difference of mean distribution using formula or standerd error 
sdDifferenceOfMean=np.sqrt((np.square(sdB))/2435+ (np.square(sdB))/2435)
print(sdDifferenceOfMean)

0.00703869481425


In [17]:
sdDifferenceOfMean2=np.sqrt(varianceB/2435+varianceW/2435)
print(sdDifferenceOfMean2)

0.00778330816545


In [19]:
varInterval=1.65*sdDifferenceOfMean  ## one tailed
print(varInterval)

0.0116138464435


#### If null hypothesis is correct then there is 95% proability to get the mean difference with varInterval of 0.012
 Since acutal mean difference is 0.032 which is greater then 0.012 , chance of getting this higher value is less 
    then 5 % , so we reject the Null Hypothesis

#### Margin of error and confidence interval at 95% confidence.  
       Since we have proven above by rejecting null hypothesis that race impacts the rate of callbacks , we have to use these samples mean difference to find population mean difference between black and white sounding names rate of callbacks. Here we have sample mean difference (meanW-meanB) as population mean difference , and find range of values with 95% pobablity or (1.96*sdDifferenceOfMean) above and below samples mean differnces.

In [22]:
## Using normal distribution and right tailed curve for 95% , the z- values comes to 1.96
UpperLimit =  (meanW-meanB) + (1.96*sdDifferenceOfMean) 
LowerLimit= (meanW-meanB) - (1.96*sdDifferenceOfMean) 
print (UpperLimit)
print (LowerLimit)
marginOfError= (1.96*sdDifferenceOfMean) 
print (marginOfError)

0.045828696691
0.0182370130191
0.0137958418359


#### Population Rate of Callbacks between white sounding and black sounding name 
    Confidence Interval 0.018 to 0.046  with 95% confidence interval
    Margin of Error 0.014

In [23]:
zValue=(meanW-meanB)/sdDifferenceOfMean
print (zValue)
 #P-Value is < 0.00001

4.55096515766


#### 4.  Since p value is less the significance level .05  null hypothesis is rejected