
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import math as mt

In [2]:
data = pd.io.stata.read_stata('F:\\Data_Science\\Statistic\\statistics project 2\\data\\us_job_market_discrimination.dta')

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
# number of callbacks for balck-sounding names
sum(data[data.race=='b'].call)

157.0

In [6]:
len(data[data.race=='w'])

2435

# What test is appropriate for this problem? Does CLT apply? 

### As number of sample is huge we can apply CLT on this data set. We will apply hypothesis test to solve the probelem. 

### Mean of sample for white sounding people got the call 

In [7]:
mean_w=sum(data[data.race=='w'].call)/len(data[data.race=='w'])
mean_w

0.096509240246406572

### Mean of sample for black sounding people got the call 

In [8]:
mean_b=sum(data[data.race=='b'].call)/len(data[data.race=='b'])
mean_b

0.064476386036960986

### Standard deviation of sample for white sounding people got the call  

In [9]:
call_w=sum(data[data.race=='w'].call)  ##Number of white sounding employer who got the call
call_w

235.0

In [10]:
n_call_w=len(data[data.race=='w'])-sum(data[data.race=='w'].call)  ##Number of white sounding employer who did not get the call
n_call_w

2200.0

In [11]:
#variance of sample for white sounding employer who got the call
var_w=((call_w*(1-mean_w)**2)+(n_call_w*(0-mean_w)**2))/(len(data[data.race=='w'])-1)
var_w

0.087231030625346942

In [12]:
#standard deviation 
std_w=mt.sqrt(var_w)
std_w

0.2953489980097223

### Standard deviation of sample for black sounding people got the call  

In [13]:
call_b=sum(data[data.race=='b'].call)  ##Number of black sounding employer who got the call
call_b

157.0

In [14]:
n_call_b=len(data[data.race=='b'])-sum(data[data.race=='b'].call)  ##Number of black sounding employer who did not get the call
n_call_b

2278.0

In [15]:
#variance of sample for black sounding employer who got the call
var_b=((call_b*(1-mean_b)**2)+(n_call_b*(0-mean_b)**2))/(len(data[data.race=='b'])-1)
var_b

0.060343963595808181

In [16]:
#standard deviation 
std_b=mt.sqrt(var_b)
std_b

0.24565008364706123

# What are the null and alternate hypotheses? 

### Null hypothesis, H0: No racial discrimination and probability of getting the call for white sounding employer is same as probabilty of getting the call for black sounding employer.
### Alternate hypothers, H1: Probability of getting the call for white sounding employer is not equal to the probabilty of getting the call for black sounding employer.

# Compute margin of error, confidence interval, and p-value.

### Margin of error 
### difference of the mean of the while employer getting the call and the black sounding employer getting the call

In [17]:
mean_diff=mean_w-mean_b
mean_diff

0.032032854209445585

### Standard deviation of the difference 

In [19]:
var_diff=(var_w/len(data[data.race=='w']))+(var_b/len(data[data.race=='b']))
var_diff

6.0605747113410729e-05

In [21]:
std_diff=mt.sqrt(var_diff)
std_diff

0.007784969307159196

### 95% Confidence interval 

In [30]:
import scipy.stats as st
st.norm.ppf(0.975)

1.959963984540054

In [33]:
interval=1.96*std_diff

#### Lower Limit of difference 

In [34]:
mean_diff-interval

0.016774314367413563

#### Upper Limit of difference 

In [35]:
mean_diff+interval

0.047291394051477607

#### Assuming H0 is true, determining the propabilty of getting difference as 'mean_diff' 

#### Getting the distance of the the mean_diff from the actual mean 

In [37]:
dist=mean_diff/std_diff
dist

4.114705266723095

In [39]:
prob=st.norm.cdf(dist)*100
prob

99.998061627787692

#### P-Value 

In [41]:
pval=100-prob
pval

0.0019383722123080815

# Discuss statistical significance. 

### As there is almost 0 probabilty of getting the sample difference is we assume our null hypothesis. So we can confidently say that our null hypothesis is incorrect. And based on the sample we there will be difference in the number of white sounding employee getting the call and the number of black sounding employee getting the call.

### As our confidence interval for mean difference is 1.7% to 4.7%, we can confidently say that white sounding people will get more call than that of black sounding people