# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [58]:
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns

In [101]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [102]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [103]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [104]:
data.describe()

Unnamed: 0,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,...,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind
count,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,...,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0
mean,3.61848,3.661396,7.842916,0.052772,0.411499,0.097125,0.448049,215.637782,3.48152,0.559548,...,0.106776,0.437166,0.07269,0.082957,0.03039,0.08501,0.213963,0.267762,0.154825,0.165092
std,0.714997,1.219126,5.044612,0.223601,0.492156,0.296159,0.497345,148.127551,2.038036,0.496492,...,0.308866,0.496083,0.259649,0.275854,0.171677,0.278932,0.410141,0.442847,0.361773,0.371308
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,7.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,3.0,5.0,0.0,0.0,0.0,0.0,27.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,4.0,6.0,0.0,0.0,0.0,0.0,267.0,4.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,4.0,9.0,0.0,1.0,0.0,1.0,313.0,6.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,4.0,7.0,44.0,1.0,1.0,1.0,1.0,903.0,6.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [105]:
print(data.columns)

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')


In [106]:
w = data[data.race=='w']
b = data[data.race=='b']

## 1.What test is appropriate for this problem? Does CLT apply?

Background information:

Conditions for inference on a proportion:

The conditions we need for inference on one proportion are:
1 Random: The data needs to come from a random sample or randomized experiment.
2 Normal: The sampling distribution of \hat p   needs to be approximately normal — needs at least 10 expected successes and 10 expected failures.
3 Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than 10%, percent of the population.

In [107]:
## Check the conditions for inference on a proportion:

In [108]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [109]:
w_c_rate=sum(data[data.race=='w'].call)/len(w)
print('White sounding names that recieved a call back: ', w_c_rate)

p1=w_c_rate

White sounding names that recieved a call back:  0.09650924024640657


In [110]:
b_c_rate=sum(data[data.race=='b'].call)/len(b)
print('White sounding names that recieved a call back:', b_c_rate)

p2=b_c_rate

White sounding names that recieved a call back: 0.06447638603696099


In [111]:
### Random:
## Based on the backgroud information, 

## ...Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names...

## So I assume that the data is random by default.


In [112]:
### Normality for White 
if len(w)*w_c_rate>=10 and len(w)*(1-w_c_rate)>=10:
    print ('Normal tests passed for white group')
if len(b)*b_c_rate>=10 and len(b)*(1-b_c_rate)>=10:
    print ('Normal tests passed for Black group')   

Normal tests passed for white group
Normal tests passed for Black group


In [113]:
## 3 Independent test:
## Since 2435 resumes are well below the population number of resumes in the United States, 
## I consider sample size should be lower than 10% , the sample can be considered as independent.

### Answer to Q1: This problem is a testing  for inference on a proportion difference between two groups, CLT can be applied for this problem.

## 2.What are the null and alternate hypotheses?

 ### Answer to Q2: 
 
 H0: The proportions of white and black-sounding names receiving a call back is equivalent. w_c_rate (P1)=b_c_rate(P2) or p1-p2=0
 
 Ha: The proportions of white and black-sounding names receiving a call back is not equivalent.(P1)!=b_c_rate(P2) or p1-p2!=0
 
 alph: 0.05

## 3.Compute margin of error, confidence interval, and p-value. 
    Try using both the bootstrapping and the frequentist statistical approaches.

In [114]:
# Computing varance and std for the two groups:

var_w=(p1*(1-p1))/len(w)
print(var_w)
var_b=(p2*(1-p2))/len(b)
print(var_b)
var_w_b=var_w+var_b
print(var_w_b)
sigma_w_b=np.sqrt(var_w_b)
print(sigma_w_b)


3.580911983304638e-05
2.4771737856498466e-05
6.058085768954485e-05
0.0077833705866767544


In [115]:
real_diff=p1-p2
print(real_diff)

0.032032854209445585


In [116]:
# calculation of 95% of sigma
sigma_w_b_95=1.96*sigma_w_b
print(sigma_w_b)

0.0077833705866767544


In [117]:
# Confidence interval: 
confi_itl=[real_diff-sigma_w_b_95,real_diff+sigma_w_b_95]
print(confi_itl)

[0.016777447859559147, 0.047288260559332024]


In [118]:
p_total=(sum(data[data.race=='w'].call)+sum(data[data.race=='b'].call))/(len(w)+len(b))

In [119]:
p_total

0.08049281314168377

In [120]:
sigma_p1_p2=np.sqrt((2*p_total*(1-p_total)/(len(w))))

In [121]:
sigma_p1_p2

0.007796894036170457

In [122]:
#Z score
Z=(real_diff-0)/sigma_p1_p2
print(Z)

4.108412152434346


In [123]:
#Margin of error = Critical value x Standard deviation/sqrt of sample size
m_e=(Z*sigma_p1_p2)/np.sqrt(len(w)+len(b))
print(m_e)

0.0004590195221617315


In [124]:
import scipy.stats as st
import scipy.special as scsp
p_value = st.norm.sf(abs(Z))*2
print(p_value)

3.983886837585077e-05


In [125]:
data_1=data[data.race=='w'].call
print(sum(data_1))

235.0


In [126]:
data_2=data[data.race=='b'].call
print(sum(data_2))

157.0


In [127]:
# Directly using stats model
n_success_b = int(np.sum(data[data.race=='b'].call))

n_success_w = int(np.sum(data[data.race=='w'].call))

from statsmodels.stats.proportion import proportions_ztest

stat, p_val = proportions_ztest([n_success_w, n_success_b], [len(w), len(b)])

print('Z-Score= ',stat, 'P-Value=',  p_val)

Z-Score=  4.108412152434346 P-Value= 3.983886837585077e-05


#### Try using the bootstrapping 

In [27]:
def permutation_sample(data_1, data_2):
    """Generate a permutation sample from two data sets."""

    # Concatenate the data sets: data
    data =np.concatenate((data_1,data_2))

    # Permute the concatenated array: permuted_data
    permuted_data =np.random.permutation(data)

    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data_1)]
    perm_sample_2 = permuted_data[len(data_2):]

    return perm_sample_1, perm_sample_2


def draw_perm_reps(data_1, data_2, func, size=1):
    """Generate multiple permutation replicates."""

    # Initialize array of replicates: perm_replicates
    perm_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation sample
        perm_sample_1, perm_sample_2 =permutation_sample(data_1, data_2)

        # Compute the test statistic
        perm_replicates[i] = func(perm_sample_1,perm_sample_2)

    return perm_replicates

def diff_of_propo(data_1, data_2):
    """Difference in propo of two arrays."""

    # The difference of means of data_1, data_2: diff
    diff = sum(data_1)/len(data_1)-sum(data_2)/len(data_2)

    return diff

# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = sum(data_1)/len(data_1)-sum(data_2)/len(data_2)

# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(data_1,data_2,
                                 diff_of_propo, size=100000)

# Compute p-value: p
p = np.sum(perm_replicates>= empirical_diff_means) / len(perm_replicates)

# Print the result
print('p-value =', p)

print(len(perm_replicates))

p-value = 2e-05
100000


In [28]:
alpha = 0.05
if p < alpha:
    print("The proportions of white and black-sounding names receiving a call back is equivalent - can be rejected")
else:
    print("The proportions of white and black-sounding names receiving a call back is equivalent - cannot be rejected")

The proportions of white and black-sounding names receiving a call back is equivalent - can be rejected


## 4.Write a story describing the statistical significance in the context or the original problem.

### Answer to Q4: Based on both simulations and parametric testing, the null hypothesis, namely the proportions of white and black-sounding names receiving a call back is equivalent ,  can be rejected.  We can say with 95% confidence that the population proportions for the two groups are significantly different, and whites are 1.7% to 4.7% more likely to get calls over blacks.

## 5.Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?


In [83]:
data.replace(to_replace='w', value='1', inplace=True) 
data.replace(to_replace='b', value='0', inplace=True) 
data.to_csv('data_job.csv')

In [86]:
df_csv=pd.read_csv('data_job.csv')

In [88]:
df_csv.colum

Index(['Unnamed: 0', 'id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors',
       'volunteer', 'military', 'empholes', 'occupspecific', 'occupbroad',
       'workinschool', 'email', 'computerskills', 'specialskills', 'firstname',
       'sex', 'race', 'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack',
       'fracwhite', 'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col',
       'expminreq', 'schoolreq', 'eoe', 'parent_sales', 'parent_emp',
       'branch_sales', 'branch_emp', 'fed', 'fracblack_empzip',
       'fracwhite_empzip', 'lmedhhinc_empzip', 'fracdropout_empzip',
       'fraccolp_empzip', 'linc_empzip', 'manager', 'supervisor', 'secretary',
       'offsupport', 'salesrep', 'retailsales', 'req', 'expreq', 'comreq',
       'educreq', 'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal',
       'trade', 'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')

In [93]:
df_csv.corr().call.sort_values()

fracdropout          -0.056671
lmedhhinc_empzip     -0.049879
req                  -0.041699
educreq              -0.033864
orgreq               -0.033416
fracwhite_empzip     -0.032989
branch_sales         -0.029126
computerskills       -0.028813
manuf                -0.028785
workinschool         -0.027888
branch_emp           -0.026909
l                    -0.025835
compreq              -0.024907
fracblack            -0.022130
trade                -0.021853
salesrep             -0.021584
military             -0.020577
manager              -0.020269
expreq               -0.019250
supervisor           -0.012061
bankreal             -0.008996
col                  -0.008479
missind              -0.007555
education            -0.005748
ofjobs                0.002311
retailsales           0.002336
comreq                0.002421
eoe                   0.003092
secretary             0.004038
busservice            0.006882
linc_empzip           0.006883
volunteer             0.007197
parent_s

### Answer to Q5: No, my analysis dose not mean race/name is the most important factor in callback success. Based on the above correlation analysis results, we can see the 'specialskills', 'honors','empholes' ,'adid','yearsexp' are more correlated with call back than 'race'.
