Examining Racial Discrimination in the US Job Market

Background
Racial discrimination continues to be pervasive in cultures throughout the world.
Researchers examined the level of racial discrimination in the United States labor market
by randomly assigning identical résumés to black-sounding or white-sounding names and observing
the impact on requests for interviews from employers.

Data
In the dataset provided, each row represents a resume. The 'race' column has two values,
'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values,
1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.


Exercises

You will perform a statistical analysis to establish whether race has a significant
impact on the rate of callbacks for resumes.

Answer the following questions in this notebook below and submit to your Github account. 
What test is appropriate for this problem? Does CLT apply?
What are the null and alternate hypotheses?
Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and
the frequentist statistical approaches.

Write a story describing the statistical significance in the context or the original problem.

Does your analysis mean that race/name is the most important factor in callback success? Why or why not?
If not, how would you amend your analysis?


In [1]:
# import libraries
import pandas as pd
import numpy as np
from scipy import stats


In [2]:
# Read data file
data = pd.io.stata.read_stata('C:/Users/rivas/OneDrive/Documents/JMR/Education/Springboard/Projects/EDA_racial_discrimination/data/us_job_market_discrimination.dta')

# EDA
print(data.columns)
print(data.shape)
print(data.call.describe())


Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')
(4870, 65)
count    4870.000000
mean        0.080493
std         0.272079
min         0.000000
25%         0.

In [8]:
# define 2 functions

##############################################
# Define a function stats_computation
def stats_computation(w, b):
    """ stats_computatio
    Parameters
    ----------
    w : np array
    b : np array
    -------
    a two-sample z-test
    
    null and alternate hypotheses:
           H0: W = B  (call back rate)
           H1: W ≠ B  (call back rate)
 
    -------
    returns z, pval, moe, ci
    """
# start computation process
    n_w = len(w)
    n_b = len(b)
    
    sum_w = np.sum(w)
    sum_b = np.sum(b)
    
    prop_w = sum_w / n_w
    prop_b = sum_b / n_b
    
    prop_diff = prop_w - prop_b
    phat = (prop_w + prop_b) /(n_w + n_b)

# Zvalue and Pvalue
    z = prop_diff / np.sqrt(phat * (1 - phat) * ((1 / n_w) + (1 / n_b)))
    pval = stats.norm.cdf(-z) * 2
    print("Z score: {}".format(z))
    print("P-value: {}".format(pval))
# MOE
    moe = 1.96 * np.sqrt(phat * (1 - phat) * ((1 / n_w) + (1 / n_b)))
    ci = prop_diff + np.array([-1, 1]) * moe
    print("Margin of Error: {}".format(moe))
    print("Confidence interval: {}".format(ci))
#

    return z, pval, moe, ci


In [9]:
# Define a function bootstrap_resample

def bootstrap_resample(X, n=None):
    """ Bootstrap resample an array_like
    Parameters
    ----------
    X : array_like
      data to resample
    n : int, optional
      length of resampled array, equal to len(X) if n==None
    Results
    -------
    returns X_resamples
    """
    if n == None:
        n = len(X)
        
    resample_i = np.floor(np.random.rand(n)*len(X)).astype(int)
    X_resample = X[resample_i]
    return X_resample

##############################################


1 - What test is appropriate for this problem:
       Answer: a two-sample z-test

       
    Does CLT apply?
       Answer: Yes,  because we assume that the samples are representative of the population. 
       The dataset size, 4870, is sufficient sample size and by randomly assigning identical 
       résumés to black-sounding or white-sounding names assumed to be independent

    What are the null and alternate hypotheses?
       Answer:  null and alternate hypotheses:
           H0: W = B  (call back rate)
           H1: W ≠ B  (call back rate)
           

In [10]:
# Split the dataset by race
np_w = data[data.race=='w'].call.values
np_b = data[data.race=='b'].call.values
print(len(np_w), len(np_b))


2435 2435


In [11]:
# frequentist approach

# start computation process
f_z, f_pval, f_moe, f_ci = stats_computation(np_w, np_b)


Z score: 194.40535795127548
P-value: 0.0
Margin of Error: 0.0003229560898534969
Confidence interval: [ 0.0317099   0.03235581]


In [12]:
# Bootstrap  approach
# split dataset into numpy arrays
np_w = data[data.race=='w'].call.values
np_b = data[data.race=='b'].call.values

# Call boostrap function
bs_w = bootstrap_resample(np_w, n=1000)
bs_b = bootstrap_resample(np_b, n=1000)
print(len(bs_w), len(bs_b))


# start computation process
bs_z, bs_pval, bs_moe, bs_ci = stats_computation(bs_w, bs_b)


1000 1000
Z score: 123.89968282638898
P-value: 0.0
Margin of Error: 0.0008384202253815207
Confidence interval: [ 0.05216158  0.05383842]


The low value of the p-value indicates  to reject the H0 hypothesis.
It means that there are differences in the call back rate based on race


Does your analysis mean that race/name is the most important factor in
callback success? Why or why not? If not, how would you amend your analysis?

Answer:
The callback rate differs based on race.  But other factors can also be factors
in the callback rate that were not measured i.e. geography, sex, age,...

To amend the analysis would look at other factors as mentioned above.