# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


# Question 1

Since we are measuring success based on the number of callbacks within a sample, it would be appropriate to do a two-sample proportion z-test for our hypotheses. For the CLT to apply, we need to verify normality of both distributions. Usually, if (n * p)>=10 and (n * (1-p))>=10, where n is the number of observations and p is the proportion of successes, we can assume that the distribution is approximately normal.

In [5]:
w = data[data.race=='w']
b = data[data.race=='b']
w.info()
b.info()
sum(b.call)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2435 entries, 0 to 4869
Data columns (total 65 columns):
id                    2435 non-null object
ad                    2435 non-null object
education             2435 non-null int8
ofjobs                2435 non-null int8
yearsexp              2435 non-null int8
honors                2435 non-null int8
volunteer             2435 non-null int8
military              2435 non-null int8
empholes              2435 non-null int8
occupspecific         2435 non-null int16
occupbroad            2435 non-null int8
workinschool          2435 non-null int8
email                 2435 non-null int8
computerskills        2435 non-null int8
specialskills         2435 non-null int8
firstname             2435 non-null object
sex                   2435 non-null object
race                  2435 non-null object
h                     2435 non-null float32
l                     2435 non-null float32
call                  2435 non-null float32
city        

157.0

Both groups have 2435 entries. The proportion of callbacks for the white group is 235/2435, so n * p = 235. Since n * (1-p) = 2200, this distribution is approximately normal. For the black group, the proportion of callbacks is 157/2435, so (n * p) = 157 and n * (1-p) = 2278. Both are greater than 10, so the black group also has an approximately normal distribution. We will assume that the proportion of callbacks between the group is independent of each other (calling one person back does not affect whether or not someone else will be called back). Thus, the CLT applies in this scenario.

# Question 2 and 3 

- Null hypothesis: There is no difference between the proportion of callbacks between black-sounding and white-sounding names.
- Alternate hypothesis: There is a difference between the proportion of callbacks between black-sounding and white-sounding names.
We will use alpha = 0.05, and since this is a two-sided test, we will reject the null hypothesis if the p-value>0.975 or <.025

In [6]:
# Let's use the frequentist approach first to test our hypothesis.
p_white = sum(w.call)/len(w.call)
p_black = sum(b.call)/len(b.call)
difference = p_white - p_black
var_white = (p_white * (1-p_white))/len(w)
var_black= (p_black * (1-p_black))/len(b)
standard_error = np.sqrt(var_white + var_black)
z_score = difference/standard_error
p_value = stats.norm.cdf(z_score)
print("The p-value is " + `p_value`)

The p-value is 0.99998068717396238


Thus, we reject the null hypothesis.

Now, let's create a confidence interval for the difference between callback proportions for white-sounding names and black-sounding names.

In [7]:
conf_int = stats.norm.interval(0.95,loc=difference,scale=standard_error)
print conf_int

(0.016777728181230755, 0.047287980237660412)


We will now use the bootstrap approach to test the hypotheses and create a confidence interval.

In [8]:
# defining the necessary functions to create 
def  bootstrap_replicate_1d(data, func):
    """Generate bootstrap replicate of 1D data."""
    bs_sample = np.random.choice(data, len(data))
    return func(bs_sample)

def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data,func)

    return bs_replicates

In [9]:
# first we need the overall proportion of callbacks
# keep in mind, in this case, taking the proportion of those with callbacks
# is the same as taking the mean of the callback column
# so we will do a two-sample bootstrap test on the mean

p_total = np.mean(data.call)

# shift such that both callback columns have means equal to the total
p_white_shifted = w.call - np.mean(p_white) + p_total
p_black_shifted = b.call - np.mean(p_black) + p_total

bs_replicates = draw_bs_reps(p_white_shifted,np.mean,10000) - draw_bs_reps(p_black_shifted,np.mean,10000)
empirical_diff_p = p_white - p_black

p = np.sum(bs_replicates >= empirical_diff_p) / len(bs_replicates)
print("p-value is " + `p`)

p-value is 0


Thus, no matter our significance level, we reject the null hypothesis. Let's construct a confidence interval based on our bootstrap replicates.

In [10]:
conf_int = np.percentile(bs_replicates,[2.5,97.5])
print conf_int

[-0.01519509  0.01478437]


# Question 4 and 5

Since we rejected the null hypothesis in both our approaches, the appropriate conclusion to make would be that there is a significant difference in the proportion of white-sounding names that get callbacks when compared to black-sounding names. Now, there could be some sources of variability: For one, some names could be white-sounding or black-sounding (this is subjective). Moreover, it could depend on the industry or field that the employers were in. However, the fact that there is such a large difference in the proportion definitely sounds like a cause for concern. This is certainly worth investigating further. 

Our analysis does not mean that race was the most important factor at all. The appropriate conclusions were stated earlier, and nothing more can be extracted from our hypothesis test. We would have to do additional analyses and hypothesis testing to assess the effect of other factors on whether or not a resumé got a callback. To do this, we may separate the dataset into two datasets just like we did for this example.