# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [40]:
import pandas as pd
from matplotlib import pyplot
from scipy import stats
from scipy.stats import shapiro
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind
from statsmodels.stats.weightstats import ztest
import statsmodels.stats.api as sms
import matplotlib.pyplot as plt
import numpy as np 

In [6]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [12]:
#total number of rows
entries = len(data)
print(entries)

4870


In [7]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [18]:
#Call-back rate: number of calls / number of entries
callBackrate_w = sum(data[data.race=='w'].call)/entries
print(callBackrate_w)
print(np.around(callBackrate_w,2)*100, '% of the total call backs were for w')

0.048254620123203286
5.0 % of the total call backs were for w


In [8]:
sum(data[data.race=='b'].call)

157.0

In [17]:
#Call-back rate: number of calls / number of entries
callBackrate_b = sum(data[data.race=='b'].call)/entries
print(callBackrate_b)
print(np.around(callBackrate_b,2)*100, '% of the total call backs were for b')

0.03223819301848049
3.0 % of the total call backs were for b


In [33]:
#Different in call back rate
callBackDifference = callBackrate_w - callBackrate_b

In [21]:
#total call backs
total_callbacks_rate = sum(data.call)/entries
total_callbacks = sum(data.call)
print(np.around(total_callbacks_rate,2)*100, '% of the total call backs')

#callbacks for blacks
b_outoftotal = sum(data[data.race=='b'].call)/total_callbacks
print(np.around(b_outoftotal,2)*100, '% of the total call backs for blacks')

w_outoftotal = sum(data[data.race=='w'].call)/total_callbacks
print(np.around(w_outoftotal,2)*100, '% of the total call backs for whites')

8.0 % of the total call backs
40.0 % of the total call backs for blacks
60.0 % of the total call backs for whites


From this analysis, it seems that whites get more callbacks then blacks

In [9]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

The idea test for this problem is a two-sample t-test comparing callbacks for black-sounding names and callbacks for white-sounding names.

The Central Limit Theorem (CLT) applies in this case as the number of observations is greater than 30 , which is the minimum sample size for the Central Limit Theorem to hold.

A t-test instead of a z-test is ideal in this case as the sample variances are not known to be representative of the population variances

<b>Null Hypothesis: There is no difference in the call back rate of both groups (b and w).</b>

<b>Alternate Hypothesis: There is a difference in the call back rate of the two groups (b and w). </b>

In [29]:
w = data[data.race=='w'].call
b = data[data.race=='b'].call

In [31]:
# Compute callback rate for all resumes: cbr_mean_all
cbr_mean_all = np.mean(data.call)
print('Mean callback rate all for all resumes: ', cbr_mean_all)
# Compute Empirical Difference in Mean Callback Rates: empirical_diff_means
empirical_diff_means = np.mean(w) - np.mean(b)
print('Empirical difference between callback rates: ', empirical_diff_means)

Mean callback rate all for all resumes:  0.08049282
Empirical difference between callback rates:  0.032032855


In [37]:
# Your solution to Q3 here
#Bootstrap Approach
callBackRate = np.mean(data.call)

# Generated shifted arrays
#w_new = w - np.mean(w) + callBackRate
#b_new = b - np.mean(b) + callBackRate


# Compute 10000 bootstrap replicates
size = 10000
bs_replicates_w = np.empty(size)
bs_replicates_b = np.empty(size)

for i in range(size):
    bs_replicates_w[i] = np.mean(np.random.choice(w,len(w)))
    bs_replicates_b[i] = np.mean(np.random.choice(b,len(b)))
    
# Compute difference of means: bs_replicates
bs_replicates = bs_replicates_w - bs_replicates_b
# Compute and print p-value:
p = np.sum(bs_replicates == callBackDifference) / size
print('p-value, bootstrap approach:', p)

conf_intrv = np.percentile(bs_replicates, [2.5, 97.5])
print('95% confidence interval is from ', conf_intrv[0], ' to ', conf_intrv[1])

margin_error = 1.96 * bs_replicates.std()
print('Margin of error is: ', margin_error)

p-value, bootstrap approach: 0.0
95% confidence interval is from  0.01642710715532303  to  0.047227926552295685
Margin of error is:  0.01531849896805089


The p-value obtained in below 0.05 and the null hypithesis can be rejected.

In [41]:
#Frequentist approach

t, p = ttest_ind(w, b)
print('p-value, frequentist approach:', p)

cm = sms.CompareMeans(sms.DescrStatsW(w), sms.DescrStatsW(b))

conf_int = cm.tconfint_diff(alpha=0.05, usevar='unequal')
print('The 95% confidence interval is from ', conf_int[0], ' to ', conf_int[1])

margin_of_error = conf_int[1] - empirical_diff_means
print('The margin of error is ', margin_of_error)

p-value, frequentist approach: 3.940802103128885e-05
The 95% confidence interval is from  0.016770673983991798  to  0.04729503443489937
The margin of error is  0.015262179579838796


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

Write a story describing the statistical significance in the context or the original problem.

The null hypothesis that the callbacks are similar for white sounding names and black sounding names, is rejected by both the bootstrapping method and frequentist approach. 
We accept the alternate hypothesis that the call back rates are are not equal. 

This suggests that the race (white sounding name or black sounding name) is an important factor in call back success. 



Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

From this analysis, we cannot conclude that it is THE most important factor. It is very important but not THE most important factor. The data set has many other columns which need to be analyzed and their importance and weightage compared to race. For example, gender and number of years of experience could be important factors if not the most important factors. 