# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF 

In [3]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [4]:
# number of callbacks for black-sounding names
per_b_call=(sum(data[data.race=='b'].call)/data[data.race=='b'].race.size)*100
# number of callbacks for white-sounding names
per_w_call=(sum(data[data.race=='w'].call)/data[data.race=='w'].race.size)*100
print('Percentage of call backs for black people is {:.3} %'.format(per_b_call)+ \
      ' percentage of call backs for white people is {:.3} %'.format(per_w_call))

P1=per_b_call/100
var_b=(P1*(1-P1)/data[data.race=='b'].race.size)
P2=per_w_call/100
var_w=(P2*(1-P2)/data[data.race=='w'].race.size)

print('Variance in percentage of call backs for black people is {:.3} '.format(var_b)+ \
      ' variance in percentage of call backs for white people is {:.3} '.format(var_w))

## Sampling Distribution P1-P2 variance
var_b_w= var_b + var_w
std_b_w=np.sqrt(var_b_w)

print('Variance in sample distribution is {:.3} '.format(var_b_w)+ \
      ' and standard deviation in sample distribution is {:.3} '.format(std_b_w))

Percentage of call backs for black people is 6.45 % percentage of call backs for white people is 9.65 %
Variance in percentage of call backs for black people is 2.48e-05  variance in percentage of call backs for white people is 3.58e-05 
Variance in sample distribution is 6.06e-05  and standard deviation in sample distribution is 0.00778 


In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

### Q1. What test is appropriate for this problem? Does CLT apply?
This is a binary response type of problem (1,0) with discrete variable and a Bernoulli distribution or binomial distribution. However, testing the difference between the "percentage called back" for each race will follow a normal distribution  and hence CLT can be applied. The requirements for CLT is that the samples must be randomly drawn, and be representative of the population. Conversely since sample size **n** is > 30, z-statistics is more appropriate that t-statistics.

A two-sample z-test can be used to test the difference between two population proportions
p_1 and p_2 when a sample is randomly selected from each population. The test statistic
is
<br>
z = $\frac{(\hat p_1 - \hat p_2) - (p_1 - p_2)}{\sqrt{\bar{p}\bar{q}(\frac{1}{n_1}+\frac{1}{n_2})}}$

Where:

* $\bar{p} = \frac{x_1 + x_2}{n_1 + n_2}$

* $\bar{q} = 1 - \bar{p}$

* $\hat p_1 = \frac{x_1}{n_1}$

* $\hat p_2 = \frac{x_2}{n_2}$

* $x_1$, $x_2$ : number of success in each sample.

* $p_1$, $p_2$:Population proportions.

* $\hat p_1$, $\hat p_2$  :Sample proportions of successes.

### Q2. What are the null and alternate hypotheses?
Ho: There is no difference between black and white resumes.
<br>
H1: There is difference between black and white resumes.
<br>
Alternatively,
<br>
Ho: There is no significant difference between "percentage called back" for black and white resumes.
<br>
H1:There is significant difference between "percentage called back" for black and white resumes.
<br>
Since the sample size is > 30 z-statistic is appropriate.

In [6]:
# number of callbacks for black-sounding names
tot_b_called=sum(data[data.race=='b'].call)
tot_b_called

157.0

In [7]:
w = data[data.race=='w']
b = data[data.race=='b']

### Q3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

#### Bootstapping approach of confidence interval

In [8]:
def bootstrap_replicate_1d(data, func):
    return func(np.random.choice(data, size=len(data)))
def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data, func)

    return bs_replicates

# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates_b_called = draw_bs_reps(data[data.race=='b'].call, np.mean, size=10000)

# Compute and print SEM
sem_b = np.std(data[data.race=='b'].call) / np.sqrt(len(data[data.race=='b'].call))
# Compute and print standard deviation of bootstrap replicates
bs_std_b = np.std(bs_replicates_b_called)
print('For black call backs the sem is: {:.3}\n and standard deviation of bootstrap replicates is: {:.3}' \
      .format(sem_b,bs_std_b))
# Generate 10,000 bootstrap replicates of the variance: bs_replicates
bs_replicates_b_called = draw_bs_reps(data[data.race=='b'].call, np.var, size=10000)

# Make a histogram of the results
#_ = plt.hist(bs_replicates_b_called, bins=50, normed=True)
#_ = plt.xlabel('variance of black callback')
#_ = plt.ylabel('PDF')

# Show the plot
plt.show()

# Compute the 95% confidence interval: conf_int
conf_int_b = np.percentile(bs_replicates_b_called,[2.5, 97.5])

# Print the confidence interval
print('95% confidence interval =', conf_int_b, 'of black called back')

##################################################################
bs_replicates_w_called = draw_bs_reps(data[data.race=='w'].call, np.mean, size=10000)

# Compute and print SEM
sem_w = np.std(data[data.race=='w'].call) / np.sqrt(len(data[data.race=='w'].call))


# Compute and print standard deviation of bootstrap replicates
bs_std_w = np.std(bs_replicates_b_called)

print('For white call backs the sem is: {:.3}\n and standard deviation of bootstrap replicates is: {:.3}' \
      .format(sem_w,bs_std_w))
# Generate 10,000 bootstrap replicates of the variance: bs_replicates
bs_replicates_w_called = draw_bs_reps(data[data.race=='w'].call, np.var, size=10000)

# Make a histogram of the results
#_ = plt.hist(bs_replicates_w_called, bins=50, normed=True)
#_ = plt.xlabel('variance of white callback')
#_ = plt.ylabel('PDF')

# Show the plot
plt.show()

# Compute the 95% confidence interval: conf_int
conf_int_w = np.percentile(bs_replicates_w_called,[2.5, 97.5])

# Print the confidence interval
print('95% confidence interval =', conf_int_w, 'of white called back')



For black call backs the sem is: 0.00498
 and standard deviation of bootstrap replicates is: 0.00499
95% confidence interval = [0.05200241 0.0688073 ] of black called back
For white call backs the sem is: 0.00598
 and standard deviation of bootstrap replicates is: 0.00431
95% confidence interval = [0.0774425  0.09634244] of white called back


### Two-sample bootstrap hypothesis test for difference of means.

In [12]:
empirical_diff_means = np.mean(data[data.race=='w'].call)- np.mean(data[data.race=='b'].call)
print(empirical_diff_means)
# Concatenate race: race_concat
race_concat = np.concatenate((data[data.race=='b'].call, data[data.race=='w'].call))
# Compute mean of all forces: mean_force
mean_race = np.mean(race_concat)

# Generate shifted arrays
b_shifted = data[data.race=='b'].call- np.mean(data[data.race=='b']) + mean_race
w_shifted = data[data.race=='w'].call - np.mean(data[data.race=='w']) + mean_race

# Compute 10,000 bootstrap replicates from shifted arrays
bs_replicates_b = draw_bs_reps(b_shifted, np.mean, size=10000)
bs_replicates_w = draw_bs_reps(w_shifted, np.mean, size=10000)

# Get replicates of difference of means: bs_replicates
bs_replicates = bs_replicates_b - bs_replicates_w

# Compute and print p-value: p
p = np.sum(bs_replicates >= empirical_diff_means) / len(bs_replicates)
print('p-value =', p)

0.03203285485506058


  return this.join(other, how=how, return_indexers=return_indexers)


p-value = 0.0




#### Frequentist statistical approaches.

Z-statistics

In [56]:
#(p1-p2)
difference = P1 - P2
n1=len(data[data.race=='b'])
n2= len(data[data.race=='w'])
#calculate p-bar
p_bar = (sum(data[data.race=='w'].call) + sum(data[data.race=='b'].call))  \
        / (n1 +n2)
#calculate q-bar
q_bar = 1 - p_bar
#calculate p-hat 1
p_hat_1 = np.divide(sum(data[data.race=='w'].call),n1)
#calculate p-hat 2
p_hat_2 = np.divide(sum(data[data.race=='b'].call),n2)
x = np.add(np.divide(1,n1),np.divide(1,n2))
#calculate z-score
z = (p_hat_1 - p_hat_2) / np.sqrt(p_bar * q_bar * x)
#calculate p-value
p = (1 - stats.norm.cdf(z)) * 2
print('The z-score is: {}\nThe p-value is: {}'.format(z,p))

The z-score is: 4.108412152434346
The p-value is: 3.983886837577444e-05


#### Margin of Error
z * $\sqrt{\bar{p}\bar{q}(\frac{1}{n_1}+\frac{1}{n_2})}$

where z is the z-score for the desired confidence interval (critical values) e.g. if you want the z score for 95% confidence interval, you can reference the table [here](http://www.statisticshowto.com/probability-and-statistics/find-critical-values/#CommonCI) and see that the z score for this critical value is 1.96

In [39]:
moe = 1.96 * np.sqrt(p_bar * q_bar * x)
print('The margin of error is: {:.4}'.format(moe))

The margin of error is: 0.01528


#### Confidence Interval values
Using the margin of error, we can take our sample difference, add and subtract the m.o.e and establish (in this case) the 95% confidence interval. 

* Confidence Interval = ($p_1 - p_2$) $\pm$ (margin of error)

In [41]:
ci = [(difference-moe), (difference+moe)]
print('With 95% confidence, the interval range is :\n {}'.format(ci))

With 95% confidence, the interval range is :
 [-0.04731476652033968, -0.01675094189855149]


#### $\chi^2$ Chi-squared Test for Equality of Proportions
The chi-square test compares the expected frequencies to observed frequencies. The null hypothesis is rejected if observed and expected frequencies are too far apart.

In [62]:
table = data[['race','call']]
tabulated = pd.crosstab(index = table.call, columns = table.race)
print(tabulated)
chi, pval, _, _ = stats.chi2_contingency(tabulated)
print('chi-square test value is {:.3} and p value is {:0.3}'.format(chi,p))

race     b     w
call            
0.0   2278  2200
1.0    157   235
chi-square test value is 16.4 and p value is 3.98e-05


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

### Q4 and Q5 : Conclusion
Performing both CLT and Chi square yields a p-value of 3.98e-05 and null hypothesis gives the the same result that there is difference between black and white resumes. This however does not mean it is the only reason for the disproportion. Multivariate analysis must be performed on all supplied variables, as some may weigh more heavy on the resulting statistics than only race.