# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [6]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [7]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [8]:
w = data[data.race=='w']
b = data[data.race=='b']

In [9]:
print("Dataset count: ", len(data))
print("Resume with White sounding names: ", len(data[data.race=='w']))
print("Resume with Black sounding names: ", len(data[data.race=='b']))
print("Call back received for white resumes: ",sum(data[data.race=='w'].call))
print("Call back received for black resumes: ",sum(data[data.race=='b'].call))

#print(data[data.race=='w'].call)

Dataset count:  4870
Resume with White sounding names:  2435
Resume with Black sounding names:  2435
Call back received for white resumes:  235.0
Call back received for black resumes:  157.0


In [10]:
#ax=sns.countplot(x = data[data.race=='w'].call , data = data[data.race=='w'].call,  orient="H")
#ax=pd.crosstab(asd_no.Jaundice, asd_no.ASD).plot(kind='bar', stacked=False, color=['green','red'], grid=False)
#_=plt.hist(data[data.race=='w'].call, color="red", bins=2)
#pd.crosstab(data.ASD, data.Jaundice).plot(kind='bar')
#ax=pd.crosstab(data[data.race=='w'].race, data[data.race=='b'].race).plot(kind='bar', stacked=False, color=['green','red'], grid=False)
#ct=pd.crosstab(data.Jaundice, data.ASD)
#print(ct)
#plt.ylabel("COUNT", fontsize=10)
#plt.xlabel("JAUNDICE", fontsize=10)
#ax.set_title("Based on Jaundice condition at Birth", fontsize=12)
#plt.show()




<h3>Q1. What test is appropriate for this problem? Does CLT apply?</h3>
Central Limit Theorem (CLT) applies as its a large enough sample. We have two large enough proportion (white and black sounding resumes) to compare if they are same. We can use z-test for that.

<h3>Q2. What are the null and alternate hypotheses?</h3>
Null and alternate hypothesis are as follows
<ul>
<li><b>Null Hypothesis HO</b>:No impact of race on call back. Probabilities are equal
</li><b>Alternate Hypothesis HA</b>: Race has a impact on callbacks. Probabilities are NOT equal




<h3>Q3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.</h3><br>
For bootstraping we will be doing the permutation approach. We will ignore the to which catetegory (white or black sounding) the resume belongs to. Then we scramble the data and assign them first portion as white and the remaining as black sounding resumes. Then we calculate the difference in mean and check what we can infer.  

In [11]:
# Your solution to Q3 here

The z score test for two population proportions is used when you want to know whether two populations or groups (e.g., males and females; theists and atheists) differ significantly on some single (categorical) characteristic - for example, whether they are vegetarians. <br>
<br>
Requirements <br>
 • A random sample of each of the population groups to be compared.<br>
 • Categorial data<br>
<br>
Null Hypothesis 
<br>
H0: p1 - p2 = 0, where p1 is the proportion from the first population and p2 the proportion from the second. 
<br>
As above, the null hypothesis tends to be that there is no difference between the two population proportions; or, more formally, that the difference is zero (so,for example, that there is no difference between the proportion of males who are vegetarian and the proportion of females who are vegetarian).  
<br>
Equation<br>
z = ((p1 - p2) -0) / sqrt(p(1 - p)*(1/n1 - 1/n2))



In [12]:

w = data[data.race=='w']
b = data[data.race=='b']
# calculate probability of getting call back for white and black sounding enteries
prob_w = np.sum(w.call)/len(w)
prob_b = np.sum(b.call)/len(b)
print(prob_w)
print(prob_b)
percent_diff = ((prob_w - prob_b) / prob_b) *100
print(percent_diff)



# probability difference
pr_diff = prob_w - prob_b
print(pr_diff)

# probability of getting a callback
prob = (np.sum(w.call) + np.sum(b.call)) / (len(w) + len(b))

# z-score
z = pr_diff / np.sqrt( prob * (1-prob) * ( (1/len(w)) + (1/len(b)) ) )
print('z-score:', z)
# p_val, multiply by 2 for two-tail test
p_value = stats.norm.cdf(-z)*2
print('p-value:', p_value)
z_critical = stats.norm.ppf(q = 0.975)
print('z-critical:', z_critical)

0.0965092402464
0.064476386037
49.6815286624
0.0320328542094
z-score: 4.10841215243
p-value: 3.98388683759e-05
z-critical: 1.95996398454


Based on the p value, the null hypothesis can be rejected

In [13]:
#some functions needed for bootstrap approach

def diff_of_means(data1, data2):
    return np.mean(data1) - np.mean(data2)

def generate_permutation_sample(data1, data2):
    combined_data  = np.concatenate((data1, data2))
    permuted_data  = np.random.permutation(combined_data)
    permuted_data1 = permuted_data[:len(data1)]
    permuted_data2 = permuted_data[len(data1):]
    return permuted_data1, permuted_data2

def draw_permutation_replicate(dataA, dataB, func, size=1):
    rep_array = np.empty(size)
    for i in range(size):
        perm1, perm2   = generate_permutation_sample(dataA, dataB)
        rep_array[i] = func(perm1,perm2)
        return rep_array


wht_calls = data[data.race=='w'].call.values
blk_calls = data[data.race=='b'].call.values

mean_diff = diff_of_means(wht_calls, blk_calls)
print(mean_diff)
permutation_reps = draw_permutation_replicate(wht_calls, blk_calls, diff_of_means, size=10000)
#print(permutation_reps)
p_val = np.sum(permutation_reps > mean_diff) / len(permutation_reps)
print('p_value=', p_val)
confidence_interval = np.percentile(permutation_reps, [2.5, 97.5])
print('confidence Interval:', confidence_interval)

0.0320329
p_value= 0.0
confidence Interval: [  1.09007486e-311   1.09013937e-311]


We calculated the probability of getting difference of mean through sampling and repeating the same for 10000 times. The probability comes out to be close to 0. Hence the null hypothesis can be rejected.

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

<h3>Q4. Write a story describing the statistical significance in the context or the original problem.</h3><br>
Based on the analysis and statistical evidence seems there is influence of race in getting back calls.

<h3>Q5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?</h3><br>
Not necessarily that the race is the most important factor in callback success. The data sample that we have contains lots of other variables/features alo. Those needs to be analyzed too. After analyzing all we can come to some statistical conclusion on what factor that dominates the callback success.