# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [3]:
import pandas as pd
import numpy as np
from scipy import stats

In [4]:
data = pd.io.stata.read_stata('us_job_market_discrimination.dta')

In [5]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [6]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

#### 1. What test is appropriate for this problem? Does CLT apply?

The above problem appears to be a Bernoulli distribution with randomly assigned b and w values.
Hypothesis test is appropriate to answer this problem (yes/no type). Also, the sample appears to be <10% of 
original population (assumed to be the entire working population of United States)

#### 2. What are the null and alternate hypotheses?

$H_{0}$  = There is NO difference in interview requests from employers between resumes with black-sounding and white-sounding names.
i.e. $$P_w = P_b$$

$H_{a}$= There is a difference in interview requests from employers between resumes with black sounding and white-sounding names.
i.e. $$P_w \neq P_b $$<br/>
$$ P_w\text{ is the proportion of white sounding names called for interview} P_b\text{ is the proportion of black sounding names called for interview}\$$

In [7]:
w = data[data.race=='w']
b = data[data.race=='b']

#### 3.Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

Confidence interval = Sample Mean + - $Z\times S$/$\sqrt{n}$<br/>
Margin of error = $Z\times S$/$\sqrt{n}$


In [30]:
w_call = w['call']
b_call = b['call']

w_mean = np.mean(w.call)
w_std = np.std (w.call)

b_mean = np.mean(b.call)
b_std = np.std (b.call)
p_diff = w_mean - b_mean #(for the sample distribution, mean value = p)

mean_diff = np.abs(np.mean(w_call) - np.mean(b_call))

print ('The mean and std deviation of w dataframe is:', w_mean,',', w_std)
print ('The mean and std deviation of b dataframe is :', b_mean,',', b_std)

The mean and std deviation of w dataframe is: 0.09650924056768417 , 0.29528486728668213
The mean and std deviation of b dataframe is : 0.0644763857126236 , 0.24559901654720306


Assuming the two dataframes w and b are two separate sample sets drawn from the original population distribution (main dataframe -'data'), The Variation of distribution is given by the sum of variations of the two sample distributions (w and b). We consider w and b sample sets as the assignment of 'b' and 'w' values were random and the distribution is a normal distribution.  Standard deviation of sampling distribution of our sample proportion is 

$$S =\sigma^2_w + \sigma^2_b$$



In [13]:
se_bw = np.sqrt((w_std**2/len(w)) + (b_std**2)/len(b))

print('The standard error for difference in proportion is : ', se_bw)

The standard error for difference in proportion is :  0.007783308359923415


#### Margin of error considering alpha (Significance value) = 0.05

In [19]:
moe = 1.96* (se_bw)             # z value for 95% CI
print('Margin of Error is:', moe)
print('The 95% confidence interval is between [', p_diff-moe, ',', p_diff+moe,']')

Margin of Error is: 0.015255284385449893
The 95% confidence interval is between [ 0.016777570469610682 , 0.04728813924051047 ]


In [26]:
# Calculating the p-value of distribution
t, p = stats.ttest_ind(w_call,b_call)
print('p val:',p)

p val: 3.940802103128886e-05


##### The above t-test is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values.<br/>
##### This test assumes that the populations have identical variances by default.

As the above p value (0.000394) is much less than our significance level (0.05) we reject the null hypothesis that<br/>
there is no difference in request for interview for black and white sounding names

p value for the sample distribution = the mean of the distribution
As per above null hypothesis, pw=pb or pw-pb !=0

In [27]:
diff_means = w_mean -b_mean
print('The difference of means of w and b sample datasets is :', diff_means)

The difference of means of w and b sample datasets is : 0.03203285485506058


The above p value is less than alpha (0.05) which makes us reject the null hypothesis.


In [33]:
# Defining a function for permutation of sample sets and splitting the output into two samples
def permutation_sample(data1, data2):
    """Generating a permutation sample from two data sets."""
    for i in range(10000):
    # Concatenate the data sets: data
        data_final = np.concatenate((data1, data2))

    # Permute the concatenated array: permuted_data
        permuted_data = np.random.permutation(data_final)

    # Split the permuted array into two: perm_sample_1, perm_sample_2
        perm_sample_1 = permuted_data[:len(data1)]
        perm_sample_2 = permuted_data[len(data1):]
        permutation_replicates[i] = np.abs(np.mean(perm_sample_1) - np.mean(perm_sample_2))
    

In [37]:
# bootstrapping replicates
permutation_replicates = np.empty(10000)

permutation_sample(w_call, b_call)

p_val = np.sum(permutation_replicates > mean_diff)/10000 
print('p val:',p_val)



p val: 0.0


#### Above, we are checking the number of samples in permutation whose (difference in mean) value is greater than the actual (difference
#### in mean) of original samples. Dividing the value by the number of permutations gives us the p-value

As the above p-value is less than the significance level (0.05), null hypothesis is to be rejected

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

#### 4. Write a story describing the statistical significance in the context or the original problem.
The project discussed the background pertaining to examination of the level of level of racial discrimination 
in the United States labor market by the Researchers. The experiment was carried out by randomly assigning 
identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews 
from employers.

To check the existence of any bias in interview calls, we carried out a hypothesis test where we assumed that there
is no difference between the calls received by the candidate with black or white sounding names.

We split the existing dataset into two samples (white and black) and checked the probability of the distribution
assuming our null hypothesis to be true. We arrived at a p value less than the significance level (0.05)
i.e. 95% confidence level. As our p-value that we arrived at was less than the 95% confidence level (< 95% chance
that the probability of null hypothesis being true) which lets us reject the null hypothesis.

#### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?
From the available dataset, a specific portion of the data was tested to find out if the race/name made an impact on
the call back success and we arrived at the conclusion that the black and white sounding names do not have the 
same callback success rates.

However, this test was limited to only the callbacks pertaining to the race/name and no other variables were 
considered while creating hypotheses and testing it. 

There could be several other factors contributing to the callback success (education level, experience, etc.) which
were not explored in this project but may have a scope of making a bigger impact in the callback rate than just
race/name.