# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
sum(data[data.race=='w'].call)import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [9]:
# number of callbacks for white-sounding names
sum(data[data.race=='w'].call)

235.0

In [11]:
len(data[data.race=='b']), len(data[data.race=='w'])

(2435, 2435)

In [12]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


## Introduction

We need to find if there is a correlation between the race of a person and the likelihood of getting a callback. We have 2435 samples each for white sounding and black sounding names.

We then proceed to calculate the number of callbacks for each group. At first glance, we discover that people with white sounding names get significantly more callbacks than people with black sounding names. 

The aim of this analysis is to find out if the difference is indeed significant or if the names and callbacks are statistically independent.

## Type of Test

Firstly, it must be stated that the Central Limit Theorem applies to this problem. This is because the sample sizes for boith groups are much greater than 30 and the observations are independent of any other observation made. In other words, no observation affects any other observation in our given dataframe.

Since we're trying to find out if there is a correlation between names and callbacks, we will be applying the **Chi Square Test for Independece.**

A lookup table can be constructed to visualise the data better.

|               | Black         | White |  Total  |
| ------------- |:-------------:| -----:| -------:|
| Success       | 157           | 235   |  392    |
| Failure       | 2278          | 2200  |  4478   |
| Total         | 2435          | 2435  |  4870   |

## Null and Alternate Hypothesis

To do hypothesis testing, we define the following:

* **Null Hypothesis:** There is no relation between names and callbacks
* **Alternate Hypothesis:** There is a relationship between names and callbacks

We are going to assume that the null hypothesis is true. Also, the significance level is assumed to be 10% or 0.1

In [18]:
bs = 157
bf = 2278
ws = 235
wf = 2200
st = bs + ws
ft = bf + wf
total = 4870
group_total = 2435

In [19]:
#Likelihood of getting and not getting a callback for the entire sample
p0 = st/total
p1 = ft/total

p0, p1

(0.08049281314168377, 0.9195071868583162)

In [20]:
e11 = p0 * group_total
e21 = p1 * group_total
e12 = p0 * group_total
e22 = p1 * group_total

e11, e21, e12, e22

(195.99999999999997, 2239.0, 195.99999999999997, 2239.0)

In [21]:
df = 1
chi2 = (bs-e11)**2/e11 + (ws-e12)**2/e12 + (bf-e21)**2/e21 + (wf-e22)**2/e22
chi2

16.879050414270225

In [24]:
p = 1 - stats.chi2.cdf(chi2, df)
p

3.9838868375885461e-05

The p-value obtained is extremely small and extremely smaller than the threshold p of 0.1. This implies that we have to reject the null hypothesis and accept the alternate hypothesis. In other words, **there is a clear correlation between the name of a candidate and the success in getting a callback based on resume.**

The confidence interval assumed was 90% with a margin of error of 20%.

## Conclusion and Final Remarks

1. There is a clear correlation between the names and the callback success of a particular person. This could imply that there is active discrimination taking place in the industry based on race.
2. However, we cannot conclude that name is the most important factor for callback success. While a correlation has been established, this does not directly implty causation. Other parameters such as education and work experience may also have a role to play and the relationship between all these variables have not been established to arrive at a definitive conclusion.

A possible amendment to this analysis is to check if names and callback success correlate to some third variable (such as work experience, education or age). If the correlation there is strong, we can offer an alternative hypothesis that both these variables are being influenced by a third variable and hence is correlated but not causated.