# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.


### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet




In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
data = pd.io.stata.read_stata('./us_job_market_discrimination.dta')

In [4]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [6]:
# number of callbacks for white-sounding names
sum(data[data.race=='w'].call)

235.0

In [9]:
# number of callbacks and no callbacks for white-sounding names
data[data.race=='w'].call.value_counts()

0.0    2200
1.0     235
Name: call, dtype: int64

In [10]:
# number of callbacks and no callbacks for black-sounding names
data[data.race=='b'].call.value_counts()

0.0    2278
1.0     157
Name: call, dtype: int64

In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [8]:
data.shape

(4870, 65)

# Question 1:

**What test is appropriate for this problem?** The appropriate test to use would be a hypothesis test. 

**Does CLT apply?** Since we are dealing with two proportion comparisons, the Central Limit Theorem would be applied by meeting two criterias: 
1. **Are the samples independent?** Yes. The samples are indepdent. 
2. **Check for Sample size/skew. np >= 10? and n(1-p) >= 10?** Yes. The sample fullfills these criterias

Since we don't have the actual population proportion, we'll have to compute the **sample pool proportion**:
- Ppool = TotalSuccess / TotalSampleSize **=** (#Success1 + #Success2) / (Size1 + Size2) **=** (#WhiteCallBack + #BlackCallBack) / (TotalSampleSize)

In [16]:
# Calculate the # of callbacks for each race
white_callback = sum(data[data.race=='w'].call)
black_callback = sum(data[data.race=='b'].call)

# Calculate the sample size for each race
white_sample_size = len(data[data.race=='w'])
black_sample_size = len(data[data.race=='b'])

# Calculate the sample pool proportion
sample_pool_proportion = (white_callback + black_callback) / (white_sample_size + black_sample_size)
sample_pool_proportion


0.080492813141683772

In [24]:
# Check both cases (np>=10 and n(1-p)>=10) for each race
whiteCLT1 = white_sample_size*sample_pool_proportion >= 10
whiteCLT2 = white_sample_size*(1-sample_pool_proportion) >= 10

blackCLT1 = black_sample_size*sample_pool_proportion >= 10
blackCLT2 = black_sample_size*(1-sample_pool_proportion) >= 10

# Print the case statement to see if it fullfills the second criteria of CLT
print (whiteCLT1,whiteCLT2, blackCLT1,blackCLT2)

True True True True


# Question 2:

**What are the null and alternate hypotheses?** 

**Null Hypothesis:** *(H0: pw = pb)* The null hypothesis would be that there is **no** racial discrimination between black/white races on rates of callback for resumes.

**Alternate Hypothesis:** *(H1: pw != pb)* The alternate hypothesis would be that there **is** some racial discrimination between black/white races on rates of callback for resumes.

# Question 3:

**Compute margin of error, confidence interval, and p-value.**

**Margin of Error:** The Margin of Error is about .0077

**Confidence Interval:** The Confidence Interval at 95% Confidence Level is (-0.0398297482456 , -0.0242359601733)

**P-Value:** The P-Value is 4E-5.

In [25]:
# Calculate the total size of the sample population
n = black_sample_size + white_sample_size

# Store the sample proportion as p
p = sample_pool_proportion

# Let our Z-Score be 1.96 at 95% Confidence Level
z = 1.96

In [57]:
import math

# Calculate the standard error / margin of error
SE = math.sqrt(p*(1-p)*((1/black_sample_size) + (1/white_sample_size)))

0.007796894036170457

In [58]:
# Calculate the Confidence Interval
white_sample_proportion = white_callback / white_sample_size

black_sample_proportion = black_callback / black_sample_size

estimate_point = black_sample_proportion - white_sample_proportion 

lower_bound = estimate_point - SE 
upper_bound = estimate_point + SE

print(lower_bound, upper_bound)

-0.0398297482456 -0.0242359601733


In [59]:
# Calculate Z-Score and P-value

#Z-Score
z_score = estimate_point / SE
z_score

#P-Value = 4E-05 

-4.1084121524343464

# Question 4:


**Write a story describing the statistical significance in the context or the original problem.**

After the hypothesis testing, one can support the alternative hypothesis of having some racial discrimination between black/white races on rates of callback for resumes because of the P-Value Score. When a p_value < 0.05, then there is a more probable chance of the alternative hypothesis to occur.

# Question 5:

**Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?**

We would probably have to conduct more experiments in order to come to a conclusion about this analysis. It could just be a correlation that race/name had an impact with rates of call back. A better way to measure the accuracy for the analysis would to compare different features of the data set and see overall how each corresponds with one another. There is always room for more experiments.