# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
data = pd.io.stata.read_stata('C:\\Users\\tilleymusprime\\Desktop\\us_job_market_discrimination.dta')

In [4]:
data.shape

(4870, 65)

In [5]:
data1 = pd.get_dummies(data['race'])
data = pd.concat([data, data1], axis=1)

In [6]:
dc = data.corr()
dc['call']
#Using correlation and dummy columns, race of name does not seem to be a significant factor
#At the same time though, it doesn't look like theres any magic bullet category that can increase calls since our
#highest correlation is .11 for special skills (probably higher for springboard acquired special skills though)

education            -0.005748
ofjobs                0.002311
yearsexp              0.061436
honors                0.071951
volunteer             0.007197
military             -0.020577
empholes              0.071888
occupspecific         0.040548
occupbroad            0.034536
workinschool         -0.027888
email                 0.025880
computerskills       -0.028813
specialskills         0.111074
h                     0.025835
l                    -0.025835
call                  1.000000
adid                  0.063178
fracblack            -0.022130
fracwhite             0.035148
lmedhhinc             0.047699
fracdropout          -0.056671
fraccolp              0.047016
linc                  0.049649
col                  -0.008479
eoe                   0.003092
parent_sales          0.008430
parent_emp            0.039060
branch_sales         -0.029126
branch_emp           -0.026909
fed                   0.014471
fracblack_empzip      0.009882
fracwhite_empzip     -0.032989
lmedhhin

In [7]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [8]:
bdf = data[data['race'] == 'b']
bdf.shape

(2435, 67)

In [9]:
bsr = 157/2435
print(bsr)

0.06447638603696099


In [10]:
# number of callbacks for white sounding names
sum(data[data.race == 'w'].call)

235.0

In [11]:
wdf = data[data['race'] == 'w']
wdf.shape

(2435, 67)

In [12]:
wsr = 235 / 2435
print(wsr)

0.09650924024640657


In [13]:
data[data['call'] == 1].shape

(392, 67)

In [14]:
dsr = 392 / 2870
dsr

0.13658536585365855

In [15]:
#From our basic test, it look like people with white names are roughly 150% as likely to get a call back than a person
#with a black sounding name.  We will move on to some statistical testing though to determine if there really is a difference

In [16]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership,b,w
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,0,1
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,0,1
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1,0
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1,0
4,b,1,3,3,22,0,0,0,0,313,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit,0,1


In [17]:
# Since we are dealing with large, independent sample sizes, we will use a z proportions test


In [18]:
#Does the Central Limit Theorem apply here?
#The data is independent
#The data is normal:  We definitely have more than 10 failures for both sets of data and we have more than 10 successes 
#The data is random according to what the data says so we will trust that this time
#Because all of these conditions apply, we can state that CLT does apply to this problem

In [19]:
# Quesiton 2:  What are the null and alternative hypothesis?
#Our null hypothesis is that the success rate for black names (6%) is the same success rate for white names (9%)
#Our alternative hypothesis is that the differences are significantly different from 0

<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [20]:
w = data[data.race=='w']
b = data[data.race=='b']
#wsr is going to be our sample mean for white sounding names and bsr will be our sample mean for black sounding names
print(wsr, bsr)

0.09650924024640657 0.06447638603696099


In [21]:
print(w.shape, b.shape)

(2435, 67) (2435, 67)


In [22]:
#Now that we have the means for both sets of data, we will find the sample variances
wsv = (2200* (0-wsr)**2 + 235* (1-wsr)**2) / 2434
wst = wsv**(0.5)
wst

0.2953489980097223

In [23]:
bsv = (2278* (0-bsr)**2 + 157* (1-bsr)**2) / 2434
bst = bsv**(.5)
bst

0.24565008364706123

In [24]:
#It looks like white sounding names have a higher standard deviation when it comes to call backs but not nearly as much 
#as the mean difference

In [25]:
def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""

    # Concatenate the data sets: data
    data = np.concatenate((data1, data2))

    # Permute the concatenated array: permuted_data
    permuted_data = np.random.permutation(data)

    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]

    return perm_sample_1, perm_sample_2

In [26]:
#Let's start by generating a permutation sample for this data.
perm = permutation_sample(b['call'], w['call'])
perm[0], perm[1]

(array([0., 0., 0., ..., 0., 0., 0.], dtype=float32),
 array([0., 0., 0., ..., 0., 0., 0.], dtype=float32))

In [32]:
def draw_perm_reps(data_1, data_2, func, size=1):
    """Generate multiple permutation replicates."""

    # Initialize array of replicates: perm_replicates
    perm_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation sample
        perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)

        # Compute the test statistic
        perm_replicates[i] = func(perm_sample_1, perm_sample_2)

    return perm_replicates

In [33]:
def diff_of_means(data_1, data_2):
    """Difference in means of two arrays."""

    # The difference of means of data_1, data_2: diff
    diff = np.mean(data_1) - np.mean(data_2)

    return diff

In [34]:
empirical_diff_of_means = diff_of_means(perm[0], perm[1])
empirical_diff_of_means

0.013141684

In [39]:
perm_replicates = draw_perm_reps(perm[0], perm[1], diff_of_means, size=10000)

In [40]:
p = np.sum(perm_replicates >= empirical_diff_of_means) / len(perm_replicates)
p
#Our p-value is 0.0485. This is greater than our alpha of 0.01 so we fail to reject our null hypothesis that black names and
# white names have a statistically significant.  As a disclaimer though, if we were ok with being 95% sure instead if 99%
#sure, we would reject the null hypothesis.




0.0485

In [51]:
perm_replicates = draw_perm_reps(w['call'], b['call'], diff_of_means, size=10000)
perm_replicates

array([-0.00985626,  0.00985626, -0.00082136, ..., -0.00657084,
       -0.00328542, -0.00246406])

In [40]:
# Your solution to Q3 here
#Margin of error:
bmoe = stats.sem(b['call'])
bmoe

0.0049781434352911685

In [43]:
#Margin of error for white sounding names:
wmoe = stats.sem(w['call'])
wmoe

0.005985301397503016

In [80]:
#Confidence intervals
#We are going to create 3 separate confidence intervals for this problem.
#The first two will be for black names and the second one will be for white names
#The third confiedence interval is going to contain a mixture between the two groups
bcipos = bsr + (1.96 * (bst / (2435**.5)))
bcineg = bsr - (1.96 * (bst / (2435**.5)))
print(bcipos, bcineg)

0.07423354779459014 0.05471922427933184


In [84]:
wcipos = wsr + (1.96 * (wst / (2435**.5)))
wcineg = wsr - (1.96 * (wst / (2435**.5)))
print(wcipos, wcineg)

0.10824043083160882 0.08477804966120432


In [None]:
#Based on our confidence interval, the black success rate is very unlikely to happen in the white name data
#Also, the success rate for the white named candidates is very unlikely to occur in the black name data


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

In [95]:
# Question 4: Write a story describing the statistical significance in the context or the original problem
#Job hunting is a tough process no matter who you are.  However, it seems that the process is more difficult if you have a 
#black sounding name.  In fact, you are roughly 150% more likely to get a call about your resume if you have a white sounding
#name.  The takeaway from here is that the name you put on your resume is far more important than you think.

In [None]:
# Question 5: Does your analysis mean that race/name is the most important factor in callback success? Why or why not? 
#If not, how would you amend your analysis?
#  While there is a clear difference between the callback numbers between white and black sounding names, the correlation
#numbers at the top of the sheet seem to say that race is not significant in the callback rate (0.059 for whites and -0.059
#for blacks).  There are also stronger correlations with some other skills like 'special skills'  Race does seem to be very
#important though.