# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as st
from scipy.stats import norm
import seaborn as sb


In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [5]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [9]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [21]:
data.columns

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')

In [12]:
#validating independence in data.
len(data[data.duplicated() == True])

0

*There are no duplicate rows in data.*

**1. What test is appropriate for this problem? Does CLT apply?**

*Chi-Squared Test is appropriate for this problem to understand the relation between Categorical variables '**race**' and '**call**'.*

*However, we can still apply central limit theorem to calculate the confidence intervals for probability of calling a **black** Vs **white** person for interview*.

**2.What are the null and alternate hypotheses?**



*H0: race has no impact on receiving an interview call. which implies P(b) = P(w)*

*H1: race has an impact on receiving interview call. which implies P(b) <> P(w).*

**3.Compute margin of error, confidence interval, and p-value.**



In [30]:
n_b = len(data[data.race=='b']) # number of blacks applied for job
n_w = len(data[data.race=='w']) # number of whites applied for job
n_bcald = sum(data[data.race=='b'].call) # number of blacks called for interview
n_wcald = sum(data[data.race=='w'].call) # number of whites called for interview
p_b = n_bcald/n_b #Probability of calling a black for interview = Mean of corresponding bernouli distribution
p_w = n_wcald/n_w #Probability of calling a white for interview = Mean of corresponding bernouli distribution
    
sd_b = np.sqrt(p_b*(1-p_b)) #Standard deviation of blacks distribution
sd_w = np.sqrt(p_w*(1-p_w)) #Standard deviation of whites distribution

n_b,p_b,sd_b,n_w,p_w,sd_w

(2435,
 0.064476386036960986,
 0.24559963697158382,
 2435,
 0.096509240246406572,
 0.29528834517039093)

In [17]:

#Function to calculate Confidence_Intervals:
def conf_int_cat(p,n,alpha):
    '''
    p = probability of success in bernouli distribution
    n = sample size
    alpha = level of significance
    
    '''
    import numpy as np
    import scipy.stats
    SE = np.sqrt(p*(1-p))/np.sqrt(n)
    z_crit = st.t.ppf(1-alpha/2,n-1)
    mean = p
    margin_err = z_crit*SE
    conf_interval = (mean-margin_err,mean+margin_err)
    print('Confidence Interval: ',conf_interval,' with margin of error = %f'%(margin_err))
    return conf_interval,margin_err


In [26]:
#Function calls for Confidence intervals
conf_int_cat(p_b,n_b,0.05) #Confidence iterval for probability of blacks to be picked for interview


Confidence Interval:  (0.054716553993800515, 0.074236218080121458)  with margin of error = 0.009760


((0.054716553993800515, 0.074236218080121458), 0.0097598320431604678)

In [27]:
conf_int_cat(p_w,n_w,0.05) #Confidence iterval for probability of whites to be picked for interview

Confidence Interval:  (0.084774839134489383, 0.10824364135832376)  with margin of error = 0.011734


((0.084774839134489383, 0.10824364135832376), 0.011734401111917184)

In [28]:
#2 sample t-test for evaluating if both blacks and whites have equal probability of being called for interview.
st.ttest_ind_from_stats(p_b,sd_b,n_b, p_w, sd_w, n_w, equal_var=False)

Ttest_indResult(statistic=-4.1155504357300003, pvalue=3.928572390043594e-05)

*From Above results: p-value < 0.05, we can reject Null-Hypothesis and conlude that interview calls are biased by race of the applicant.*

In [31]:
#We can also conduct this analysis using a simple and straight chi-squared test.
#Contigency Table
cont_table = np.array([[n_bcald,n_wcald],[n_b-n_bcald,n_w-n_wcald]])
cont_table

array([[  157.,   235.],
       [ 2278.,  2200.]])

In [32]:
st.chi2_contingency(cont_table)

(16.449028584189371, 4.9975783899632552e-05, 1, array([[  196.,   196.],
        [ 2239.,  2239.]]))

*From Above results: p-value(0.0000499) < 0.05, we can reject Null-Hypothesis and conlude that interview calls are biased by race of the applicant.*

**4.Write a story describing the statistical significance in the context or the original problem.**


**Assuming Null Hypothesis is true:** We would expect 196-blacks and 196-whites are called for the interview.

**p-value from Chis-squared test:** p=0.000049 is much less than 0.05. And observred values differ significantly from expected values.  This implies that there is a statistical significance for Racial discrimination in job-selection process.



**5.Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?**

*As per analysis, there is a statistical significance for racial descrimination in job-selection process.
However, It does not mean racial bias is the only key attribute that contributes to interview calls.*

Further, analysing other data-elements in the "data" will unviel the true contributors to interview-calls.