Adapted and extended from Sections 2.1, 2.2, & 2.3 of Imai, Kosuke (2017). *Quantitative 
Social Science: An Introduction.* Princeton, NJ: Princeton University Press.

# 1 Introduction



## 1.1 Overview - Causality

The sciences, whether physical, life, or social, aim to capture the causal workings of the world.    

* Does this particular vaccine prevent infection?  
* Which get-out-the-vote efforts are most impactful on voter turnout? 
* Does the introduction of a catalyst increase the rate of a chemical reaction?

To establish causation, scientists often conduct experiments comparing results in the presence and absence of potential causes.

In this sequence of lessons, we begin with a superficial analysis of data from a very famous social science experiment.  Reflection will lead us to ask a deeper question about these results and lead us to our next topic.

For the most part we will use the same Python commands we used in previous notebooks.  The one addition will be the Pandas dataframe method `pivot_table()`.

## 1.2 "Are Emily and Greg more employable than Lakisha and Jamal?"



That is the title of a well-known study published in 2004 by Marianne Bertrand and Sendhil Mullainathan in the *American Economic Review*, vol. 94, pp. 991-1013.  You can read a pdf of it [here](https://www.aeaweb.org/articles?id=10.1257/0002828042002561).  On [Google Scholar](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C31&q=lakisha+and+jamal&oq=lakis), you can see that this article has been cited over 5300 times!



The **first 32 seconds** of this video give you a quick snapshot of the study.

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo('fTcSVQJ2h8g?start=0&end=32&autoplay=0')

The first part of the journal article's abstract:
>We perform a field experiment to measure racial discrimination in the labor market. We respond
with fictitious resumes to help-wanted ads in Boston and Chicago newspapers. To manipulate
perception of race, each resume is assigned either a very African American sounding name or a very
White sounding name. The results show significant discrimination against African-American
names: White names receive 50 percent more callbacks for interviews. ...

Let's look at some of the data.

# 2 Imports


In [2]:
import pandas as pd
import numpy as np
import altair as alt

# 3 Analyzing the data

A susbset of the experimental data is available in Kosuke Imai's github repo for his book *Quantitative Social Science: An Introduction*. 

For this data set the fields are:   

| **Variable** | **Description**| **Values**|
|--- | --- | --- |
| `firstname`| first name of the fictitious job applicant | Ex: `Jamal Jones`, `Emily Baker`|
|`sex`| sex of applicant | `female` or `male` |
|`race`| race of applicant |`black` or `white`|
|`call`|whether a callback was made | `1`=yes, `0`=no|


## 3.1 Reading it in

To read it in:
* go to Kosuke Imai's github repo [here](https://github.com/kosukeimai/qss/tree/master/CAUSALITY)
* find the file labeled `resume.csv`  
* copy the link for the `RAW` version
* use the command `pd.read_csv()` 
* read it into a dataframe named `resumes`

## 3.2  `resumes` counts...  (w QUIZ)

Complete the tasks in the next two subsections and then answer the  questions in Canvas quiz *QUIZ 05: Resumes A*.

### 3.2.1 ...via `value_counts()`

You can answer the following questions using `value_counts()`.  Just run it on a column using either
* `dataframe['column_name'].value_counts()`
* `dataframe.column_name.value_counts()`

**Note:**
you can only use the second method when there are no spaces in the name of the column.

**DO THIS!:** Run `value_counts()` on all of the columns.    

In order that you can see all of the many `firstnames` used, run the next cell first.

In [3]:
pd.set_option('display.max_rows', None)

### 3.2.2 ...via `pivot_table()` 

Let's try the `pivot_table()` method.  This is slightly different information than that produced by `value_counts()` run on individual columns. 

**Notes:** 
* there is also a Pandas `pivot_table()` command (eg  `pd.pivot_table(df,...)` instead of `dataframe.pivot_table()`)
* you can write the `pivot_table()` method on one line or you can split it up on separate lines to make it more readable. Below, the trailing commas indicate that the command continues on to the next line. For the first parameter, there is no trailing comma, so we used `\`, the line continuation symbol. This approach also lets you comment (via #) mid-command.

In [4]:
resumes.pivot_table( \
              index='sex',      # the variable for the rows in your pivot table
              columns='race',   # the variable for the columns of your pivot table
              values='call',    # the values you will be using...since we are just counting the choice doesn't matter much
              aggfunc=len)      # len is the function to use for counting

NameError: ignored

To calculate totals and subtotals, just add the parameter `margins=True` as below.  Here's what it looks like written on one line.

In [None]:
resumes.pivot_table(index='sex',columns='race',values='call',aggfunc=len,margins=True)

Not only do you get the `value_counts()` for `race` and `sex`, but you also get the total number of resumes, as well as the counts for `black`-`female`, `black`-`male`, `white`-`female`, and `white`-`male`.  Useful!

You can read more about `pivot_table()` [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html).  There is also a `crosstab()` command which is similar, but has some slightly different features.

## 3.3 Analyzing the overall results



Now let's *finally* get to the IMPORTANT question:  did `black` resumes receive callbacks at a lower rate than `white` ones?  

You could easily use `.groupby()` to calculate these rates. 

In [None]:
resumes.groupby('race')['call'].mean()

So on average, the 6.4% of the `black` resumes received call backs whereas 9.7% of the `white` resumes did. That seems like a pretty big deal, both numerically and in what it says about the world.  You can see why there has been such interest in this study.

You could also break down this data using gender in addition to race using a pivot table. We're going to assign the resulting pivot table to the variable `resumes_pt`.

In [None]:
resumes_pt = resumes.pivot_table( \
                                 index = 'sex',
                                 columns = 'race',
                                 values = 'call',
                                 aggfunc = np.mean, #this applies the numpy mean function to the data
                                 margins = True \
                                )
resumes_pt

Again, you can see  that 'white' resumes receive a call roughly 9.65% of the time, whereas 'black' resumes receive a call roughly 6.45% of the time.  You can also see the breakdown by gender.  

Since `resumes_pt` is itself a dataframe, we can easily create a new column that is calculated from other columns.  As we have done in earlier notebooks.

In [None]:
resumes_pt['w2b']=resumes_pt['white'] / resumes_pt['black']
resumes_pt

With this new column, you can see that the 50% increase in call rate for white resumes over black resumes was pretty consistent (and discouraging) across `male` and `female` groups.

## 3.4 The results by name

Now rank the names by `call` rate by using 
* `groupby()` 
* `[]` column selection
* `mean()`
* `sort_values()`

And then answer the following questions on *QUIZ 05: Resumes B* on Canvas.

* What name had the highest rate for `call` ?
* What name had the lowest rate?
* How many names were above the average for all resumes you found in the pivot table?
* How many black names were above this average?

## 3.5 Something to ponder

Although the fact that white resumes receive approximately 50% more calls than black resumes seems evidence of racial bias, doesn't the fact that several black names received calls at an above average rate undermine that claim of racial bias?

In the following lessons, we'll develop a theory that will help us resolve this tension.

But first...

# 4 You try it!

Here is an excerpt from the end of the abstract of the Bertrand and Mullainathan article:

>...The amount of discrimination is uniform across occupations and industries. Federal contractors and
employers who list “Equal Opportunity Employer” in their ad discriminate as much as other
employers. We find little evidence that our results are driven by employers inferring something
other than race, such as social class, from the names. These results suggest that racial discrimination
is still a prominent feature of the labor market.

I created a very abridged version of the data (acquired from [here](https://dev.openicpsr.org/openicpsr/project/108486/version/V1/view)) and put it on the repo for this course.  It has only four columns.

|**variable**|**description**|**values**|
|--- | --- | --- |
|race|race of applicant|`b`=black; `w`=white|
|city|city of employer and applicant | `b`=Boston; `c`=Chicago|
|eoe|does employer declare to be an 'equal opportunity employer'|`1`=yes; `0`=no|
|call|whether a callback was made| `1`=yes; `0`=no|

The following line will load this data into a dataframe named `resumes_city_eoe`.

In [None]:
resumes_city_eoe = pd.read_csv("https://raw.githubusercontent.com/sawula/Survey-of-Python-data-science-stack/gh-pages/resumes_city_eoe.csv")

Answer these two questions using two different pivot tables.  

* Is the racial bias different in the different cities?
* Is the racial bias different with the group of equal opportunity employers (this abstract says no!)

To answer these questions, calculate the white / black call rates for these groups.  Once you've answered these, take *QUIZ 05: Resumes C*.