# CSS Bootcamp

## Day 4 (Multiple and logistic regression): Lab

This lab is intended to accompany **Day 4** of the week on **Statistics**, which focuses on:

- Building and interpreting **multiple linear regression** models.  
- The theoretical foundations of **logistic regression**.  
- Building and interpreting **logistic regression** models. 

This lab has some "free response" questions, in which you are asked to describe or make some inference from a graph. 

It also has questions requiring you to program answers in Python. In some cases, this will use built-in functions we've discussed in class (either today, or previous weeks). In others, there'll be a built-in function that we *haven't* discussed, which you will have to look up in the documentation. And in other cases, you'll be asked to write an original function.

Please reach out for help if anything is unclear!

#### Key imports

Here, we import some of the libraries that will be critical for the lab.

In [1]:
import matplotlib.pyplot as plt
import math
import numpy as np
import seaborn as sns
import scipy.stats as ss
import statsmodels.formula.api as smf
import pandas as pd

%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # makes figs nicer!



# Part 1: Multiple linear regression

In this first section, we'll revisit a dataset from **Day 3**, which contains judgments about the **relatedness** of ambiguous words across different contexts.  

##### Load data

First, let's load a version of the dataset, which can be found in `data/lab/raw-c.csv`. (I've copied the same data file from the `data/lab` folder under the **Day 3** directory for ease of use.)

In [2]:
df_rawc = pd.read_csv("data/lab/raw-c.csv")
df_rawc.head(5)

Unnamed: 0,word,sentence1,sentence2,same,ambiguity_type,mean_relatedness,distance_bert,distance_elmo
0,act,It was a desperate act.,It was a magic act.,False,Polysemy,2.181818,0.20411,0.034093
1,act,It was a desperate act.,It was a comedic act.,False,Polysemy,2.0,0.215616,0.045927
2,act,It was a humane act.,It was a magic act.,False,Polysemy,2.818182,0.191488,0.042351
3,act,It was a humane act.,It was a comedic act.,False,Polysemy,2.809524,0.225272,0.057707
4,act,It was a desperate act.,It was a humane act.,True,Polysemy,3.9,0.16799,0.04144


##### First, build a linear model using `statsmodels` predicting `mean_relatedness`, with `distance_bert` and `same` as predictors.

In [10]:
#### Your code here.

##### Write out the linear equation corresponding to these parameters.

In [32]:
#### Your code here.

##### According to this model, what is the predicted relatedness for `different sense` pairs that have a distance of $0.3$?

In [33]:
#### Your code here.

##### According to this model, what is the predicted relatedness for `same sense` pairs that have a distance of $0.3$?

In [35]:
#### Your code here.

##### Use `predict` to collect the model's predictions, then plot them against the real values of `mean_relatedness`.

Note: when visualizing this, I found it helpful to use `sns.scatterplot`, and set the `hue` to `df_rawc['same']`. 

In [37]:
#### Your code here.

##### Plot the model's residuals against `distance_bert`. 

In [39]:
#### Your code here.

##### How would you interpret these residuals? Is there evidence of heteroscedasticity?

In [41]:
#### Your response here.

##### Build and compare (visually, e.g., in a barplot) the $R^2$ of three models (described below).

- A model with only `same`. 
- A model with only `distance_bert`.
- A model with both parameters.

In [30]:
#### Your response here.

##### Considering each of the "reduced" models (e.g., `Same` and `Distance`), which model has a bigger difference in $R^2$ with `Both`? What does this mean?

In [48]:
#### Your response here.

# Part 2: Logistic regression

### Predicting hiring discrimination.

In this section, we'll consider a [dataset](https://www.openintro.org/data/index.php?data=resume) from a [now well-known paper](https://www.aeaweb.org/articles?id=10.1257/0002828042002561) (Bertrand & Mullainathan , 2004). Here's what the authors did:

> We study race in the labor market by sending fictitious resumes to help-wanted ads in Boston and Chicago newspapers. To manipulate perceived race, resumes are randomly assigned African-American- or White-sounding names. 

The authors then measured whether resumes received a **callback** (yes or no).

Importantly, this was an **experiment**: resumes were randomly assigned to one of two treatment conditions (i.e., whether the name given was judged separately as being more likely to belong to an African-American or White applicant.)

#### Load data

The data can be found in `data/lab/resume.csv`). There are a *ton* of columns/variables here, but for our purposes, we're focused on just a couple:

- `received_callback`: whether or not a resume received a callback.  
- `years_experience`: how many years of experience the applicant has.  
- `race`: whether the resume was coded as `white` or `black`.  

In [83]:
df_resume = pd.read_csv("data/lab/resume.csv")
df_resume.head(2)

Unnamed: 0,job_ad_id,job_city,job_industry,job_type,job_fed_contractor,job_equal_opp_employer,job_ownership,job_req_any,job_req_communication,job_req_education,...,honors,worked_during_school,years_experience,computer_skills,special_skills,volunteer,military,employment_holes,has_email_address,resume_quality
0,384,Chicago,manufacturing,supervisor,,1,unknown,1,0,0,...,0,0,6,1,0,0,0,1,0,low
1,384,Chicago,manufacturing,supervisor,,1,unknown,1,0,0,...,0,1,6,1,0,1,1,0,1,high


##### How many observations are there in this dataset?

In [84]:
#### Your code here

##### What proportion of people received a callback?

In [86]:
#### Your code here

##### How many resumes were coded as `white` and how many as `black`?

In [88]:
#### Your code here

##### What is the average `years_experience`, and what does the distribution look like?

In [91]:
#### Your code here

##### Let's start simple. Fit a model using `smf.logit` predicting `received_callback` from `years_experience`. 

In [94]:
#### Your code here

##### Print out the parameters using `mod.params`.  

In [96]:
#### Your code here

##### Write out the equation corresponding to these parameters.

Remember that $Y$ is actually the **log odds**, i.e., $log(\frac{p(y)}{1 - p(y)})$.

In [99]:
#### Your code here

##### What is the predicted log-odds of someone with 2 years of experience getting a callback?

In [103]:
#### Your code here

##### What is the predicted probability of someone with 2 years of experience getting a callback?

Recall that log-odds (here, $LO$) can be converted to probability using the following equation:

$\Large \frac{e^{LO}}{1 + e^{LO}}$

In [105]:
#### Your code here

##### Consider a (ridiculous) range of years of experience, from $1$ to $200$. Use the equations above to produce a probability of getting a callback for each value of `years_experience`. Then plot these using `sns.lineplot`. 

In [115]:
#### Your code here

##### Now add `race` as a covariate to the model predicting `received_callback` (along with a covariate of `years_experience`). 

In [94]:
#### Your code here

##### Compare the AIC of this more complex model to the model with just `years_experience`. 

In [94]:
#### Your code here

##### Which AIC is bigger? What does that mean about the more complex model? Is it better? Why or why not?

In [94]:
#### Your code here

##### Print out the parameters of the more complex model using `params`.  Are these significant?

In [139]:
#### Your code here

##### Write out the equation corresponding to these parameters.

Remember that $Y$ is actually the **log odds**, i.e., $log(\frac{p(y)}{1 - p(y)})$.

In [141]:
#### Your code here

##### How should we interpret the $\beta$ parameter for `race`? What does this mean?

Notes:

- Feel free to answer this qualitatively (we'll be more precise in a moment).
- This answer should include a reference to how the levels of `race` are coded.  

In [142]:
#### Your response here

##### What is the predicted log-odds of a Black applicant with 2 years of experience getting a callback?

In [143]:
#### Your code here

##### What is the predicted probability of a Black applicant with 2 years of experience getting a callback?

Recall that log-odds (here, $LO$) can be converted to probability using the following equation:

$\Large \frac{e^{LO}}{1 + e^{LO}}$

In [145]:
#### Your code here

##### What is the predicted log-odds of a White applicant with 2 years of experience getting a callback?

In [147]:
#### Your code here

##### What is the predicted probability of a White applicant with 2 years of experience getting a callback?

Recall that log-odds (here, $LO$) can be converted to probability using the following equation:

$\Large \frac{e^{LO}}{1 + e^{LO}}$

In [149]:
#### Your code here

##### Once again, consider a range of `years_experience` from $1$ to $200$. This time, use the parameters from the updated model to calculate `p(callback)` for this range. Do this twice––first assuming the applicants are Black, then assuming they are White.

In [151]:
#### Your code here

##### Based on these results, what should we conclude about racial discrimination in the application process?

In [131]:
#### Your code here

# Conclusion

Congratulations! You've now:

- Built and interpreted multiple linear regression models.  
- Built and interpreted multiple logistic regression models.  
- Replicated some of the key analyses in a seminal paper looking at discrimination in the job market.  