<a href="https://colab.research.google.com/github/tsvoronos/API202-students/blob/main/section-KP/section3-KP-exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## API-202M ABC SECTION #3
###### TF: Kelsey Pukelis

**I - INSTRUCTIONS**  

1. **Create a copy of this Jupyter notebook in your own drive by clicking `Copy to Drive` in the menubar (this is explained below in more detail below) - *if you do not do this your work will not be saved!***
    1. Remember to save your work frequently by pressing `command-S` or clicking `File > Save` in the menubar.
    1. We recommend completing this in Google Chrome.

## Load `R` libraries and data

**Please refer to Sheet 1 in this [R Cheat Sheet](https://bit.ly/HKS-R) which includes the commands you learned last semester in addition to a number of additional ones.**

The code cell below imports the R tidyverse. Make sure to run it before starting the problem set!

*Note: Click the "play" button that appears when you hover over a cell to run it. The first time you do this you may receive an alert that this notebook was not authored by Google. If so, click "Run anyway" to proceed.*

In [None]:
library(tidyverse)

# PART I: Preferences for Receiving Redistribution: Explaining Welfare Participation



The purpose of this exercise is to understand:
* how to interpret a linear probability model
* the math regression is doing when we have all dummy variables on the right-hand side. 

In this exercise, we will replicate exploratory analysis I did for my own research using [General Social Survey (GSS)](https://gss.norc.org/About-The-GSS) data. 

The dataset `gss` includes the following variables: 
* `id_`: unique identifier for individuals in the dataset (You can ignore this.)
* `year`: The year of the survey, which is 1986 for all observations. (You can ignore this.)
* `getaid`: ever received welfare: "Have you personally ever received income from Aid to Families with Dependent Children (AFDC), General Assistance, Supplemental Security Income, or Food Stamps?"
* `welfare1`: "For each of the following statements, please tell me whether you strongly agree, agree, disagree, or strongly disagree with it. 'Welfare makes people work less than they would if there wasn't a welfare system.'"
* `wrkstat`: labor force status
* `age`: age in years
* `educ`: years of completed education
* `sex`: male or female
* `race`: White, Black, or Other race
* `incom16`: "Thinking about the time when you were 16 years old, compared with American families in general then, would you say your family income was--far below average, below average, average, above average, or far above average?"
* `income`: total family income
* `rincome`: respondent's earned income
* `partyid`: political party affiliation

In the code below, I recode some of these variables to make them easier to use in regession analysis, and save the new data as `gss_clean`. 

In [None]:
gss <- read.csv('https://raw.githubusercontent.com/tsvoronos/API202-students/main/data/gss_short.csv')
head(gss)

Unnamed: 0_level_0,id_,year,getaid,welfare1,wrkstat,age,educ,sex,race,incom16,income,rincome,partyid
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1,1986,2. NO,3. DISAGREE,1. Working full time,28,14,1. MALE,1. White,5. FAR ABOVE AVERAGE,"11. $20,000 to $24,999",11. $20000 - 24999,"3. Independent (neither, no response)"
2,2,1986,2. NO,2. AGREE,7. Keeping house,54,16,2. FEMALE,1. White,5. FAR ABOVE AVERAGE,"12. $25,000 or more",.i,6. Strong republican
3,3,1986,1. YES,2. AGREE,1. Working full time,44,16,2. FEMALE,1. White,4. ABOVE AVERAGE,"12. $25,000 or more",6. $6000 TO 6999,0. Strong democrat
4,4,1986,2. NO,3. DISAGREE,5. Retired,77,14,2. FEMALE,2. Black,1. FAR BELOW AVERAGE,.r,.i,0. Strong democrat
5,5,1986,1. YES,2. AGREE,2. Working part time,44,14,2. FEMALE,2. Black,1. FAR BELOW AVERAGE,.d,.r,0. Strong democrat
6,6,1986,1. YES,1. STRONGLY AGREE,7. Keeping house,47,10,1. MALE,2. Black,3. AVERAGE,"5. $5,000 to $5,999",.i,1. Not very strong democrat


In [None]:
# some pre-processing of the data 
gss_clean <- gss %>% 
  mutate(getaid = as.numeric(recode(getaid, 
        "2. NO" = "0", 
        "1. YES" = "1", 
        ".n" = NA_character_))) %>%
  mutate(welfare1_num = as.numeric(recode(welfare1,
        "4. STRONGLY DISAGREE" = "4", 
        "3. DISAGREE" = "3",
        "2. AGREE" = "2",
        "1. STRONGLY AGREE" = "1",
        ".n" = NA_character_,
        ".d" = NA_character_))) %>%
  relocate(welfare1_num, .after = welfare1) %>%
  mutate(working = as.numeric(recode(wrkstat,
        "1. Working full time" = "1",
        "2. Working part time" = "1",
        "3. With a job, but not at work because of temporary illness, vacation, strike" = "1",
        "4. Unemployed, laid off, looking for work" = "0",
        "5. Retired" = "0",
        "6. In school" = "0",
        "7. Keeping house" = "0",
        "8. Other" = "0"))) %>%
  relocate(working, .after = wrkstat) %>%
  mutate(age = as.numeric(recode(age,
        "89. 89 or older" = "89",
        ".n" = NA_character_))) %>%
  mutate(educ = as.numeric(recode(educ,
        "0. No formal schooling" = "0",
        ".n" = NA_character_))) %>%
  mutate(female = as.numeric(recode(sex,
        "1. MALE" = "0",
        "2. FEMALE" = "1"))) %>%
  relocate(female, .after = sex) %>%
  mutate(race_white = ifelse(race == "1. White",1,0)) %>%
  mutate(race_black = ifelse(race == "2. Black",1,0)) %>%
  mutate(race_other = ifelse(race == "3. Other",1,0)) %>%
  relocate(race_white, .after = race) %>%
  relocate(race_black, .after = race_white) %>%
  relocate(race_other, .after = race_black) %>%
  mutate(income_num = as.numeric(recode(income,
        '1. Under $1,000' = "1",
        '2. $1,000 to $2,999' = "2",
        '3. $3,000 to $3,999' = "3",
        '4. $4,000 to $4,999' = "4",
        '5. $5,000 to $5,999' = "5",
        '6. $6,000 to $6,999' = "6",
        '7. $7,000 to $7,999' = "7",
        '8. $8,000 to $9,999' = "8",
        '9. $10,000 to $14,999' = "9",
        '10. $15,000 to $19,999' = "10",
        '11. $20,000 to $24,999' = "11",
        '12. $25,000 or more' = "12",
        '.d' = NA_character_,
        '.n' = NA_character_,
        '.r' = NA_character_))) %>%
  relocate(income_num, .after = income) %>%
  mutate(party_short = recode(partyid,
        '0. Strong democrat' = "Democrat",
        '1. Not very strong democrat' = "Democrat",
        '2. Independent, close to democrat' = "Independent",
        '3. Independent (neither, no response)' = "Independent",
        '4. Independent, close to republican' = "Independent",
        '5. Not very strong republican' = "Republican",
        '6. Strong republican' = "Republican",
        '7. Other party' = "Independent",
        ".n" = NA_character_)) %>%
  relocate(party_short, .after = partyid) %>%
  mutate(party_dem = ifelse(party_short == "Democrat",1,0)) %>%
  mutate(party_ind = ifelse(party_short == "Independent",1,0)) %>%
  mutate(party_rep = ifelse(party_short == "Republican",1,0)) %>%
  relocate(party_dem, .after = party_short) %>%
  relocate(party_ind, .after = party_dem) %>%
  relocate(party_rep, .after = party_ind)

head(gss_clean)

Unnamed: 0_level_0,id_,year,getaid,welfare1,welfare1_num,wrkstat,working,age,educ,sex,⋯,race_other,incom16,income,income_num,rincome,partyid,party_short,party_dem,party_ind,party_rep
Unnamed: 0_level_1,<int>,<int>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
1,1,1986,0,3. DISAGREE,3,1. Working full time,1,28,14,1. MALE,⋯,0,5. FAR ABOVE AVERAGE,"11. $20,000 to $24,999",11.0,11. $20000 - 24999,"3. Independent (neither, no response)",Independent,0,1,0
2,2,1986,0,2. AGREE,2,7. Keeping house,0,54,16,2. FEMALE,⋯,0,5. FAR ABOVE AVERAGE,"12. $25,000 or more",12.0,.i,6. Strong republican,Republican,0,0,1
3,3,1986,1,2. AGREE,2,1. Working full time,1,44,16,2. FEMALE,⋯,0,4. ABOVE AVERAGE,"12. $25,000 or more",12.0,6. $6000 TO 6999,0. Strong democrat,Democrat,1,0,0
4,4,1986,0,3. DISAGREE,3,5. Retired,0,77,14,2. FEMALE,⋯,0,1. FAR BELOW AVERAGE,.r,,.i,0. Strong democrat,Democrat,1,0,0
5,5,1986,1,2. AGREE,2,2. Working part time,1,44,14,2. FEMALE,⋯,0,1. FAR BELOW AVERAGE,.d,,.r,0. Strong democrat,Democrat,1,0,0
6,6,1986,1,1. STRONGLY AGREE,1,7. Keeping house,0,47,10,1. MALE,⋯,0,3. AVERAGE,"5. $5,000 to $5,999",5.0,.i,1. Not very strong democrat,Democrat,1,0,0


**1. Run a bivariate regression of receiving welfare on a dummy for agreeing with the statement, "Welfare makes people work less." (Note: You should create a dummy variable first, grouping together people who "agree" or "strongly agree" into the "agree" group.) Interpret the slope coefficient and intercept.** 

In [1]:
# Your code here 

Your Answer Here 

**2. Beliefs about welfare programs are very political, so we should probably account for respondents' party. Control for political party by including the dummies for "Republican", "Democrat", and/or "Indepedent" in a multivariate regression. Which dummies did you include in the regression and why? Say what each of the coefficients is measuring. Interpret the coefficient on agreeing with the statement, and briefly note how the coefficient changed compared to the bivariate regression.**

In [2]:
# Your code here 

Your Answer Here

**Let's practice running the same regression, now using factors in R instead of using dummy variables.**

Recall the following coding example from the lecture appendix:

```
NLS <- NLS %>% mutate(grad = as_factor(case_when(educ < 12 ~ "Less than HS",
                                                  educ >= 12 & educ < 16 ~ "HS",
                                                  educ >= 16 ~ "College")))
```
If we already had a variable called `educ_group` that took on (character) values `"Less than HS","HS","College"`, then we could create a factor variable using the following code:
```
NLS <- NLS %>% mutate(grad = as_factor(educ_group))
```

**3. Turn the political party variable `party_short` into a factor variable, and run the same regression as before. Note which group that R automatically omitted.**

In [3]:
# Your code here

Your Answer Here

#### START 

When using the factor variable R automatically omits the Independent group. So our regression results are exact the same as above. 

#### END 

If we already created a factor variable called `grad`, we could set the reference group by listing in first using the `relevel` command:

```
NLS$grad <- relevel(NLS$grad, "Less than HS", "HS", "College")
```
So this code sets `"Less than HS"` as the reference group for the factor variable `grad`.

**4. Using a factor variable, set the reference group to Republicans and run the regression again. Briefly discuss how the interpretation of the regression changes.**

In [4]:
# Your code here 

Your Answer Here 


**Suppose we want to test the null hypothesis that the relationship between agreeing with the statement and receiving welfare is the same across the three political groups.** 

**5. To start, simply calculate the means of the welfare participation variable for 6 groups: Independents who disagree, Republicans who disagree, Democrats who disagree, Independents who agree, Republicans who agree, and Democrats who agree.**

In [5]:
# Your code here 

Your Answer Here


**Suppose now we actually want to *test* the null hypothesis that the relationship between agreeing with the statement and receiving welfare is the same across the three political groups. Someone suggests running the following regression:**

$$
getaid_i = \beta_0 + \beta_1 Republican_i + \beta_2 Democrat_i + \beta_3  welfare1.agree_i  + \beta_4 welfare1.agree * Republican_i + \beta_5 welfare1.agree * Democrat_i + u_i
$$


**6. Fill in the following table to show how the coefficients in the regression relate to the 6 means that you just calculated. What is the omitted group in the regression? Which coefficient(s) are relevant for testing the null hypothesis? Can we test the null hypothesis that all the effect is the same for *all* three groups?**

$$
\begin{array}{|l|l|l|}
\hline \\
Party & Agree/Disagree & Coefficient(s) \\
\hline \\
Independents & Disagree & \\
\hline \\
Republicans & Disagree & \\
\hline \\
Democrats & Disagree & \\
\hline \\
Independents & Agree & \\
\hline \\
Republicans & Agree & \\
\hline \\
Democrats & Agree & \\
\hline \\
\end{array}
$$


Your Answer Here 

**7. Now create the interaction terms in the dataset and run this regression.**

In [6]:
# Your code here

**8. Confirm that you can recover the means for each of the 6 groups by adding together different coefficients as indicated in your table above. What does this tell you about what regression is doing when we are working with all dummy variables on the right hand side?**

In [7]:
# Your code here (optional)

Your Answer Here 

#### START

Wow! By adding together coefficients from the regressions, we can *exactly* recover each of the group means that we calculated before. This shows that, when all the variables on the right hand side are dummy variables, regression is just a fancy average-making machine. 

Note: We refer to a regression that includes all possible combinations of dummy variable interactions as "fully saturated."

#### END 

**Bonus: Run a regression controlling for other factors that you think might affect welfare participation. How does this change the estimate on the "agree" coefficient?**

In [8]:
# Your code here

Your Answer Here 

#### START 

Controlling for these other characteristics reduces the magnitude of the coefficient on the agree variable. (Especially noteworthy are the fact that working people are less likely to report receiving welfare, all else constant; richer people are less likely to report receiving welfare, all else constant; and black individuals are more likely to report receiving, all else constant.)

However, it is still the case that the relationship between agreeing with the statement "Welfare makes people work less" and reported welfare receipt is negative, statistically significant, and quite large in magnitude. 


#### END 

**Bonus. Create a table showing results from your regressions: the bivariate regression, the regression controling for political party, the regression with interaction terms, and perhaps the regression including other controls.** 

First, run the script below installing the package modelsummary (this may take a minute or so to install).

In [None]:
check_installed <- require(modelsummary)
if(check_installed==F){
  install.packages("modelsummary")
  require(modelsummary)
}

In the code snippet below, `fit1`, `fit2`, and `fit3` is the regression outputs you saved in previous code, `Name of Outcome Variable` is what you would like to show up on the top of each column, and `var1` and `var2` are the name of two independent variables included in the regression (e.g. party_rep). Note that you can and should include all the variables that you used in the models you ran in the `coef_map` line.
```
  modelsummary(list("Name Outcome Variable" = fit1,"Name Outcome Variable" = fit2, "Name Outcome Variable" = fit3),
  stars=T,
  coef_map = c("(Intercept)" = "Constant","var1" = "Name of var1", "var2" = "Name of var2"),
 title = "Add Here the Title of the Table",
  gof_omit = 'IC|Log',
  output="jupyter")

  ```

In [9]:
# Your code here 

**9. Finally, can you reject the null hypothesis that the relationship between  agreeing with the statement and receiving welfare is the same across the political groups?**

Your Answer Here 

**10. Come up with a variable in the dataset other than political party that you think may affect the relationship between beliefs about welfare and welfare participation. Create interaction term(s) and run a regression that allows you to test your hypothesis. Compare to a regression where you include that variable as a control, but without an interaction term. Interpret your results.**

In [10]:
# Your code here 

Your Answer Here 

**Bonus. Describe what this exercise overall tells us about the causal relationship between beliefs about welfare and participation in welfare programs.**

Your answer here