-
Notifications
You must be signed in to change notification settings - Fork 9
/
Regression2_Implementation.Rmd
365 lines (280 loc) · 10.2 KB
/
Regression2_Implementation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
---
title: "Regression 2: Implementation in R"
author: "Instructor: Yuta Toyama"
date: "Last updated: `r Sys.Date()`"
fig_width: 6
fig_height: 4
output:
xaringan::moon_reader:
css: xaringan-themer.css
nature:
countIncrementalSlides: no
highlightLines: yes
highlightStyle: github
ratio: '16:9'
knit: pagedown::chrome_print
---
class: title-slide-section, center, middle
name: logistics
# Introduction
---
## Acknowledgement
This note is based on "Introduction to Econometrics with R". https://www.econometrics-with-r.org/index.html
---
## Preliminary: packages
- We use the following packages:
- `AER` :
- `dplyr` : data manipulation
- `stargazer` : output of regression results
``` {r, warning = FALSE, message=FALSE}
# Install package if you have not done so
# install.packages("AER")
# install.packages("dplyr")
# install.packages("stargazer")
# install.packages("texreg")
# install.packages("estimatr")
# load packages
library("AER")
library("dplyr")
library("stargazer")
library("texreg")
library("estimatr")
```
---
## Empirical setting: Data from California School
- Question: How does the student-teacher ratio affects test scores?
- We use data from California school, which is included in `AER` package.
- See here for the details: https://www.rdocumentation.org/packages/AER/versions/1.2-6/topics/CASchools
```{r}
# load the the data set in the workspace
data(CASchools)
```
---
- Use `class()` function to see `CASchools` is `data.frame` object.
```{r}
class(CASchools)
```
- We take 2 steps for the analysis.
- Step 1: Look at data (descriptive analysis)
- Step 2: Run regression
---
class: title-slide-section, center, middle
name: logistics
# Step 1: Descriptive analysis
---
### Descriptive analysis
- It is always important to grasp your data before running regression.
- `head()` function give you a first overview of the data.
```{r}
head(CASchools)
```
- Alternatively, you can use `View()` to see the entire dataset in browser window.
---
## Create variables
- Create several variables that are needed for the analysis.
- We use `dplyr` for this purpose.
```{r}
CASchools %>%
mutate( STR = students / teachers ) %>%
mutate( score = (read + math) / 2 ) -> CASchools
```
---
## Descriptive statistics
- There are several ways to show descriptive statistics
- The standard one is to use `summary()` function
```{r,eval=FALSE}
summary(CASchools)
```
---
```{r,echo=FALSE}
summary(CASchools)
```
---
- This returns the desriptive statistics for all the variables in dataframe.
- You can combine this with `dplyr::select`
```{r}
CASchools %>%
select(STR, score) %>%
summary()
```
---
- You can do a bit lengthly thing manually like this.
```{r}
# compute sample averages of STR and score
avg_STR <- mean(CASchools$STR)
avg_score <- mean(CASchools$score)
# compute sample standard deviations of STR and score
sd_STR <- sd(CASchools$STR)
sd_score <- sd(CASchools$score)
# set up a vector of percentiles and compute the quantiles
quantiles <- c(0.10, 0.25, 0.4, 0.5, 0.6, 0.75, 0.9)
quant_STR <- quantile(CASchools$STR, quantiles)
quant_score <- quantile(CASchools$score, quantiles)
# gather everything in a data.frame
DistributionSummary <- data.frame(Average = c(avg_STR, avg_score),
StandardDeviation = c(sd_STR, sd_score),
quantile = rbind(quant_STR, quant_score))
```
---
```{r}
DistributionSummary
```
---
- My personal favorite is to use `stargazer` function.
```{r}
stargazer(CASchools, type = "text")
```
---
- You can choose summary statistics you want to report.
```{r}
CASchools %>%
stargazer( type = "text", summary.stat = c("n", "p75", "sd") )
```
---
<br/><br/>
- See https://www.jakeruss.com/cheatsheets/stargazer/#the-default-summary-statistics-table for the details.
- `stargazer` can be also used to report regression results.
- But, we will use `texreg` instead.
---
## Scatter plot
- Let's see how test score and student-teacher-ratio is correlated.
```{r, eval=FALSE}
plot(score ~ STR,
data = CASchools,
main = "Scatterplot of TestScore and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)")
```
---
.center[
```{r, fig.height=8, fig.width=12, echo=FALSE}
plot(score ~ STR,
data = CASchools,
main = "Scatterplot of TestScore and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)")
```
]
---
<br/><br/>
- Use `cor()` to compute the correlation between two numeric vectors.
```{r}
cor(CASchools$STR, CASchools$score)
```
---
class: title-slide-section, center, middle
name: logistics
# Step 2: Run regression
---
## Simple linear regression
- We use `lm()` function to run linear regression
- First, consider the simple linear regression
$$score_i = \beta_0 + \beta_1 size_i + \epsilon_i$$
where $size_i$ is the class size (student-teacher-ratio).
- From now on we call student-teacher-ratio (STR) class size.
---
- To run this regression, we use `lm`
```{r}
# First, we rename the variable `STR`
CASchools %>%
dplyr::rename( size = STR) -> CASchools
# Run regression and save results in the varaiable `model1_summary`
model1_summary <- lm( score ~ size, data = CASchools)
# See the results
summary(model1_summary)
```
---
- Interpretations
- An increase of one student per teacher leads to 2.2 point decrease in test scores.
- p value is very small. The effect of the class size on test score is significant.
- Note: Be careful. These standard errors are NOT heteroskedasiticity robust. We will come back to this point soon.
- $R^2 = 0.051$, implying that 5.1% of the variance of the dependent variable is explained by the model.
- You can add more variable in the regression (will see this soon)
---
## Robust standard error with `lm_robust`
- We use `lm_robust()` in `estimatr` package to run regression with robust standard error.
```{r}
model1_robust <- lm_robust( score ~ size, data = CASchools, se_type = "HC1")
summary(model1_robust)
```
- Notice that robust standard errors are larger than the one we obtained from `lm`!
---
## Report by texreg
- `texreg` is useful to show the regression result.
- `screenreg` function shows the table on R markdown.
- You can use `htmlreg` (`texreg`) to get html (latex) format.
- `stargazer` is also used to show regression results, however it does not follow `lm_robust`.
```{r, eval=FALSE}
# Create output by `screenreg` function.
screenreg(l=list(model1_summary, model1_robust),
digits = 3,
# caption = 'title',
custom.model.names = c("model1", "model1 Robust"),
custom.coef.names = NULL, # add a class, if you want to change the names of variables.
include.ci = F,
include.rsquared = FALSE, include.adjrs = TRUE, include.nobs = TRUE,
include.pvalues = FALSE, include.df = FALSE, include.rmse = FALSE,
custom.header = list("score" = 1:2), # you can add header especially to indicate dependent variable
stars = numeric(0) # to delete star expression
)
```
---
```{r, echo = FALSE }
# Create output by `screenreg` function.
screenreg(l=list(model1_summary, model1_robust),
digits = 3,
# caption = 'title',
custom.model.names = c("model1", "model1 Robust"),
custom.coef.names = NULL, # add a class, if you want to change the names of variables.
include.ci = F,
include.rsquared = FALSE, include.adjrs = TRUE, include.nobs = TRUE,
include.pvalues = FALSE, include.df = FALSE, include.rmse = FALSE,
custom.header = list("score" = 1:2), # you can add header especially to indicate dependent variable
stars = numeric(0) # to delete star expression
)
```
---
## Full results
Taken from https://www.econometrics-with-r.org/7-6-analysis-of-the-test-score-data-set.html
```{r, eval=FALSE}
# estimate different model specifications
spec1 <- lm_robust(score ~ size, data = CASchools, se_type = "HC1")
spec2 <- lm_robust(score ~ size + english, data = CASchools, se_type = "HC1")
spec3 <- lm_robust(score ~ size + english + lunch, data = CASchools, se_type = "HC1")
spec4 <- lm_robust(score ~ size + english + calworks, data = CASchools, se_type = "HC1")
spec5 <- lm_robust(score ~ size + english + lunch + calworks, data = CASchools, se_type = "HC1")
# generate a table using texreg
screenreg(l = list(spec1, spec2, spec3, spec4, spec5),
digits = 3,
# caption = 'title',
custom.model.names = c("(I)", "(II)", "(III)", "(IV)", "(V)"),
custom.coef.names = NULL, # add a class, if you want to change the names of variables.
include.ci = F,
include.rsquared = FALSE, include.adjrs = TRUE, include.nobs = TRUE,
include.pvalues = FALSE, include.df = FALSE, include.rmse = FALSE,
custom.header = list("score" = 1:2), # you can add header especially to indicate dependent variable
stars = numeric(0) # to delete star expression
)
```
---
```{r, echo=FALSE}
# estimate different model specifications
spec1 <- lm_robust(score ~ size, data = CASchools, se_type = "HC1")
spec2 <- lm_robust(score ~ size + english, data = CASchools, se_type = "HC1")
spec3 <- lm_robust(score ~ size + english + lunch, data = CASchools, se_type = "HC1")
spec4 <- lm_robust(score ~ size + english + calworks, data = CASchools, se_type = "HC1")
spec5 <- lm_robust(score ~ size + english + lunch + calworks, data = CASchools, se_type = "HC1")
# generate a table using texreg
screenreg(l = list(spec1, spec2, spec3, spec4, spec5),
digits = 3,
# caption = 'title',
custom.model.names = c("(I)", "(II)", "(III)", "(IV)", "(V)"),
custom.coef.names = NULL, # add a class, if you want to change the names of variables.
include.ci = F,
include.rsquared = FALSE, include.adjrs = TRUE, include.nobs = TRUE,
include.pvalues = FALSE, include.df = FALSE, include.rmse = FALSE,
custom.header = list("score" = 1:2), # you can add header especially to indicate dependent variable
stars = numeric(0) # to delete star expression
)
```
- The coefficient on the class size decreases as we add more explantory variables. Can you explain why? (Hint: omitted variable bias)