<img src="images/econ140R_logo.png" width="200" />

<h1>ECON 140R Class 18</h1>

<h2>Online Midterm Scores</h2>

Learning objectives:

1. Some remarks about privacy and Human Subjects
2. Exam results
3. More experience with OLS with treatment groups
4. Some fun kernel density plots, courtesy of Data Science

Multiple choice exams are not my favorite, but at least they provide instant satisfaction to "data heads" like me. And not only do we have the scores already; we also have the responses to an interesting survey question, shown in the image below.

<font color = "blue">Do you have any priors about what association, if any, we might find between test scores and the answers to this question?</font>

<img src="images/econ140_fa22_mt_survey.png" width="500" />

I wanted to include this survey question on the exam for at least three reasons:

* It makes for a nice exercise for class
* I would like to know whether there are big performance differences associated with participation mode
* I think students should also be aware of what the data say

As you may have deduced from our term thus far, however, I tend to believe that people in general, and students in particular, tend to choose things that work best for them, unless they face constraints that push them in one way or another. There is much heterogeneity in people's circumstances, and often the one who will make the best choice given the circumstances is the person. 

With that perspective, one might well expect not to reject the <b>null hypothesis that there is no difference in scores between groups</b>. If the choice to attend in person vs. online is a function of other things, like "being a morning person" or not, it is not clear one would see any systematic correlation between scores and in-person attendance. 

But with mode of instruction and mode of participation, the typical concern one will hear is that <b>"there is no substitute for in-person instruction,"</b> meaning that distance participation is inherently worse. That hypothesis would be weakly supported if we found evidence that students who attended less often in person also performed less well on the midterm. (I'm saying "weakly supported" because this is not a rigorously designed study, and all kinds of things could be going on.)

Probably one of the wildest things to think about is this: <i>Does the mode of this <b>online exam</b> bias the results in a particular way?</i> One might hypothesize that an online exam could be easier for students who usually attend online. I have no idea whether this might be true.

There are many other things one can state, but for now let us look at the data.

<hr>

<h2>These data and your privacy</h2>

Data may be collected and examined for <i>educational purposes only</i> without Human Subjects Review. Therefore, please use these data only for educational purposes, as we are doing here.

These data are <b>anonymized</b>, with no identifiers. I have scrambled the order in which the data appear, while preserving the covariance structure. Further, I have dropped the students who answered "6. I don't know" and "7. Refuse to answer" from the dataset. Scores and the response are not public and not known except to the individual student and to the instructors. The instructors are already bound not to reveal grades and identities by FERPA. The risk of reidentification is thus limited to self-reidentification by a student, or a data hack of bCourses.

In [None]:
library(haven)
library(ggplot2)
library(dplyr)

These are the online midterm scores from Fall 2022:

In [None]:
omtscores_22 <- read_dta("data/omtscores_public_22.dta")
head(omtscores_22)

In [None]:
summary(omtscores_22)

And these are the online midterm scores from Fall 2023: 

In [None]:
omtscores_23 <- read_dta("data/omtscores_public_23.dta")
head(omtscores_23)

In [None]:
summary(omtscores_23)

And let's combine (or stack or pool) the datasets with `rbind()`

In [None]:
omtscores_22_23 <- rbind(omtscores_22, omtscores_23)

Here are tables of frequencies by counts and then percentages:

In [None]:
inperson_22_tbl = data.frame(table(omtscores_22$inperson))
inperson_22_tbl_pct = data.frame(prop.table(table(omtscores_22$inperson)))
inperson_22_table = cbind(inperson_22_tbl, inperson_22_tbl_pct)
inperson_22_table

In [None]:
inperson_23_tbl = data.frame(table(omtscores_23$inperson))
inperson_23_tbl_pct = data.frame(prop.table(table(omtscores_23$inperson)))
inperson_23_table = cbind(inperson_23_tbl, inperson_23_tbl_pct)
inperson_23_table

This is BY NO MEANS a randomized controlled trial. But the structure of the data allow us to look at a "control group" and 4 treatment groups, similar to the structure of the RAND Health Insurance Experiment.

As we did there, let us estimate the following model using ordinary least squares:

$$
omtscore_i = \alpha + \beta^2 \cdot D^2_i + \beta^3 \cdot D^3_i + \beta^4 \cdot D^4_i
+ \beta^5 \cdot D^5_i + \epsilon_i
$$

where the $D$'s are indicator variables for the given response (2, 3, 4, or 5) to the survey question. The omitted cateogory is the "control group," and that will be the group that responded "1. Most/all."

In [None]:
# The story in Fall 2022
omtscores_reg22 <- lm(omtscore ~ factor(inperson), 
                    data = omtscores_22)
summary(omtscores_reg22)

Our study has control and treatment arms that are small by statistical standards, so a lack of statistical significance might be expected. But what signs and magnitudes do you see here? With which hypothesis is this more consistent?

<hr>

In [None]:
# The story in Fall 2023
omtscores_reg23 <- lm(omtscore ~ factor(inperson), 
                    data = omtscores_23)
summary(omtscores_reg23)

In [None]:
# The story in Fall 2022 and in Fall 2023
omtscores_reg22_23 <- lm(omtscore ~ factor(inperson) + factor(term), 
                    data = omtscores_22_23)
summary(omtscores_reg22_23)

<hr>

Here are some nifty moves courtesy of the Data Science team. I adapted their code to help us look at the different distributions of scores in the 5 different groups.

<h2>Fall 2022</h2>

In [None]:
# Plotting density curves for scores across the 5 answer categories
ggplot(omtscores_22, aes(omtscore)) + 
    geom_density(data = subset(omtscores_22, inperson == 1), color = "blue") +
    geom_density(data = subset(omtscores_22, inperson == 2), color = "red") +
    geom_density(data = subset(omtscores_22, inperson == 3), color = "orange") +
    geom_density(data = subset(omtscores_22, inperson == 4), color = "green") +
    geom_density(data = subset(omtscores_22, inperson == 5), color = "black") +
labs(title="Comparison of the Online Midterm Score by In-Person Frequency",
         subtitle="Blue = Most/all     Red = Majority     Orange = Half     Green = Some     Black = Rarely/never") +
    xlab("Score") +
    ylab("Density")

In [None]:
# Plotting density curves for all scores
ggplot(omtscores_22, aes(omtscore)) + 
    geom_density(data = omtscores_22, color = "purple") +
labs(title="Online Midterm Scores") +
    xlab("Score") +
    ylab("Density")

<h2>Fall 2023</h2>

In [None]:
# Plotting density curves for scores across the 5 answer categories
ggplot(omtscores_23, aes(omtscore)) + 
    geom_density(data = subset(omtscores_23, inperson == 1), color = "blue") +
    geom_density(data = subset(omtscores_23, inperson == 2), color = "red") +
    geom_density(data = subset(omtscores_23, inperson == 3), color = "orange") +
    geom_density(data = subset(omtscores_23, inperson == 4), color = "green") +
    geom_density(data = subset(omtscores_23, inperson == 5), color = "black") +
labs(title="Comparison of the Online Midterm Score by In-Person Frequency",
         subtitle="Blue = Most/all     Red = Majority     Orange = Half     Green = Some     Black = Rarely/never") +
    xlab("Score") +
    ylab("Density")

In [None]:
# Plotting density curves for all scores
ggplot(omtscores_23, aes(omtscore)) + 
    geom_density(data = omtscores_23, color = "purple") +
labs(title="Online Midterm Scores") +
    xlab("Score") +
    ylab("Density")

<h2>Fall 2022 and Fall 2023 combined</h2>

In [None]:
# Plotting density curves for scores across the 5 answer categories
ggplot(omtscores_22_23, aes(omtscore)) + 
    geom_density(data = subset(omtscores_22_23, inperson == 1), color = "blue") +
    geom_density(data = subset(omtscores_22_23, inperson == 2), color = "red") +
    geom_density(data = subset(omtscores_22_23, inperson == 3), color = "orange") +
    geom_density(data = subset(omtscores_22_23, inperson == 4), color = "green") +
    geom_density(data = subset(omtscores_22_23, inperson == 5), color = "black") +
labs(title="Comparison of the Online Midterm Score by In-Person Frequency",
         subtitle="Blue = Most/all     Red = Majority     Orange = Half     Green = Some     Black = Rarely/never") +
    xlab("Score") +
    ylab("Density")

Thoughts? Plan to come to class less often? More often? All of the above?

Thanks for participating!!

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>