In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation
library(broom) # for tidy statistical summaries

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

# Estimating covid total cases

In [None]:
covid <- readRDS(sprintf("%s/rds/05_01_covid4.rds", datapath))

covid dataset is created for this course from different sources:

In [None]:
covid %>% str

In [None]:
covid %>% filter(max_tc != max(max_tc))

In [None]:
covid %>% datatable

- max_tc: Cumulative number of cases until April 2020
- intl_flights: Total number of international flights that the country recevied between January-April 2020
- LP: Total population
- dom_flights: Total number of domestic flights that the country had between January-April 2020
- sq_km: Land area of country in square kilometers
- household_size: Size of households in person

Your tasks are as follows:

- Create a linear model to estimate max_tc using all other variables. Assign the model result into an object. Note: You may exclude columns for country identities and year as we do not need them 
- Is there a relationship between the response and predictors?
- Write the regression equation. 
- What can you say about the significance of the relationship between the response and each predictor? 
- What can you say about the model fit?
- Calculate the fitted values for all observations. 
- Plot residuals vs fitted values. What are some insights?

Write your comments in markdown cells.

# Answer

In [None]:
covid2 <- covid %>% select(-c("iso3c", "title", "year"))

In [None]:
model1 <- lm(max_tc ~ ., covid2)

In [None]:
model1 %>% summary

model1 %>% tidy %>% filter(p.value < 0.05)

- Model is significantly different than intercept only model (F's p value < 0.05)
- intl_flights and dom flights are significantly different than 0 at 0.05 level
- Model explains 0.9 of the total variance in max_tc

In [None]:
predicted <- predict(model1, covid2)

In [None]:
data.table(predicted = predicted, residual = covid2$max_tc - predicted) %>%
ggplot(aes(x = predicted, y = residual)) +
geom_point()

- Residuals show a pattern across predicted values. The variance is not uniform across. The model specification might be reconsidered.
- The outlier value can distort the model. The model can be run after its exclusion