# POLI 175

## Class 05 - More Julia and More Regression

Dr. Umberto Mignozzetti

UCSD

# Julia

## Load CSV files

In [None]:
# Load packages
import Pkg; Pkg.add("LaTeXStrings"); Pkg.add("StatsPlots")
using CSV, DataFrames, Plots, GLM, StatsBase, Random, LaTeXStrings, StatsPlots

# Auxiliar function
function pairplot(df)
    _, cols = size(df)
    plots = []
    for row = 1:cols, col = 1:cols
        push!(
            plots,
            scatter(
                df[:, row],
                df[:, col],
                xtickfont = font(4),
                ytickfont = font(4),
                legend = false,
            ),
        )
    end
    plot(plots..., layout = (cols, cols))
end

In [None]:
# Loading Prestige dataset
urldat = "https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/Duncan.csv"
dat = CSV.read(download(urldat), DataFrame)

## Exploring

In [None]:
# Dataset dimension
size(dat)

In [None]:
# Column names
names(dat)

## Exploring

In [None]:
# Head
first(dat, 3)

In [None]:
# Tail
last(dat, 3)

## Exploring

In [None]:
describe(dat)

# Regression Analysis

# Regression

A few questions about `prestige`:

1. Is there a relationship between `prestige` and `income`?
1. How strong is the relationship between `prestige` and `income`?
1. Which variables are associated with `prestige`?
1. How can we accurately predict the prestige of professions not studied in this survey?
1. Is the relationship linear?
1. Is there a synergy among predictors?

## Simple Linear Regression

### Is there a relationship between `prestige` and `income`?

In [None]:
scatter(
    dat.income, dat.prestige,
    xlabel = "Income", ylabel = "Prestige",
    series_annotations = text.(dat.profession, :bottom, 8),
    legend = false
)

## Simple Linear Regression

### Is there a relationship between `prestige` and `income`?

$$ \widehat{prestige}_i = \widehat{\beta}_0 + \widehat{\beta}_1 income_i $$

In [None]:
mod = lm(@formula(prestige ~ income), dat)

Now, where to look to decide if there is a relationship between both?

## Simple Linear Regression

### Is there a relationship between `prestige` and `income`?

**Coefficient:** Is it statistically equal or different than zero?

I bet this sound silly... You may be thinking: "What do you mean? I see a 2.46 and a 1.08 there. They are different than zero."

Not true. Let me cook some data to show you.

In [None]:
# No relationship between X and Y
Random.seed!(12345)
df = DataFrame(x = randn(10), y = randn(10))
lm(@formula(x ~ y), df)

In [None]:
# Again, no relationship between X and Y
df = DataFrame(x = randn(10), y = randn(10))
lm(@formula(x ~ y), df)

## Simple Linear Regression

### Is there a relationship between `prestige` and `income`?

We need to know how precise are our coefficients. We find that by computing the `standard errors`.

$$ SE(\hat{\beta}_0) = \sqrt{\sigma^2\left[\dfrac{1}{n} + \dfrac{\overline{x}^2}{\sum_i(x_i-\overline{x})^2}\right]}\ , \quad SE(\hat{\beta}_1) = \sqrt{\dfrac{\sigma^2}{\sum_i(x_i-\overline{x})^2}}$$

If our coefficients are normally distributed, with mean = coeff and variance equals the square of the standard error. We can benchmark how common is it that we see a "zero" relationship!

In [None]:
#= 
First cooked regression, repeated 100 
times under the same parameters found
=#
Random.seed!(12345)  
p1 = histogram(-0.277385 .+ (0.290078 .* randn(100)))
vline!([-0.277385, 0], linewidth = 4)
p2 = histogram(-0.487333 .+ (0.255324 .* randn(100)))
vline!([-0.487333, 0], linewidth = 4)
plot(p1, p2, layout = (1, 2))

## Simple Linear Regression

### Is there a relationship between `prestige` and `income`?

***Question:*** How common is zero in the 100 trials?

In [None]:
#= 
Second cooked regression, repeated 100 
times under the same parameters found
=#
Random.seed!(12345)
p1 = histogram(-0.106654 .+ (0.175614 .* randn(100)))
vline!([-0.106654, 0], linewidth = 4)
p2 = histogram(0.169026 .+ (0.227732 .* randn(100)))
vline!([0.169026, 0], linewidth = 4)
plot(p1, p2, layout = (1, 2))

## Simple Linear Regression

### Is there a relationship between `prestige` and `income`?

Now we can answer this question:

In [None]:
mod = lm(@formula(prestige ~ income), dat)

In [None]:
Random.seed!(12345)
p1 = histogram(2.45657 .+ (5.19006 .* randn(100)), label = latexstring("\\widehat{\\beta}_0"))
vline!([2.45657, 0], linewidth = 4, label = latexstring("\\widehat{\\beta}_0"))
p2 = histogram(1.08039 .+ (0.107369 .* randn(100)), label = latexstring("\\widehat{\\beta}_1"))
vline!([1.08039, 0], linewidth = 4, label = latexstring("\\widehat{\\beta}_1"))
plot(p1, p2, layout = (1, 2))

# Regression

## Simple Linear Regression

A few questions about `prestige`:

1. *Is there a relationship between `prestige` and `income`?* **Yes!**
1. How strong is the relationship between `prestige` and `income`?
1. Which variables are associated with `prestige`?
1. How can we accurately predict the prestige of professions not studied in this survey?
1. Is the relationship linear?
1. Is there a synergy among predictors?


## Simple Linear Regression

### Estimation

- Actual value:

$$ y_i = \hat{y}_i + e_i = \hat{\beta}_0 + \hat{\beta}_1 x_i + e_i $$

- And `best` here will mean that we minimized the **residuals sum of squares**:

$$ RSS \ = \ e_1^2 + e_2^2 + \cdots + e_n^2 $$

- It is a well-behaved function on $\hat{\beta}_0$ and $\hat{\beta}_1$.

## Simple Linear Regression

### Estimation

- With simple optimization, we can find the $\hat{\beta}$s that minimize this.

![reg](https://github.com/umbertomig/POLI175julia/blob/e0a55ce2350643ed200a482a3581479c6a8e3f74/img/fig2.png?raw=true)

## Simple Linear Regression

### How strong is the relationship between `prestige` and `income`?

There are three ways to study this question in the regression framework:

1. What is the magnitude of the effect?
1. Are there bounds for the effects? What are they?
1. What proportion of variation in prestige that may be explained by income?

## Simple Linear Regression

### How strong is the relationship between `prestige` and `income`?

What is the magnitude of the effect?

In [None]:
scatter(dat.income, dat.prestige, xlabel = "Income", 
    ylabel = "Prestige", smooth = :true, legend = false)
annotate!(20, 70, 
    latexstring("\\widehat{pr} = $(round(coef(mod)[1], digits = 2)) + \\textbf{$(round(coef(mod)[2], digits = 2))}inc"), :darkblue)

## Simple Linear Regression

### Assessing the accuracy of the whole model

$R^2$: Measure of goodness-of-fit.

It is widely used because it is between zero and one.

The proportion of the variability of $Y$ that is explained by modeling it using $X$.

It is defined as:

$$ \text{R}^2 \ = \ \dfrac{TSS - RSS}{TSS} \  = \ 1 - \dfrac{RSS}{TSS} $$

And the total sum of squares is defined as $TSS = \sum_i(y_i-\overline{y})^2$. 

The higher the $R^2$, the better.

In [None]:
r2(mod)

In [None]:
# Alternatively
TSS = sum((dat.prestige .- mean(dat.prestige)).^2)
RSS = sum((residuals(mod)).^2)
1 - RSS / TSS

## Simple Linear Regression

### Assessing the accuracy of the whole model

#### RMSE

The residual standard error (root mean squared error) is one of the best measures of the fit quality.

As we said in the second class, it is the criterium we use for most Supervised Machine Learning models.

It is defined as:

$$ \text{RMSE (or RSE)} \ = \ \sqrt{\dfrac{RSS}{n-2}} \ = \ \sqrt{\dfrac{\sum_i(y_i - \hat{y}_i)^2}{n-2}} $$

The lower, the better.

In [None]:
# MSE
RSS/dof_residual(mod)

In [None]:
# RMSE
sqrt(RSS/dof_residual(mod))

## Simple Linear Regression

### How strong is the relationship between `prestige` and `income`?

What is the magnitude of the effect? + What proportion of variation in prestige that may be explained by income?

In [None]:
scatter(dat.income, dat.prestige, xlabel = "Income", 
    ylabel = "Prestige", smooth = :true, legend = false)
annotate!(20, 70, 
    latexstring("\\widehat{pr} = $(round(coef(mod)[1], digits = 2)) + $(round(coef(mod)[2], digits = 2))inc"), :darkblue)
annotate!(20, 60, 
    latexstring("R^2 = $(round(r2(mod), digits = 3))"), :darkblue)

## Simple Linear Regression

### How strong is the relationship between `prestige` and `income`?

Are there bounds for the effects? What are they?

The natural bound for our estimates is called `confidence interval`.

A 95% confidence interval looks like this:

$$ \hat{\beta}_k \pm 1.96 \times SE(\hat{\beta}_k) $$

You may use the number 2 instead of 1.96. This number would change depending on the confidence levels you choose: 95% = 1.96; 90% = 1.645; 99% = 2.807, etc..

In [None]:
mod

In [None]:
round.(confint(mod), digits = 3)

## Simple Linear Regression

### How strong is the relationship between `prestige` and `income`?

What is the magnitude of the effect? + What proportion of variation in prestige that may be explained by income? + Effect bounds

About Confidence x Predition intervals, see [this](https://real-statistics.com/regression/confidence-and-prediction-intervals/).


In [None]:
scatter(dat.income, dat.prestige, xlabel = "Income", 
    ylabel = "Prestige", smooth = :true, legend = :bottomright)
pred = DataFrame(income = minimum(dat.income .- 2):0.01:maximum(dat.income .+ 2));
pr = predict(mod, pred, interval = :confidence, level = 0.95)
plot!(pred.income, pr.prediction, label="confidence", linewidth=3, seriesalpha = 0.2,
        ribbon = (pr.prediction .- pr.lower, pr.upper .- pr.prediction), color = :red)
pr = predict(mod, pred, interval = :prediction, level = 0.95)
plot!(pred.income, pr.prediction, label="prediction", linewidth=3, seriesalpha = 0.2,
        ribbon = (pr.prediction .- pr.lower, pr.upper .- pr.prediction), color = :purple)
annotate!(20, 110, 
    latexstring("\\widehat{pr} = $(round(coef(mod)[1], digits = 2)) + $(round(coef(mod)[2], digits = 2))inc"), :darkblue)
annotate!(20, 100, 
    latexstring("R^2 = $(round(r2(mod), digits = 3))"), :darkblue)

# Regression

## Simple Linear Regression

A few questions about `prestige`:

1. *Is there a relationship between `prestige` and `income`?* **Yes!**
1. *How strong is the relationship between `prestige` and `income`?* **I'd say strong, but no baseline makes it harder for a good comparison**.
1. Which variables are associated with `prestige`?
1. How can we accurately predict the prestige of professions not studied in this survey?
1. Is the relationship linear?
1. Is there a synergy among predictors?

## Multiple Linear Regression

- We use multiple linear regression when we have multiple predictors for the same outcome variable.

- Let:
    + $y_i$ the variable we want to predict
    + $x_{ik}$ are the variables we will use to make the prediction.
    + $p$: number of predictors.
    + And if we assume a linear relationship, we want to find a slope $\beta_1$ and an intercept $\beta_0$.
    + $n$ the number of observations
    + $i$ a given observation
    + $k$ and $l$: given predictors
    + Thus:
    
$$ y_i \ = \ \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_px_{ip} + \varepsilon $$

### Estimation

- And the residual sum of squares is defined similarly as before, but we optimize over more parameters:

$$ \text{RSS} \ = \ \sum_ie_i^2 \ = \ \sum_i(y_i - \hat{\beta}_0 - \hat{\beta}_1x_{i1} - \cdots - \hat{\beta}_px_{ip})^2 $$

![reg](https://github.com/umbertomig/POLI175julia/blob/e0a55ce2350643ed200a482a3581479c6a8e3f74/img/fig4.png?raw=true)

## Multiple Linear Regression

Let us cook the following model:

$$ y_i = 1 + 5x_{1i} - 10.5 x_{2i} + \varepsilon_i $$

In [None]:
Random.seed!(4321)
df = DataFrame(x1 = rand(100), x2 = rand(100))
df.y = 1 .+ 5 .* df.x1 .- 10.5 .* df.x2 .+ randn(100)
describe(df)

In [None]:
pairplot(df)

## Multiple Linear Regression

A simple regression would give us:

In [None]:
cookm = lm(@formula(y ~ x1 + x2), df)

Interpretation?

# Regression

## Simple Linear Regression

A few questions about `prestige`:

1. *Is there a relationship between `prestige` and `income`?* **Yes!**
1. *How strong is the relationship between `prestige` and `income`?* **I'd say strong, but no baseline makes it harder for a good comparison**.
1. Which variables are associated with `prestige`?
1. How can we accurately predict the prestige of professions not studied in this survey?
1. Is the relationship linear?
1. Is there a synergy among predictors?

# Regression

## Which variables are associated with `prestige`?

In [None]:
pairplot(dat[:, Not("profession")])

# Regression

## Which variables are associated with `prestige`?

Let's try `income` and the `type` of the profession:

In [None]:
mod2 = lm(@formula(prestige ~ income + type), dat)

# Regression

## Which variables are associated with `prestige`?

Is this model better than the previous one?

### F-Statistic

The F-Statistic tests whether at least one predictor is different from zero. The null hypothesis is:

$$ H_0: \ \beta_1 = \beta_2 = \cdots = \beta_p = 0 $$

The alternative hypothesis is:

$$ H_a: \ \exists k \in \{1, \cdots, p\}, \ s.t. \ \beta_k \neq 0 $$

The F-Statistic is equal to:

$$ \text{F} \ = \ \dfrac{\frac{TSS-RSS}{p}}{\frac{RSS}{n-p-1}} \ \sim \ F(p, n-p-1) $$

## Which variables are associated with `prestige`?

### F-Statistic

Why is this a good test? Because under the null hypothesis:

$$ \mathbb{E}\left[\dfrac{TSS-RSS}{p}\right] = \mathbb{E}\left[\dfrac{RSS}{n-p-1}\right] = \sigma^2 $$

And so, $F \approx 1$ under $H_0$.

In [None]:
ftest(mod2.model)

## Which variables are associated with `prestige`?

But the best thing about F is that it allows us to compare two models, and check if we improved when moving from one to the other.

In this case, we can question: Are we improving our fitting when adding `type` of profession to our model?

## Which variables are associated with `prestige`?

### F-Statistic for model selection

Suppose we have $\{1, \cdots, l \}$ predictors, but we could add $\{l+1, \cdots, p \}$ extra predictors to our model.

We can test the *RSS of the restricted model* against the RSS of the full model*.

The null hypothesis is:

$$ H_0: \ \beta_{l+1} = \cdots = \beta_{p} = 0 $$

And the F-Stat:

$$ \text{F} \ = \ \dfrac{\frac{RSS_0-RSS}{p-l}}{\frac{RSS}{n-p-1}} \ \sim \ F(p-l, n-p-1) $$

In [None]:
ftest(mod.model, mod2.model)

## Which variables are associated with `prestige`?

Your turn. Add `education` and study the result.

1. How to interpret it?
1. Run the F-test. Any improvements?
1. Answer the main Q: ***Which variables are associated with `prestige`?***

In [None]:
# Your code here

# Questions?

# See you next class
