# CSS 201 / 202 - CSS Bootcamp

## Week 06 - Lecture 01

### Umberto Mignozzetti

# Machine Learning

# Introduction

- My computer is good. 

- It is new, fast, and reliable. It is capable of storing 1.0 Terabytes.

- But humans produce around 2.5 quintillion bytes every day. This is is equal to 1,000,000 Terabytes.

- Every day, we produce the equivalent of 2.5 million computers like mine of data!

    + Mostly lovely cat pics :)

# Introduction

- This is popularly known as Big Data.

- However, data per se means nothing! Big data is just a passive description of the world we live in now.

- To prove that, try to take a large dataset and learn something from it.

- If you want inspiration, take all your pictures and try to create coherent slides of moments of your life.

# Introduction

- This is hard and time-consuming. You would spend days doing it.

- But note that, interestingly, your phone does that to you every day!

- Almost every week, I open up my iPhone, and it shows me a slide show with music and pictures of my family and me.

- How does it do that? Machine Learning!

# Introduction

- Machine Learning is a branch of Artificial Intelligence that uses data and algorithms to imitate how humans learn (IBM).

- Algorithm: short for *recipe*.

- Data: can be anything.

- **And note the intent:** *Learn* here means both make sense of things, discover patterns, and predict things.

# Introduction

- How can we use this as Social Scientists?

- Many applications in Political Science, Economics, Public Policy, Cognitive Sciences, etc.

- We will have an *applied focus*, meaning that we will talk about theory, but the focus will be on generating results.

# Introduction

We will use three books:

1. [ISL] James et al. (second edition, 2021) *Introduction to Statistical Learning with Applications in R. Springer.* [https://www.statlearning.com]

2. [MG] Müller & Guido (2017) *Introduction to Machine Learning with Python.* O'Reilly.

3. [PDA] McKinney (2013) *Python for Data Analysis.* O'Reilly.

## Introduction

- Suppose you are hired as a consultant to help design campaign expenditures for a firm.

- And they ask you: Where should we spend our resources? The options are: `TV`, `radio`, and `newspaper`.

- They want to maximize the sales revenue.

- Where would you spend the money?

## Introduction

- Let me give you a bit more info: here are the previous advertising expenditures and their effects on sales:

![image](https://github.com/umbertomig/POLI175public/blob/main/img/sales.png?raw=true)

## Introduction

- Did it help?

- Some people would say yes, I'd say *not really*.

![image](https://github.com/umbertomig/POLI175public/blob/main/img/sales.png?raw=true)

## Introduction

Let's formalize the ideas:

- $X$: Matrix of predictors ($X_1$: TV expenditures, $X_2$: radio, $X_3$: newspaper)

- $Y$: Response variable

- $f(.)$: Unknown function that connects the predictors with the response variable.

- $\varepsilon$: Random error term

$$ Y \ = \ f(X) + \varepsilon $$

## Introduction

Another example: Do you think your years of study will reflect into a better salary in the future?

- $Y$: Future salary

- $X$: Years of study

![image](https://github.com/umbertomig/POLI175public/blob/main/img/educ.png?raw=true)

## Why estimate $f$?

Our job when doing ML is to estimate $f$. But why do we do that?

1. **Prediction**: We want to predict the values of $Y$: $\hat{Y} = f(\hat{X})$
    
$$ E(Y − \hat{Y})^2 \ = \ E[f(X) + \varepsilon - \hat{f}(X)]^2 = \underbrace{[f(X) - \hat{f}(X)]^2}_{\text{Reducible}} + \underbrace{Var(\varepsilon)}_{\text{Non-reducible}}
$$

## Why estimate $f$?

2. **Inference**: We want, as scientists, to understand how $Y$ is related with a set of $X$s.
    
    1. *Which predictors are associated with the response?*
    
    2. *What is the relationship between the response and each predictor?*
    
    3. *Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?*

## How do we estimate $f$?

- Let a set of $n$ observations, $(Y_1, X_1)$, ..., $(Y_n, X_n)$.

- We will call these observations the **training set**, since we will use these to estimate the function $f$.

- Broadly speaking we have two methods to estimate the $f$ function:

    1. Parametric

    2. Non-parametric

## How do we estimate $f$?

**Parametric**:

1. We make an assumption about the functional form, e.g., that the f.f. is linear:

$$ Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p $$

2. After the f.f. is selected, we fit (train) the model using the data.

## How do we estimate $f$?

**Parametric**:

$$ \text{income} \approx \beta_0 + \beta_1 \times \text{education} + \beta_2 \times \text{seniority} $$

![image](https://github.com/umbertomig/POLI175public/blob/main/img/linreg.png?raw=true)

## How do we estimate $f$?

**Parametric**:

- This parametric approach has advantages. The main one is that it is straightforward to estimate.

- However, it is not very flexible, and it does not capture more complex relationships.

- We can estimate more flexible relations, but we may *overfit* our estimates.

- We can always conjecture the wrong $f$!

- In any case, in the parametric models we need to make assumptions regarding the f.f. of $f$.

## How do we estimate $f$?

**Non-parametric**:

- Does not assume the f.f. of $f$.

- Seek an estimate of $f$ that gets as close to the data points as possible, without being too rough or wiggly.

- Requires lots of observations.

- *Overfitting* becomes a more salient problem.

**Overfitting:** The estimation do well in the training set, but when you apply it to other observations, it does poorly.

## How do we estimate $f$?

**Non-parametric**: Thin-plate splines

![spline](https://github.com/umbertomig/POLI175public/blob/main/img/spline.png?raw=true)

# Estimation of $f$

**Trade-offs:** Flexibility x Interpretability

- *Why would we ever choose to use a more restrictive method instead of a very flexible approach?*

- If you are a scientist, you may want to interpret the results more than have a flexible but hard-to-understand approach.

- Thus, when **inference** is the goal, we may choose a more restrictive model.

- When **prediction** is the goal, we may use a more flexible model. It captures more nuanced relationships.

- Think self-driving Teslas: you need to predict when to turn, not explain to me.

- But the interpretability problem does not go away: think about why some people complain about self-driving Teslas?

## Estimation of $f$

**Trade-offs:** Flexibility x Interpretability

![flexint](https://github.com/umbertomig/POLI175public/blob/main/img/flexint.png?raw=true)

## Estimation of $f$

**Approaches:** Supervised x Unsupervised Machine Learning

- The machine learning techniques roughly divide into *Supervised* and *Unsupervised* methods

- **Supervised:** For each observation $i$, we have a target $Y_i$.

- **Unsupervised:** We have **no** target $Y_i$. Only $X_i$s, and we want to make sense of it.

- **Semi-Supervised:** We know a few $Y_i$, but we want to predict the $Y_i$s for the majority of the data.

## Estimation of $f$

**Unsupervised approach:**

![unsup](https://github.com/umbertomig/POLI175public/blob/main/img/unsup.png?raw=true)

## Model Accuracy

- Too many methods... How to choose?

- *There is no free lunch in statistics*: **no one method dominates all others over all possible data sets.**.

- We will spend some time choosing methods, and then, choosing the best *tunning* parameters for these methods.

- One criterion: 

**Mean Squared Error (MSE)**

$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i − \hat{f}(x_i))^2 $$

## Model Accuracy
**Mean Squared Error (MSE)**

$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i − \hat{f}(x_i))^2 $$

- We can compute the MSE on the *training* data, but what we really want to know is how the MSE performs in *unseen* data.

- That's why for most training purposes, we will split our dataset into two parts: *training* and *testing*.

- We want to compute the MSE in this *testing* data: it is our best shot at knowing how it is going to behave in real-world applications!

## Model Accuracy

**Mean Squared Error (MSE)**

![bvt](https://github.com/umbertomig/POLI175public/blob/main/img/bvt.png?raw=true)

## Model Accuracy

**Mean Squared Error (MSE)**

![bvt](https://github.com/umbertomig/POLI175public/blob/main/img/bvt2.png?raw=true)

## Model Accuracy

- This trade-off is called **Bias-Variance Trade-off**.

- When we adopt a more flexible approach, we **decrease** the bias (distance between $f$ and $\hat{f}$).

    - This means that the training MSE decreases.

- However, when we adopt a more flexible approach, we **increase** the variance (think overfitting).

$$ E(y_0 - \hat{f}(x_0))^2 \ = \ Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\varepsilon) $$

- Our job is to fit a model that has **low bias** and **low variance**.

## Model Accuracy

**Bias-Variance Trade-off**

![bvt](https://github.com/umbertomig/POLI175public/blob/main/img/bvt3.png?raw=true)

# Regression

# Regression

- Regression analysis is one of the most studied approaches for Supervised ML.

- It has been around for a long time: we know it well.

- it is a great starting point for learning more sophisticated methods.

# Regression

- Consider that we run a survey to measure the `prestige` of several professions in the U.S. (We are going to study a survey like this in the next class.)

- A few questions about `prestige`:
    + Is there a relationship between `prestige` and `income`?
    + How strong is the relationship between `prestige` and `income`?
    + Which variables are associated with `prestige`?
    + How can we accurately predict the prestige of professions not studied in this survey?
    + Is the relationship linear?
    + Is there a synergy among predictors?
    
- These are relevant questions, and regression analysis can help us here.

## Simple Linear Regression

### Estimation

- It lives for its name! A very simple approach to regression:

- Let:
    + $y_i$ the variable we want to predict
    + $x_i$ is the variable we will be using to make the prediction.
    + And if we assume a linear relationship, we want to find a slope $\beta_1$ and an intercept $\beta_0$.
    + $n$ the number of observations
    + $i$ a given observation.
    + Thus:

$$ y_i \approx \beta_0 + \beta_1 x_i $$

## Simple Linear Regression

### Estimation

How do we estimate $\beta_0$ and $\beta_1$? 

![reg](https://github.com/umbertomig/POLI175public/blob/main/img/fig1.png?raw=true)

## Simple Linear Regression

### Estimation

- We are searching, among all the possible lines, for the one that does `best`.

- What does `best` mean in this context?

- One concept: minimize the distance between the `predicted` values and the `actual` value.

- Predicted value:

$$ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i $$

## Simple Linear Regression

### Estimation

- Actual value:

$$ y_i = \hat{y}_i + e_i = \hat{\beta}_0 + \hat{\beta}_1 x_i + e_i $$

- And `best` here will mean that we minimized the **residuals sum of squares**:

$$ RSS \ = \ e_1^2 + e_2^2 + \cdots + e_n^2 $$

- It is a well-behaved function on $\hat{\beta}_0$ and $\hat{\beta}_1$.

## Simple Linear Regression

### Estimation

- With simple optimization, we can find the $\hat{\beta}$s that minimize this.

![reg](https://github.com/umbertomig/POLI175public/blob/main/img/fig2.png?raw=true)

## Simple Linear Regression

### Assessing the accuracy of the estimates

We rarely know the true estimates $\beta_1$ and $\beta_0$ (we only do if we `cook the data`).

How do we know how good these $\hat{\beta}_1$ and $\hat{\beta}_0$ are as an approximation of the true $\beta$s?

## Simple Linear Regression

### Assessing the accuracy of the estimates

![reg](https://github.com/umbertomig/POLI175public/blob/main/img/fig3.png?raw=true)

## Simple Linear Regression

### Assessing the accuracy of the estimates

- We find how precise our estimates are by computing the `standard error`.

- In some sense, the standard error of the $\beta_k$ in question is the square root of the variance of it.

- The variance, for each of the $\hat{\beta}$s in here, is:

$$ SE(\hat{\beta}_0)^2 = \sigma^2\left[\dfrac{1}{n} + \dfrac{\overline{x}^2}{\sum_i(x_i-\overline{x})^2}\right]\ , \quad SE(\hat{\beta}_1)^2 = \dfrac{\sigma^2}{\sum_i(x_i-\overline{x})^2}$$

## Simple Linear Regression

### Assessing the accuracy of the estimates

- And since $\sigma^2 = Var(\varepsilon)$, i.e., the real error term, we need also to estimate it:

$$ \sigma \ = \ \sqrt{\dfrac{RSS}{n-2}} $$

- These standard errors give us an idea of how much we can `trust` our estimates. The smaller, the better!

## Simple Linear Regression

### Assessing the accuracy of the estimates

#### Confidence Intervals

- We can also put together a `confidence interval` for our estimates:

- A 95\% confidence interval looks like this:

$$ \hat{\beta}_k \pm 1.96 \times SE(\hat{\beta}_k) $$

- The number 2 would change depending on the confidence levels you choose.


## Simple Linear Regression

### Assessing the accuracy of the estimates

#### Hypothesis testing

- We can also test whether the coefficient could be considered `statistically different` than zero.

- We test the hypothesis (called null hypothesis):

$$ H_0: \beta_k = 0 $$

- Against the hypothesis (called an alternative hypothesis):

$$ H_a: \beta_k \neq 0 $$

## Simple Linear Regression

### Assessing the accuracy of the estimates

#### Hypothesis testing

And to do that, we put together the `t-statistic`:

$$ t = \dfrac{\hat{\beta}_k - 0}{SE(\hat{\beta}_k)} \ \sim \ \text{Student's-t}(n-2) $$

- And the p-value is the probability of finding a value larger than $t$ in the Student's-t distribution.

## Simple Linear Regression

### Assessing the accuracy of the whole model

#### RSE

The residual standard error is one of the best measures of the fit quality.

As we said in the second class, it is the criterium we use for most Supervised Machine Learning models.

It is defined as:

$$ \text{RSE} \ = \ \sqrt{\dfrac{RSS}{n-2}} \ = \ \sqrt{\dfrac{\sum_i(y_i - \hat{y}_i)^2}{n-2}} $$

The lower, the better.

## Simple Linear Regression

### Assessing the accuracy of the whole model

#### $R^2$

The $R^2$ is a measure of goodness-of-fit.

It is widely used because it is between zero and one.

The proportion of the variability of $Y$ that is explained by modeling it using $X$.

It is defined as:

$$ \text{R}^2 \ = \ \dfrac{TSS - RSS}{TSS} \  = \ 1 - \dfrac{RSS}{TSS} $$

And the total sum of squares is defined as $TSS = \sum_i(y_i-\overline{y})^2$. 

The higher the $R^2$, the better.

## Simple Linear Regression

### Assessing the accuracy of the whole model

#### F-Statistic

- Not now! First, multiple Linear Regression :)

## Multiple Linear Regression

- We use multiple linear regression when we have multiple predictors for the same outcome variable.

- Let:
    + $y_i$ the variable we want to predict
    + $x_{ik}$ are the variables we will use to make the prediction.
    + $p$: number of predictors.
    + And if we assume a linear relationship, we want to find a slope $\beta_1$ and an intercept $\beta_0$.
    + $n$ the number of observations
    + $i$ a given observation
    + $k$ and $l$: given predictors
    + Thus:
    
$$ y_i \ = \ \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_px_{ip} + \varepsilon $$

### Estimation

- And the residual sum of squares is defined similarly as before, but we optimize over more parameters:

$$ \text{RSS} \ = \ \sum_ie_i^2 \ = \ \sum_i(y_i - \hat{\beta}_0 - \hat{\beta}_1x_{i1} - \cdots - \hat{\beta}_px_{ip})^2 $$

![reg](https://github.com/umbertomig/POLI175public/blob/main/img/fig4.png?raw=true)

### Assessing accuracy

#### F-Statistic

- Now, yes! The F-Statistic tests whether at least one predictor are different from zero.

- The null hypothesis is:

$$ H_0: \ \beta_1 = \beta_2 = \cdots = \beta_p $$

### Assessing accuracy

#### F-Statistic

- And the alternative hypothesis is:

$$ H_a: \ \exists k \in \{1, \cdots, p\}, \ s.t. \ \beta_k \neq 0 $$

- And $F$ is equal to:

$$ \text{F} \ = \ \dfrac{\frac{TSS-RSS}{p}}{\frac{RSS}{n-p-1}} \ \sim \ F(p, n-p-1) $$

### Assessing accuracy

#### F-Statistic

- Why is this a good test? Because under the null hypothesis:

$$ \mathbb{E}\left[\dfrac{TSS-RSS}{p}\right] = \mathbb{E}\left[\dfrac{RSS}{n-p-1}\right] = \sigma^2 $$

- And so, $F \approx 1$ under $H_0$.


#### F-Statistic for model selection

- Suppose we have $\{1, \cdots, l \}$ predictors, but we could add $\{l+1, \cdots, p \}$ extra predictors in our model.

- Does that make sense? We can test the RSS of the restricted model against the RSS of the full model.

- The null hypothesis is:

$$ H_0: \ \beta_{l+1} = \cdots = \beta_{p} = 0 $$

- And the F-Stat:

$$ \text{F} \ = \ \dfrac{\frac{RSS_0-RSS}{p-l}}{\frac{RSS}{n-p-1}} \ \sim \ F(p-l, n-p-1) $$

#### Deciding on important variables

- Several criteria can be used. We will discuss later their trade-offs.

- But we have a couple of automated ways to select them that are easier to implement:

1. **Forward selection**:
    + Start with the null model and fit $p$ regressions for each predictor. 
    + Add to the model the variable that results in the lowest RSS.
    + Repeat until some stopping rule is satisfied.


#### Deciding on important variables

2. **Backward selection**:
    + Start with the full model, with all $p$ predictors. 
    + Remove the variable with the lowest p-value.
    + Fit the new model with p-1 variables.
    + Repeat until some stopping rule is satisfied.

#### Deciding on important variables

3. **Mixed selection**:
    + Start with the null model and fit $p$ regressions for each predictor.
    + Add to the model the variable that results in the lowest RSS.
    + Look at the p-value and remove it if it drops under a certain threshold.
    + Repeat until some stopping rule is satisfied.


## Data

Some data: `prestige` dataset.

| **Variable** | **Meaning**                                                                                                                                                        |
|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `type`         | Type of occupation. A factor with the following levels: <br>`prof`, professional and managerial; `wc`, white-collar; `bc`, blue-collar.           |
| `income`      | Percentage of occupational incumbents in the 1950 US Census who earned USD 3,500 <br>or more per year (about USD 36,000 in 2017 US dollars).                             |
| `education`    | Percentage of occupational incumbents in 1950 who were high school graduates<br>(which, were we cynical, we would say it is roughly equivalent to a Ph.D. in 2017) |
| `prestige`     | Percentage of respondents in a social survey who rated the occupation as “good” <br>or better in prestige                                                          |
| `profession`   | Name of the profession                                                                                                                                             |


## Questions

- Quick reminder of a few relevant questions:

    + Is there a relationship between `prestige` and `income`?
    + How strong is the relationship between `prestige` and `income`?
    + Which variables are associated with `prestige`?
    + How can we accurately predict the prestige of professions not studied in this survey?
    + Is the relationship linear?
    + Is there a synergy among predictors?
    
- These are relevant questions, and regression analysis can help us here.

## Loading Packages

In [None]:
## Loading Libraries and Modules

# scikit-learn: barebones, but fast and reliable
from sklearn.linear_model import LogisticRegression 
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import DecisionTreeClassifier

# statsmodels: pretty and good to use, great for interpretable ML
from statsmodels.formula.api import ols
from statsmodels.formula.api import logit
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Data processing
import pandas as pd
import numpy as np

# Plotting things:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

## Regression

**Check-in**: Explore the `duncan` dataset.

In [None]:
## Loading the data
duncan = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/Duncan.csv')
duncan = duncan.set_index('profession')

In [None]:
## Your code here

## Regression

- There are two packages in Python to run Regression:
    + `statsmodels`
    + `scikit.learn`

- Today we are going to study the `statsmodels`

## Bivariate Regression

This regression is in the form of 

$$ Y = \beta_0 + \beta_1X_1 + \varepsilon $$

We need to load a few packages:

## Running a Bivariate Regression

- Intuitively, earning more `income` is probably a good predictor of `prestige`. We can check that!

In [None]:
# Scatterplot:
sns.scatterplot(x = 'income', y = 'prestige', data = duncan)
plt.show()

In [None]:
# Regplot:
sns.regplot(x = 'income', y = 'prestige', data = duncan)
plt.show()

In [None]:
## Now running the actual regression:

# Create the model.Fit the model
model = ols('prestige ~ income', data = duncan).fit()

# Print the parameters
print(model.params)

Meaning:

$$ \text{prestige} \ \approx \ 2.46 + 1.08 \times \text{income} $$

## Regression

- Relevant Questions:
    + *Is there a relationship between `prestige` and `income`?*
        + Yes! But not sure yet if `statistically significant` or not...
    + *How strong is the relationship between `prestige` and `income`?*
        + When we increase the income by one unit (which is a percentage of people earning more than 39k in the profession), we increase the prestige **on average** by 1.08 units.
        
- How about statistical significance? Let's test it!

In [None]:
print(model.summary())

## Regression

- Relevant Questions:
    + *Is there a relationship between `prestige` and `income`?*
        + Yes! It is statistically significant at a level lower than 0.001!
    + *How strong is the relationship between `prestige` and `income`?*
        + When we increase the income by one unit (which is a percentage of people earning more than 39k in the profession), we increase the prestige **on average** by 1.08 units.
        
- How accurate is our overall model?
    + Let's check the R$^2$ and the Residual Standard Error (RSE)

In [None]:
# R-squared
rsq = model.rsquared
print(rsq) 

# Around 70% of the variation of prestige is explained by income

In [None]:
# Mean Squared Error
mse = model.mse_resid
print('The mean squared error: ' + str(mse))

# Residual Standard Error
rse = np.sqrt(mse)
print('The Residual Standard Error: ' + str(rse))

# The "typical" distance between the predicted and the observed values is 17.4 prestige points

## Diagnostics

- The fit seem to be good.

- But if you think about it, we still have many questions about our model.
    + Is it linear?
    + Can we do better than the bivariate model?
    + Are the standard errors well-defined?
    + And others...
    
- Here we are going to see how to diagnose some of these problems.

## Diagnostics

Several plots can help us diagnose the quality of our model.

**Warning**: Find and analyzing these violations is **more of an art**.

- In any case, be mindful that a careful analysis is frequent enough to ensure you have a `good` model.

- We are going to look at some of them, that make sense for the bivariate case. 

- Later we are going to look at the ones that make sense for the multivariate case.

### Non-linearity

When the relationship is non-linear, you could have done better using a different (more flexible) functional form.

The plot to detect this is residual in the y-axis against the fitted values in the x-axis:

![reg](https://github.com/umbertomig/POLI175public/blob/main/img/fig5.png?raw=true)

- Plot: Fitted Values x Raw Residuals

- The best: You should find no patterns.

- The ugly: A discernible pattern tells you that you could have done better with a more flexible model.

### Non-linearity

- Hint: Look at the smoothing trend line (the `lowess`). You should see no discernible trend.

For the `prestige` x `income` relationship:

In [None]:
# Residual x fitted values (linearity + heteroscedasticity)
sns.residplot(x = 'income', y = 'prestige', data = duncan, lowess = True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()

### Non-linearity

- Let's `cook` a non-linear relation:

- I will cook the following:

$$ Y = 2 + 3 X - 2 X^2 + \varepsilon $$

- The relationship is obviously non-linear.

In [None]:
## Cooking
cooked_data = pd.DataFrame({
    'x': np.random.normal(0, 1, 100)
})
cooked_data['y'] = 2 + 3 * cooked_data['x'] - 2 * (cooked_data['x'] ** 2) + np.random.normal(0, 1, 100)
## Fitting
model2 = ols('y ~ x', data = cooked_data).fit()

## Checking
print(model2.summary())

In [None]:
# Residual x fitted values (linearity + heteroscedasticity)
sns.residplot(x = 'x', y = 'y', data = cooked_data, lowess = True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()

### Heteroscedasticity

- It is fancy wording to say that the variance in error is not constant.

- It usually means that you are better at fitting some range of the predictors than others.

![reg](https://github.com/umbertomig/POLI175public/blob/main/img/fig7.png?raw=true)

- Plot: Fitted Values x Raw Residuals

- The best: You should find no patterns.

- The ugly: A funnel-shaped figure tells you that you may have heteroscedasticity. It invalidates your standard errors.

### Heteroscedasticity

- Hint: Look at the "shape" of the data cloud. You should see no discernible "shape."

For the `prestige` x `income` relationship:

In [None]:
# Residual x fitted values (linearity + heteroscedasticity)
sns.residplot(x = 'income', y = 'prestige', data = duncan, lowess = True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()

### Outliers

- Outliers are values very far away from most values predicted by the model.

- Sometimes, it is correct, but frequently it may tell you that you made a mistake in collecting the data!

![reg](https://github.com/umbertomig/POLI175public/blob/main/img/fig8.png?raw=true)

- Plot: Fitted x Studentized residuals

- The best: You should find no extreme values in the plot.

- The ugly: An extreme value can affect your RSE, $R^2$, and messes up with p-values.

In [None]:
## More technical info about the model
summary_info = model.get_influence().summary_frame()
summary_info.head()

In [None]:
# Checking the Studentized Residuals
summary_info['fittedvalues'] = model.fittedvalues
sns.regplot(x = 'fittedvalues', y = 'student_resid', data = summary_info, lowess = True)
plt.xlabel("Fitted values")
plt.ylabel("Studentized residuals")
plt.show()

In [None]:
# Checking the Studentized Residuals
print(summary_info[['student_resid', 'fittedvalues']].sort_values('student_resid', ascending = False).head())

# Checking the Studentized Residuals
print(summary_info[['student_resid', 'fittedvalues']].sort_values('student_resid', ascending = False).tail())

In [None]:
# Checking the Absolute Value of the Studentized Residuals
summary_info['abs_student_resid'] = np.abs(summary_info['student_resid'])
sns.regplot(x = 'fittedvalues', y = 'abs_student_resid', data = summary_info, lowess = True)
plt.xlabel("Fitted values")
plt.ylabel("Absolute Value of the Studentized Residuals")
plt.show()

In [None]:
# Checking the Absolute Value of the Studentized Residuals
summary_info.sort_values('abs_student_resid', ascending = False).head()

### High Leverage

- Have very unusual $x_i$ values that could potentially tilt the regression line towards them.

- **If high leverage and outlier, bad combination!**

![reg](https://github.com/umbertomig/POLI175public/blob/main/img/fig9.png?raw=true)

- Plot: Leverage x Studentizided residuals

- The best: You should find no extreme values in the plot.

- The ugly: An extreme value can affect your fit.

In [None]:
## Checking the Leverage
sns.scatterplot(x = 'hat_diag', y = 'student_resid', data = summary_info)
plt.xlabel("Leverage (called hat_diag there)")
plt.ylabel("Studentized residuals")
plt.show()

In [None]:
summary_info.sort_values('hat_diag', ascending = False).head()

In [None]:
## Checking the Leverage and Outliers: Cook's-d x Leverage
sns.scatterplot(x = 'hat_diag', y = 'cooks_d', data = summary_info)
plt.xlabel("Leverage (called hat_diag there)")
plt.ylabel("Cook's-d (how much removing the variable\nshifts the regression)")
plt.show()

In [None]:
# Checking the Cook's-d measure
summary_info.sort_values('cooks_d', ascending = False).head()

## Multiple Linear Regression

- So far:
    + Is there a relationship between `prestige` and `income`? **Yes**
    + How strong is the relationship between `prestige` and `income`? **Yes**
    + Which variables are associated with `prestige`?
    + How can we accurately predict the prestige of professions not studied in this survey? **Yes, so far...**
    + Is the relationship linear? **Yes, so far...**
    + Is there a synergy among predictors?
    
- Can we do better? **Yes**, we have other predictors that we didn't not explore.

## Multiple Linear Regression

Let's fit the following model:

$$ \text{prestige} = \beta_0 + \beta_1\text{income} + \beta_2\text{education} + \varepsilon $$

In [None]:
## Running the actual regression:

# Create the model.Fit the model
model3 = ols('prestige ~ income + education', data = duncan).fit()

# Print the parameters
print(model3.params)

Meaning:

$$ \text{prestige} \ \approx \ -6.06 + 0.60\text{income} + 0.55\text{education} $$

## F-Statistic

Are we doing better than the linear regression? We can test that!

**Null hypothesis:** The model with fewer parameters is better.

**Alternative hypothesis:** At least one variable in the new model does well.

In [None]:
## Anova for model without x model with education
anova_lm(model, model3)

## RSE and R$^2$

We can also look at the Residual Standard Error and the R$^2$ to determine this:

In [None]:
# Model with only income
mse = model.mse_resid
print('The mean squared error: ' + str(mse))

# Residual Standard Error
rse = np.sqrt(mse)
print('The Residual Standard Error: ' + str(rse))

# R-squared
rsq = model.rsquared
print(rsq)

In [None]:
# Model with income and education
mse = model3.mse_resid
print('The mean squared error: ' + str(mse))

# Residual Standard Error
rse = np.sqrt(mse)
print('The Residual Standard Error: ' + str(rse))

# R-squared
rsq = model3.rsquared
print(rsq)

## Diagnostics

Besides the diagnostics that we run before, we can check something called *multicollinearity*

### Multicollinearity

- Multicollinearity is a situation when your predictors are highly correlated.

- In extreme cases, it messes up with the computations in your model.

![reg](https://github.com/umbertomig/POLI175public/blob/main/img/fig10.png?raw=true)

In [None]:
## Pairplot to check
sns.pairplot(duncan[['prestige', 'income', 'education']])
plt.show()

### Multicollinearity

- One measure of multicollinearity is the *Variance Inflation Factor*.
    + How much the multicollinearity is messing up with the estimates.
    
- To compute, it is fairly easy. As a rule-of-thumb, we would like to see values lower than 5.

- It is rarely a problem, though... Especially with large datasets.

In [None]:
## VIF
variables = duncan[['income', 'education']]
vif = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif 

## Deciding on important variables

- Several criteria can be used. We will discuss later their trade-offs.

- But we have a couple of automated ways to select them that are easier to implement:

1. **Forward selection**:
    + Start with the null model and fit $p$ regressions for each predictor. 
    + Add to the model the variable that results in the lowest RSS.
    + Repeat until some stopping rule is satisfied.


#### Deciding on important variables

2. **Backward selection**:
    + Start with the full model, with all $p$ predictors. 
    + Remove the variable with the lowest p-value.
    + Fit the new model with p-1 variables.
    + Repeat until some stopping rule is satisfied.

#### Deciding on important variables

3. **Mixed selection**:
    + Start with the null model and fit $p$ regressions for each predictor.
    + Add to the model the variable that results in the lowest RSS.
    + Look at the p-value and remove it if it drops under a certain threshold.
    + Repeat until some stopping rule is satisfied.


## Application

- So far:
    + Is there a relationship between `prestige` and `income`? **Yes**
    + How strong is the relationship between `prestige` and `income`? **Yes**
    + Which variables are associated with `prestige`? **income, education, others?**
    + How can we accurately predict the prestige of professions not studied in this survey? **Yes**
    + Is the relationship linear? **It seems so**
    + Is there a synergy among predictors? **Good question!**

## Application

**Check-in**: Chattopadhyay and Duflo run this study here:

[Chattopadhyay and Esther Duflo. 2004. "*Women as Policy Makers: Evidence from a Randomized Policy Experiment in India.*" **Econometrica**, 72 (5): 1409–43.](https://economics.mit.edu/files/792)

Claiming that women implemented different policies than men.

| **Variable** | **Description**                                                                                  |
|--------------|--------------------------------------------------------------------------------------------------|
| village      | village identifier ("Gram Panchayat number _ village number")                                    |
| female       | whether village was assigned a female politician: 1=yes, 0=no                                    |
| water        | number of new (or repaired) drinking water facilities in the village <br>since random assignment |
| irrigation   | number of new (or repaired) irrigation facilities in the village<br>since random assignment      |

1. Explore the dataset.
1. Fit a regression models for `irrigation` and `water`, using `female` as predictor.
    + Hint for diagnostics: Do plotly with village as info.
1. Do all the diagnostics. Anything odd in there? If yes, rerun removing the *oddity*.

In [None]:
# Dataset
india = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI30Dpublic/main/datasets/india.csv')

# Your answers in here

# Classification

## Classification

- Linear regression is great! But it assumes we want to predict a continuous target variable.

- But there are situations when our response variables is qualitative.

**Examples:**

- Whether a country default its debt obligations?

- Whether a person voted Republican, Democrat, Independent, voted for a different party, or did not turnout to vote?

- What determines the number of FOI requests that a given public office receives every day?

- Is a country expected to meet, exceed, or not meet the Paris Treaty Nationally Determined Contributions?

All these questions are qualitative in nature.

## Example

- In 1988, the Chilean Dictator Augusto Pinochet conducted a referendum to whether he should step out.

- The FLACSO in Chile conducted a surver on 2700 respondents.

- We are going to build a model to predict their voting intentions.

## Data

| **Variable** | **Meaning** |
|:---:|---|
| region | A factor with levels:<br>- `C`, Central; <br>- `M`, Metropolitan Santiago area; <br>- `N`, North; <br>- `S`, South; <br>- `SA`, city of Santiago. |
| population | The population size of respondent's community. |
| sex | A factor with levels: <br>- `F`, female; <br>- `M`, male. |
| age | The respondent's age in years. |
| education | A factor with levels: <br>- `P`, Primary; <br>- `S`, Secondary; <br>- `PS`, Post-secondary. |
| income | The respondent's monthly income, in Pesos. |
| statusquo | A scale of support for the status-quo. |
| vote | A factor with levels: <br>- `A`, will abstain; <br>- `N`, will vote no (against Pinochet);<br>- `U`, is undecided; <br>- `Y`, will vote yes (for Pinochet). |

In [None]:
## Loading the data
chile = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/chilesurvey.csv')
chile.head()
chile_clean = chile.dropna()
chile_clean = chile_clean[chile_clean['vote'].isin(['Y', 'N'])]
chile_clean['vote'] = np.where(chile_clean['vote'] == 'Y', 1, 0)
chile_clean['logincome'] = np.log(chile_clean['income'])
chile_clean['logpop'] = np.log(chile_clean['population'])
chile_clean.head()

## Why not run a Linear Regression?

You could ask this very valid question. And my answer here differs a bit from the book.

**My suggestion:**

- If you want to **measure a treatment effect**, or any other fitting where **explanation trumps prediction**, go with the linear regression.
    + Easy to explain to a lay audience.
    + Good polynomial expansion around the ATE.
    + Needs a careful design (in Causal Inference, the design is more important than the statistical method!).
    + Interaction terms are just partial derivatives of the fitted equation.

## Why not run a Linear Regression?

You could ask this very valid question. And my answer here differs a bit from the book.

**My suggestion:**

- If you want to **predict outcomes**, go with a classification model appropriate for your target variable unit.
    + You are not going to do `weird` prediction.
    + You have a marginal efficiency gain (in terms of Standard Errors).
    + If you have an ordered target variable, your model does look like more meaningful.
    + Need to be careful about interaction terms (has to do with taking derivatives of link function in Generalized Linear Models).

## Why not run a Linear Regression?

You could ask this very valid question. And my answer here differs a bit from the book.

**My suggestion:**

- Be **careful when you have discrete nominal variation in your target variable**:
    + Binary outcome: Linear Regression and Linear Discriminant Analysis are the same.
    + Three or more categories, like the `vote` in the Chilean dataset messes up badly with things.

## Book's Example

Chance of Default on Credit Card Debt by Account Balance:

![linear x logistic regression IRLR book](https://github.com/umbertomig/POLI175public/blob/main/img/linvslogit.png?raw=true)

## Logistic Regression

Logistic Regression belongs to a class of models called [Generalized Linear Models](https://en.wikipedia.org/wiki/Generalized_linear_model) (or GLM for short).

- A GLM, in a nutshell (and in a proudly lazy definition) is an expansion of Linear Model that assumes:
    + A Linear Relationship in part of the model
    + But then applies a non-linear transformation to the response variable.

- The non-linear transformation is called `link function`. Many link functions around (check [here](https://en.wikipedia.org/wiki/Generalized_linear_model) for various link functions).

- The link function is going to determine which types of models we run.

- When the outcome variable is binary, we may use the `Logistic` or `Probit` links.

## Logistic Regression

In a regression, we are investigating something along the lines of:

$$ \mathbb{E}[Y | X] \ = \ \beta_0 + \beta_1 X $$

But when the outcome is binary we would like to get:

$$ \mathbb{E}[Y | X] \ = \ \mathbb{P}(Y = 1 | X) $$

And the Logistic link is nothing but:

$$ \mathbb{P}(Y = 1 | X) \ = \ \dfrac{e^{(\beta_0 + \beta_1X)}}{1 + e^{(\beta_0 + \beta_1X)}} $$

## Logistic Regression

With a bit of manipulation, we get to something called odds ratio:

$$ \dfrac{\mathbb{P}(Y = 1 | X)}{\mathbb{P}(Y = 0 | X)} \ = \ \dfrac{\mathbb{P}(Y = 1 | X)}{1 - \mathbb{P}(Y = 1 | X)} \ = \ e^{(\beta_0 + \beta_1X)} $$

And logging the thing gets rid of the Euler constant:

$$ \log \left( \dfrac{\mathbb{P}(Y = 1 | X)}{1 - \mathbb{P}(Y = 1 | X)}\right) \ = \ \beta_0 + \beta_1X $$

And this is the Logit Link.

## Logistic Regression

Little detour to talk about odd ratios:

- Note the odd ratio: $\dfrac{\mathbb{P}(Y = 1 | X)}{1 - \mathbb{P}(Y = 1 | X)}$

- It is a ratio between the chance of $Y = 1$ divided by the chance of $Y = 0$.

- Since probabilities are between zero and one, the ratio is always between $(0, \infty)$.

Example:

- If based on characteristics, two in every ten people vote for Pinochet, $\mathbb{P}(Y = 1 | X = \text{some characs.}) = 0.2$ and the odds ratio is $1/4$.

- If based on other set of characteristics, nine out of ten people vote for Pinochet, $\mathbb{P}(Y = 1 | X = \text{some other characs.}) = 0.9$ and the odds ratio is $9$.

- One is like the number that does not change the ratios.


## Logistic Regression

Little other detour to talk about the coefficients:

- In linear regression, changes in one unit of $x_i$ changes your target variable in $\beta_i$ units, on average.

- In logistic regression, changes in one unit of $x_i$ changes **the log odds** your target variable in $\beta_i$ units, on average.

- Multiplies the odds by $e^{\beta_i}$! This is **not** a straight line!

- Easy proxy (does not work for interaction terms): 
    + When $\beta_1$ is **positive**, it **increases** the $\mathbb{P}(Y = 1 | X)$
    + When $\beta_1$ is **negative**, it **decreases** the $\mathbb{P}(Y = 1 | X)$
    
- Try to compute the partial derivatives on $X$ and you will see the complications!

## Logistic Regression

Technical:

1. The estimation is through [maximizing the likelihood function](https://en.wikipedia.org/wiki/Likelihood_function).
    + This is outside the scope of the course, but an interesting topic to learn in an advanced course.


2. The hypothesis test for the coefficient's significance in here is a Z-test (based on the Normal distribution).
    + Null Hypothesis: $H_0: \ \beta_i = 0$ or alternatively $H_0: \ e^{\beta_i} = 1$.


3. Making predictions:
    + Just insert the predicted $\hat{\beta}$s on the equation.
    
$$ \hat{p}(X) \ = \ \dfrac{e^{\hat{\beta}_0 + \hat{\beta}_1 X}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 X}} $$

## Logistic Regression

First, let's fit a Linear Regression:

In [None]:
sns.regplot(x = 'logincome', y = 'vote', x_jitter = 0.1, y_jitter = 0.1, data = chile_clean)
plt.show()

## Logistic Regression

In [None]:
# Linear Model
modlin = ols('vote ~ logincome', data = chile_clean).fit()
modlin.summary()

## Logistic Regression

Now, let us fit a Logistic Regression:

In [None]:
## Seaborn plot
sns.regplot(x = 'logincome', y = 'vote', 
            x_jitter = 0.1, y_jitter = 0.1, 
            data = chile_clean, logistic = True)
plt.show()

## Logistic Regression

In [None]:
# Logistic Regression
modlogit = logit('vote ~ logincome', data = chile_clean).fit()
modlogit.summary()

## Logistic Regression

In [None]:
# Logistic Regression
modlogit2 = logit('vote ~ logincome + logpop + region + age + education', data = chile_clean).fit()
modlogit2.summary()

## Logistic Regression

- Let's look at the parameters:

In [None]:
## Parameters
np.exp(modlogit2.params)

## Logistic Regression

- Let's look at the parameters:

In [None]:
## Parameters
np.exp(modlogit2.params)-1

## Logistic Regression

- Now with Scikit Learn:

In [None]:
# Target variable
y = chile_clean['vote']

# Predictors
X = chile_clean[['logincome', 'logpop', 'age']]

# Loading the model
logreg =  LogisticRegression() 

# Fitting the model
logreg.fit(X, y)

# Getting parameters
print(logreg.intercept_, logreg.coef_)

## Logistic Regression

Where are the categorical variables?

In Scikit Learn, you need to create dummy variables for the categorical vars. 

Thus, you should do:

In [None]:
## Detour: Creating Dummies for Male
dummies = pd.get_dummies(chile_clean['sex'], prefix = 'sex', drop_first = True)
chile_clean_wdumvars = pd.concat([chile_clean, dummies], axis=1)
chile_clean_wdumvars.head()

## Logistic Regression

**Your turn:** Create dummies for `region` and `education`. Which category was dropped in each of the processes?

In [None]:
## Your code here

## Creating dummies

In [None]:
## Dummies

# Sex
dummies = pd.get_dummies(chile_clean['sex'], prefix = 'sex', drop_first = True)
chile_clean_wdumvars = pd.concat([chile_clean, dummies], axis=1)

# Education
dummies = pd.get_dummies(chile_clean['region'], prefix = 'region', drop_first = True)
chile_clean_wdumvars = pd.concat([chile_clean_wdumvars, dummies], axis=1)

# Region
dummies = pd.get_dummies(chile_clean['education'], prefix = 'education', drop_first = True)
chile_clean_wdumvars = pd.concat([chile_clean_wdumvars, dummies], axis=1)

## Head
chile_clean_wdumvars.head()

# You can even drop the original variables, if you want to: 
# DataFrame.drop(labels = ['v1, 'v2',..., 'vn'], axis = 1)

## Logistic Regression

- Now with Scikit Learn, and using all the categorical variables:

In [None]:
# Target variable
y = chile_clean_wdumvars['vote']

# Predictors
X = chile_clean_wdumvars[['logincome', 'logpop', 'age', 
                          'sex_M', 
                          'region_M', 'region_N', 'region_S', 'region_SA', 
                          'education_PS', 'education_S']]

# Loading the model
logreg =  LogisticRegression(solver = 'newton-cg') 

# Fitting the model
logreg.fit(X, y)

## Logistic Regression

In [None]:
# Getting parameters
print('Original coefficients: ')
print(logreg.intercept_, logreg.coef_)

print('\n\n')

# Exps:
print('Exponentiated coefficients: ')
print(np.exp(logreg.intercept_), np.exp(logreg.coef_))

# Generative Models of Classification

## Generative Models of Classification

Logistic regression involves modeling the probability of a response given a set of parameters
    + Uses the logistic link for the *conditional distribution*
    
$$ \mathbb{E}(Y = 1 | X = x) \ = \ \mathbb{P}(Y = 1 | X = x) \ = \ \text{Logit}(\beta_0 + \cdots + \beta_pX_p) $$

Another approach is to model the distribution for each values of $Y$.

And then, use the Bayes' Theorem to get the conditional distributions.

But why?

1. Separation

2. Small sample size

## Generative Models of Classification

Let $\pi_k$ the prior probability of $Y = k$.

And let $f_k(x) = \mathbb{P}(X = x | Y = k)$ the density function for an observation that comes from the $k$-th class.

The Bayes theorem says that:

$$ \mathbb{P}(Y = k | X = x) \ = \ \dfrac{\pi_kf_k(x)}{\sum_l \pi_l f_l(x)} $$

Now, estimating $\pi_k$ is easy: we just compute the fraction that belongs to the $k$-th class.

How about $f$?

+ Different estimators are going to give us different classifiers!

## Generative Models of Classification

### 1. Linear Discriminant Analysis

- Suppose we have only one variable $x$ and $f_k$ is Gaussian:

$$ x \sim N(\mu_k, \sigma_k^2) $$

- And assuming further that the draws have the same variance: $\sigma^2 = \sigma_k^2 \forall k$

- Computing the log of the posterior gives us:

$$ \delta_k(x) \ = \ x \dfrac{\mu_k}{\sigma^2} - \dfrac{\mu_k^2}{2\sigma^2} + \log(\pi_k) $$

## Generative Models of Classification

### 1. Linear Discriminant Analysis

And the decision for which class the $x$ belongs is simple: **Whichever has the highest probability is the "winner"**.

1. Let $x$

2. Compute $\delta_0(x)$

3. Compute $\delta_1(x)$

4. The highest is the winner :-)

## Generative Models of Classification

### 1. Linear Discriminant Analysis

But how the decision boundary looks like? We need to find the *indifference point*:

$$ \delta_1(x) = \delta_0(x) $$

Do the algebra, and you are going to find:

$$ x \ = \ \dfrac{\mu_0 + \mu_1}{2} $$



## Generative Models of Classification

### 1. Linear Discriminant Analysis

![img lda](https://github.com/umbertomig/POLI175public/blob/main/img/ldabounds.png?raw=true)

## Generative Models of Classification

### 1. Linear Discriminant Analysis

And the LDA approximate the quantities of interest by doing the following:

1. $$ \widehat{\mu}_k  \ = \ \dfrac{1}{n_k} \sum_{i:y_i = k}x_i $$


2. $$ \widehat{\sigma}^2 \ = \ \dfrac{1}{n - K} \sum_{k=1}^K\sum_{i:y_i = k}(x_i - \widehat{\mu}_k)^2 $$


3. $$ \widehat{\pi}_k \ = \ \dfrac{n_k}{n} $$

Note that you can classify more than two categories.

## Generative Models of Classification

### 1. Linear Discriminant Analysis

The chance that $x$ belongs to $y=k$ is going to be:

$$ \widehat{\delta}_k(x) \ = \ x \dfrac{\widehat{\mu}_k}{\widehat{\sigma}^2} - \dfrac{\widehat{\mu}_k^2}{2\widehat{\sigma}^2} + \log(\widehat{\pi}_k) $$

Note that this is a linear function, so the name `Linear Discriminant Analysis`!

## Generative Models of Classification

### 1. Linear Discriminant Analysis

Now let's fit it using `scikit learn`

In [None]:
# Start a LDA (do not mix this up with Latent Dirichlet Allocation!)
X, y = chile_clean[['logincome', 'age']], chile_clean['vote']

# Create the model
ldan = LinearDiscriminantAnalysis()

# Fitting model
ldan.fit(X, y)

# Plotting the tree boundaries
fig = DecisionBoundaryDisplay.from_estimator(ldan, X, response_method="predict",
                                             alpha=0.5, cmap=plt.cm.coolwarm)

# Plotting the data points    
fig.ax_.scatter(x = chile_clean['logincome'], y = chile_clean['age'], 
                c = y, alpha = 0.5,
                cmap = plt.cm.coolwarm)

plt.show()

## Generative Models of Classification

### 1. Linear Discriminant Analysis

The most fundamental question:

- How much error in classification we are doing?

- To learn that, we need to study the `confusion matrix`!

### Measuring Performance

**Confusion Matrix**:

|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

1. **Accuracy:** $$\dfrac{\text{correct predictions}}{\text{total observations}} \ = \ \dfrac{tp + tn}{tp + tn + fp + fn}$$

- High accuracy: lots of correct predictions!

### Measuring Performance

**Confusion Matrix**:

|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

2. **Precision:** $$\dfrac{\text{true positives}}{\text{total predicted positive}} \ = \ \dfrac{tp}{tp + fp}$$

- High precision: low false-positive rates.


### Measuring Performance

**Confusion Matrix**:

|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

3. **Recall:** $$\dfrac{\text{true positives}}{\text{total actual positive}} \ = \ \dfrac{tp}{tp + fn}$$

- High recall: low false-negative rates.


### Measuring Performance

4. **F1-Score**:

$$ \text{F1} \ = \ 2 \times \dfrac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} $$

- Lets look at the two models: logistic x lda:

In [None]:
# Target variable
y = chile_clean_wdumvars['vote']

# Predictors
X = chile_clean_wdumvars[['logincome', 'logpop', 'age', 
                          'sex_M', 
                          'region_M', 'region_N', 'region_S', 'region_SA', 
                          'education_PS', 'education_S']]

# Loading the model
logreg =  LogisticRegression(solver = 'newton-cg')
ldan = LinearDiscriminantAnalysis()

# Fitting the models
logreg.fit(X, y)
ldan.fit(X, y)

### Measuring Performance

- Lets look at the two models: logistic x lda:

In [None]:
# Predictions
y_pred_logreg = logreg.predict(X)
y_pred_ldan = ldan.predict(X)

# Logistic Regression
print(confusion_matrix(y, y_pred_logreg))

# Linear Discriminant Analysis
print(confusion_matrix(y, y_pred_ldan))

In [None]:
# Logistic Classification Report
print(classification_report(y, y_pred_logreg))

In [None]:
# LDA Classification Report
print(classification_report(y, y_pred_ldan))

## Generative Models of Classification

### 2. Quadratic Discriminant Analysis

The main difference is that it assumes that every observation has its own covariance matrix:

- Drop the `same-sigma-assumption`.

### 3. Naïve Bayes

Instead of assuming that $f$ belongs to a class of distributions (e.g., Normal), it assumes that the $f$s are independent:

- Drop the `Multivariate-Normal-assumption`.

- For $p$ predictors, you make only assumptions about each $x_{ik}$:

$$ f_k(x) \ = \ f_{k1}(x_1)\times \cdots \times f_{kp}(x_p) $$

- And you assume a normal distribution (Gaussian shape) for each variable...

## Generative Models of Classification


![img](https://github.com/umbertomig/POLI175public/blob/main/img/ldaxqdaxnb.png?raw=true)

- Purple: Naïve Bayes; Black: LDA; Green: QDA.

In [None]:
## QDA
qdan = QuadraticDiscriminantAnalysis()
qdan.fit(X, y)

## Gaussian Naive Bayes
nbays = GaussianNB()
nbays.fit(X, y)


# Predictions
y_pred_logreg = logreg.predict(X)
y_pred_ldan = ldan.predict(X)
y_pred_qdab = qdan.predict(X)
y_pred_nbays = nbays.predict(X)

In [None]:
# Logistic Regression
print(classification_report(y, y_pred_logreg))

In [None]:
# Linear Discriminant Analysis
print(classification_report(y, y_pred_ldan))

In [None]:
# Quadratic Discriminant Analysis
print(classification_report(y, y_pred_qdab))

In [None]:
# Gaussian Naive Bayes
print(classification_report(y, y_pred_nbays))

## Logistic x Generative Models for Classification

**Check-in**: Does social pressure affects turnout?

Gerber, Green, and Larimer. 2008 studied this question on their ["*Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.*" **American Political Science Review**, 102 (1): 33-48.](http://www.donaldgreen.com/wp-content/uploads/2015/09/Gerber_Green_Larimer-APSR-2008.pdf).

They selected households in Michigan receive a letter containing the following information:

> Dear Registered Voter: \concept{WHAT IF YOUR NEIGHBORS KNEW WHETHER YOU VOTED?} ... We’re sending this mailing to you and your neighbors to publicize who does and does not vote. The chart shows the names of some of your neighbors, showing which have voted in the past. After the August 8 election, we intend to mail an updated chart. You and your neighbors will all know who voted and who did not. \concept{DO YOUR CIVIC DUTY--VOTE!}

| MAPLE DR                 | Aug 2004    | Nov 2004 | Aug 2006 |
|--------------------------|-------------|----------|----------|
| 9995 JOSEPH JAMES SMITH  | Voted       | Voted    | ???      |
| 995 JENNIFER KAY SMITH   | Didn't vote | Voted    | ???      |
| 9997 RICHARD B JACKSON   | Didn't vote | Voted    | ???      |
| 9999 KATHY MARIE JACKSON | Didn't vote | Voted    | ???      |

The treatment assignment is called `pressure`. If no pressure, then the voter received no letter. We want to study whether `pressure` affected `voted`. 

Fit all models we learned so far on this dataset (Note: The linear is the most adequate, since the data comes from a randomized experiment, but please fit all).

In [None]:
voting = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI30Dpublic/main/datasets/voting.csv')

# Your answers here

# Questions?

# See you in the next class!