# CSS 201 / 202 - CSS Bootcamp

## Week 06 - Lecture 02

### Umberto Mignozzetti

# Regression

# Regression

Recap Regression:

Theory:
- Minimizes mean squared error (or the residual sum of squares)
- Can have as many variables as we want (not really...)
- Good to answer about relationship (existence and strength)
- Synergy
- Not very flexible
- You need to check the consistency of your model (diagnostics: `model.get_influence()`)

Estimation:
- `statsmodels` do a good job. Tutorial [here](https://www.statsmodels.org/dev/examples/index.html).

In [None]:
## Loading Libraries and Modules

# scikit-learn: barebones, but fast and reliable
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report, precision_score, get_scorer_names
from sklearn.model_selection import train_test_split, LeaveOneOut, cross_val_score, KFold
from sklearn.inspection import DecisionBoundaryDisplay
#from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# statsmodels: pretty and good to use, great for interpretable ML
from statsmodels.formula.api import ols, logit
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.regressionplots import plot_partregress_grid, influence_plot

# Data processing
import pandas as pd
import numpy as np

# Plotting things:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

## Regression

Datasets:

- `duncan` dataset.
- `education` expenditure by US state dataset

In [None]:
## Loading the data
duncan = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/Duncan.csv')
duncan = duncan.set_index('profession')
educexp = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/educexp.csv')

## Multiple Linear Regression

- So far:
    + Is there a relationship between `prestige` and `income`? **Yes**
    + How strong is the relationship between `prestige` and `income`? **Yes**
    + Which variables are associated with `prestige`?
    + How can we accurately predict the prestige of professions not studied in this survey? **Yes, so far...**
    + Is the relationship linear? **Yes, so far...**
    + Is there a synergy among predictors?
    
- Can we do better? **Yes**, we have other predictors that we didn't not explore.

## Multiple Linear Regression

Let's fit the following model:

$$ \text{prestige} = \beta_0 + \beta_1\text{income} + \beta_2\text{education} + \varepsilon $$

In [None]:
## Running the actual regression:

# Create the model.Fit the model
model = ols('prestige ~ income', data = duncan).fit()
model3 = ols('prestige ~ income + education', data = duncan).fit()

# Print the parameters
print(model3.params)

Meaning:

$$ \text{prestige} \ \approx \ -6.06 + 0.60\text{income} + 0.55\text{education} $$

## Multiple Linear Regression

Partial regression plots:

In [None]:
fig = plot_partregress_grid(model3)

## Multiple Linear Regression

Influence plot:

In [None]:
fig = influence_plot(model3)

## F-Statistic

Are we doing better than the linear regression? We can test that!

**Null hypothesis:** The model with fewer parameters is better.

**Alternative hypothesis:** At least one variable in the new model does well.

In [None]:
## Anova for model without x model with education
anova_lm(model, model3)

## RSE and R$^2$

We can also look at the Residual Standard Error and the R$^2$ to determine this:

In [None]:
# Model with only income
mse = model.mse_resid
print('The mean squared error: ' + str(mse))

# Residual Standard Error
rse = np.sqrt(mse)
print('The Residual Standard Error: ' + str(rse))

# R-squared
rsq = model.rsquared
print(rsq)

## RSE and R$^2$

We can also look at the Residual Standard Error and the R$^2$ to determine this:

In [None]:
# Model with income and education
mse = model3.mse_resid
print('The mean squared error: ' + str(mse))

# Residual Standard Error
rse = np.sqrt(mse)
print('The Residual Standard Error: ' + str(rse))

# R-squared
rsq = model3.rsquared
print(rsq)

## Diagnostics

Besides the diagnostics that we run before, we can check something called *multicollinearity*

### Multicollinearity

- Multicollinearity is a situation when your predictors are highly correlated.

- In extreme cases, it messes up with the computations in your model.

![reg](https://github.com/umbertomig/POLI175public/blob/main/img/fig10.png?raw=true)

In [None]:
## Pairplot to check
sns.pairplot(duncan[['prestige', 'income', 'education']])
plt.show()

### Multicollinearity

- One measure of multicollinearity is the *Variance Inflation Factor*.
    + How much the multicollinearity is messing up with the estimates.
    
- To compute, it is fairly easy. As a rule-of-thumb, we would like to see values lower than 5.

- It is rarely a problem, though... Especially with large datasets.

In [None]:
## VIF
variables = duncan[['income', 'education']]
vif = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif

## Multiple Regression Models

**Check-in:** Run a multiple regression model for the education expenditure dataset.

In [None]:
## Your code here
educexp.head(2)

## Adding dummy variables to the mix

In [None]:
duncan.head(2)

We should add `type` to the model. But how to do that, since it is a `character` variable?

We need to **create dummies**!

## Adding dummy variables to the mix

In [None]:
pd.get_dummies(duncan.type).sample(5)

And we can add to the dataset:

In [None]:
dummies = pd.get_dummies(duncan.type, prefix = 'type', drop_first = True)
duncan = pd.concat([duncan, dummies], axis=1)

In [None]:
duncan.sample(3)

## Regression with dummies

**Check-in:** Add dummies to the mix and estimate the models.

In [None]:
## Your code here

## Diagnostics

**Check-in**: Do the diagnostics of the regression you just run.

In [None]:
# Your code here

## Application

- So far:
    + Is there a relationship between `prestige` and `income`? **Yes**
    + How strong is the relationship between `prestige` and `income`? **Yes**
    + Which variables are associated with `prestige`? **income, education, others?**
    + How can we accurately predict the prestige of professions not studied in this survey? **Yes**
    + Is the relationship linear? **It seems so**
    + Is there a synergy among predictors? **Good question!**

## Regression with interactions

Check for interactions!

In [None]:
model4 = ols('prestige ~ income * education', data = duncan).fit()
model4.summary()

# Classification

## Classification

- Linear regression is great! But it assumes we want to predict a continuous target variable.

- But there are situations when our response variables is qualitative.

**Examples:**

- Whether a country default its debt obligations?

- Whether a person voted Republican, Democrat, Independent, voted for a different party, or did not turnout to vote?

- What determines the number of FOI requests that a given public office receives every day?

- Is a country expected to meet, exceed, or not meet the Paris Treaty Nationally Determined Contributions?

All these questions are qualitative in nature.

## Example

- In 1988, the Chilean Dictator Augusto Pinochet conducted a referendum to whether he should step out.

- The FLACSO in Chile conducted a surver on 2700 respondents.

- We are going to build a model to predict their voting intentions.

## Data

| **Variable** | **Meaning** |
|:---:|---|
| region | A factor with levels:<br>- `C`, Central; <br>- `M`, Metropolitan Santiago area; <br>- `N`, North; <br>- `S`, South; <br>- `SA`, city of Santiago. |
| population | The population size of respondent's community. |
| sex | A factor with levels: <br>- `F`, female; <br>- `M`, male. |
| age | The respondent's age in years. |
| education | A factor with levels: <br>- `P`, Primary; <br>- `S`, Secondary; <br>- `PS`, Post-secondary. |
| income | The respondent's monthly income, in Pesos. |
| statusquo | A scale of support for the status-quo. |
| vote | A factor with levels: <br>- `A`, will abstain; <br>- `N`, will vote no (against Pinochet);<br>- `U`, is undecided; <br>- `Y`, will vote yes (for Pinochet). |

In [None]:
## Loading the data
chile = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/chilesurvey.csv')
chile.head()
chile_clean = chile.dropna()
chile_clean = chile_clean[chile_clean['vote'].isin(['Y', 'N'])]
chile_clean['vote'] = np.where(chile_clean['vote'] == 'Y', 1, 0)
chile_clean['logincome'] = np.log(chile_clean['income'])
chile_clean['logpop'] = np.log(chile_clean['population'])
chile_clean.head()

## Why not run a Linear Regression?

You could ask this very valid question. And my answer here differs a bit from the book.

**My suggestion:**

- If you want to **measure a treatment effect**, or any other fitting where **explanation trumps prediction**, go with the linear regression.
    + Easy to explain to a lay audience.
    + Good polynomial expansion around the ATE.
    + Needs a careful design (in Causal Inference, the design is more important than the statistical method!).
    + Interaction terms are just partial derivatives of the fitted equation.

## Why not run a Linear Regression?

You could ask this very valid question. And my answer here differs a bit from the book.

**My suggestion:**

- If you want to **predict outcomes**, go with a classification model appropriate for your target variable unit.
    + You are not going to do `weird` prediction.
    + You have a marginal efficiency gain (in terms of Standard Errors).
    + If you have an ordered target variable, your model does look like more meaningful.
    + Need to be careful about interaction terms (has to do with taking derivatives of link function in Generalized Linear Models).

## Why not run a Linear Regression?

You could ask this very valid question. And my answer here differs a bit from the book.

**My suggestion:**

- Be **careful when you have discrete nominal variation in your target variable**:
    + Binary outcome: Linear Regression and Linear Discriminant Analysis are the same.
    + Three or more categories, like the `vote` in the Chilean dataset messes up badly with things.

## Book's Example

Chance of Default on Credit Card Debt by Account Balance:

![linear x logistic regression IRLR book](https://github.com/umbertomig/POLI175public/blob/main/img/linvslogit.png?raw=true)

## Logistic Regression

Logistic Regression belongs to a class of models called [Generalized Linear Models](https://en.wikipedia.org/wiki/Generalized_linear_model) (or GLM for short).

- A GLM, in a nutshell (and in a proudly lazy definition) is an expansion of Linear Model that assumes:
    + A Linear Relationship in part of the model
    + But then applies a non-linear transformation to the response variable.

- The non-linear transformation is called `link function`. Many link functions around (check [here](https://en.wikipedia.org/wiki/Generalized_linear_model) for various link functions).

- The link function is going to determine which types of models we run.

- When the outcome variable is binary, we may use the `Logistic` or `Probit` links.

## Logistic Regression

In a regression, we are investigating something along the lines of:

$$ \mathbb{E}[Y | X] \ = \ \beta_0 + \beta_1 X $$

But when the outcome is binary we would like to get:

$$ \mathbb{E}[Y | X] \ = \ \mathbb{P}(Y = 1 | X) $$

And the Logistic link is nothing but:

$$ \mathbb{P}(Y = 1 | X) \ = \ \dfrac{e^{(\beta_0 + \beta_1X)}}{1 + e^{(\beta_0 + \beta_1X)}} $$

## Logistic Regression

With a bit of manipulation, we get to something called odds ratio:

$$ \dfrac{\mathbb{P}(Y = 1 | X)}{\mathbb{P}(Y = 0 | X)} \ = \ \dfrac{\mathbb{P}(Y = 1 | X)}{1 - \mathbb{P}(Y = 1 | X)} \ = \ e^{(\beta_0 + \beta_1X)} $$

And logging the thing gets rid of the Euler constant:

$$ \log \left( \dfrac{\mathbb{P}(Y = 1 | X)}{1 - \mathbb{P}(Y = 1 | X)}\right) \ = \ \beta_0 + \beta_1X $$

And this is the Logit Link.

## Logistic Regression

Little detour to talk about odd ratios:

- Note the odd ratio: $\dfrac{\mathbb{P}(Y = 1 | X)}{1 - \mathbb{P}(Y = 1 | X)}$

- It is a ratio between the chance of $Y = 1$ divided by the chance of $Y = 0$.

- Since probabilities are between zero and one, the ratio is always between $(0, \infty)$.

Example:

- If based on characteristics, two in every ten people vote for Pinochet, $\mathbb{P}(Y = 1 | X = \text{some characs.}) = 0.2$ and the odds ratio is $1/4$.

- If based on other set of characteristics, nine out of ten people vote for Pinochet, $\mathbb{P}(Y = 1 | X = \text{some other characs.}) = 0.9$ and the odds ratio is $9$.

- One is like the number that does not change the ratios.


## Logistic Regression

Little other detour to talk about the coefficients:

- In linear regression, changes in one unit of $x_i$ changes your target variable in $\beta_i$ units, on average.

- In logistic regression, changes in one unit of $x_i$ changes **the log odds** your target variable in $\beta_i$ units, on average.

- Multiplies the odds by $e^{\beta_i}$! This is **not** a straight line!

- Easy proxy (does not work for interaction terms): 
    + When $\beta_1$ is **positive**, it **increases** the $\mathbb{P}(Y = 1 | X)$
    + When $\beta_1$ is **negative**, it **decreases** the $\mathbb{P}(Y = 1 | X)$
    
- Try to compute the partial derivatives on $X$ and you will see the complications!

## Logistic Regression

Technical:

1. The estimation is through [maximizing the likelihood function](https://en.wikipedia.org/wiki/Likelihood_function).
    + This is outside the scope of the course, but an interesting topic to learn in an advanced course.


2. The hypothesis test for the coefficient's significance in here is a Z-test (based on the Normal distribution).
    + Null Hypothesis: $H_0: \ \beta_i = 0$ or alternatively $H_0: \ e^{\beta_i} = 1$.


3. Making predictions:
    + Just insert the predicted $\hat{\beta}$s on the equation.
    
$$ \hat{p}(X) \ = \ \dfrac{e^{\hat{\beta}_0 + \hat{\beta}_1 X}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 X}} $$

## Logistic Regression

First, let's fit a Linear Regression:

In [None]:
sns.regplot(x = 'logincome', y = 'vote', x_jitter = 0.1, y_jitter = 0.1, data = chile_clean)
plt.show()

## Logistic Regression

In [None]:
# Linear Model
modlin = ols('vote ~ logincome', data = chile_clean).fit()
modlin.summary()

## Logistic Regression

Now, let us fit a Logistic Regression:

In [None]:
## Seaborn plot
sns.regplot(x = 'logincome', y = 'vote', 
            x_jitter = 0.1, y_jitter = 0.1, 
            data = chile_clean, logistic = True)
plt.show()

## Logistic Regression

In [None]:
# Logistic Regression
modlogit = logit('vote ~ logincome', data = chile_clean).fit()
modlogit.summary()

## Logistic Regression

In [None]:
# Logistic Regression
modlogit2 = logit('vote ~ logincome + logpop + region + age + education', data = chile_clean).fit()
modlogit2.summary()

## Logistic Regression

- Let's look at the parameters:

In [None]:
## Parameters
np.exp(modlogit2.params)

## Logistic Regression

- Let's look at the parameters:

In [None]:
## Parameters
np.exp(modlogit2.params)-1

## Logistic Regression

- Now with Scikit Learn:

In [None]:
# Target variable
y = chile_clean['vote']

# Predictors
X = chile_clean[['logincome', 'logpop', 'age']]

# Loading the model
logreg =  LogisticRegression() 

# Fitting the model
logreg.fit(X, y)

# Getting parameters
print(logreg.intercept_, logreg.coef_)

## Logistic Regression

Where are the categorical variables?

In Scikit Learn, you need to create dummy variables for the categorical vars. 

Thus, you should do:

In [None]:
## Detour: Creating Dummies for Male
dummies = pd.get_dummies(chile_clean['sex'], prefix = 'sex', drop_first = True)
chile_clean_wdumvars = pd.concat([chile_clean, dummies], axis=1)
chile_clean_wdumvars.head()

## Logistic Regression

**Your turn:** Create dummies for `region` and `education`. Which category was dropped in each of the processes?

In [None]:
## Your code here

## Creating dummies

In [None]:
## Dummies

# Sex
dummies = pd.get_dummies(chile_clean['sex'], prefix = 'sex', drop_first = True)
chile_clean_wdumvars = pd.concat([chile_clean, dummies], axis=1)

# Education
dummies = pd.get_dummies(chile_clean['region'], prefix = 'region', drop_first = True)
chile_clean_wdumvars = pd.concat([chile_clean_wdumvars, dummies], axis=1)

# Region
dummies = pd.get_dummies(chile_clean['education'], prefix = 'education', drop_first = True)
chile_clean_wdumvars = pd.concat([chile_clean_wdumvars, dummies], axis=1)

## Head
chile_clean_wdumvars.head()

# You can even drop the original variables, if you want to: 
# DataFrame.drop(labels = ['v1, 'v2',..., 'vn'], axis = 1)

## Logistic Regression

- Now with Scikit Learn, and using all the categorical variables:

In [None]:
# Target variable
y = chile_clean_wdumvars['vote']

# Predictors
X = chile_clean_wdumvars[['logincome', 'logpop', 'age', 
                          'sex_M', 
                          'region_M', 'region_N', 'region_S', 'region_SA', 
                          'education_PS', 'education_S']]

# Loading the model
logreg =  LogisticRegression(solver = 'newton-cg') 

# Fitting the model
logreg.fit(X, y)

## Logistic Regression

In [None]:
# Getting parameters
print('Original coefficients: ')
print(logreg.intercept_, logreg.coef_)

print('\n\n')

# Exps:
print('Exponentiated coefficients: ')
print(np.exp(logreg.intercept_), np.exp(logreg.coef_))

# Generative Models of Classification

## Generative Models of Classification

Logistic regression involves modeling the probability of a response given a set of parameters
    + Uses the logistic link for the *conditional distribution*
    
$$ \mathbb{E}(Y = 1 | X = x) \ = \ \mathbb{P}(Y = 1 | X = x) \ = \ \text{Logit}(\beta_0 + \cdots + \beta_pX_p) $$

Another approach is to model the distribution for each values of $Y$.

And then, use the Bayes' Theorem to get the conditional distributions.

But why?

1. Separation

2. Small sample size

## Generative Models of Classification

Let $\pi_k$ the prior probability of $Y = k$.

And let $f_k(x) = \mathbb{P}(X = x | Y = k)$ the density function for an observation that comes from the $k$-th class.

The Bayes theorem says that:

$$ \mathbb{P}(Y = k | X = x) \ = \ \dfrac{\pi_kf_k(x)}{\sum_l \pi_l f_l(x)} $$

Now, estimating $\pi_k$ is easy: we just compute the fraction that belongs to the $k$-th class.

How about $f$?

+ Different estimators are going to give us different classifiers!

## Generative Models of Classification

### 1. Linear Discriminant Analysis

- Suppose we have only one variable $x$ and $f_k$ is Gaussian:

$$ x \sim N(\mu_k, \sigma_k^2) $$

- And assuming further that the draws have the same variance: $\sigma^2 = \sigma_k^2 \forall k$

- Computing the log of the posterior gives us:

$$ \delta_k(x) \ = \ x \dfrac{\mu_k}{\sigma^2} - \dfrac{\mu_k^2}{2\sigma^2} + \log(\pi_k) $$

## Generative Models of Classification

### 1. Linear Discriminant Analysis

And the decision for which class the $x$ belongs is simple: **Whichever has the highest probability is the "winner"**.

1. Let $x$

2. Compute $\delta_0(x)$

3. Compute $\delta_1(x)$

4. The highest is the winner :-)

## Generative Models of Classification

### 1. Linear Discriminant Analysis

But how the decision boundary looks like? We need to find the *indifference point*:

$$ \delta_1(x) = \delta_0(x) $$

Do the algebra, and you are going to find:

$$ x \ = \ \dfrac{\mu_0 + \mu_1}{2} $$



## Generative Models of Classification

### 1. Linear Discriminant Analysis

![img lda](https://github.com/umbertomig/POLI175public/blob/main/img/ldabounds.png?raw=true)

## Generative Models of Classification

### 1. Linear Discriminant Analysis

And the LDA approximate the quantities of interest by doing the following:

1. $$ \widehat{\mu}_k  \ = \ \dfrac{1}{n_k} \sum_{i:y_i = k}x_i $$


2. $$ \widehat{\sigma}^2 \ = \ \dfrac{1}{n - K} \sum_{k=1}^K\sum_{i:y_i = k}(x_i - \widehat{\mu}_k)^2 $$


3. $$ \widehat{\pi}_k \ = \ \dfrac{n_k}{n} $$

Note that you can classify more than two categories.

## Generative Models of Classification

### 1. Linear Discriminant Analysis

The chance that $x$ belongs to $y=k$ is going to be:

$$ \widehat{\delta}_k(x) \ = \ x \dfrac{\widehat{\mu}_k}{\widehat{\sigma}^2} - \dfrac{\widehat{\mu}_k^2}{2\widehat{\sigma}^2} + \log(\widehat{\pi}_k) $$

Note that this is a linear function, so the name `Linear Discriminant Analysis`!

## Generative Models of Classification

### 1. Linear Discriminant Analysis

Now let's fit it using `scikit learn`

In [None]:
# Start a LDA (do not mix this up with Latent Dirichlet Allocation!)
X, y = chile_clean[['logincome', 'age']], chile_clean['vote']

# Create the model
ldan = LinearDiscriminantAnalysis()

# Fitting model
ldan.fit(X, y)

# Plotting the tree boundaries
fig = DecisionBoundaryDisplay.from_estimator(ldan, X, response_method="predict",
                                             alpha=0.5, cmap=plt.cm.coolwarm)

# Plotting the data points    
fig.ax_.scatter(x = chile_clean['logincome'], y = chile_clean['age'], 
                c = y, alpha = 0.5,
                cmap = plt.cm.coolwarm)

plt.show()

## Generative Models of Classification

### 1. Linear Discriminant Analysis

The most fundamental question:

- How much error in classification we are doing?

- To learn that, we need to study the `confusion matrix`!

### Measuring Performance

**Confusion Matrix**:

|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

1. **Accuracy:** $$\dfrac{\text{correct predictions}}{\text{total observations}} \ = \ \dfrac{tp + tn}{tp + tn + fp + fn}$$

- High accuracy: lots of correct predictions!

### Measuring Performance

**Confusion Matrix**:

|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

2. **Precision:** $$\dfrac{\text{true positives}}{\text{total predicted positive}} \ = \ \dfrac{tp}{tp + fp}$$

- High precision: low false-positive rates.


### Measuring Performance

**Confusion Matrix**:

|  | **Predicted: 0** | **Predicted: 1** |
|---|---|---|
| **Actual: 0** | True Negative | False Positive |
| **Actual: 1** | False Negative | True Positive |

3. **Recall:** $$\dfrac{\text{true positives}}{\text{total actual positive}} \ = \ \dfrac{tp}{tp + fn}$$

- High recall: low false-negative rates.


### Measuring Performance

4. **F1-Score**:

$$ \text{F1} \ = \ 2 \times \dfrac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} $$

- Lets look at the two models: logistic x lda:

In [None]:
# Target variable
y = chile_clean_wdumvars['vote']

# Predictors
X = chile_clean_wdumvars[['logincome', 'logpop', 'age', 
                          'sex_M', 
                          'region_M', 'region_N', 'region_S', 'region_SA', 
                          'education_PS', 'education_S']]

# Loading the model
logreg =  LogisticRegression(solver = 'newton-cg')
ldan = LinearDiscriminantAnalysis()

# Fitting the models
logreg.fit(X, y)
ldan.fit(X, y)

### Measuring Performance

- Lets look at the two models: logistic x lda:

In [None]:
# Predictions
y_pred_logreg = logreg.predict(X)
y_pred_ldan = ldan.predict(X)

# Logistic Regression
print(confusion_matrix(y, y_pred_logreg))

# Linear Discriminant Analysis
print(confusion_matrix(y, y_pred_ldan))

In [None]:
# Logistic Classification Report
print(classification_report(y, y_pred_logreg))

In [None]:
# LDA Classification Report
print(classification_report(y, y_pred_ldan))

## Generative Models of Classification

### 2. Quadratic Discriminant Analysis

The main difference is that it assumes that every observation has its own covariance matrix:

- Drop the `same-sigma-assumption`.

### 3. Naïve Bayes

Instead of assuming that $f$ belongs to a class of distributions (e.g., Normal), it assumes that the $f$s are independent:

- Drop the `Multivariate-Normal-assumption`.

- For $p$ predictors, you make only assumptions about each $x_{ik}$:

$$ f_k(x) \ = \ f_{k1}(x_1)\times \cdots \times f_{kp}(x_p) $$

- And you assume a normal distribution (Gaussian shape) for each variable...

## Generative Models of Classification


![img](https://github.com/umbertomig/POLI175public/blob/main/img/ldaxqdaxnb.png?raw=true)

- Purple: Naïve Bayes; Black: LDA; Green: QDA.

In [None]:
## QDA
qdan = QuadraticDiscriminantAnalysis()
qdan.fit(X, y)

## Gaussian Naive Bayes
nbays = GaussianNB()
nbays.fit(X, y)


# Predictions
y_pred_logreg = logreg.predict(X)
y_pred_ldan = ldan.predict(X)
y_pred_qdab = qdan.predict(X)
y_pred_nbays = nbays.predict(X)

## Generative Models of Classification

In [None]:
# Logistic Regression
print(classification_report(y, y_pred_logreg))

## Generative Models of Classification

In [None]:
# Linear Discriminant Analysis
print(classification_report(y, y_pred_ldan))

## Generative Models of Classification

In [None]:
# Quadratic Discriminant Analysis
print(classification_report(y, y_pred_qdab))

## Generative Models of Classification

In [None]:
# Gaussian Naive Bayes
print(classification_report(y, y_pred_nbays))

## Logistic x Generative Models for Classification

**Check-in**: Does social pressure affects turnout?

Gerber, Green, and Larimer. 2008 studied this question on their ["*Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.*" **American Political Science Review**, 102 (1): 33-48.](http://www.donaldgreen.com/wp-content/uploads/2015/09/Gerber_Green_Larimer-APSR-2008.pdf).

They selected households in Michigan receive a letter containing the following information:

> Dear Registered Voter: \concept{WHAT IF YOUR NEIGHBORS KNEW WHETHER YOU VOTED?} ... We’re sending this mailing to you and your neighbors to publicize who does and does not vote. The chart shows the names of some of your neighbors, showing which have voted in the past. After the August 8 election, we intend to mail an updated chart. You and your neighbors will all know who voted and who did not. \concept{DO YOUR CIVIC DUTY--VOTE!}

| MAPLE DR                 | Aug 2004    | Nov 2004 | Aug 2006 |
|--------------------------|-------------|----------|----------|
| 9995 JOSEPH JAMES SMITH  | Voted       | Voted    | ???      |
| 995 JENNIFER KAY SMITH   | Didn't vote | Voted    | ???      |
| 9997 RICHARD B JACKSON   | Didn't vote | Voted    | ???      |
| 9999 KATHY MARIE JACKSON | Didn't vote | Voted    | ???      |

The treatment assignment is called `pressure`. If no pressure, then the voter received no letter. We want to study whether `pressure` affected `voted`. 

Fit all models we learned so far on this dataset (Note: The linear is the most adequate, since the data comes from a randomized experiment, but please fit all).

In [None]:
voting = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI30Dpublic/main/datasets/voting.csv')

# Your answers here

# Resampling

## Resampling

- Involve repeatedly drawing `samples` for a `training dataset` to obtain fitting information.

- `Samples`: A randomly selected fraction of the original data.
    + Do not mistake it for a different sample from a population.
    
- `Training`: Training the model means to fit the model.

## Resampling

- This sounds weird: why not fit the model into the actual data?
    + We would not have a measure of how well our model is doing.
    + In the end, this matters! And matters especially for the data that we did not train the model!

- In this sense, resampling is a clever trick to see how the model would do in the `real world`, without going to the real world.

## Resampling

- It helps us to:
    + Evaluate the performance of the model (`Model assessment`).
    + Select the proper flexibility for our model (`Model selection`).

- Drawback: they are computationally intensive.
    + Usually involves refitting the model again and again.
    
- We are going to discuss the following:
    + `Cross-validation`: Measure the performance and select appropriate flexibility.
    + (not this now) `Bootstrap`: Measure the accuracy of parameters.

In [None]:
## Loading Chile data
chile = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/chilesurvey.csv')
chile_clean = chile.dropna()
chile_clean = chile_clean[chile_clean['vote'].isin(['Y', 'N'])]
chile_clean['vote'] = np.where(chile_clean['vote'] == 'Y', 1, 0)
chile_clean['logincome'] = np.log(chile_clean['income'])
chile_clean['logpop'] = np.log(chile_clean['population'])
dummies = pd.get_dummies(chile_clean['sex'], prefix = 'sex', drop_first = True)
chile_clean = pd.concat([chile_clean, dummies], axis=1)
dummies = pd.get_dummies(chile_clean['region'], prefix = 'region', drop_first = True)
chile_clean = pd.concat([chile_clean, dummies], axis=1)
dummies = pd.get_dummies(chile_clean['education'], prefix = 'education', drop_first = True)
chile_clean = pd.concat([chile_clean, dummies], axis=1)
chile_clean.head()

In [None]:
## Education Expenditure Dataset
educ = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/educexp.csv')
educ = educ.set_index('states')
for i in educ.columns:
    educ[i + '_log'] = np.log(educ[i])
educ.head()

## Cross-Validation

- We talked about it yesterday.

- In that context, we looked at the idea of a
    + `training error rate` (the boring one): The error when fitting the model to data that was used to train the parameters, and
    + `test error rate` (the cool one): The error associated with fitting the model to ***unseen*** data.

## Cross-Validation

### Validation Set Approach

- Randomly divide the data into two sets:
    + `Training set`: The data used to fit the model
    + `Testing set`: The data used to test the performance of the fitted model.

## Cross-Validation

### Validation Set Approach

- Split the sample in half training - half testing and running the estimation:

![img vsa](https://github.com/umbertomig/POLI175public/blob/main/img/cv1.png?raw=true)

In [None]:
## With 50% split (no urban_log)
y = educ['education_log']
X = educ[['income_log', 'young_log']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 1234)

reg = LinearRegression().fit(X_train, y_train)

y_pred = reg.predict(X_test)

np.sum((y_pred - y_test) ** 2)

In [None]:
## With 50% split (with urban_log)
y = educ['education_log']
X = educ[['income_log', 'young_log', 'urban_log']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 1234)
reg = LinearRegression().fit(X_train, y_train)
y_pred = reg.predict(X_test)
np.sum((y_pred - y_test) ** 2)

In [None]:
## Your turn: Check the MSE when removing income_log. Is it
##  better?

In [None]:
## Your turn: Check the MSE when removing 'urban_pop' 
##   with only 20% of observations in the testing set.

## Cross-Validation

### Leave-One-Out Cross-Validation

- It does what it says: leaves one observation out and fits the model with $n-1$ cases.

- Then, it predicts the results in the case left out.

- **Great** for small datasets and when prediction is critical.

- **Bad** in terms of computational time.

$$ CV_n \ = \ \dfrac{1}{n}\sum_i MSE_i $$

## Cross-Validation

### Leave-One-Out Cross-Validation

![img](https://github.com/umbertomig/POLI175public/blob/main/img/cv2.png?raw=true)

In [None]:
## LOOCV
## Variables: model without urban population
y = educ['education_log']
X = educ[['income_log','young_log']]

## Leave-One-Out-CV
cv = LeaveOneOut()
reg = LinearRegression()

## Run the CV
scores = cross_val_score(reg, X, y,
                         scoring = 'neg_mean_squared_error',
                         cv = cv)

## RMSE
print(np.sqrt(np.mean(np.absolute(scores))))

## MSE
np.mean(np.absolute(scores))

In [None]:
## LOOCV
## Variables: model **with** urban population
y = educ['education_log']
X = educ[['income_log', 'young_log', 'urban_log']]

## Leave-One-Out-CV
cv = LeaveOneOut()
reg = LinearRegression()


## Run the CV
scores = cross_val_score(reg, X, y, 
                         scoring = 'neg_mean_squared_error',
                         cv = cv)

## MSE
print(np.mean(np.absolute(scores)))

## RMSE
np.sqrt(np.mean(np.absolute(scores)))

In [None]:
## Your turn: compare the model with x without logs
## Note: the target has to be the same!

## Cross-Validation

### Metrics

- To do the comparison, you need a metric.

- `scikit learn` has many matrics available:

In [None]:
## Lots of stats to compute the error:
print(get_scorer_names())

In [None]:
## Your turn: find and use R-squared as the parameter for a
## LOOCV. What is the difference?

## Cross-Validation

### K-Fold Cross-Validation

- Leaves $k$ groups out and fits the model with the observations outside each group.

- Then, it predicts the results in the cases left out.

- **Great** in most cases.

- **Bad** *sometimes* computationally expensive.

$$ CV_k \ = \ \dfrac{1}{k}\sum_i MSE_i $$

## Cross-Validation

### K-Fold Cross-Validation

![img](https://github.com/umbertomig/POLI175public/blob/main/img/cv3.png?raw=true)

In [None]:
## K-Fold CV (k = 5)
y = educ['education_log']
X = educ[['income_log', 'young_log']]

## k-Fold CV (n_splits = k, shuffle: reshuffle data before split)
cv = KFold(n_splits = 5, random_state = 1234, shuffle = True) 
reg = LinearRegression()


## Run the CV
scores = cross_val_score(reg, X, y,
                         scoring = 'neg_mean_squared_error',
                         cv = cv)

## MSE
print(np.mean(np.absolute(scores)))

## RMSE
np.sqrt(np.mean(np.absolute(scores)))

In [None]:
## K-Fold CV (k = 5)
y = educ['education_log']
X = educ[['income_log', 'young_log', 'urban_log']]

## k-Fold CV (n_splits = k, shuffle: reshuffle data before split)
cv = KFold(n_splits = 5, random_state = 1234, shuffle = True) 
reg = LinearRegression()


## Run the CV
scores = cross_val_score(reg, X, y,
                         scoring = 'neg_mean_squared_error',
                         cv = cv)

## MSE
print(np.mean(np.absolute(scores)))

## RMSE
np.sqrt(np.mean(np.absolute(scores)))

In [None]:
## Your turn: Run a 10-fold CV? Any differences?

## Cross-Validation

### Bias-Variance Trade-off

- k-Fold CV is more computationally efficient than LOOCV. But how about Bias-Variance Trade-offs?

- Larger fractions in a two-split leads to high bias: over-estimates the error rates.

- LOOCV: leaves just one, so it gives an unbiased estimate of the testing error rates: 
    + Very good for bias reduction!

## Cross-Validation

### Bias-Variance Trade-off

- LOOCV has high variance: almost the same observations at each run!
    + Very bad for variance.
    
- k-Fold CV:
    + Each subset is a *bit more different* than the other.
    + Leads to less correlation between each fold.
    + Good balance usually with $k=5$ or $k=10$.

## Cross-Validation

### Bias-Variance Trade-off

![img](https://github.com/umbertomig/POLI175public/blob/main/img/cv4.png?raw=true)

## Cross-Validation

### CV on Classification Problems

- When we have a classification, we must change how we evaluate the error.

- With classification, the LOOCV would look like this:

$$ CV_n \ = \ \dfrac{1}{n} \sum_i I(y_i \neq \widehat{y}_i) $$

- And the `accuracy` measure will be $I(y_i = \widehat{y}_i)$, so we need to subtract 1.

## Cross-Validation

### CV on Classification Problems

![img](https://github.com/umbertomig/POLI175public/blob/main/img/cv5.png?raw=true)

In [None]:
## LOOCV on a Logistic Regression
# Checking best polynomial for Age
poly = list(range(1, 6))
errmea = []
y = chile_clean['vote']
for p in poly:
    if p == 1:
        X = pd.DataFrame({
            'age_1': chile_clean['age']
        })
    else:
        X['age_' + str(p)] = X['age_1'] ** p
    cv = LeaveOneOut()
    logreg = LogisticRegression()
    scores = cross_val_score(logreg, X, y, 
                             scoring = 'accuracy',
                             cv = cv, n_jobs = -1)
    print('For polynomial order {a}, the Logistic Regression Error Rate is {b}.\n'.format(a = str(p), b = str(1-scores.mean())))
    errmea.append(1-scores.mean())

## Classification

### K-Nearest Neighbors Classifier

- Little detour back to talk about a good algorithm for classification (also very intuitive).

- Given an integer $K$, and a test observation, it says that:

$$ \mathbb{P}(Y = j| X = x_0) \ = \ \dfrac{1}{K}\sum_{i \in N_0} I(y_i = j) $$

- Meaning: classify the observation based on the class of the closest $K$ obs:
    + The one more frequent is the winner.
    
- Closest: the idea of a metric.

## Classification

### K-Nearest Neighbors Classifier

![img](https://github.com/umbertomig/POLI175public/blob/main/img/knn1.png?raw=true)

## Classification

### K-Nearest Neighbors Classifier

![img](https://github.com/umbertomig/POLI175public/blob/main/img/knn2.png?raw=true)

In [None]:
# KNN
X = chile_clean[['age', 'statusquo']]
y = chile_clean['vote']

# Create the model
knn = KNeighborsClassifier(n_neighbors = 10).fit(X, y)

# Plotting the tree boundaries
fig = DecisionBoundaryDisplay.from_estimator(knn, X, response_method="predict",
                                             alpha=0.5, cmap=plt.cm.coolwarm)

# Plotting the data points    
fig.ax_.scatter(x = chile_clean['age'], y = chile_clean['statusquo'], 
                c = y, alpha = 0.5,
                cmap = plt.cm.coolwarm)

plt.show()

In [None]:
## Now choose K! (We will learn a better method for doing this: GridSearchCV!)
bigK = list(range(1, 100))
errmea = []
y = chile_clean['vote']
X = chile_clean[['statusquo', 'logincome', 'logpop', 'age']]
for smallk in bigK:
    cv = KFold(n_splits = 10, random_state = 1234, shuffle = True)
    knn = KNeighborsClassifier(n_neighbors = smallk)
    scores = cross_val_score(knn, X, y, 
                             scoring = 'accuracy',
                             cv = cv, n_jobs = -1)
    errmea.append(1-scores.mean())
print('Best K is {a}.'.format(a = str(bigK[errmea.index(min(errmea))])))

In [None]:
sns.lineplot(x = bigK, y = errmea)
plt.title('KNN algorithm')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.scatter(bigK[errmea.index(min(errmea))], min(errmea), marker='X', color = 'red')
plt.show()

## Classification

**Check-in:** Study the best method for predicting default in credit card.

In [None]:
default = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/default.csv')
default.head()

# Questions?

# See you in the next class!