So far, you've learned how to build linear regression models and interpret estimated coefficients. In this checkpoint, you'll explore how to evaluate model performance in the training phase. Recall that there are two contexts where performance matters: with the training set and with a test set. When you evaluate performance related to the training set, that enables you to talk about how well your model explains the information in the target variable. And evaluating a test set's performance tells you how well your model will perform when it's given previously unseen observations.

In this checkpoint, you'll go over concepts like *F-tests* and *R-squared*. F-tests allow you to compare your model to a reduced model with no features. R-squared (and *adjusted R-squared*, which is a variant of R-squared) values tell you how well the model accounts for variance in the target.

After that, you'll see how to compare different models in terms of their explanatory power. You'll learn how to read the *Akaike information criterion* and the *Bayesian information criterion* for this purpose.

## Key topics

* Training and test data
* Evaluating training performance
* F-tests
* Degrees of freedom
* R-squared
* Akaike information criterion
* Bayesian information criterion

At the end of this checkpoint, you'll work through two assignments where you'll evaluate the performance of your weather and house prices models.

## Is your model better than an "empty" model?

When evaluating your model, you first need to ask whether your model contributes anything to the explanation of the outcome variable. In other words, you need to determine whether or not your features explain variance in the outcome. If they don't, you could drop your features altogether, and the resulting "empty" model would perform equally well—which is to say, not very well!

For this purpose, use an *F-test*.

###  F-tests

F-tests can be calculated in different ways, depending on the situation. But in general, they represent the ratio between a model's unexplained variance compared to a reduced model. Here, the reduced model is a model with no features, meaning that all variance in the outcome is unexplained. For a linear regression model with two parameters $y=\alpha+\beta x$, the F-test is built from these pieces:

| Name | Equation | 
| :-- | :-- |
| Unexplained model variance | $$SSE_F=\sum(y_i-\hat{y}_i)^2$$ |
| Unexplained variance in reduced model | $$SSE_R=Var_y = \sum(y_i-\bar{y})^2$$ |
| Number of parameters in the model | $$p_F = 2 (\alpha \text{ and } \beta)$$ |
| Number of parameters in the reduced model | $$p_R = 1 (\alpha)$$ |
| Number of observations | $$n$$ |
| Degrees of freedom of $SSE_F$ | $$df_F = n - p_F$$ |
| Degrees of freedom of $SSE_R$ | $$df_R = n - p_R$$ |

These pieces come together to give you the full equation for the F-test:

$$F=\dfrac{SSE_F-SSE_R}{df_F-df_R}÷\dfrac{SSE_F}{df_F}$$

This introduces some new terminology. *Degrees of freedom* quantifies the amount of information "left over" to estimate variability after all parameters are estimated.

In regression, degrees of freedom for a function works like this: With two data points, a regression line $y=\alpha + \beta x$ has `0` degrees of freedom (`2` minus the number of parameters). Those two parameters encompass all the information in the data. Knowing $\alpha$ and $\beta$ alone, you can perfectly reproduce the original data. No additional information is available from the data itself. If you have 10 data points, then the model's degrees of freedom would be `8` (`10` minus the number of parameters).

The F-test's null hypothesis states that the model is indistinguishable from the reduced model, which means that the features contribute nothing to the explanation of the target variable. Instead of reading the F-statistic, it's easier to read its associated p-value. The lower the p-value, the better your model. Namely, if the p-value of the F-test for your model is less than or equal to `0.1` (or even less than or equal to `0.05`), you can say that your model is useful and contributes something that is statistically significant in the explanation of the target.

Now, calculate the F-statistic of your medical costs model:

In [5]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sqlalchemy import create_engine

# Display preferences
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

In [6]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'medicalcosts'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

insurance_df = pd.read_sql_query('select * from medicalcosts',con=engine)

# No need for an open connection, because you're only doing a single query
engine.dispose()

insurance_df.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.9
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.705,0,no,northwest,21984.5
4,32,male,28.88,0,no,northwest,3866.86
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
8,37,male,29.83,2,no,northeast,6406.41
9,60,female,25.84,0,no,northwest,28923.1


In [7]:
insurance_df["is_male"] = pd.get_dummies(insurance_df.sex, drop_first=True)
insurance_df["is_smoker"] = pd.get_dummies(insurance_df.smoker, drop_first=True)

# `Y` is the target variable
Y = insurance_df['charges']

# `X` is the feature set
X = insurance_df[['is_male','is_smoker', 'age', 'bmi']]

# Add a constant to the model because it's best practice
# to do so every time!
X = sm.add_constant(X)

# Fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# Print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.747
Model:                            OLS   Adj. R-squared:                  0.747
Method:                 Least Squares   F-statistic:                     986.5
Date:                Wed, 19 Dec 2018   Prob (F-statistic):               0.00
Time:                        17:18:48   Log-Likelihood:                -13557.
No. Observations:                1338   AIC:                         2.712e+04
Df Residuals:                    1333   BIC:                         2.715e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.163e+04    947.267    -12.281      0.0

This model's F-statistic is `986.5`, and the associated p-value is very close to zero. This means that your features add some information to the reduced model, and your model is useful in explaining `charges`.

However, F-tests don't quantify how much information your model contributes. That requires R-squared, which you'll learn about next.

## Quantifying a model's performance on the training set

R-squared is probably the most common measure of goodness of fit in a linear regression model. It is a proportion (between `0` and `1`) that expresses how much variance in the outcome variable is explained by the explanatory variables in the model. Generally speaking, higher $R^2$ values are better to a point. A low $R^2$ indicates that your model isn't explaining much information about the outcome, which means that it will not give very good predictions. But a very high $R^2$ is a warning sign of overfitting. No dataset is a perfect representation of reality, so a model that perfectly fits your data ($R^2$ of `1` or close to `1`) is likely to be biased by quirks in the data and will perform less well on the test set.

In the regression summary table above, you can see that your medical costs model's R-squared value is `0.747`. This means that your model explains 74.7% of the variance in the charges, leaving 25.3% unexplained. You can conclude that there's still room for improvement. Now, fit the model in the previous checkpoint again, where you included the interaction of `BMI` and the `is_smoking` dummy:

In [8]:
# `Y` is the target variable
Y = insurance_df['charges']

# This is the interaction between BMI and smoking
insurance_df["bmi_is_smoker"] = insurance_df.bmi * insurance_df.is_smoker

# `X` is the feature set
X = insurance_df[['is_male','is_smoker', 'age', 'bmi', "bmi_is_smoker"]]

# Add a constant to the model because it's best practice
# to do so every time!
X = sm.add_constant(X)

# Fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# Print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.837
Model:                            OLS   Adj. R-squared:                  0.836
Method:                 Least Squares   F-statistic:                     1365.
Date:                Wed, 19 Dec 2018   Prob (F-statistic):               0.00
Time:                        17:18:50   Log-Likelihood:                -13265.
No. Observations:                1338   AIC:                         2.654e+04
Df Residuals:                    1332   BIC:                         2.657e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const         -2071.0750    840.644     -2.464

The R-squared of this model is `0.837`, which is higher than your previous model's R-squared. This improvement indicates that the interaction of `BMI` and `is_smoker` explains some previously unexplained variance in `charges`. 

As mentioned before, high R-squared values are generally desirable. But, in some cases, very high R-squared values indicate some potential problems with a model. Specifically, it could mean the following:

* A very high R-squared value may be a sign of overfitting. If your model is too complex for the data, then it may overfit the training set and do a poor job on the test set. That said, there isn't an agreed-upon R-squared threshold to detect overfitting. Instead, you need to compare the model's performance on test versus training data. If your model performs significantly worse on the test set than the training set, you should suspect overfitting. You'll explore how to evaluate linear regression models on the test set in the next checkpoint.

* R-squared is an inherently biased estimate of the performance, in the sense that the more explanatory variables are added to the model, the higher R-squared values you get. This is true even when you include irrelevant variables like noises or random data. To mitigate this problem, data scientists usually use a metric called *adjusted R-squared* instead of *R-squared*. Adjusted R-squared does the same job as R-squared, but it is adjusted according to the number of features included in the model. Hence, it's always safer to look at the adjusted R-squared value instead of R-squared value.

**A note on negative R-squared values:** It is possible to get negative R-squared values for some models. In general, if a model is weaker than a straight horizontal line, then the R-squared value becomes negative. This usually happens when a constant is not included in the model. If you get a negative value for R-squared, that means that your model explains the target very poorly. 

## Comparing different models

Comparing different models and choosing the best one is an essential practice in data science. Often, you'll try several models and evaluate their performance on a test set in order to determine the top-performing model. However, *inference* is also a critical task when it comes to linear regression models. Unlike testing the predictive power, in inference, you care about the explanatory power of your models.

Throughout this checkpoint, you've seen that you can measure the performance of your models on the training set using F-test or R-squared. Hence, both F-test and R-squared can be used to compare different models. Unfortunately, the two metrics suffer from some drawbacks that make them inappropriate to use in certain situations.

Here, you'll briefly explore how you can use F-tests and R-squared to compare models. Then, you'll learn about information criteria that you can also use to compare different models.

### Using F-tests for model comparison

You can use an F-test to compare two models if one of them is nested within the other. That is, if the feature set in a model is a subset of the feature set of the other, then you can use an F-test. In this case, you say that the model with the higher F-statistic is superior to the other model.

However, if models are not nested, then using an F-test may be misleading. F-tests are quite sensitive to the normality of the error terms. If errors are not normally distributed, you should try other methods.

### Using R-squared for model comparison

R-squared can also be used. You already saw that R-squared is biased, as it tends to increase with the number of explanatory variables. So, instead of R-squared, you can use adjusted R-squared. The higher the adjusted R-squared, the better the model explains the target variable. 

### Using information criteria

Using information criteria is also a common way of comparing different models and selecting the best one. Here, you'll learn about two information criteria: *Akaike information criterion* (AIC) and *Bayesian information criterion* (BIC). Both take into consideration the sum of the squared errors (SSE), the sample size, and the number of parameters.

The formula for AIC is as follows:

$$nln(SSE)−nln(n)+2p$$ 


And the formula for BIC is as follows:

$$nln(SSE)−nln(n)+pln(n)$$

In both of these formulas, $n$ represents the sample size, $p$ represents the number of regression coefficients in the model (including the constant), and $ln$ stands for the natural logarithm.

For both AIC and BIC, lower values indicate better models. So you should choose the model with the lowest AIC or BIC value. Although you can use either of the two criteria, AIC is often criticized for its tendency to overfit. In contrast, BIC penalizes the number of parameters more severely than AIC, so it favors more parsimonious models (that is, models with fewer parameters).

## Which medical costs model is better?

The statmodels' `summary()` function provides all of the above metrics. Take a look at these metrics in the tables above. For your first model, R-squared is `0.747`, adjusted R-squared is `0.747`, the F-statistic is `986.5`, AIC is `27.120`, and BIC is `27.150`. For your second model, R-squared is `0.837`, adjusted R-squared is `0.836`, the F-statistic is `1365`, AIC is `26.540`, and BIC is `26.570`. According to all of the metrics, the second model seems better than the first one.