# Quantifying model fitness

- Determine how good a model is
- the model's goodness is expressed in quantity / number
1. Coefficient of determination
    - Sometimes called "r-squared" or "R-squared".
    - quantifies relationship between variables
    - The proportion of the variance in the response variable that is predictable from the explanatory variable (How much of the response variable can be explained by explanatory variable)
    - 1 means a perfect fit, 0 means the worst possible fit
    - get the value : `print(model.rsquared)`
2. Residual standard error (RSE) 
    - A "typical" difference between prediction and observed responses
    - It has the same unit as the response variable.
    - get the value : `print(np.sqrt(model.mse_resid))`
    - Degrees of freedom =  the number of observations minus the number of model coefficients.
    - a model has an RSE of 74  means : The difference between predicted and observed values is typically about 74 unit.
<center><img src="images/03.01.jpg"  style="width: 400px, height: 300px;"/></center>

3. Mean Squared error (MSE)
    - MSE = RSE²
    - get the value : `print(model.mse_resid)`
4. Root-mean-square error (RMSE)
    - get the value : `print(np.sqrt(model.mse_resid))`
5. Correlation vs Coefficient of determination
    - The coefficient of determination for a simple linear regression model is the correlation squared.

### Calculating RSE

In [1]:
# residuals_sq = model.resid ** 2
# resid_sum_of_sq = sum(residuals_sq)
# deg_freedom = len(df.index) - deg_of_freedom # Degree of freedom used
# rse = np.sqrt(resid_sum_of_sq/deg_freedom)
# print("rse :", rse)

### Calculating RMSE

In [3]:
# residuals_sq = model.resid ** 2
# resid_sum_of_sq = sum(residuals_sq)
# n_obs = len(df.index) # Number of observations used
# rmse = np.sqrt(resid_sum_of_sq/n_obs)
# print("rmse :", rmse)

# Residual properties of a good fit

- Residuals are normally distributed
- The mean of the residuals is zero
<center><img src="images/03.04.jpg"  style="width: 400px, height: 300px;"/></center>


# Plot Diagnostics

### Residuals vs. fitted

- Lowess trend line = blue line
- smooth curve following the data
- useful for visualizing trends
- if residuals are normally distributed, then the blue line will be close to 0 dotted line
- Shows positive or negative as the fitted values change (on the Y-axis)
<center><img src="images/03.05.jpg"  style="width: 400px, height: 300px;"/></center>


In [4]:
# sns.residplot(x="x_var", y="y_var", data=df, lowess=True)
# plt.xlabel("Fitted values")
# plt.ylabel("Residuals")

### Q-Q Plot

- Shows whether or not the residuals follow a normal distribution
- X-axis shows theoretical quantiles
- Y-axis shows quantiles derived from the dataset
- If the points trek along the straight line the residuals are normally distributed
- Most deviated points indicate the highest residuals and their rows
<center><img src="images/03.06.jpg"  style="width: 400px, height: 300px;"/></center>


In [5]:
# from statsmodels.api import qqplot
# qqplot(data=model.resid, fit=True, line="45")

### Scale-location plot

- Square root of normalized standard residuals versus the fitted values
- Shows whether the size of the residuals gets bigger or smaller as the fitted values change
- Constant size = smooth line, dynamic size = line with ups and downs
- smooth = good, too many ups and downs = bad
<center><img src="images/03.07.jpg"  style="width: 400px, height: 300px;"/></center>


In [2]:
# model_norm_residuals = model.get_influence().resid_studentized_internal # Normalized residuals
# model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# sns.regplot(x=model.fittedvalues, y=model_norm_residuals_abs_sqrt, ci=None, lowess=True)
# plt.xlabel("Fitted values")
# plt.ylabel("Sqrt of abs val of stdized residuals")

In [1]:
# model_norm_residuals_abs_sqrt=np.sqrt(np.abs(residuals))

# plt.figure(figsize=(7,7))
# sns.regplot(predictions.reshape(-1), model_norm_residuals_abs_sqrt,
#               scatter=True,
#               lowess=True,
#               line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
# plt.ylabel("Standarized residuals")
# plt.xlabel("Fitted value")

# Outliers

- Explanatory variables that are extreme
    - easy to visualize them
- points lie a long way from predictions



# Leverage

- Quantifies how extreme the explanatory variable values are
- Simple if there is only one explanatory variable (Just find minimum and maximum)
- Mathematically complex for multiple explanatory variable (How do you consider which points are minimum or maximum according to a hyperdimension?)
- Highly leveraged points are the ones with explanatory variables that are furthest away from the others.
- `hat_diag` in `summary_frame()` stores the leverage

# Influence

- Measures how much the model would change if you leave the observation out of the dataset when modeling.
- Influence of each observation is based on the size of the residuals and the leverage
- See all influences of data of a model in dataframe with `model.get_influence().summary_frame()`
- Cook's distance is the most common measure of influence.
- `cooks_d` in `summary_frame()` stores the influence

In [5]:
# model = ols("y_var ~ x_var", data=df).fit()
# df_summary = model.get_influence().summary_frame()
# df["leverage"] = df_summary["hat_diag"]