### Coefficient of Determination
How strong is the linear relationship?

one metric is **r-squared** (coefficienc of determination)(lowercase are for simple linear regression and uppercase R when you have more than one explanatory variable

the proportion of the variance in the response variable that is predictable from the explanatory variable

a score of 1 means the model is a perfect fit and a score of 0 is no better than randomness, the score means that the number of impressions explains 89% of the variability in the number of clicks

shows up in the first line of .summary(), an easier way to extract the metric is to use to use the .rsquared attribute

for simple linear regression, the CoD is the correlation between the explanatory and response variables squared

another metric is the **residual standard error (RSE)**, each residual is the difference between a predicted value and an observed value, the RSE is kinda like the measure of the typical size of the residuals (how much the predictions are typically wrong), has the same unit as the response variable, **mean squared error (MSE)** is related by less commonly used, it's the squared RSE, .mse_resid can be square rooted to get the RSE
- mse = mdl_koi.mse_resid
- rse = np.sqrt(mse)
an RSE of 74 would mean that the difference between the predicted bream masses and the observed bream masses is typically about 74 grams

another metric which is related is the **root-mean-square error (RMSE)**, this is calculated the same way except the number of coefficents isn't subtracted, it's just like RSE (quantifies how inaccurate the model predictions are) but is worse for comparisons between models, you should mostly use RSE

### Visualizing Model Fit
if the linear regression model is a good fit then the residuals are approximately normally distributed and the mean of the residuals will be 0
there's a few diagnostic plots that can help you visualize this

**residuals versus fitted values**
the blue line is a Lowess trend line, it's a smooth curve following the data, it's not good for making predictions but is useful for visualizing trends, a good fit would mean that the trend line would closely follow the y=0 line of the plot
    sns.residplot(x="length_cm", y="mass_g", data=koi, lowess=True)
    plt.xlabel("Fitted Values")
    plt.ylabel("Residuals")

**Q-Q plot**
shows whether or not the residuals follow a normal distribution, the x-axis is quantiles from the normal distribution, the y-axis is the sample quantiles (quantiles derived from your dataset), if the points track along the straight line they are normally distributed, the highest residuals are the points on the ends
    from statsmodels.api import qqplot
    qqplot(data=mdl_koi.resid, fit=True, line="45")

**scale-location plot**
shows the square root of the standardized residuals versus the fitted values, shows whether the size of the residuals get bigger or smaller, a more straight line shows not a huge change and a line that goes up and down all over the place means it's a poor fit
    # first extract the normalized residuals from the model
    model_norm_residuals_koi = mdl_koi.get_influence().resid_studentized_internal
    # take the absolute values and take the square root of these norm resids to standardize them
    model_norm_residuals_abs_sqrt_koi = np.sqrt(np.abs(model_norm_residuals_bream))
    sns.regplot(x=mdl_koi.fittedvalues, y=model_norms_residuals_abs_sqrt_koi, ci=None, lowess=True)
    plt.xlabel("Fitted Values")
    plt.ylabel("sqrt of abs val of stdized residuals")

### Outliers
Is your model overly affected by some unusual data points?
an outlier is an unusual data point
-an outlier can be an explanatory variable that is extreme (an extremely long/short koi)
-an outlier could also be a point that's far from the model predictions (a koi with 0 mass)

**Leverage**
measure of how extreme the explanatory variable values are, highly leveraged points are the ones with explanatory variables that are furthest away from the others

**Influence**
measures how much the model would change if you left the observation out of the dataset when modeling, "leave one out" metric, influence is based on the size of the residuals and the leverage, the points far from the trendline 

**Cook's Distance**
the most common measure of influence, you can find the most influential koi by arranging the rows by descending Cook's distance values

use .get_influence() and .summary_frame() to get these metrics, the values of leverage are stored in the hat_diag column
    mdl_koi = ols("mass ~ length", data=koi).fit()
    summary_roach = mdl_roach.get_influence().summary_frame()
    koi["leverage"] = summary_koi["hat_diag"]
    print(koi.sort_values("cooks_dist", ascending=False).head())
    
you can remove the most influential koi
    koi_not_short = koi[koi["length"] != 12.9]
    
    sns.regplot(x="length", 
                y="mass",
                data=koi,
                ci=None,
                line_kws={"color": "green"})
                
    sns.regplot(x="length", 
                y="mass",
                data=koi_not_short,
                ci=None,
                line_kws={"color": "red"})
                
Code example:
# Create summary_info
summary_info = mdl_price_vs_dist.get_influence().summary_frame()

# Add the hat_diag column to taiwan_real_estate, name it leverage
taiwan_real_estate["leverage"] = summary_info["hat_diag"]

# Add the cooks_d column to taiwan_real_estate, name it cooks_dist
taiwan_real_estate["cooks_dist"] = summary_info["cooks_d"]

# Sort taiwan_real_estate by cooks_dist in descending order and print the head.
print(taiwan_real_estate.sort_values("cooks_dist", ascending=False).head())

