# Coefficient of determination

The coefficient of determination is a measure of how well the linear regression line fits the observed values. For simple linear regression, it is equal to the square of the correlation between the explanatory and response variables.

Here, you'll take another look at the second stage of the advertising pipeline: modeling the click response to impressions. Two models are available: `mdl_click_vs_impression_orig` models `n_clicks` versus `n_impressions`. `mdl_click_vs_impression_trans` is the transformed model you saw in Chapter 2. It models `n_clicks` to the power of 0.25 versus `n_impressions` to the power of 0.25.

In [1]:
# # Print a summary of mdl_click_vs_impression_orig
# print(mdl_click_vs_impression_orig.summary())

# # Print a summary of mdl_click_vs_impression_trans
# print(mdl_click_vs_impression_trans.summary())

In [2]:
# # Print the coeff of determination for mdl_click_vs_impression_orig
# print(mdl_click_vs_impression_orig.rsquared)

# # Print the coeff of determination for mdl_click_vs_impression_trans
# print(mdl_click_vs_impression_trans.rsquared)

`mdl_click_vs_impression_orig` has a coefficient of determination of 0.89. Which statement about the model is true?
- The number of impressions explains 89% of the variability in the number of clicks.

Which model does the coefficient of determination suggest gives a better fit?
- The transformed model, `mdl_click_vs_impression_trans`

# Residual standard error


Residual standard error (RSE) is a measure of the typical size of the residuals. Equivalently, it's a measure of how wrong you can expect predictions to be. Smaller numbers are better, with zero being a perfect fit to the data.

Again, you'll look at the models from the advertising pipeline, `mdl_click_vs_impression_orig` and `mdl_click_vs_impression_trans`.

In [3]:
# # Calculate mse_orig for mdl_click_vs_impression_orig
# mse_orig = mdl_click_vs_impression_orig.mse_resid

# # Calculate rse_orig for mdl_click_vs_impression_orig and print it
# rse_orig = np.sqrt(mdl_click_vs_impression_orig.mse_resid)
# print("RSE of original model: ", rse_orig)

# # Calculate mse_trans for mdl_click_vs_impression_trans
# mse_trans = mdl_click_vs_impression_trans.mse_resid

# # Calculate rse_trans for mdl_click_vs_impression_trans and print it
# rse_trans = np.sqrt(mdl_click_vs_impression_trans.mse_resid)
# print("RSE of transformed model: ", rse_trans)

`mdl_click_vs_impression_orig` has an RSE of about 20. Which statement is true?
- The typical difference between observed number of clicks and predicted number of clicks is 20.

Which model does the RSE suggest gives more accurate predictions? `mdl_click_vs_impression_orig` has an RSE of about 20, `mdl_click_vs_impression_trans` has an RSE of about 0.2.
- The transformed model, `mdl_click_vs_impression_trans`.

# Residuals vs. fitted values

Here you can see diagnostic plots of residuals versus fitted values for two models on advertising conversion.

Original model (`n_clicks` versus `n_impressions`):

<center><img src="images/03.05.png"  style="width: 400px, height: 300px;"/></center>

Transformed model (`n_clicks` ** 0.25 versus `n_impressions` ** 0.25):

<center><img src="images/03.052.png"  style="width: 400px, height: 300px;"/></center>

Look at the numbers on the y-axis scales and how well the trend lines follow the  line. Which statement is true?

- The residuals track the  line more closely in the transformed model compared to the original model, indicating that the transformed model is a better fit for the data.

# Q-Q plot of residuals

Here are normal Q-Q plots of the previous two models.

Original model (`n_clicks` versus `n_impressions`):

<center><img src="images/03.061.png"  style="width: 400px, height: 300px;"/></center>

Transformed model (`n_clicks` ** 0.25 versus `n_impressions` ** 0.25):

<center><img src="images/03.062.png"  style="width: 400px, height: 300px;"/></center>

Look at how well the points track the "normality" line. Which statement is true?

- The residuals track the "normality" line more closely in the transformed model compared to the original model, indicating that the transformed model is a better fit for the data.

# Scale-location

Here are normal scale-location plots of the previous two models. That is, they show the size of residuals versus fitted values.

Original model (`n_clicks` versus `n_impressions`):

<center><img src="images/03.071.png"  style="width: 400px, height: 300px;"/></center>

Transformed model (`n_clicks` ** 0.25 versus `n_impressions` ** 0.25):

<center><img src="images/03.072.png"  style="width: 400px, height: 300px;"/></center>

Look at the numbers on the y-axis and the slope of the trend line. Which statement is true?

- The size of the standardized residuals is more consistent in the transformed model compared to the original model, indicating that the transformed model is a better fit for the data.

# Drawing diagnostic plots

It's time for you to draw these diagnostic plots yourself using the Taiwan real estate dataset and the model of house prices versus number of convenience stores.

`taiwan_real_estate` is available as a `pandas` DataFrame and `mdl_price_vs_conv` is available

In [1]:
# # Plot the residuals vs. fitted values
# sns.residplot(x='n_convenience', y='price_twd_msq', data=taiwan_real_estate, lowess = True)
# plt.xlabel("Fitted values")
# plt.ylabel("Residuals")

# # Show the plot
# plt.show()

In [2]:
# # Import qqplot
# from statsmodels.api import qqplot

# # Create the Q-Q plot of the residuals
# qqplot(data=mdl_price_vs_conv.resid, fit=True, line="45")

# # Show the plot
# plt.show()

In [3]:
# # Preprocessing steps
# model_norm_residuals = mdl_price_vs_conv.get_influence().resid_studentized_internal
# model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

# # Create the scale-location plot
# sns.regplot(x=mdl_price_vs_conv.fittedvalues, y=model_norm_residuals_abs_sqrt, ci=None, lowess=True)
# plt.xlabel("Fitted values")
# plt.ylabel("Sqrt of abs val of stdized residuals")

# # Show the plot
# plt.show()

# Leverage

Leverage measures how unusual or extreme the explanatory variables are for each observation. Very roughly, high leverage means that the explanatory variable has values that are different from other points in the dataset. In the case of simple linear regression, where there is only one explanatory value, this typically means values with a very high or very low explanatory value.

Here, you'll look at highly leveraged values in the model of house price versus the square root of distance from the nearest MRT station in the Taiwan real estate dataset.

Guess which observations you think will have a high leverage, then move the slider to find out.
<center><img src="images/03.10.jpg"  style="width: 400px, height: 300px;"/></center>


Which statement is true?

- Observations with a large distance to the nearest MRT station have the highest leverage, because most of the observations have a short distance, so long distances are more extreme.

# Influence

Influence measures how much a model would change if each observation was left out of the model calculations, one at a time. That is, it measures how different the prediction line would look if you would run a linear regression on all data points except that point, compared to running a linear regression on the whole dataset.

The standard metric for influence is Cook's distance, which calculates influence based on the residual size and the leverage of the point.

You can see the same model as last time: house price versus the square root of distance from the nearest MRT station in the Taiwan real estate dataset.

Guess which observations you think will have a high influence, then move the slider to find out.

<center><img src="images/03.11.jpg"  style="width: 400px, height: 300px;"/></center>


Which statement is true?

- Observations far away from the trend line have high influence, because they have large residuals and are far away from other observations.

# Extracting leverage and influence

In the last few exercises, you explored which observations had the highest leverage and influence. Now you'll extract those values from the model.

`mdl_price_vs_dist` and `taiwan_real_estate` are available.

In [4]:
# # Create summary_info
# summary_info = mdl_price_vs_dist.get_influence().summary_frame()

# # Add the hat_diag column to taiwan_real_estate, name it leverage
# taiwan_real_estate["leverage"] = summary_info['hat_diag']

# # Sort taiwan_real_estate by leverage in descending order and print the head
# print(taiwan_real_estate.sort_values(["leverage"],ascending = False).head())

In [5]:
# # Create summary_info
# summary_info = mdl_price_vs_dist.get_influence().summary_frame()

# # Add the hat_diag column to taiwan_real_estate, name it leverage
# taiwan_real_estate["leverage"] = summary_info["hat_diag"]

# # Add the cooks_d column to taiwan_real_estate, name it cooks_dist
# taiwan_real_estate["cooks_dist"] = summary_info["cooks_d"]

# # Sort taiwan_real_estate by cooks_dist in descending order and print the head.
# print(taiwan_real_estate.sort_values(["cooks_dist"],ascending = False).head())