# Unit 2 R-squared Metric

Hello\! Today, we will learn about the **R-squared** metric, a vital measure in machine learning. Have you ever wondered how to measure how well your model fits the real data? That’s exactly what the R-squared metric helps with\! By the end of this lesson, you'll understand what R-squared is, why it’s essential, and how to calculate it using Python.

-----

## Understanding R-squared

### What is R-squared ($R^2$)?

Imagine you're predicting how the height of children changes with age. R-squared, also called the **coefficient of determination**, helps us understand how well our model explains this variability in height based on age.

The formula for R-squared is:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Where:

  * $SS\_{res}$ is the sum of the squares of the residuals (errors): $SS\_{res} = \\sum\_{i=1}^{n}(y\_i - \\hat{y}\_i)^2$
  * $SS\_{tot}$ is the total sum of the squares (proportional to the variance of the data): $SS\_{tot} = \\sum\_{i=1}^{n}(y\_i - \\bar{y})^2$

R-squared tells us what proportion of the variance in the dependent variable (e.g., height) is predictable from the independent variable (e.g., age). An R-squared value ranges from 0 to 1:

  * $R^2 = 0$: The model explains none of the variability.
  * $R^2 = 1$: The model explains all the variability.

### Interpreting R-squared

### Why is R-squared Important?

Think about predicting someone’s height based on their age. If your predictions are very close to the actual heights, your model does a good job. If predictions are off, your model needs improvement. R-squared gives you a single number to show how well your model performs.

Higher R-squared values mean the model better explains the variability of the target variable. For instance, a high R-squared value in a model predicting house prices means the model accurately predicts based on inputs like square footage and number of bedrooms.

If your model has an R-squared of 0.85, it tells you 85% of the variance in house prices is explained by your model.

-----

## Calculating R-squared in Python

Here’s how to calculate R-squared using Python. Let’s take a look at the code snippet first and explain it step-by-step.

```python
from sklearn.metrics import r2_score
import numpy as np

# Sample regression dataset: True values
y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])

r2 = r2_score(y_true, y_pred)
print(f"R-squared: {r2}")  # R-squared: 0.948
```

  * **Importing Libraries**: First, we import the function `r2_score` from the `sklearn.metrics` module. This tool makes calculating R-squared straightforward.
  * **Calculating R-squared**: Using the `r2_score` function with `y_true` and `y_pred`, we calculate the R-squared value.
  * **Displaying the Result**: We print out the R-squared value.

-----

## R-squared vs. Mean Squared Error (MSE)

While both R-squared and Mean Squared Error (MSE) are used to evaluate the performance of a regression model, they provide different insights:

  * **R-squared**: This metric provides a **relative measure** of how well the model's predictions match the actual data. It tells us the proportion of variability in the dependent variable that can be explained by the model. R-squared is useful when you want to understand the explanatory power of your model.
  * **MSE**: This metric provides an **absolute measure** of the average squared difference between the predicted and actual values. It focuses on the magnitude of prediction errors, regardless of the variability in the data. MSE is useful when you want to understand the accuracy of your predictions in the same units as the target variable.

In summary, R-squared is important because it gives a normalized measure of model performance that accounts for the variability in the data, whereas MSE provides a direct measure of prediction error magnitude. Both metrics together can offer a comprehensive view of your model's accuracy and explanatory power.

-----

## Lesson Summary

Great job\! You’ve learned what R-squared is and how it helps measure the performance of a regression model. You now know how to interpret the R-squared value: a higher value means a better fit. You also know how to calculate R-squared using Python.

Now it's time to practice\! You'll get hands-on experience calculating the R-squared metric with different datasets. This will solidify your understanding and let you apply what you've learned. Happy coding\!

## Update Predictions for R-Squared Calculation

Great progress, Space Explorer!

Now, let's tweak the pred_prices in the starter code to see how it affects the R-squared value. Let's make pred_prices closer to the true_prices and see what happens to the metric

Keep exploring!

```python
from sklearn.metrics import r2_score
import numpy as np

# Predicting house prices based on various features
true_prices = np.array([350000, 460000, 580000, 610000])
pred_prices = np.array([330000, 440000, 600000, 615000])  # TODO: adjust these values to be closer to the true prices

# Calculate and print R-squared value
r_squared = r2_score(true_prices, pred_prices)
print(f"R-squared: {r_squared}")

```

```python
from sklearn.metrics import r2_score
import numpy as np

# Predicting house prices based on various features
true_prices = np.array([350000, 460000, 580000, 610000])
# Adjust these values to be closer to the true prices
pred_prices = np.array([348000, 462000, 579000, 611000])

# Calculate and print R-squared value
r_squared = r2_score(true_prices, pred_prices)
print(f"R-squared: {r_squared}")
```

## Calculating R-squared for House Prices

Alright, Space Wanderer! Let's test your skills. Fill in the missing pieces to calculate the R-squared value for predicting house prices.

```python
# TODO: import a correct function for R-squared

# Predicting house prices
true_prices = [300000, 420000, 500000, 600000]
predicted_prices = [310000, 430000, 490000, 610000]

# TODO: Calculate and print R-squared

```

To calculate R-squared, you'll need the `r2_score` function from `sklearn.metrics`. You also asked for a comparison between R-squared and MSE.

Here's the completed code and a comparison:

```python
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# Predicting house prices
true_prices = [300000, 420000, 500000, 600000]
predicted_prices = [310000, 430000, 490000, 610000]

# Calculate and print R-squared
r_squared = r2_score(true_prices, predicted_prices)
print(f"R-squared: {r_squared:.4f}")

# Calculate and print MSE
mse = mean_squared_error(true_prices, predicted_prices)
print(f"Mean Squared Error (MSE): {mse:.2f}")
```

### R-squared vs. Mean Squared Error (MSE)

Both R-squared and MSE are metrics used to evaluate the performance of regression models, but they provide different insights:

**R-squared (Coefficient of Determination)**

  * **Definition:** R-squared represents the proportion of the variance in the dependent variable (target) that is predictable from the independent variables (features). In simpler terms, it tells you how well your model explains the variability of the data.
  * **Range:** It ranges from 0 to 1.
      * An R-squared of 0 indicates that the model explains none of the variability of the response data around its mean.
      * An R-squared of 1 indicates that the model explains all the variability of the response data around its mean.
  * **Interpretation:** A higher R-squared value generally indicates a better fit for the model. For example, an R-squared of 0.80 means that 80% of the variation in the dependent variable can be explained by the independent variables in the model.
  * **Advantages:**
      * Easy to interpret as a proportion.
      * Provides a sense of the "goodness of fit" of the model.
  * **Disadvantages:**
      * Can increase with the addition of more independent variables, even if those variables are not actually improving the model's predictive power (this is why adjusted R-squared is sometimes preferred).
      * Does not tell you if your model is biased or if the predictions are accurate in absolute terms.

**Mean Squared Error (MSE)**

  * **Definition:** MSE is the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. Each error is the vertical distance between the actual data point and the regression line.
  * **Formula:** $MSE = \\frac{1}{n} \\sum\_{i=1}^{n} (Y\_i - \\hat{Y}\_i)^2$
      * $Y\_i$ is the actual value
      * $\\hat{Y}\_i$ is the predicted value
      * $n$ is the number of data points
  * **Range:** It ranges from 0 to infinity.
      * A MSE of 0 means the model has no error, i.e., perfect predictions.
      * Larger MSE values indicate larger errors.
  * **Interpretation:** MSE gives you a concrete measure of the average magnitude of the errors. The units of MSE are the square of the units of the target variable, which can sometimes make it harder to interpret directly in the context of the original data. (Root Mean Squared Error (RMSE) is often used for easier interpretation as it's in the same units as the target variable).
  * **Advantages:**
      * Penalizes larger errors more heavily due to the squaring of the differences, which can be desirable in many situations.
      * Provides an absolute measure of error.
  * **Disadvantages:**
      * Sensitive to outliers due to the squaring of errors.
      * The units are squared, which can make it less intuitive to interpret compared to the original units of the target variable.

**Key Difference and When to Use Which:**

  * **R-squared** tells you the **proportion of variance explained** by your model relative to the total variance. It's a relative measure of fit.
  * **MSE** (or RMSE) tells you the **average magnitude of the prediction errors** in the units squared (or original units for RMSE). It's an absolute measure of error.

You would typically use both metrics to get a comprehensive understanding of your model's performance. R-squared helps you understand how much of the target variable's variability your model captures, while MSE (or RMSE) helps you understand the typical size of the errors your model makes.

## Compare R-sqared to MSE

Alright, Space Wanderer! Let's test your skills. Fill in the missing pieces to calculate the R-squared value for predicting house prices.

```python
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Actual house prices
y_true = np.array([300, 400, 500, 600])

# Different sets of predicted house prices
predictions = {
    "Perfect Predictions": np.array([300, 400, 500, 600]),
    "Good Predictions": np.array([310, 390, 505, 595]),
    "Poor Predictions": np.array([200, 450, 480, 610]),
}

for description, y_pred in predictions.items():
    # TODO: calculate and print both MSE and R-squared for all the predictions

```

All right, Space Wanderer\! Let's fill in those missing pieces and see how R-squared and MSE behave with different prediction scenarios.

```python
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Actual house prices
y_true = np.array([300, 400, 500, 600])

# Different sets of predicted house prices
predictions = {
    "Perfect Predictions": np.array([300, 400, 500, 600]),
    "Good Predictions": np.array([310, 390, 505, 595]),
    "Poor Predictions": np.array([200, 450, 480, 610]),
}

for description, y_pred in predictions.items():
    # Calculate MSE
    mse = mean_squared_error(y_true, y_pred)
    # Calculate R-squared
    r_squared = r2_score(y_true, y_pred)

    print(f"--- {description} ---")
    print(f"  Mean Squared Error (MSE): {mse:.2f}")
    print(f"  R-squared: {r_squared:.4f}")
    print("-" * (len(description) + 8)) # Separator for readability
```

### Observations and Comparison:

Let's run through the results and reinforce the comparison between R-squared and MSE:

1.  **Perfect Predictions:**

      * **MSE will be 0:** This is because there is no difference between the true and predicted values ($Y\_i - \\hat{Y}\_i = 0$ for all $i$). MSE directly measures the average squared error, so a perfect prediction yields zero error.
      * **R-squared will be 1:** A perfect model explains 100% of the variance in the true prices. This is the ideal scenario for R-squared.

2.  **Good Predictions:**

      * **MSE will be a small positive number:** There are some errors, but they are relatively small. MSE quantifies the average magnitude of these squared errors.
      * **R-squared will be close to 1 (but less than 1):** The model still explains a large proportion of the variance in house prices, indicating a strong fit, but not perfect.

3.  **Poor Predictions:**

      * **MSE will be a larger positive number:** The errors are significantly larger, leading to a higher average squared error.
      * **R-squared will be much lower, potentially negative:** A low or negative R-squared indicates that the model poorly explains the variance in the true prices. A negative R-squared occurs when the model performs worse than a simple horizontal line (the mean of the actual values). This means your model's predictions are worse than simply predicting the average house price for every instance.

**Summary of Comparison through this example:**

  * **MSE provides an absolute measure of error.** It quantifies *how much* error your model has, with smaller values indicating better performance. Its value is directly related to the scale of your target variable.
  * **R-squared provides a relative measure of how well your model fits the variance in the data.** It tells you *what proportion* of the variability in the target variable is explained by your model, ranging from 0 to 1 (with potential for negative values for very poor models).

In essence, MSE tells you about the *accuracy* of your predictions in the units of your data (squared), while R-squared tells you about the *explanatory power* of your model.