# R²

'In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).' - https://en.wikipedia.org/wiki/Coefficient_of_determination

Variance is a statistical measure that quantifies the spread or dispersion of a set of data points. It essentially describes how much the individual data points deviate from the average (mean) of the dataset. A higher variance indicates that the data points are more widely spread, while a lower variance indicates that they are clustered more closely around the mean.
- Variance is calculated by finding the average of the squared differences between each data point and the mean.
- It is a measure of variability, meaning it shows how much the data varies around its central tendency.

## Calculating R² manually

In [1]:
import numpy as np

### Example data
Here we create a numpy array to represent the true values and another numpy array to represent estimates from some machine learning model.

In [2]:
y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])

### Step 1: Calculate the mean of the true values
This is the average value for the observed data points.

In [3]:
y_mean = np.mean(y_true)

y_mean

2.875

### Step 2: Calculate Total Sum of Squares (TSS)
TSS measures the total variance in the observed data. It is the **sum** of squared differences between each true value and the mean. Remember that variance is the **average** of the squared differences between each true value and the mean.

In [4]:
total_sum_of_squares = np.sum((y_true - y_mean) ** 2)

total_sum_of_squares

29.1875

### Step 3: Calculate Residual Sum of Squares (RSS)
RSS measures the error of the predictions. It is the sum of squared differences between each true value and its prediction.

In [5]:
sum_of_error_squared = np.sum((y_true - y_pred) ** 2)

sum_of_error_squared

1.5

### Step 4: Calculate R-squared (R²)
The formula for calulating R² is:
- R² = 1 - (RSS / TSS)

It represents the proportion of variance explained by the model. If the error is zero, R² will equal 1. Larger errors will result in lower R² values. An R² value of 0 would indicate that the model is no better than predicting the mean--it does not explain any of the variability in the data. 

In [6]:
r2_manual = 1 - (sum_of_error_squared / total_sum_of_squares)

print("manual R²:", r2_manual)

manual R²: 0.9486081370449679


## Using scikit-learn's built-in r2_score function
We don't need to calculate R² manually when evaluating our machine learning models.

In [7]:
from sklearn.metrics import r2_score

In [8]:
r2_sklearn = r2_score(y_true, y_pred)
print("scikit-learn R²:", r2_sklearn)

scikit-learn R²: 0.9486081370449679
