# Module 1: Introduction to Scikit-Learn

## Section 2: Supervised Learning Algorithms

### Part 1: Linear Regression

In this part, we will explore Linear Regression, one of the fundamental supervised learning algorithms used for predicting continuous numeric values. Linear Regression models the relationship between independent variables (features) and a dependent variable (target) by fitting a linear equation to the data.

### 1.1 Understanding Linear Regression

Linear Regression assumes a linear relationship between the independent variables and the target variable. The equation of a simple linear regression model can be represented as:

$y = b_0 + b_1 * x_1 + b_2 * x_2 + b_n * x_n$


Where:

- $y$ is the target variable
- $x_1$, $x_2$, ..., xn are the independent variables (features)
- $b_0$, $b_1$, $b_2$, $b_n$ are the coefficients (slopes) of the linear equation

The goal of linear regression is to find the best-fit line that minimizes the difference between the predicted values and the actual values.

Linear Regression makes certain assumptions about the data. It assumes that:

- There is a linear relationship between the independent variables and the target variable.
- There is no multicollinearity among the independent variables.
- The residuals (the differences between the predicted and actual values) follow a normal distribution.
- The residuals have constant variance (homoscedasticity).

### 1.2 Training and Evaluation

To train a Linear Regression model, we need a labeled dataset with the target variable and the corresponding feature values. The model learns the coefficients (b0, b1, b2, ..., bn) by minimizing the residual sum of squares (RSS) or the mean squared error (MSE) between the predicted and actual values.

Once trained, we can evaluate the model's performance using evaluation metrics such as:

- Mean Squared Error (MSE) is a common metric used to evaluate the performance of regression models. It measures the average squared difference between the predicted values and the actual values.

    $\text{MSE} = \frac{1}{n} * \sum(y_{actual} - y_{predicted})^2$
    
- Mean Absolute Error (MAE) is a metric that measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers compared to MSE.

    $\text{MAE} = \frac{1}{n} * \sum|y_{actual} - y_{predicted}|$

- Root Mean Squared Error (RMSE) is the square root of the MSE. It is a popular metric as it is in the same unit as the target variable, making it easier to interpret.

    $\text{RMSE} = \sqrt{MSE}$

- R-squared (coefficient of determination)is another evaluation metric for regression models that measures the proportion of the variance in the target variable that is predictable from the independent variables. R2 values range from 0 to 1, where 0 indicates that the model explains none of the variance, and 1 indicates a perfect fit.

    $\text{R2} = 1 - (\sum(y_{actual} - y_{predicted})^2) / (\sum(y_{actual} - y_{mean})^2)$


### 1.3 Implementing Linear Regression in Scikit-Learn

#### Example 1: One independent variable

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic data
np.random.seed(42)
# 100 x values between 0 and 10
X = np.linspace(0, 10, 100)  
# 100 y values following a normal distribution with 0 mean and 2 std desviation
y = 2 * X + 5 + np.random.normal(0, 2, 100)  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)

# Create a linear regression model
regressor = LinearRegression()
# Fit the model on the training data
regressor.fit(X_train, y_train)
# Make predictions on the test data
y_pred = regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)
print("R-squared (R2) Score:", r2)

# Plot the original data points and the linear regression line
plt.scatter(X_train, y_train, label='Training data', color='blue')
plt.scatter(X_test, y_test, label='Test data', color='green')
plt.scatter(X_test, y_pred, label='Predicted test data', color='red', marker='x')
plt.plot(X, regressor.predict(X.reshape(-1, 1)), label='Linear regression line', color='red')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Linear Regression Example')
plt.legend()
plt.grid(True)
plt.show()

The plot shows the scatter plot of the training and test data points in blue and green, respectively. The red 'x' markers represent the predicted values for the test data based on the trained linear regression model. The red line represents the linear regression line fitted to the entire X range using the trained model. The goal of the linear regression model is to learn a line that best fits the training data and predicts the target variable for new data points.

The most common metrics used for linear regression evaluation are the Mean Squared Error (MSE) and the R-squared (R2) score:

MSE: Quantifies the average squared difference between the predicted values and the actual values. 
- The range of MSE values depends on the scale of the target variable and the nature of the data. Generally, the MSE can take any non-negative value, where:
- A smaller MSE (close to 0) indicates that the model's predictions are closer to the actual values, and therefore, the model performs better.
- A larger MSE means the model's predictions are farther from the actual values and the model is less accurate.

R2: Proportion of the variance in the dependent variable (target). It indicates how well the model fits the data and explains the variability of the target variable. For the R-squared (R2) score, we expect it to be as close to 1 as possible. 
- R2 = 0: The model does not explain any variance in the target variable, and it performs no better than predicting the mean of the target variable.
- R2 = 1: The model perfectly explains all the variance in the target variable, and it makes perfect predictions.

In this case, the MSE is 2.488168969160713, which means, on average, the squared difference between the predicted and actual values is around 2.49. Lower values of MSE are generally preferred, but an MSE of 2.49 is still relatively small, indicating that the model's predictions are reasonably close to the true values.<br>
An R2 score of 0.9325280805909615 indicates that approximately 93.25% of the variance in the target variable is explained by the independent variables in the model. This is quite high, suggesting that the model is doing an excellent job of capturing the relationships between the features and the target.<br>
Overall, this is a strong indication that the linear regression model is a good fit for the data and is making accurate predictions.
<br><br>

#### Example 2: Two independent variables

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Generate synthetic data with two independent variables
np.random.seed(42)
# 100 x1 values between 0 and 10
x1 = np.linspace(0, 10, 100)
# 100 x2 values between -5 and 5
x2 = np.linspace(-5, 5, 100)
# Combine the two independent variables into a 2D array as X
X = np.column_stack((x1, x2))
# Generate corresponding y values following a normal distribution with 0 mean and 2 std deviation
y = 2 * x1 + 3 * x2 + np.random.normal(0, 2, 100)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
regressor = LinearRegression()
# Fit the model on the training data
regressor.fit(X_train, y_train)
# Make predictions on the test data
y_pred = regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)
print("R-squared (R2) Score:", r2)

# Plot the actual Y values in 3D
fig, ax = plt.subplots(subplot_kw={'projection': '3d'}, figsize=(8,6))
ax.scatter(X_test[:, 0], X_test[:, 1], y_test, c='green', marker='o', label='Actual values')
ax.scatter(X_test[:, 0], X_test[:, 1], y_pred, c='red', marker='x', label='Predicted values')
ax.set_xlabel('X1')
ax.set_ylabel('X2')
ax.set_zlabel('Y')
ax.set_title('Actual and Predicted Y values')
ax.view_init(elev=20, azim=-75)
ax.legend()
plt.show()

In this updated code, we generate two independent variables (x1, x2) and combine them into a 2D array X. The target variable y is generated using a linear combination of these two variables, with added noise. We then split the data into training and test sets.

Next, we create a linear regression model and fit it to the training data with multiple independent variables (X_train). We make predictions on the test data with multiple independent variables (X_test) and obtain corresponding predicted target values (y_pred).

The plot shows the actual and predicted values of the target variable (Y) in a 3D space, with two independent variables (X1 and X2). The green points represent the actual Y values for the test set while the red "x" markers represent the predicted Y values for the same test set.

In this case, the MSE is 2.4881689691607116, which means, on average, the squared difference between the predicted and actual values is around 2.49. Lower values of MSE are generally preferred, but an MSE of 2.49 is still relatively small, indicating that the model's predictions are reasonably close to the true values.<br>
An R2 score of 0.9882143330718749 indicates that approximately 98.82% of the variance in the target variable is explained by the independent variables in the model. This is quite high, suggesting that the model is doing an excellent job of capturing the relationships between the features and the target.<br>
Overall, this is a strong indication that the linear regression model is a good fit for the data and is making accurate predictions.

#### Example 3: Multiple independent variables

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load the California housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Preprocess the data - standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model
regressor = LinearRegression()
# Fit the model on the training data
regressor.fit(X_train, y_train)
# Make predictions on the test data
y_pred = regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)
print("R-squared (R2) Score:", r2)

# Plot the residuals (differences between y_actual and y_predicted)
residuals = y_test - y_pred
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
# Plot the predicted values vs. actual values
ax1.scatter(y_test, y_pred, color='b', alpha=0.6)
ax1.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], linestyle='--', color='r')
ax1.set_xlabel('Predicted Values')
ax1.set_ylabel('Actual Values')
ax1.set_title('Predicted vs. Actual Values')
ax1.grid(True)
# Plot the residuals against the predicted values
ax2.scatter(y_pred, residuals, color='b', alpha=0.6)
ax2.axhline(y=0, color='r', linestyle='--')
ax2.set_xlabel('Predicted Values')
ax2.set_ylabel('Residuals')
ax2.set_title('Residual Plot: Residuals vs. Predicted Values')
ax2.grid(True)
plt.tight_layout()
plt.show()

The California housing dataset is loaded using fetch_california_housing from scikit-learn. It contains features related to the housing properties and their corresponding target variable (median house value). The features are standardized using StandardScaler from scikit-learn to ensure that all features have zero mean and unit variance. The data is split into training and test sets using train_test_split from scikit-learn. The linear regression model is trained on the training data using fit method.

In this example we cannot make a 2D or 3D plot because we have now multiple independent variables. Instead, we can plot the residuals.

- In the first plot, actual vs. predicted values, we expect to have a strong correlation. Instead we see that model’s predictions aren’t very good at all. 

- In the second plot, predicted vs. residuals, positive values for the residuals (on the y-axis) mean the prediction was too low, and negative values mean the prediction was too high; 0 means the guess was exactly correct. Ideally, we want to see more points clustered around y=0.

Finally the MSE value of 0.5558915986952444 suggests that, on average, the squared difference between the predicted and actual values is approximately 0.56. This indicates that the model's predictions are reasonably close to the true target values in the test set. The R-squared value of 0.5757877060324508 means that the regression model can explain approximately 57.58% of the variance in the target variable. <br> Overall, the model's performance seems to be moderate, as indicated by the MSE and R-squared values.

### 1.6 Summary

Linear Regression is a widely used supervised learning algorithm for modeling the relationship between a dependent variable (target) and one or more independent variables (features). It assumes a linear relationship between the features and the target, and aims to find the best-fit line that minimizes the difference between the predicted and actual values.

To evaluate a Linear regression model:
- Mean Squared Error (MSE) measures the average squared difference between the predicted values and the actual values.
- Mean Absolute Error (MAE) measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers compared to MSE.
- Root Mean Squared Error (RMSE) is the square root of the MSE. It is a popular metric as it is in the same unit as the target variable, making it easier to interpret.
- R-squared (R2), also known as the coefficient of determination, measures the proportion of the variance in the target variable that is predictable from the independent variables. R2 values range from 0 to 1, where 0 indicates that the model explains none of the variance, and 1 indicates a perfect fit.

In summary MSE and RMSE penalize larger errors more severely, while MAE is more robust to outliers. R2 provides a measure of how well the model fits the data and can range from 0 to 1.

To evaluate a Linear regression model graphically we can also see the residuals:
For evaluate the performance of the regression model we can also see the actual vs. predicted plot and the predicted vs. residuals plot.

- Actual vs. Predicted Plot:<br>
    The actual vs. predicted plot is a scatter plot where the actual target values are plotted on the y-axis, and the corresponding predicted values are plotted on the x-axis. Each data point represents a specific sample in the test dataset. The plot helps to visualize how well the model predictions align with the actual values. Ideally, the points should lie close to the diagonal line (y = x), indicating a good fit of the model. Deviations from the diagonal line suggest discrepancies between the predicted and actual values.

- Predicted vs. Residuals Plot:<br>
    The predicted vs. residuals plot is also a scatter plot, where the predicted target values are plotted on the x-axis, and the residuals (differences between actual and predicted values) are plotted on the y-axis. This plot allows us to examine the relationship between the residuals and the predicted values. In an effective model, the residuals should be randomly scattered around the horizontal line at y = 0, indicating that the model has captured the underlying patterns in the data. Patterns or trends in the residuals indicate that the model is not performing optimally.