# Linear Regression

## Problem Type
**Linear Regression** is primarily used for:
- **Regression** problems
- **Supervised** learning

### How Linear Regression Works
- **Assumes a linear relationship** between the input variables (features) and the output variable (target).
- **Fits a line** (in simple linear regression) or a hyperplane (in multiple linear regression) to minimize the difference between actual and predicted values.
- **Uses the Ordinary Least Squares (OLS) method** to minimize the sum of squared residuals (differences between observed and predicted values).
- **Calculates coefficients (weights)** for each feature to determine the best-fit line/hyperplane.

### Key Tuning Metrics
- **`fit_intercept`:** 
  - Controls whether to calculate the intercept (`True` by default).
  - Setting to `False` forces the line to go through the origin (0,0).
- **`n_jobs`:**
  - Specifies the number of CPUs to use for computation.
  - `-1` uses all processors, speeding up the computation on large datasets.

### Pros vs Cons

| Pros                                                | Cons                                               |
|-----------------------------------------------------|----------------------------------------------------|
| Simple to understand and implement                  | Assumes a linear relationship between variables    |
| Interpretable coefficients                          | Sensitive to outliers                              |
| Computationally efficient for small to medium datasets | Limited to linear relationships                    |
| Works well when the relationship is approximately linear | Prone to multicollinearity if features are highly correlated |
| Provides insights into the relative importance of features | Can be overfitted if not properly regularized      |

### Evaluation Metrics
- **Mean Absolute Error (MAE):**
  - **Description:** Average of absolute errors between predicted and actual values.
  - **Good Value:** Lower values indicate better model performance.
  - **Bad Value:** Higher values suggest poor model accuracy.
- **Mean Squared Error (MSE):**
  - **Description:** Average of squared errors between predicted and actual values.
  - **Good Value:** Lower values indicate fewer errors.
  - **Bad Value:** Higher values indicate greater errors; sensitive to outliers.
- **Root Mean Squared Error (RMSE):**
  - **Description:** Square root of the mean squared errors; gives error in the same units as the target variable.
  - **Good Value:** Lower values indicate better fit.
  - **Bad Value:** Higher values indicate poor model performance.
- **R-squared (R²):**
  - **Description:** Proportion of variance in the dependent variable that is predictable from the independent variables.
  - **Good Value:** Closer to 1 (e.g., 0.9+) suggests a good fit.
  - **Bad Value:** Closer to 0 (e.g., 0.5 or lower) suggests a poor fit.
- **Adjusted R-squared:**
  - **Description:** R² adjusted for the number of predictors in the model; penalizes adding irrelevant features.
  - **Good Value:** Higher values are better, but should also be close to R².
  - **Bad Value:** A large drop from R² suggests overfitting with unnecessary features.


In [None]:
from math import sqrt

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
# Load data
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df["MedHouseValue"] = housing.target
df.head()

In [None]:
df = df.drop(df.loc[df["MedHouseValue"] == max(df["MedHouseValue"])].index)
df.shape

In [None]:
X = df.drop("MedHouseValue", axis=1)
y = df["MedHouseValue"]

In [None]:
# Scale features (optional but recommended)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
linear_model = LinearRegression(fit_intercept=True, n_jobs=-1).fit(X_train, y_train)

## Model Evaluation

In [None]:
print(f"Training score: {linear_model.score(X_train, y_train)}")

In [None]:
predictors = X.columns
predictors

In [None]:
coef = pd.Series(linear_model.coef_, predictors).sort_values()
coef

In [None]:
y_pred = linear_model.predict(X_test)

In [None]:
df_pred_actual = pd.DataFrame({"predicted": y_pred, "actual": y_test})
df_pred_actual.head()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(y_test, y_pred)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()

In [None]:
df_pred_actual_sample = df_pred_actual.sample(100)
df_pred_actual_sample = df_pred_actual_sample.reset_index()
df_pred_actual_sample.head()

In [None]:
plt.figure(figsize=(20, 10))

plt.plot(df_pred_actual_sample["predicted"], label="predicted")
plt.plot(df_pred_actual_sample["actual"], label="actual")

plt.ylabel("median_house_value")
plt.legend()
plt.show()

## Mean Squared Error 
- **Interpretation:** Measures how far off the predictions are from the correct values on average, in squared units of the target variable. Lower MSE indicates better model performance (closer predictions to actual values).
- **Good vs. Bad Values:** There's no universal threshold, but generally, a lower MSE is better. The importance depends on the scale and range of your target variable. A small MSE on a dataset with values in the range of 0-1 might be less significant than a similar MSE on a dataset with values in the 1000-2000 range.

In [None]:
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

## Root Mean Squared Error (RMSE):

- **Interpretation:** Represents the standard deviation of the errors, expressed in the same units as the target variable. It's easier to interpret than MSE because it's in the same scale as the target values.
- **Good vs. Bad Values:** A lower RMSE is better, indicating a smaller average error magnitude. The importance depends on the scale and range of your target variable, similar to MSE and MAE.

In [None]:
rmse = sqrt(mse)
print(f"Root Mean Squared Error: {rmse}")

## R-Squared
- **Interpretation:** A value closer to 1 indicates a better fit, meaning the model explains a higher proportion of the variance. However, R² can be misleading, especially with highly correlated features. It might increase even if the model doesn't capture the underlying relationships well.
-  **Good vs. Bad Values:** Higher R² is generally preferred, but be cautious of overfitting. Consider it alongside other metrics for a more comprehensive evaluation.

In [None]:
r2 = r2_score(y_test, y_pred)
print(f"R squared: {r2}")

## Mean Absolute Error (MAE)

- **Interpretation:** Measures the average magnitude of errors, in the same units as the target variable. It's less sensitive to outliers compared to MSE. Lower MAE suggests better model performance (smaller average prediction errors).
- **Good vs. Bad Values:** A lower MAE is better. Similar to MSE, the significance depends on the scale and range of your target variable.

In [None]:
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")

## Cross-Validation Scores

- **Interpretation:** Provides an idea of how well the model might perform on unseen data. Scores closer to 1 for regression tasks (higher for classification) indicate better generalization ability.
- **Good vs. Bad Values:** Higher cross-validation scores suggest better model generalizability. However, it's crucial to consider other evaluation metrics alongside this to get a more holistic understanding.

In [None]:
cross_val_scores = cross_val_score(linear_model, X_train, y_train, cv=5, scoring="r2")
print(f"Cross Validation Scores: {cross_val_scores}")