### Bias Variance Tradeoff.

**Our objective is to achieve low bias and low variance.**

### Bias

**Bias** refers to errors that come from incorrect assumptions in the learning algorithm. It's like having a preconceived notion that all fruits are round and sweet.

High Bias: The model makes strong assumptions about the data, leading to systematic errors. It's like always aiming your arrows towards the same spot, regardless of where the target is.

Example: If you're using a simple model to predict house prices, it might only consider the size and ignore other important factors like location. This oversimplification leads to inaccurate predictions.
Low Bias: The model makes fewer assumptions and can capture more complex patterns. High bias is also known as underfitting.

### Variance
**Variance** refers to how much the model's predictions fluctuate for different sets of training data. Example, we have cat images and we have clustered them into 3 sets. And we train models independently using these 3 different sets. If the features learned by the model differ greatly, then it's said to have high variance. It means the model has overfitted to the training data and is sensitive to input data.

Low Variance: The model's predictions are more consistent across different datasets. High variance is also known as overfitting.

In practice when you train models, when you want to decrease bias, there's high change of having high variance and when you try to reduce variance, there's high chance of having high bias. So, they are studied together as bias vs variance tradeoff. We need to be aware of these issues when training the machine learning models. Usually larger models and lot of training can lead to high variance and smaller models and not enough training leads to high bias. This is often one of the biggest issues in ML.

**We normally use regularization to reduce/avoid overfitting when training models**

In this notebook, we will use different regression techniques, linear, polynomial, Lasso and Ridge regression


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline


from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet


# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Set visual style
sns.set(style="whitegrid")

In [2]:
'''
Ask evapotransipration data with Mostafa, change dataset and work again.
'''

df = pd.read_csv('../dataset/water_potability.csv')
df = df.drop(['Potability'], axis=1)

In [3]:
len(df)

for col in df.columns:
    print(col, df[col].isna().sum())

ph 491
Hardness 0
Solids 0
Chloramines 0
Sulfate 781
Conductivity 0
Organic_carbon 0
Trihalomethanes 162
Turbidity 0


In [4]:
def load_data(file_path):
    """
    Loads the water quality dataset from a CSV file.

    Parameters:
    - file_path (str): Path to the CSV file.

    Returns:
    - df (pd.DataFrame): Loaded DataFrame.
    """
    df = pd.read_csv(file_path)
    return df


def handle_missing_values(df):
    """
    Handles missing values in the dataset by removing the columns.

    Parameters:
    - df (pd.DataFrame): The DataFrame to process.

    Returns:
    - df_clean (pd.DataFrame): DataFrame with missing values handled.
    """
    # Check for missing values
    missing = df.isnull().sum()
    print("Missing Values in Each Column:")
    print(missing)
    
    # drop the columns with nan values
    df_clean = df.dropna()
    
    return df_clean


def split_dataset(df, target_column, test_size=0.2, random_state=42):
    """
    Splits the dataset into training and testing sets.

    Parameters:
    - df (pd.DataFrame): The DataFrame to split.
    - target_column (str): The name of the target column.
    - test_size (float): Proportion of the dataset to include in the test split.
    - random_state (int): Controls the shuffling applied to the data before splitting.

    Returns:
    - X_train, X_test, y_train, y_test: Split datasets.
    """
    X = df.drop(columns=[target_column])
    y = df[target_column]
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    
    return X_train, X_test, y_train, y_test



def scale_features(X_train, X_test):
    """
    Scales the features using StandardScaler.

    Parameters:
    - X_train (pd.DataFrame or np.ndarray): Training feature set.
    - X_test (pd.DataFrame or np.ndarray): Testing feature set.

    Returns:
    - X_train_scaled, X_test_scaled (np.ndarray): Scaled feature sets.
    """
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled



def multiple_linear_regression(X_train, y_train):
    """
    Trains a Multiple Linear Regression model.

    Parameters:
    - X_train (np.ndarray): Scaled training features.
    - y_train (pd.Series or np.ndarray): Training target.

    Returns:
    - model (LinearRegression): Trained Linear Regression model.
    """
    model = LinearRegression()
    model.fit(X_train, y_train)
    return model



def polynomial_regression(X_train, y_train, degree=2):
    """
    Trains a Polynomial Regression model of a specified degree.

    Parameters:
    - X_train (np.ndarray): Scaled training features.
    - y_train (pd.Series or np.ndarray): Training target.
    - degree (int): Degree of the polynomial.

    Returns:
    - poly_model (Pipeline): Trained Polynomial Regression pipeline.
    """
    poly_features = PolynomialFeatures(degree=degree)
    lin_reg = LinearRegression()
    poly_model = Pipeline([
        ('poly_features', poly_features),
        ('linear_regression', lin_reg)
    ])
    poly_model.fit(X_train, y_train)
    return poly_model



def evaluate_model(model, X_test, y_test, model_name='Model'):
    """
    Evaluates the model on the test set and prints performance metrics.

    Parameters:
    - model: Trained model with a predict method.
    - X_test (np.ndarray): Scaled testing features.
    - y_test (pd.Series or np.ndarray): Testing target.
    - model_name (str): Name of the model for display purposes.

    Returns:
    - metrics (dict): Dictionary containing MSE, MAE, and R² Score.
    """
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f"{model_name} Performance:")
    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}\n")
    
    metrics = {'MSE': mse, 'MAE': mae, 'R²': r2}
    return metrics


### L1 Regularization

#### Problem Setup:
Imagine you are trying to fit a line (or another model) to a set of data points. The goal is to minimize the difference between the predicted and actual values—this difference is called **error**.

However, when you're trying to fit a line too perfectly (reducing the error to almost zero), you might end up with a complicated line that doesn’t generalize well to new data. This is called **overfitting**. To avoid overfitting, we add regularization, which penalizes the complexity of the model.

#### What is L1 Regularization?

L1 regularization works by adding a penalty term to your cost function (the function you want to minimize). The penalty is based on the absolute values of the model’s coefficients (the numbers that define the line or model). The purpose is to encourage some of the coefficients to become smaller or even zero, which simplifies the model.

#### Basic Concept:
Let’s say your model has some coefficients $ w_1, w_2, \dots, w_n $. Normally, you would minimize the error term (let’s call it **E**) that measures how far off your model’s predictions are from the actual data.

The error term might look like:
$$
\text{Error} = E(w_1, w_2, \dots, w_n)
$$

In L1 regularization, you add an additional term to the error:
$$
\text{Total Cost} = E(w_1, w_2, \dots, w_n) + \lambda \sum_{i=1}^n |w_i|
$$
Where:
- $ \sum_{i=1}^n |w_i| $ is the sum of the absolute values of the model’s coefficients.
- $ \lambda $ is a constant that controls how strong the penalty is. If $ \lambda $ is large, the penalty becomes stronger, and the coefficients get smaller.


1. **Absolute values**: The regularization term involves absolute values, which you might recognize from high school. Absolute values just make sure everything is positive, so $ |w_i| $ is always a non-negative number. For example, $ |3| = 3 $ and $ |-3| = 3 $.

2. **Summing up coefficients**: You are adding up the absolute values of all the coefficients of your model (like the slope and intercept in a line equation).

3. **Penalizing complexity**: This added term penalizes large coefficients. If a coefficient becomes too large, it increases the total cost, making it harder for the model to choose complex solutions. In fact, some coefficients may shrink to exactly zero, which simplifies the model by removing unnecessary features.

#### Example:

Imagine you have a simple linear model:
$$
y = w_1 x_1 + w_2 x_2 + b
$$
Normally, you’d just try to minimize the error, but with L1 regularization, you also add the penalty term:
$$
\text{Cost} = \text{Error} + \lambda (|w_1| + |w_2|)
$$
If $ w_1 $ or $ w_2 $ gets too large, the total cost will increase because of the absolute values, and the model will prefer to keep these coefficients smaller. In some cases, it might even set $ w_1 $ or $ w_2 $ to zero if it doesn't significantly improve the predictions.

#### Why is it useful?
By forcing some of the coefficients to zero, L1 regularization helps make the model simpler and less prone to overfitting. This is especially useful when you have many variables, and some of them aren’t very important.

#### Summary:
- **L1 regularization** adds a penalty based on the sum of the absolute values of the model’s coefficients.
- It **simplifies the model** by encouraging smaller or zero coefficients.
- The penalty term discourages the model from overfitting to the training data, making it more generalizable.


In [5]:
def lasso_regression(X_train, y_train, alpha=0.1):
    """
    Trains a Lasso Regression model with L1 regularization.

    Parameters:
    - X_train (np.ndarray): Scaled training features.
    - y_train (pd.Series or np.ndarray): Training target.
    - alpha (float): Regularization strength.

    Returns:
    - lasso_model (Lasso): Trained Lasso Regression model.
    """
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train, y_train)
    return lasso

### Ridge Regression

Ridge regression, like L1 regularization, is used to prevent overfitting in models by adding a penalty term. The key difference is that Ridge regression uses the sum of the **squares** of the coefficients (instead of their absolute values) as the penalty.

#### Problem Setup:
As before, let’s assume you are trying to fit a line (or another model) to a set of data points. The goal is to minimize the **error** (the difference between the predicted values and the actual values).

If you fit the model too perfectly, it could lead to **overfitting**—where the model captures noise or irrelevant patterns in the data, making it perform poorly on new data. To avoid this, Ridge regression adds a penalty term that discourages large coefficients and helps to simplify the model.

#### What is Ridge Regression?

In Ridge regression, the penalty term is based on the **squared values** of the model’s coefficients. The more complex (larger) the coefficients, the larger the penalty becomes.

#### Basic Concept:
Let’s say your model has coefficients $ w_1, w_2, \dots, w_n $. Normally, you would minimize the error term **E**, which measures how far off the predictions are from the actual data.

The error term might look like:
$$
\text{Error} = E(w_1, w_2, \dots, w_n)
$$

In Ridge regression, you add a penalty based on the sum of the squares of the coefficients:
$$
\text{Total Cost} = E(w_1, w_2, \dots, w_n) + \lambda \sum_{i=1}^n w_i^2
$$
Where:
- $ \sum_{i=1}^n w_i^2 $ is the sum of the squares of the coefficients.
- $ \lambda $ is a constant that controls the strength of the penalty. A larger $ \lambda $ makes the penalty stronger, which keeps the coefficients smaller.


1. **Squaring values**: In Ridge regression, instead of using absolute values, you square each coefficient $ w_i $. Squaring a number makes sure it's non-negative (because $ w_i^2 \geq 0 $), and larger values are penalized more heavily. For example, $ 3^2 = 9 $ and $ (-3)^2 = 9 $.

2. **Summing squared coefficients**: You add up all the squared values of the coefficients, which is a measure of the model's complexity. Larger sums mean more complexity, and Ridge regression tries to minimize this.

3. **Penalizing large coefficients**: By adding this penalty term to the cost, Ridge regression discourages large coefficients. This makes the model simpler and less prone to overfitting. However, unlike L1 regularization (Lasso), Ridge regression does not force coefficients to exactly zero. It just makes them smaller.

#### Example:

Imagine you have a linear model:
$$
y = w_1 x_1 + w_2 x_2 + b
$$
Normally, you’d minimize the error, but with Ridge regression, you also add the penalty:
$$
\text{Cost} = \text{Error} + \lambda (w_1^2 + w_2^2)
$$
If $ w_1 $ or $ w_2 $ gets too large, the total cost will increase because of the squared values. This encourages the model to keep $ w_1 $ and $ w_2 $ smaller, but it won’t force them to zero like L1 regularization.

#### Why is it useful?
Ridge regression helps reduce overfitting by discouraging large coefficients in the model. It’s useful when you have a lot of variables, and you want to keep the model more general without forcing any coefficients to zero.


In [15]:
def ridge_regression(X_train, y_train, alpha=1.0):
    """
    Trains a Ridge Regression model with L2 regularization.

    Parameters:
    - X_train (np.ndarray): Scaled training features.
    - y_train (pd.Series or np.ndarray): Training target.
    - alpha (float): Regularization strength.

    Returns:
    - ridge_model (Ridge): Trained Ridge Regression model.
    """
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    return ridge


def ridge_polynomial_regression(X_train, y_train, degree = 2, alpha=1.0):
    
    ridge_pipeline = Pipeline([
        ('poly_features', PolynomialFeatures(degree=degree, include_bias=False)),  
        ('ridge', Ridge(alpha=alpha))
    ])
    
    ridge_pipeline.fit(X_train, y_train)
    
    return ridge_pipeline


def lasso_polynomial_regression(X_train, y_train, degree = 2, alpha=1.0):
    
    lasso_pipeline = Pipeline([
        ('poly_features', PolynomialFeatures(degree=degree, include_bias=False)),  
        ('lasso', Lasso(alpha=alpha))
    ])
    
    lasso_pipeline.fit(X_train, y_train)
    
    return lasso_pipeline


def elastic_net_regression(X_train, y_train, alpha=0.1, l1_ratio=0.5):
    """
    Trains an Elastic Net Regression model combining L1 and L2 regularization.

    Parameters:
    - X_train (np.ndarray): Scaled training features.
    - y_train (pd.Series or np.ndarray): Training target.
    - alpha (float): Regularization strength.
    - l1_ratio (float): The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1.
                       l1_ratio=0 corresponds to Ridge, l1_ratio=1 to Lasso.

    Returns:
    - enet_model (ElasticNet): Trained Elastic Net Regression model.
    """
    enet = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
    enet.fit(X_train, y_train)
    return enet

In [27]:
def compare_models(X_train, y_train, X_test, y_test):
    """
    Trains and evaluates Multiple Linear Regression, Polynomial Regression,
    Lasso Regression, Ridge Regression, and Elastic Net Regression models.

    Parameters:
    - X_train (np.ndarray): Scaled training features.
    - y_train (pd.Series or np.ndarray): Training target.
    - X_test (np.ndarray): Scaled testing features.
    - y_test (pd.Series or np.ndarray): Testing target.

    Returns:
    - results (pd.DataFrame): DataFrame containing performance metrics for each model.
    """
    results = {}
    
    # Multiple Linear Regression
    lin_reg = multiple_linear_regression(X_train, y_train)
    lin_metrics = evaluate_model(lin_reg, X_test, y_test, 'Multiple Linear Regression')
    results['Multiple Linear Regression'] = lin_metrics
    
    # Polynomial Regression (Degree 2)
    poly_reg = polynomial_regression(X_train, y_train, degree=2)
    poly_metrics = evaluate_model(poly_reg, X_test, y_test, 'Polynomial Regression (Degree 2)')
    results['Polynomial Regression (Degree 2)'] = poly_metrics
    
    # Lasso Regression
    lasso = lasso_regression(X_train, y_train, alpha=0.1)
    lasso_metrics = evaluate_model(lasso, X_test, y_test, 'Lasso Regression (L1)')
    results['Lasso Regression (linear) (L1)'] = lasso_metrics


    # Lasso Polynomial Regression
    lasso_p = lasso_polynomial_regression(X_train, y_train, degree = 2, alpha=1.0)
    lasso_p_metrics = evaluate_model(lasso_p, X_test, y_test, 'Lasso Regression (L1)')
    results['Lasso Polynomial Regression (L1)'] = lasso_p_metrics
    
    
    # Ridge Regression
    ridge = ridge_regression(X_train, y_train, alpha=0.1)
    ridge_metrics = evaluate_model(ridge, X_test, y_test, 'Ridge Regression (L2)')
    results['Ridge Regression (linear) (L2)'] = ridge_metrics
    
    # Ridge Polynomial Regression
    ridge_p = ridge_polynomial_regression(X_train, y_train, degree = 2, alpha=1000.0)
    ridge_p_metrics = evaluate_model(ridge_p, X_test, y_test, 'Ridge Regression (L2)')
    results['Ridge Polynomial Regression (L2)'] = ridge_p_metrics
    
    
    # Elastic Net Regression
    enet = elastic_net_regression(X_train, y_train, alpha=0.1, l1_ratio=0.5)
    enet_metrics = evaluate_model(enet, X_test, y_test, 'Elastic Net Regression')
    results['Elastic Net Regression'] = enet_metrics
    
    # Create a DataFrame for comparison
    performance_df = pd.DataFrame(results).T
    return performance_df


In [28]:
def run_full_regression_analysis(file_path, target_column='Potability', test_size=0.2, random_state=42, poly_degree=2, lasso_alpha=0.1, ridge_alpha=1.0, enet_alpha=0.1, enet_l1_ratio=0.5):
    """
    Runs the full regression analysis including Multiple Linear Regression,
    Polynomial Regression, and Regularization techniques on the specified dataset.

    Parameters:
    - file_path (str): Path to the CSV file.
    - target_column (str): Name of the target column.
    - test_size (float): Proportion of the dataset to include in the test split.
    - random_state (int): Controls the shuffling applied to the data before splitting.
    - poly_degree (int): Degree of the polynomial for Polynomial Regression.
    - lasso_alpha (float): Regularization strength for Lasso Regression.
    - ridge_alpha (float): Regularization strength for Ridge Regression.
    - enet_alpha (float): Regularization strength for Elastic Net Regression.
    - enet_l1_ratio (float): The ElasticNet mixing parameter.

    Returns:
    - performance_df (pd.DataFrame): DataFrame containing performance metrics for each model.
    - models (dict): Dictionary containing trained models.
    """
    # Load data
    df = load_data(file_path)
    print("Data Loaded Successfully.\n")
    
    # Handle missing values
    df_clean = handle_missing_values(df)
    print("Missing Values Handled.\n")
    
    # Split dataset
    X_train, X_test, y_train, y_test = split_dataset(df_clean, target_column, test_size, random_state)
    print("Dataset Split into Training and Testing Sets.\n")
    
    # Scale features
    X_train_scaled, X_test_scaled = scale_features(X_train, X_test)
    print("Features Scaled.\n")

    # Scale targets
    y_train_scaled, y_test_scaled = scale_features(y_train.to_frame(), y_test.to_frame())
    y_train_scaled = y_train_scaled.ravel()
    y_test_scaled = y_test_scaled.ravel()
    
    # Compare models
    performance_df = compare_models(X_train_scaled, y_train_scaled, X_test_scaled, y_test_scaled)
    print("Model Comparison Completed.\n")
    
    # Train Regularized Models separately if needed
    models = {
        'Multiple Linear Regression': multiple_linear_regression(X_train_scaled, y_train),
        'Polynomial Regression': polynomial_regression(X_train_scaled, y_train, degree=poly_degree),
        'Lasso Regression': lasso_regression(X_train_scaled, y_train, alpha=lasso_alpha),
        'Ridge Regression': ridge_regression(X_train_scaled, y_train, alpha=ridge_alpha),
        'Elastic Net Regression': elastic_net_regression(X_train_scaled, y_train, alpha=enet_alpha, l1_ratio=enet_l1_ratio)
    }
    
    return performance_df, models


In [30]:
file_path = '../dataset/water_potability.csv'  
target_column = 'Turbidity'

performance_df, trained_models = run_full_regression_analysis(
    file_path=file_path,
    target_column=target_column,
    test_size=0.2,
    random_state=42,
    poly_degree=2,
    lasso_alpha=0.1,
    ridge_alpha=1.0,
    enet_alpha=0.1,
    enet_l1_ratio=0.5
)

print("Model Performance Comparison:")
print(performance_df)


Data Loaded Successfully.

Missing Values in Each Column:
ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64
Missing Values Handled.

Dataset Split into Training and Testing Sets.

Features Scaled.

Multiple Linear Regression Performance:
Mean Squared Error (MSE): 0.9916
Mean Absolute Error (MAE): 0.7821
R² Score: 0.0008

Polynomial Regression (Degree 2) Performance:
Mean Squared Error (MSE): 1.0191
Mean Absolute Error (MAE): 0.7930
R² Score: -0.0269

Lasso Regression (L1) Performance:
Mean Squared Error (MSE): 0.9931
Mean Absolute Error (MAE): 0.7854
R² Score: -0.0007

Lasso Regression (L1) Performance:
Mean Squared Error (MSE): 0.9931
Mean Absolute Error (MAE): 0.7854
R² Score: -0.0007

Ridge Regression (L2) Performance:
Mean Squared Error (MSE): 0.9916
Mean Absolute Error (MAE): 0.7821
R² Score

In [22]:
# with mean imputation

# Model Performance Comparison:
#                                          MSE        MAE        R²
# Multiple Linear Regression        239.455271  11.964480 -0.002330
# Polynomial Regression (Degree 2)  241.377260  12.135437 -0.010375
# Lasso Regression (L1)             239.274149  11.943337 -0.001572
# Ridge Regression (L2)             239.454914  11.964451 -0.002329
# Elastic Net Regression            239.386005  11.951843 -0.002040

In [23]:
df

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.990970,2.963135
1,3.716080,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771
4,9.092223,181.101509,17978.986339,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075
...,...,...,...,...,...,...,...,...,...
3271,4.668102,193.681735,47580.991603,7.166639,359.948574,526.424171,13.894419,66.687695,4.435821
3272,7.808856,193.553212,17329.802160,8.061362,,392.449580,19.903225,,2.798243
3273,9.419510,175.762646,33155.578218,7.350233,,432.044783,11.039070,69.845400,3.298875
3274,5.126763,230.603758,11983.869376,6.303357,,402.883113,11.168946,77.488213,4.708658
