Agenda
-> What is the drawback of using the train_test_split method
-> How does k-fold cross-validation overcomes this limitation?
-> How can cross-validation be used for selecting tuning parameters and selecting features
-> What are some possible improvements to cross-validation

In [None]:
'''
Benefits of train_test_split:
    -> Fast and easy to implement
    -> Helps prevent overfitting by ensuring that the model is evaluated on unseen data
    -> Allows for easy experimentation with different train-test ratios
    -> Facilitates reproducibility through the use of a random state
    -> Essential for validating model performance in machine learning tasks

Drawbacks of train_test_split:
    -> May result in a small test set, which can lead to high variance in performance metrics
    -> Random splitting can cause important patterns or trends to be missed, especially in small datasets
    -> Does not account for time-based dependencies in time series data
    -> Can lead to biased estimates if the data is not randomly distributed
'''

Now Use K-fold cross validation method to train data

In [7]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse
import pandas as pd
data = pd.read_csv('J:\\4-Fourth Smester\\coding_semester4\\AI Lab\\ML\\4-KNN\\customer_purchase_data.csv')
feature_cols = ['NumberOfPurchases', 'AnnualIncome']
X = data[feature_cols]
Y = data[['TimeSpentOnWebsite']]


In [None]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

# Load the dataset
data = pd.read_csv('J:\\4-Fourth Smester\\coding_semester4\\AI Lab\\ML\\4-KNN\\customer_purchase_data.csv')

# Define features and target variable
feature_cols = ['NumberOfPurchases', 'AnnualIncome']
X = data[feature_cols]
Y = data[['TimeSpentOnWebsite']]

# Set up k-fold cross-validation
k = 5  # Number of folds
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize the model and store results
model = LinearRegression()
mse_list = []

# Perform k-fold cross-validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]  # Use iloc for correct indexing
    Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]  # Use iloc for correct indexing
    
    # Train the model
    model.fit(X_train, Y_train)
    
    # Make predictions
    Y_pred = model.predict(X_test)
    
    # Calculate and store the mean squared error
    fold_mse = mean_squared_error(Y_test, Y_pred)
    mse_list.append(fold_mse)

# Calculate the average MSE across all folds
average_mse = np.mean(mse_list)
print(f'Average Mean Squared Error (MSE) across {k} folds: {average_mse}')
# In last lec. we showed for this data through train_test_split method and the error was about 305
# but now it is reduced to 289.86

Average Mean Squared Error (MSE) across 5 folds: 289.86631990590615


In [None]:
'''
Cross Validation Benefits:
    -> Estimates are more accurate than train_test_split function
    -> Flexibility to use different k values for tuning the validation process
    -> Reduces the variance in performance estimates by averaging results across multiple folds
    -> Makes better use of the entire dataset, especially with smaller datasets
    -> Helps in identifying model overfitting and underfitting more effectively

Cross Validation Drawbacks:
    -> Increased computational cost due to multiple training cycles
    -> Longer training times, especially for large datasets or complex models
    -> May lead to data leakage if not implemented correctly (e.g., using the same preprocessing steps)
    -> Not suitable for time series data without appropriate modifications to preserve temporal order
'''