# Water Potability Detection: Model Training

In this notebook, we'll train various machine learning models on our preprocessed water potability data.
We'll test several different algorithms and save the trained models for later evaluation.

Models to be trained:
1. Logistic Regression
2. Random Forest Classifier
3. Gradient Boosting (XGBoost, LightGBM)
4. Support Vector Machines (SVM)
5. K-Nearest Neighbors (KNN)

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import time
from datetime import datetime

# Set paths
HOME = os.getcwd()
HOME = HOME[0:HOME.find("notebooks")]
DATA_FOLDER = HOME + "data/"
MODEL_FOLDER = HOME + "models/"

# Create models directory if it doesn't exist
os.makedirs(MODEL_FOLDER, exist_ok=True)

# For reproducibility
RANDOM_STATE = 42

In [2]:
# Import ML libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
import lightgbm as lgb

# For GPU detection/usage
import warnings
warnings.filterwarnings('ignore')

# Check for GPU availability
gpu_available = False

# Check for CUDA-compatible GPU for XGBoost
try:
    xgb_params = {"tree_method": "gpu_hist"}
    xgb_clf_test = xgb.XGBClassifier(**xgb_params)
    xgb_clf_test.fit(np.random.rand(10, 9), np.random.randint(0, 2, 10))
    gpu_available = True
    print("GPU is available for XGBoost!")
except Exception as e:
    print(f"GPU not available for XGBoost: {str(e)}")
    print("Will use CPU instead.")

# For LightGBM, detect CUDA
try:
    lgb_params = {"device": "gpu"}
    lgb_clf_test = lgb.LGBMClassifier(**lgb_params)
    lgb_clf_test.fit(np.random.rand(10, 9), np.random.randint(0, 2, 10))
    gpu_available_lgb = True
    print("GPU is available for LightGBM!")
except Exception as e:
    gpu_available_lgb = False
    print(f"GPU not available for LightGBM: {str(e)}")
    print("Will use CPU for LightGBM.")

GPU is available for XGBoost!
[LightGBM] [Info] Number of positive: 8, number of negative: 2
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 10, number of used features: 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3050 Laptop GPU, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...




[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.800000 -> initscore=1.386294
[LightGBM] [Info] Start training from score 1.386294
GPU is available for LightGBM!


## Load the data

We'll use the training data that we created in the previous data exploration notebook.

In [3]:
# Load train dataset
train_data = pd.read_csv(DATA_FOLDER + "train_data.csv")
print(f"Training data shape: {train_data.shape}")

# Split into features and target
X_train = train_data.drop('Potability', axis=1)
y_train = train_data['Potability']

# Display class distribution
print("\nClass distribution in training data:")
print(y_train.value_counts(normalize=True).mul(100).round(2))

Training data shape: (4000, 10)

Class distribution in training data:
Potability
1    78.95
0    21.05
Name: proportion, dtype: float64


## 1. Logistic Regression

First, let's try a simple logistic regression model as a baseline.

In [4]:
# Initialize and train Logistic Regression model
print("Training Logistic Regression...")
start_time = time.time()

lr_model = LogisticRegression(random_state=RANDOM_STATE, max_iter=1000, n_jobs=-1)
lr_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")

# Save the model
model_path = MODEL_FOLDER + "logistic_regression_model.pkl"
with open(model_path, 'wb') as file:
    pickle.dump(lr_model, file)
print(f"Model saved to {model_path}")

Training Logistic Regression...
Training completed in 0.85 seconds
Model saved to /home/yashpotdar/projects/water-potability-detection/models/logistic_regression_model.pkl


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 2. Random Forest Classifier

Next, let's train a Random Forest model, which often performs well on tabular data.

In [5]:
# Initialize and train Random Forest model
print("Training Random Forest Classifier...")
start_time = time.time()

rf_model = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
rf_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")

# Save the model
model_path = MODEL_FOLDER + "random_forest_model.pkl"
with open(model_path, 'wb') as file:
    pickle.dump(rf_model, file)
print(f"Model saved to {model_path}")

Training Random Forest Classifier...
Training completed in 0.11 seconds
Model saved to /home/yashpotdar/projects/water-potability-detection/models/random_forest_model.pkl


## 3a. XGBoost Classifier

Now let's train an XGBoost model, a powerful gradient boosting implementation.

In [6]:
# Initialize and train XGBoost model with GPU if available
print("Training XGBoost Classifier...")
start_time = time.time()

if gpu_available:
    xgb_model = xgb.XGBClassifier(
        n_estimators=100, 
        random_state=RANDOM_STATE,
        tree_method='gpu_hist',  # Use GPU acceleration
        gpu_id=0
    )
    print("Using GPU acceleration for XGBoost")
else:
    xgb_model = xgb.XGBClassifier(
        n_estimators=100, 
        random_state=RANDOM_STATE,
        n_jobs=-1  # Use all CPU cores
    )
    print("Using CPU for XGBoost")

xgb_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")

# Save the model
model_path = MODEL_FOLDER + "xgboost_model.pkl"
with open(model_path, 'wb') as file:
    pickle.dump(xgb_model, file)
print(f"Model saved to {model_path}")

Training XGBoost Classifier...
Using GPU acceleration for XGBoost
Training completed in 0.13 seconds
Model saved to /home/yashpotdar/projects/water-potability-detection/models/xgboost_model.pkl


## 3b. LightGBM Classifier

Let's also try LightGBM, another efficient gradient boosting framework.

In [7]:
# Initialize and train LightGBM model with GPU if available
print("Training LightGBM Classifier...")
start_time = time.time()

if gpu_available_lgb:
    lgb_model = lgb.LGBMClassifier(
        n_estimators=100, 
        random_state=RANDOM_STATE,
        device='gpu',
        gpu_platform_id=0,
        gpu_device_id=0
    )
    print("Using GPU acceleration for LightGBM")
else:
    lgb_model = lgb.LGBMClassifier(
        n_estimators=100, 
        random_state=RANDOM_STATE,
        n_jobs=-1  # Use all CPU cores
    )
    print("Using CPU for LightGBM")

lgb_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")

# Save the model
model_path = MODEL_FOLDER + "lightgbm_model.pkl"
with open(model_path, 'wb') as file:
    pickle.dump(lgb_model, file)
print(f"Model saved to {model_path}")

Training LightGBM Classifier...
Using GPU acceleration for LightGBM
[LightGBM] [Info] Number of positive: 3158, number of negative: 842
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 2124
[LightGBM] [Info] Number of data points in the train set: 4000, number of used features: 9
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3050 Laptop GPU, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...




[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 9 dense feature groups (0.05 MB) transferred to GPU in 0.000441 secs. 0 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.789500 -> initscore=1.321914
[LightGBM] [Info] Start training from score 1.321914
Training completed in 2.59 seconds
Model saved to /home/yashpotdar/projects/water-potability-detection/models/lightgbm_model.pkl


## 4. Support Vector Machine (SVM)

Let's try an SVM classifier, which can be effective for this type of classification task.

In [8]:
# Initialize and train SVM model
print("Training Support Vector Machine Classifier...")
start_time = time.time()

svm_model = SVC(probability=True, random_state=RANDOM_STATE)
svm_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")

# Save the model
model_path = MODEL_FOLDER + "svm_model.pkl"
with open(model_path, 'wb') as file:
    pickle.dump(svm_model, file)
print(f"Model saved to {model_path}")

Training Support Vector Machine Classifier...
Training completed in 0.12 seconds
Model saved to /home/yashpotdar/projects/water-potability-detection/models/svm_model.pkl


## 5. K-Nearest Neighbors (KNN)

Finally, let's train a KNN classifier.

In [9]:
# Initialize and train KNN model
print("Training K-Nearest Neighbors Classifier...")
start_time = time.time()

knn_model = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
knn_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")

# Save the model
model_path = MODEL_FOLDER + "knn_model.pkl"
with open(model_path, 'wb') as file:
    pickle.dump(knn_model, file)
print(f"Model saved to {model_path}")

Training K-Nearest Neighbors Classifier...
Training completed in 0.00 seconds
Model saved to /home/yashpotdar/projects/water-potability-detection/models/knn_model.pkl


## Summary

In this notebook, we've trained five different types of machine learning models on our water potability data:

1. Logistic Regression
2. Random Forest Classifier
3. Gradient Boosting (XGBoost and LightGBM)
4. Support Vector Machine (SVM)
5. K-Nearest Neighbors (KNN)

All models have been saved to the `models/` directory for future evaluation. In the next notebook, we'll evaluate these models and compare their performance.

In [10]:
# List all saved models
print("Saved models:")
for model_file in os.listdir(MODEL_FOLDER):
    if model_file.endswith('.pkl'):
        model_path = os.path.join(MODEL_FOLDER, model_file)
        size_mb = os.path.getsize(model_path) / (1024 * 1024)
        print(f" - {model_file} ({size_mb:.2f} MB)")

Saved models:
 - knn_model.pkl (0.38 MB)
 - lightgbm_model.pkl (0.34 MB)
 - logistic_regression_model.pkl (0.00 MB)
 - xgboost_model.pkl (0.08 MB)
 - random_forest_model.pkl (0.25 MB)
 - svm_model.pkl (0.02 MB)
