# ðŸ’§ Water Potability Prediction System

This project aims to predict whether water is **safe for drinking** using
machine learning techniques.  
We also do a **Comparitive Analysis** through different models checking which one works the best.

### Objectives:
- Analyze water quality parameters
- Handle class imbalance
- Build a robust ML model
- Visualize insights and feature importance


# Library Used

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

from imblearn.over_sampling import SMOTE
import joblib

import warnings
warnings.filterwarnings("ignore")


# Load Dataset

In [None]:
data = pd.read_csv("water_potability.csv")
data.head()

In [None]:
data.info()

In [None]:
data.describe()


# Target Variable Distribution

In [None]:
plt.figure(figsize=(5,4))
sns.countplot(x='Potability', data=data)
plt.title("Water Potability Class Distribution")
plt.xlabel("Potable (1) / Not Potable (0)")
plt.ylabel("Count")
plt.show()


The dataset is imbalanced, with fewer potable water samples.
This motivates the use of **SMOTE** for class balancing.


# Feature Distribution Visualization

Histogram

In [None]:
data.drop("Potability", axis=1).hist(
    figsize=(14,10),
    bins=20,
    edgecolor='black'
)
plt.suptitle("Distribution of Water Quality Parameters")
plt.show()


Correlation Heatmap

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(
    data.corr(),
    cmap='coolwarm',
    annot=True,
    linewidths=0.5
)
plt.title("Feature Correlation Heatmap")
plt.show()


Highly correlated features may influence model decisions.
Random Forest handles multicollinearity well.

# Data Preprocessing

In [None]:
# ==============================
# DATA PREPROCESSING
# ==============================

# 1. Create a copy of original data
df = data.copy()

# 2. Check missing values
print("Missing values before preprocessing:\n")
print(df.isnull().sum())

# 3. Handle missing values (Median Imputation)
for col in df.columns:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)

# 4. Verify missing values are handled
print("\nMissing values after preprocessing:\n")
print(df.isnull().sum())

# 5. Remove duplicates if any
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows found: {duplicates}")

if duplicates > 0:
    df.drop_duplicates(inplace=True)
    print("Duplicates removed.")

# 6. Separate features and target
X = df.drop("Potability", axis=1)
y = df["Potability"]

# 7. Feature scaling check (Tree models don't need scaling)
print("\nFeature scaling not required for Random Forest.")


## ðŸ§¹ Data Preprocessing

- Missing values were handled using **median imputation**
- Duplicate rows were removed
- Feature scaling was not applied as **Random Forest is scale-invariant**
- Cleaned dataset was used for further analysis and modeling


# Train-Test + SMOTE

In [None]:
X = df.drop("Potability", axis=1)
y = df["Potability"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Apply SMOTE to balance the classes in training data
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

print("Before SMOTE:")
print(y_train.value_counts())

print("\nAfter SMOTE:")
print(pd.Series(y_res).value_counts())


HyperParameter Grid

In [None]:
param_dist = {
    "n_estimators": [100, 200, 300, 500, 800],
    "max_depth": [10, 15, 20, 25, 30, None],
    "min_samples_split": [2, 5, 10, 15],
    "min_samples_leaf": [1, 2, 4, 8],
    "max_features": ["sqrt", "log2", None],
    "bootstrap": [True, False],
    "class_weight": ["balanced", "balanced_subsample", None],
    "criterion": ["gini", "entropy"],
}


# Model Training 

# RandomForest and RandomizedSearchCV

In [None]:
base_est = RandomForestClassifier(
    random_state=42,
    n_jobs=-1
)

search = RandomizedSearchCV(
    estimator=base_est,
    param_distributions=param_dist,
    n_iter=50,
    scoring="roc_auc",
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    verbose=1,
    n_jobs=-1,
    random_state=42
)

search.fit(X_res, y_res)


Best Model and parameters

In [112]:
model1 = search.best_estimator_

print("Best Parameters:")
for k, v in search.best_params_.items():
    print(f"{k}: {v}")

print(f"\nBest CV ROC-AUC: {search.best_score_:.4f}")


Best Parameters:
n_estimators: 300
min_samples_split: 2
min_samples_leaf: 1
max_features: None
max_depth: 30
criterion: entropy
class_weight: None
bootstrap: True

Best CV ROC-AUC: 0.7736


Model Evaluation

In [None]:
y_pred = model1.predict(X_test)
y_prob = model1.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Accuracy: {accuracy*100:.2f}%")
print(f"F1 Score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Confusion Matrix Visualization

In [None]:
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(5,4))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Greens'
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()


ROC curve

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.figure(figsize=(6,5))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.3f}")
plt.plot([0,1], [0,1], linestyle='--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()


In [None]:
importances = model1.feature_importances_
features = X.columns

fi_df = pd.DataFrame({
    "Feature": features,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

plt.figure(figsize=(10,6))
sns.barplot(
    x="Importance",
    y="Feature",
    data=fi_df.head(10),
    palette="Greens_r"
)
plt.title("Top 10 Important Features")
plt.show()


Features with higher importance contribute more to the decision-making
process of the Random Forest model, indicating their relevance in
determining water potability.


Save Model

In [None]:
joblib.dump(model1, "Models/RandomForest_Model.pkl")
print("Model saved successfully!")


## âœ… Conclusion

- The Random Forest model achieved **high accuracy and ROC-AUC**
- SMOTE effectively handled class imbalance
- Key parameters influencing potability were identified
- The system can be deployed for **real-time water quality monitoring**

This model supports **safe drinking water assessment** using data-driven methods.
