# 🏠 Problem Definition
# 🎯 Objective: 
# The goal is to determine the Water quality condition based on various environmental parameters.

# 📌 This is a Classification problem, where the objective is to use a machine learning model to predict the exact Water quality status given new input parameters.


In [1]:
# Data Collection
import pandas as pd
import numpy as np

#Vizualization
import matplotlib.pyplot as plt
import seaborn as sns

# Data Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer

# Data partitioning and model tuning
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score, KFold

# Calling regression models
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor

# Model Evaluation
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

#  Determining the explanatory power of a model (SHAP) 
import shap  




In [2]:
# 1) First, we will show the available handlers and their formats.
import logging, sys

print("Current root handlers and their format strings:")
for i, h in enumerate(logging.root.handlers):
    fmt = None
    try:
        fmt = h.formatter._fmt if h.formatter else None
    except Exception as e:
        fmt = f"<error reading formatter: {e}>"
    print(i, type(h), "format:", fmt)

# 2)If there are old/broken handlers, we will delete them.
for h in logging.root.handlers[:]:
    logging.root.removeHandler(h)

# 3) We add new, correct formatters and handlers (file + console)
logger = logging.getLogger()         # root logger
logger.setLevel(logging.INFO)

correct_fmt = "%(asctime)s - %(levelname)s - %(message)s"
formatter = logging.Formatter(correct_fmt)

# File handler
fh = logging.FileHandler("Info_Log.log", mode="a", encoding="utf-8")
fh.setFormatter(formatter)
logger.addHandler(fh)

# Console/stream handler (usually useful for Jupyter)
sh = logging.StreamHandler(sys.stdout)
sh.setFormatter(formatter)
logger.addHandler(sh)

# 4) Test: now we write test logs
logger.info("Logging reconfigured successfully (test message).")
logger.error("If you see this message, logging is working.")

# 5) Now our file reading code:
try:
    import pandas as pd
    df = pd.read_csv("water_potability.csv")
    logger.info("Data Set Uploaded Successfully")
except FileNotFoundError as e:
    logger.error("Data Set Not Found: %s", e)
except Exception as e:
    logger.exception("Other error while loading dataset: %s", e)



Current root handlers and their format strings:
2025-08-28 14:48:59,579 - INFO - Logging reconfigured successfully (test message).
2025-08-28 14:48:59,580 - ERROR - If you see this message, logging is working.
2025-08-28 14:48:59,605 - INFO - Data Set Uploaded Successfully


## 📊 Dataset Overview  

Before performing analysis and visualization, let's first take a look at the dataset to understand its structure, features, and available information.


In [3]:
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


# 💧 Water Quality Dataset – Feature Description

## 🔎 Key Features

- **pH**  
  Measure of the acidity or alkalinity of water. A neutral pH is 7; values below 7 indicate acidity, while values above 7 indicate alkalinity.

- **Hardness**  
  Concentration of dissolved calcium and magnesium salts, contributing to water hardness.

- **Solids (TDS)**  
  Total Dissolved Solids in water. High TDS may affect taste, odor, and overall quality.

- **Chloramines**  
  Disinfectant compound formed by mixing chlorine and ammonia, commonly used in water treatment.

- **Sulfate**  
  Concentration of sulfate ions in water. Excessive levels may affect taste and health.

- **Conductivity**  
  Ability of water to conduct electricity, directly related to the concentration of dissolved ions.

- **Organic_carbon**  
  Amount of organic carbon present, indicating possible contamination or pollutants.

- **Trihalomethanes (THMs)**  
  Chemical by-products formed during the disinfection process with chlorine.

- **Turbidity**  
  Cloudiness or haziness of water caused by suspended particles, affecting clarity.

- **Potability**  
  Indicates whether water is safe for human consumption.  
  - `1` → Drinkable  
  - `0` → Not drinkable



## 📉 Missing Data Analysis  

To better understand the dataset quality, we calculate the **percentage of missing values** for each feature.  
This helps identify which columns may require data cleaning, imputation, or removal before further analysis.


In [None]:
missing_data = df.isnull().sum()
total = df.isnull().count()
percent = (missing_data/total) * 100
missing_data = pd.concat([missing_data, percent], axis=1, keys=['Total', 'Percent'])
missing_data

Unnamed: 0,Total,Percent
ph,491,14.98779
Hardness,0,0.0
Solids,0,0.0
Chloramines,0,0.0
Sulfate,781,23.840049
Conductivity,0,0.0
Organic_carbon,0,0.0
Trihalomethanes,162,4.945055
Turbidity,0,0.0
Potability,0,0.0


# 📊 Data Overview

- **Shape:** 3275 rows × 10 columns  
- **Missing Values:**  
  - `ph` → 491  
  - `Sulfate` → 781  
  - `Trihalomethanes` → 162  
- **Data Types:** Mostly numeric (`float64`, `int64`)



## 🛠 Handling Missing Data  

To ensure data quality and reliability, we address the missing values in key features.  
This step is crucial before performing any statistical analysis or machine learning tasks.


In [None]:
df['ph'] = df['ph'].fillna(df['ph'].mean())
df['Sulfate'] = df['Sulfate'].fillna(df['Sulfate'].mean())
df['Trihalomethanes'] = df['Trihalomethanes'].fillna(df['Trihalomethanes'].mean())

In [4]:
df.isnull().sum()

ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64