## 📊 Dataset Overview  

Before performing analysis and visualization, let's first take a look at the dataset to understand its structure, features, and available information.


In [3]:
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


# 💧 Water Quality Dataset – Feature Description

## 🔎 Key Features

- **pH**  
  Measure of the acidity or alkalinity of water. A neutral pH is 7; values below 7 indicate acidity, while values above 7 indicate alkalinity.

- **Hardness**  
  Concentration of dissolved calcium and magnesium salts, contributing to water hardness.

- **Solids (TDS)**  
  Total Dissolved Solids in water. High TDS may affect taste, odor, and overall quality.

- **Chloramines**  
  Disinfectant compound formed by mixing chlorine and ammonia, commonly used in water treatment.

- **Sulfate**  
  Concentration of sulfate ions in water. Excessive levels may affect taste and health.

- **Conductivity**  
  Ability of water to conduct electricity, directly related to the concentration of dissolved ions.

- **Organic_carbon**  
  Amount of organic carbon present, indicating possible contamination or pollutants.

- **Trihalomethanes (THMs)**  
  Chemical by-products formed during the disinfection process with chlorine.

- **Turbidity**  
  Cloudiness or haziness of water caused by suspended particles, affecting clarity.

- **Potability**  
  Indicates whether water is safe for human consumption.  
  - `1` → Drinkable  
  - `0` → Not drinkable



## 📉 Missing Data Analysis  

To better understand the dataset quality, we calculate the **percentage of missing values** for each feature.  
This helps identify which columns may require data cleaning, imputation, or removal before further analysis.


In [None]:
missing_data = df.isnull().sum()
total = df.isnull().count()
percent = (missing_data/total) * 100
missing_data = pd.concat([missing_data, percent], axis=1, keys=['Total', 'Percent'])
missing_data

Unnamed: 0,Total,Percent
ph,491,14.98779
Hardness,0,0.0
Solids,0,0.0
Chloramines,0,0.0
Sulfate,781,23.840049
Conductivity,0,0.0
Organic_carbon,0,0.0
Trihalomethanes,162,4.945055
Turbidity,0,0.0
Potability,0,0.0


# 📊 Data Overview

- **Shape:** 3275 rows × 10 columns  
- **Missing Values:**  
  - `ph` → 491  
  - `Sulfate` → 781  
  - `Trihalomethanes` → 162  
- **Data Types:** Mostly numeric (`float64`, `int64`)

