# Water Pollution in Europe
## Classification for Water Quality

In this project, our goal is to develop and train a machine learning model capable of accurately classifying the water quality index into three categories: Low, Medium, and High. To achieve this, we employ supervised machine learning techniques such as Decision Tree, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Multilayer Perceptron (MLP).

## Project Overview:

### Problem Definition

Given chemical parameters in drinking water, can we accurately calculate the water quality index, classify it accordingly, and make accurate predictions using machine learning models?

### Data EDA (Exploratory Data Analysis)

The original data is provided by the[ European Environmental Agency (EEA)](https://www.eea.europa.eu/en).

#### Features

The features consist of chemical elements found in drinking water. The value range is set by the World Health Organization guidelines.

#### Data Dictionary

- **Monitoring Site ID:** Unique identification codes for monitoring locations following ISO standards.
- **Time Sampling Date:** The date of measurement in Year-Month-Day format.
- **Ammonium:** Concentration levels in milligrams per liter of water. Allowed range of Ammonium: 0.25 mg/L to 35 mg/L.
- **Nitrate:** Concentration levels in milligrams per liter of water. Allowed range of Nitrate: 3 mg/L to 50 mg/L.
- **Dissolved Oxygen:** Concentration levels in milligrams per liter of water. Allowed range of Dissolved Oxygen: 4 mg/L to 10 mg/L.
- **Total Nitrogen:** Concentration levels in milligrams per liter of water. Allowed range of Total Nitrogen: 0.5 mg/L to 2 mg/L.
- **Phosphate:** Concentration levels in milligrams per liter of water. Allowed range of Phosphate: 0.075 mg/L to 0.35 mg/L.
- **Total Phosphorus:** Concentration levels in milligrams per liter of water. Allowed range of Total Phosphorus: 0.15 mg/L to 0.7 mg/L.
- **Chlorophyll a:** Concentration levels in milligrams per liter of water. Allowed range of Chlorophyll a: 1 mg/L to 300 mg/L.

### Modeling

We will use the following models for supervised machine learning to achieve our goals:

- Decision Tree
- Support Vector Machine
- K-Nearest Neighbor

### Evaluation

Success will be considered if we can reach a 95% accuracy in predicting the water quality index on the test dataset.

### Final Results

Final results of evaluated metrics for each model.

**Note** The current notebook will focus on the Modeling, Evaluation, and Final Results. For a comprehensive understanding of how the data is processed, please refer to the  [data_preprocessing](data_preprocessing.ipynb) notebook, where we meticulously clean and prepare the dataset for the subsequent stages of our project.


# Modeling

In this phase of the project we will deep dive into the heart of the mission - developing and training machine learning models for the accurate classificaiton of the water quality index into distinct categories: **Low** - (Low level of polution in the water) , **Medium**, and **High** - (high level of popution in the water) that we have calculated in the prevoius part of this project. To accomplish this, we employ a variety of supervised machine learning techniques, including Decision Trees, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Multilayer Perceptron (MLP) classifiers.

### Model Selection

1. Decision Tree: Decision trees are a popular choice for classification tasks. They work by recursively splitting the dataset into subsets based on the most significant attribute at each step. This hierarchical structure makes it easy to interpret and visualize the decision-making process.

2. Support Vector Machine (SVM): SVMs are excellent for binary and multi-class classification tasks. They seek to find the hyperplane that best separates different classes while maximizing the margin between them. SVMs can handle high-dimensional data efficiently.

3. K-Nearest Neighbor (KNN): KNN is a simple yet effective classification algorithm. It assigns a class label based on the majority class among its 'k' nearest neighbors. This model is non-parametric and can adapt to different data distributions.

## Import Libary's

In [9]:
import pandas as pd
import numpy as np

## Laod Data
We will load the processed data from the local `data.zip` directory

In [8]:
train_data = './data/training_dataframe.csv'
val_data = './data/validation_dataframe.csv'
test_data = './data/test_dataframe.csv'

In [21]:
train = pd.read_csv(train_data)
train.head()

Unnamed: 0,MonitoringSiteID,TimeSamplingDate,Ammonium-mg/L,Nitrate-mg/L,DissolvedOxygen-mg/L,TotalNitrogen-mg/L,Phosphate-mg/L,TotalPhosphorus-mg/L,Chlorophyll-mg/L,WQI
0,IT10GEN1,2016-10-03 00:00:00,-0.042689,-0.042892,-1.197668,2.676997,-0.345265,10.960912,-0.015047,High
1,IT10TVR7,2016-07-04 00:00:00,-0.042689,-0.042892,-1.197668,2.676997,-0.345265,10.960912,-0.015047,High
2,ITR110199CH,2016-12-15 00:00:00,-0.042689,-0.042892,1.122891,-0.739142,-0.345265,-0.124127,-0.015047,High
3,ITR110074ACE,2016-06-14 00:00:00,-0.042689,-0.042892,-0.037388,-0.739142,-0.345265,-0.124127,-0.015047,Med
4,ITR110074ACE,2016-11-10 00:00:00,-0.042689,-0.042892,1.122891,-0.739142,-0.345265,-0.124127,-0.015047,High


In [23]:
val = pd.read_csv(val_data)
val.head()

Unnamed: 0,MonitoringSiteID,TimeSamplingDate,Ammonium-mg/L,Nitrate-mg/L,DissolvedOxygen-mg/L,TotalNitrogen-mg/L,Phosphate-mg/L,TotalPhosphorus-mg/L,Chlorophyll-mg/L,WQI
0,SE662925-154156,2012-08-09 00:00:00,-0.042689,-0.042892,1.122891,-0.739142,2.335554,5.418393,-0.015047,High
1,SE627500-151900,2008-08-12 00:00:00,-0.042689,-0.042892,1.122891,0.968928,2.335554,5.418393,-0.015047,High
2,SE627500-151900,2012-04-20 00:00:00,-0.042689,-0.042892,1.122891,0.968928,2.335554,5.418393,-0.015047,High
3,SE637654-150206,2010-04-13 00:00:00,-0.042689,-0.042892,1.122891,-0.739142,2.335554,5.418393,-0.015047,High
4,SE642900-144100,2009-09-08 00:00:00,-0.042689,-0.042892,1.122891,0.968928,2.335554,5.418393,-0.015047,High


In [25]:
test = pd.read_csv(test_data)
test.head()
test.describe()

Unnamed: 0,Ammonium-mg/L,Nitrate-mg/L,DissolvedOxygen-mg/L,TotalNitrogen-mg/L,Phosphate-mg/L,TotalPhosphorus-mg/L,Chlorophyll-mg/L
count,8020.0,8020.0,8020.0,8020.0,8020.0,8020.0,8020.0
mean,0.486056,2.338568,0.334567,0.976382,-0.178466,0.123282,-0.007573
std,3.648416,6.360909,0.914434,1.139168,0.860369,1.62584,0.669387
min,-0.042689,-0.042892,-1.197668,-0.739142,-0.345265,-0.124127,-0.015047
25%,-0.042689,-0.042892,-0.037388,0.968928,-0.345265,-0.124127,-0.015047
50%,-0.042689,-0.042892,1.122891,0.968928,-0.345265,-0.124127,-0.015047
75%,-0.042689,-0.042892,1.122891,0.968928,-0.345265,-0.124127,-0.015047
max,25.657518,30.589517,1.122891,2.676997,5.016372,10.960912,59.931509


In [None]:
The values of the Dataframes are scaled with the `StandardScaler` and 