# **Water Quality using Random Forest Classification - Knightbearr**

# **Workflow**

1. Data Collection
2. Data Cleaning & Checking
3. EDA
4. Splitting Data
5. Oversampling model
6. Modeling
7. Prediction Score


note : Sorry if my english is bad, and sorry if i had a mistake. thanks in advance!

# **Import Libraries**

import the module that we want to use for this research.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import mutual_info_regression
from sklearn.pipeline import Pipeline

from sklearn import preprocessing
from sklearn import metrics
from imblearn.over_sampling import SMOTE
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### **Setup Libraries**

In [None]:
pd.set_option('display.width', 150)
pd.set_option('display.max_columns', 12)

plt.style.use("seaborn-whitegrid")

sns.set_theme(
    color_codes=True, 
    style='darkgrid', 
    palette='deep', 
    font='sans-serif'
)

# **Load The Dataset**

Load the dataset that we want to research

In [None]:
water_data = pd.read_csv('../input/water-potability/water_potability.csv')

# **Clean and Checking the Data**

we must check the data every time we want to make a model, because this is the important thing, if you suddenly meet a bad dataset, wether you want it or not, you must clean the data.

In [None]:
# Checking the first 5 rows of data
water_data.head()

In [None]:
# Checking the last 5 rows of data
water_data.tail()

In [None]:
# Getting the statistical measure info
water_data.describe()

In [None]:
# Getting the information about the dataset
water_data.info()

In [None]:
# Chechking the null data
water_data.isnull().sum()

In [None]:
# Checking the shape of data
water_data.shape

**First**, we overcome the null values in the ph column first, and we know that the **ph values less than 6.5 and more than 8.5 are not suitable** for consumption, and the **range of pH that is within the WHO standard is 6.52 - 6.83** , and also we have obtained statistical info from the data above, that the **mean value at pH is 7** which is still within the scope of water that is safe for drinking by humans, so we will fill this empty ph value with the mean value of this ph to improve our model

In [None]:
# Create a new variable named meanPh to hold the value of mean in ph
meanPh = water_data['ph'].mean()

# Fill the blank/null data ph with value of meanPh
water_data['ph'] = water_data['ph'].fillna(meanPh)

# Check the null value and print
print(f'Null Value : {water_data.ph.isnull().sum()}\n')

# Check the data ph
water_data['ph']

**Now**, we overcome the null value in sulfate, and we know that **Sulfate is one of the important ions in the availability of water** because of its important effect for humans when it is available in large quantities. **The maximum sulfate limit in water is about 250 mg/L** for water for human consumption. if we see the statistics info above, **the mean of the Sulfate is 333mg/L**, which mean, that's not good for consumption, and we know that, why the mean value is 333ml/L? this happens because there are so many null values.

In [None]:
# Created a new variable named waterSulfate tohold the value
waterSulfate = (water_data.Sulfate.mean() - water_data.Sulfate.min())

# Fill the blank/null dataSulfate with the value of waterSulfate
water_data['Sulfate'] = water_data['Sulfate'].fillna(waterSulfate)

# Check the null value and print
print(f'Null Value : {water_data.Sulfate.isnull().sum()}')

# Check the data Sulfate
water_data['Sulfate']

**And last**, we deal with Trihalomethanes, what are Trihalomethanes ? **Trihalomethanes (THMs) are among the most dangerous chemical compounds** that find their way into the water supply. The concentration of THM in drinking water varies according to the level of organic matter in the water, the amount of chlorine required to treat the water, and the temperature of the treated water. **THM levels up to 80 ppm are considered safe** in drinking water. and if we see the statistical measure above, we can see the max value of thm is so high almost doubled value of thm, which that means is bad for consume.


In [None]:
# Created a new variable named waterTrihalomethanes to hold the value
waterTrihalomethanes = water_data.Trihalomethanes.mean()

# Fill the blank/null data Trihalomethanes with the value of waterTrihalomethanes
water_data['Trihalomethanes'] = water_data['Trihalomethanes'].fillna(waterTrihalomethanes)

# Check the null value and print
print(f'Null Value : {water_data.Trihalomethanes.isnull().sum()}')

# Check the data Sulfate
water_data['Trihalomethanes']

**Checking again**

In [None]:
# Checking data and print
print(f'Isnull ? :\n{water_data.isnull().sum()}\n')

# Checking data and print
print(f'Is all the data is True ? :\n{water_data.any()}\n')

# Count the value and print
print(f'Potability :\n{water_data.Potability.value_counts()}\n')

# Check the shape and print
print(f'Data Shape : {water_data.shape}')

In [None]:
# Getting the statistical measure info
water_data.describe()

# **EDA**

analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.

In [None]:
# Make a correlation data to knowing Value Strength and Direction of Linear Relationship
corr = water_data.corr()
corr

In [None]:
# Checking the structure of the data
sample = water_data.sample(11, random_state=42).T
sample

In [None]:
# Constructing a heatmap to understand the correlation
plt.figure(figsize=(12, 10))

sns.heatmap(
    corr, 
    cbar=True, 
    square=True, 
    fmt='.1f', 
    annot=True,
    annot_kws={'size': 8},
    cmap='YlGnBu'
)

plt.plot()

You see that? that's terrible, there's no one data that have a good correlation with the Potability.

In [None]:
# Create Regression Plot
sns.regplot(
    x=water_data.Hardness, 
    y=water_data.Potability, 
    data=water_data
)

plt.show()

In [None]:
# Create a histogram plot
water_data.hist(figsize=(12,12))
plt.show()

**Coefficient of Variation**

The coefficient of variation is a measure of variance that can be used to compare a data distribution that has different units.

* **The higher the Coefficient of Variation** = the wider the data you have compared to the average data (more difficult to predict)
* **The Lower Coefficient of Variation** = The narrower the data you have compared to the Average data (Easier to predict)

In [None]:
# Coefficient of Potability
covPota = ((water_data['Potability'].std()/water_data['Potability'].mean()) * 100)
print(f'Coefficient Of Variation Potability : {covPota}%')

as you can see the output above, the coefficient of variation is so high, which mean, is so difficult to predict 

In [None]:
# Getting the Mutual Information about the data
X = water_data.copy()
y = X.pop('Potability')

# All discrete features should now have integer dtypes
discreateFeatures = X.dtypes == int

In [None]:
# Make a function
def makeMiScores(X, y, discreateFeatures):
    miScores = mutual_info_regression(X, y, discrete_features=discreateFeatures)
    miScores = pd.Series(miScores, name='MI Scores', index=X.columns)
    miScores = miScores.sort_values(ascending=False)
    return miScores

miScores = makeMiScores(X, y, discreateFeatures)
miScores # show a features with their MI scores

In [None]:
# And now bar plot to make comparisons easier
def plotMiScores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

# Figuring the plot and plotting
plt.figure(dpi=100, figsize=(6, 3))
plotMiScores(miScores)

Data visualization is a great follow-up to a utility ranking. as we can see the **Hardness, Conductivity, Organic, Turbidity, Solids, and ph have a mutual information** with Potability.

# **Splitting the Data**

divide the data and split it using train test split module from sklearn.

In [None]:
# Divide the data
X = water_data.drop(['Potability'], axis=1)
y = water_data['Potability']

In [None]:
# Splitting Data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.3, random_state=42
)

In [None]:
# Checking the Target
y_train.value_counts()

In [None]:
# Checking the Target
y_test.value_counts()

# **Upsampling the Target**

upsampling the target using SMOTE, upsampling the target, because we can see the portability and not portability data is have a  huge difference.

In [None]:
sm = SMOTE(random_state=42)
X_train_res, Y_train_res = sm.fit_resample(X_train, y_train)

# **Train and Fit the model**

Train and fit the model using **RandomForestClassifier** Algorithm. and **Pipelines** are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

In [None]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('transformer', QuantileTransformer(
        random_state=42)
    ),
    ('model', RandomForestClassifier(
        n_estimators=620, 
        min_samples_leaf=1, 
        random_state=42))
])

pipe.fit(X_train_res, Y_train_res)

In [None]:
# Train Predict
pred_train = pipe.predict(X_train_res)
print(metrics.classification_report(Y_train_res, pred_train))

In [None]:
# Test Predict
pred_test = pipe.predict(X_test)
print(metrics.classification_report(y_test, pred_test))

> # **That's it! don't forget to give me feedback and upvote if you like it! thanks in advance!**

## **Here's my another notebook that i made:**

**Data Analysist and Visualization:**

- [World Covid Vaccination](https://www.kaggle.com/knightbearr/data-visualization-world-vaccination-knightbearr)
- [Netflix Time Series Visualization](https://www.kaggle.com/knightbearr/netflix-visualization-time-series-knightbearr)
- [Taiwan Weight Stock Analysist](https://www.kaggle.com/knightbearr/taiwan-weight-stock-index-analysis-knightbearr)

**Regression and Classification:**

- [S&P 500 Companies](https://www.kaggle.com/knightbearr/pricesales-eda-rfr-knightbearr)
- [Credit Card Fraud Detection](https://www.kaggle.com/knightbearr/credit-card-fraud-detection-knightbearr)
- [Car Price V3](https://www.kaggle.com/knightbearr/car-price-v3-xgbregressor-knightbearr)
- [House Price Iran](https://www.kaggle.com/knightbearr/house-price-iran-knightbearr)

**Deep Learning:**

- [Rock Paper Scissors](https://www.kaggle.com/knightbearr/rock-paper-scissors-knightbearr)

**Some Python Code:**

- [Python Cheat Sheet](https://www.kaggle.com/knightbearr/python-cheat-sheet-knightbearr)
- [22 Python Progam](https://www.kaggle.com/knightbearr/22-simple-python-program-knightbearr)