# FEATURES AND INFORMATIONS

1. pH value:
* PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

2. Hardness:
* Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.

3. Solids (Total dissolved solids - TDS):
* Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These minerals produced un-wanted taste and diluted color in appearance of water. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

4. Chloramines:
* Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

5. Sulfate:
* Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.

6. Conductivity:
* Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.

7. Organic_carbon:
* Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.

8. Trihalomethanes:
* THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.

9. Turbidity:
* The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

10. Potability:
* Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

# PACKAGES AND LIBRARIES

In [None]:
!pip install dataprep by

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from warnings import filterwarnings
from mpl_toolkits.mplot3d import Axes3D
import statsmodels.api as sm
import missingno as msno
import statsmodels.stats.api as sms
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.neighbors import LocalOutlierFactor
from scipy.stats import levene
from scipy.stats import shapiro
from scipy.stats.stats import pearsonr
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.preprocessing import scale
from sklearn.model_selection import ShuffleSplit, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LinearRegression
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import ElasticNetCV
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
import xgboost as xgb
from xgboost import XGBRegressor, XGBClassifier
from lightgbm import LGBMRegressor, LGBMClassifier
from catboost import CatBoostRegressor, CatBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn import tree
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_auc_score, roc_curve
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.manifold import Isomap,TSNE
from sklearn.feature_selection import mutual_info_classif
from tqdm.notebook import tqdm
from scipy.stats import ttest_ind
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as pyo
import scipy.stats as stats
import pymc3 as pm
from dataprep.eda import *
from dataprep.eda import plot
from dataprep.eda import plot_diff
from dataprep.eda import plot_correlation
from dataprep.eda import plot_missing
import plotly.figure_factory as ff
from collections import Counter
import pandas_profiling as pp

In [None]:
filterwarnings("ignore", category=DeprecationWarning) 
filterwarnings("ignore", category=FutureWarning) 
filterwarnings("ignore", category=UserWarning)

# DATA PROCESS & EXPLORATORY DATA ANALYSIS (EDA)

#### MAIN DATA

In [None]:
Water_Potability_CSV = pd.read_csv("../input/water-potability/water_potability.csv")

Data = Water_Potability_CSV.copy() # COPY FOR PROTECTING MAIN DATA

Numeric_Only = Data.select_dtypes(include=["float32","float64","int32","int64"]) # if it is necessary

In [None]:
Data

In [None]:
print(Data.info())

In [None]:
print(Data.describe().T)

In [None]:
print(Numeric_Only.corr())

In [None]:
print(Numeric_Only.cov())

In [None]:
print(Data.columns)

In [None]:
print("SHAPE: ",Data.shape)
print("SIZE: ",Data.size)

In [None]:
print(Data["Potability"].value_counts())

In [None]:
print("NaN\n")
print(Data.isnull().sum())

In [None]:
print("NaN Bool\n")
print(Data.isna())

In [None]:
print("Duplicated\n")
print(Data.duplicated().sum())

#### NaN PROCESS

In [None]:
msno.matrix(Data,figsize=(12,5)) # missing relationship
plt.show()

In [None]:
msno.bar(Data,figsize=(12,5)) # missing counting
plt.show()

In [None]:
msno.dendrogram(Data,figsize=(12,5)) # missing relationship step
plt.show()

In [None]:
msno.heatmap(Data,figsize=(12,5)) # missing relationship with heatmap
plt.show()

In [None]:
figure = plt.figure(figsize=(12,5))
plt.title("NaN COUNT")
Nan_Checking = Data.isna().sum().sort_values(ascending=False).to_frame() # missing values with heatmap
sns.heatmap(Nan_Checking,fmt="d",cmap="jet")
plt.show()

In [None]:
Numeric_Only["ph"].fillna(Numeric_Only.groupby(["Potability"])["ph"].transform("mean"),inplace=True)
Numeric_Only["Sulfate"].fillna(Numeric_Only.groupby(["Potability"])["Sulfate"].transform("mean"),inplace=True)
Numeric_Only["Trihalomethanes"].fillna(Numeric_Only.groupby(["Potability"])["Trihalomethanes"].transform("mean"),inplace=True)

In [None]:
print("NaN\n")
print(Numeric_Only.isnull().sum())

#### MIN AND MAX DANGEROUS LIMITS FOR GLOBAL

In [None]:
pH_Dangerous_Less = Numeric_Only[Numeric_Only["ph"] < 6.5]
pH_Dangerous_High = Numeric_Only[Numeric_Only["ph"] > 8.5]

pH_Dangerous_Less = pH_Dangerous_Less.reset_index(drop=True)
pH_Dangerous_High = pH_Dangerous_High.reset_index(drop=True)

In [None]:
Solids_Dangerous_Less = Numeric_Only[Numeric_Only["Solids"] < 500]
Solids_Dangerous_High = Numeric_Only[Numeric_Only["Solids"] > 1000]

Solids_Dangerous_Less = Solids_Dangerous_Less.reset_index(drop=True)
Solids_Dangerous_High = Solids_Dangerous_High.reset_index(drop=True)

In [None]:
Chloramines_Dangerous_Less = Numeric_Only[Numeric_Only["Chloramines"] < 4.0]
Chloramines_Dangerous_High = Numeric_Only[Numeric_Only["Chloramines"] > 5.0]

Chloramines_Dangerous_Less = Chloramines_Dangerous_Less.reset_index(drop=True)
Chloramines_Dangerous_High = Chloramines_Dangerous_High.reset_index(drop=True)

In [None]:
Conductivity_Dangerous_Limit = Numeric_Only[Numeric_Only["Conductivity"] > 400.0]

Conductivity_Dangerous_Limit = Conductivity_Dangerous_Limit.reset_index(drop=True)

In [None]:
Trihalomethanes_Dangerous_Limit = Numeric_Only[Numeric_Only["Trihalomethanes"] > 80]

Trihalomethanes_Dangerous_Limit = Trihalomethanes_Dangerous_Limit.reset_index(drop=True)

In [None]:
Turbidity_Dangerous_Limit = Numeric_Only[Numeric_Only["Turbidity"] < 5.0]

Turbidity_Dangerous_Limit = Turbidity_Dangerous_Limit.reset_index(drop=True)

#### GROUPBY MEANING

In [None]:
print(Numeric_Only.groupby(["Potability"])["ph"].mean())

In [None]:
print(Numeric_Only.groupby(["Potability"])["Solids"].mean())

In [None]:
print(Numeric_Only.groupby(["Potability"])["Chloramines"].mean())

In [None]:
print(Numeric_Only.groupby(["Potability"])["Sulfate"].mean())

In [None]:
print(Numeric_Only.groupby(["Potability"])["Conductivity"].mean())

In [None]:
print(Numeric_Only.groupby(["Potability"])["Trihalomethanes"].mean())

In [None]:
print(Numeric_Only.groupby(["Potability"])["Turbidity"].mean())

In [None]:
print(Numeric_Only.groupby(["Potability"])["Organic_carbon"].mean())

In [None]:
print(Numeric_Only.groupby(["Potability"])["Hardness"].mean())

* Note that, we have more unsuitable water data for drinking

#### SPECIAL CHECKING

##### FOR MIN MAX DANGEROUS

In [None]:
plot_diff([pH_Dangerous_Less,pH_Dangerous_High])

In [None]:
plot_diff([Chloramines_Dangerous_Less,Chloramines_Dangerous_High])

In [None]:
plot_diff([Solids_Dangerous_Less,Solids_Dangerous_High])

In [None]:
plot_diff([Conductivity_Dangerous_Limit,Trihalomethanes_Dangerous_Limit])

##### CORRELATION

In [None]:
plot_correlation(Numeric_Only)

In [None]:
Corr_Pearson = Numeric_Only.corr(method="pearson")
Corr_Spearman = Numeric_Only.corr(method="spearman")

In [None]:
figure = plt.figure(figsize=(15,8))
plt.title("CORRELATION PEARSON")
sns.heatmap(Corr_Pearson,annot=True,vmin=-1,center=0,vmax=1,linewidths=2,linecolor="black",cmap="hot")
plt.show()

In [None]:
figure = plt.figure(figsize=(15,8))
plt.title("CORRELATION SPEARMAN")
sns.heatmap(Corr_Spearman,annot=True,vmin=-1,center=0,vmax=1,linewidths=2,linecolor="black",cmap="hot")
plt.show()

##### COVARIANCE

In [None]:
Cov_Data = Numeric_Only.cov()

In [None]:
figure = plt.figure(figsize=(15,8))
plt.title("CORRELATION SPEARMAN")
sns.heatmap(Cov_Data,annot=True,vmin=-1,center=0,vmax=1,linewidths=2,linecolor="black",cmap="hot")
plt.show()

##### CHART

In [None]:
plot(Numeric_Only)

In [None]:
plot(Numeric_Only, "Potability")

In [None]:
plot(Numeric_Only, "ph")

In [None]:
plot(Numeric_Only, "Hardness")

In [None]:
plot(Numeric_Only, "Solids")

In [None]:
plot(Numeric_Only, "Chloramines")

In [None]:
plot(Numeric_Only, "Sulfate")

In [None]:
plot(Numeric_Only, "Conductivity")

In [None]:
plot(Numeric_Only, "Organic_carbon")

In [None]:
plot(Numeric_Only, "Trihalomethanes")

In [None]:
plot(Numeric_Only, "Turbidity")

In [None]:
pp.ProfileReport(Numeric_Only)

##### VISION

In [None]:
NoN_Potable = Numeric_Only.query("Potability == 0")
Potable = Numeric_Only.query("Potability == 1")

In [None]:
figure = plt.figure(figsize=(15,8))
for axis,indexing in enumerate(Numeric_Only.columns[0:9]):
    
    plt.subplot(3,3,axis+1)
    plt.title("KDE")
    
    sns.kdeplot(x=NoN_Potable[indexing],label="NoN")
    sns.kdeplot(x=Potable[indexing],label="Potable")
    plt.legend(prop=dict(size=10))
    
plt.tight_layout()
plt.show()

In [None]:
figure = plt.figure(figsize=(15,8))
for axis,indexing in enumerate(Numeric_Only.columns[0:9]):
    
    plt.subplot(3,3,axis+1)
    plt.title(f"INFO - {indexing}")
    
    sns.distplot(x=NoN_Potable[indexing],label="NoN",color="red")
    sns.distplot(x=Potable[indexing],label="Potable",color="black")
    plt.legend(prop=dict(size=10))
    
plt.tight_layout()
plt.show()

In [None]:
figure = plt.figure(figsize=(15,8))
for axis,indexing in enumerate(Numeric_Only.columns[0:9]):
    
    plt.subplot(3,3,axis+1)
    plt.title(f"INFO")
    
    sns.boxplot(x=Numeric_Only[indexing],color="black")
    plt.legend(prop=dict(size=10))
    
plt.tight_layout()
plt.show()

In [None]:
DataVis = Numeric_Only.copy()
DataVis["Potability"] = pd.Categorical(DataVis["Potability"])

In [None]:
figure = plt.figure(figsize=(15,8))
for axis,indexing in enumerate(Numeric_Only.columns[0:9]):
    
    plt.subplot(3,3,axis+1)
    plt.title(f"INFO - {indexing}")
    
    sns.barplot(x=DataVis["Potability"],y=DataVis[indexing],color="red")
    plt.legend(prop=dict(size=10))
    
plt.tight_layout()
plt.show()

In [None]:
figure = plt.figure(figsize=(15,8))
for axis,indexing in enumerate(Numeric_Only.columns[0:9]):
    
    plt.subplot(3,3,axis+1)
    plt.title("INFO - Potability")
    
    sns.histplot(x=DataVis[indexing],hue=DataVis["Potability"],multiple="stack",edgecolor=".3",linewidth=.5)
    plt.legend()
    
plt.tight_layout()
plt.show()

In [None]:
figure = plt.figure(figsize=(15,8))
for axis,indexing in enumerate(Numeric_Only.columns[1:9]):
    
    plt.subplot(3,3,axis+1)
    plt.title("INFO - PH") # check others if you want
    
    sns.lineplot(x=DataVis[indexing],y=DataVis["ph"],hue=DataVis["Potability"])
    plt.legend(prop=dict(size=10))
    
plt.tight_layout()
plt.show()

In [None]:
figure = plt.figure(figsize=(15,8))
for axis,indexing in enumerate(Numeric_Only.columns[1:9]):
    
    plt.subplot(3,3,axis+1)
    plt.title("INFO - PH") # check others if you want
    
    sns.scatterplot(x=DataVis[indexing],y=DataVis["ph"],hue=DataVis["Potability"])
    plt.legend(prop=dict(size=10))
    
plt.tight_layout()
plt.show()

In [None]:
figure = plt.figure(figsize=(15,8))
sns.pairplot(DataVis,hue="Potability")
plt.show()

In [None]:
figure = plt.figure(figsize=(20,8))
sns.distplot(DataVis[DataVis['Potability'] == 0]["ph"], color='black',label='DANGEROUS') 
sns.distplot(DataVis[DataVis['Potability'] == 1]["ph"], color='red',label='OKAY')
# for example , check others

plt.title('ph', fontsize=10)
plt.legend()

In [None]:
figure = plt.figure(figsize=(15,8))
for axis,indexing in enumerate(Numeric_Only.columns[0:9]):
    
    plt.subplot(3,3,axis+1)
    plt.title(f"INFO - {indexing}")
    
    sns.distplot(DataVis[DataVis[indexing] < DataVis[indexing].mean()], color='black',label='UNDER MEAN') 
    sns.distplot(DataVis[DataVis[indexing] > DataVis[indexing].mean()], color='red',label='UPPER MEAN')
    plt.xlim([0,3500]) 

    plt.legend()

plt.tight_layout()
plt.show()

In [None]:
print(DataVis.columns)

In [None]:
plt.figure(figsize=(20,8))
sns.distplot(DataVis[DataVis['ph'] < DataVis['ph'].mean()], color='black',label='LESS THAN MEAN') 
sns.distplot(DataVis[DataVis['ph'] > DataVis['ph'].mean()], color='red',label='MORE THAN MEAN')
# for example , check others

plt.title('ph', fontsize=10)
plt.legend()

# STANDARDIZATION EXAMPLE

In [None]:
Standart_Data = Numeric_Only.copy()

#### EXAMPLE BASED ON PH VALUES

In [None]:
print("MODE: ", Standart_Data["ph"].mode(),print("\n"))
print("MAX: ", Standart_Data["ph"].max(),print("\n"))
print("MIN: ", Standart_Data["ph"].min(),print("\n"))
print("MEAN: ", Standart_Data["ph"].mean(),print("\n"))

In [None]:
print("Confidence interval is based on normal distribution:\n",sms.DescrStatsW(Standart_Data["ph"]).tconfint_mean())

In [None]:
def std_values(x):
    if x <= 6.5:
        return "DANGEROUS" # based on global pH
    elif x >= 8.5:
        return "DANGEROUS" # based on global pH
    else:
        return "ACCEPTABLE"

In [None]:
Standart_Data["ph"] = Standart_Data["ph"].apply(lambda x: std_values(x))

In [None]:
print(Standart_Data["ph"].value_counts())

* you can apply for others, just be careful with global acceptable values

# NORMALITY AND HOMOGENEITY CHECKING

#### NORMALITY

In [None]:
for col in Numeric_Only.columns:
    print(col)
    print("---"*5)
    print("%.4f - %.4f" % shapiro(Numeric_Only[col]))
    print("---"*15)

#### HOMOGENEITY

In [None]:
print("%.4f - %.4f" % levene(Numeric_Only["ph"],Numeric_Only["Hardness"],
                            Numeric_Only["Solids"],Numeric_Only["Potability"]))
# check for others

# OUTLIER

In [None]:
DataForA = Numeric_Only.copy()

In [None]:
clf = LocalOutlierFactor()

In [None]:
clf.fit_predict(DataForA)

In [None]:
Gen_Score = clf.negative_outlier_factor_

In [None]:
Sorted_Score = np.sort(Gen_Score)

In [None]:
print(Sorted_Score[0:150])
# checking outlier, look where the biggest jump took place

* data seems stabil
* but when it does not, you can follow the process below 

In [None]:
point = Sorted_Score[6] # it just for example
print(point)
print("---"*10)
print(DataForA[Gen_Score == point])

# for our example

In [None]:
outliers = Gen_Score < point
print(Numeric_Only[outliers])
print("---"*20)
print(Numeric_Only[outliers].index)

# outliers

In [None]:
outliersIndexList = [Numeric_Only[outliers].index]

for d_i in outliersIndexList:
    Numeric_Only.drop(index=d_i,inplace=True)
    
# deleting process
# don't run, if you continue prediction process
# don't forget, it was just for example

#### END OF THE EDA / CHECK PREDICTION PROCESS