**What is Potable Water**

> Drinking water, also known as potable water, is water that is safe to drink or use for food preparation. The amount of drinking water required to maintain good health varies, and depends on physical activity level, age, health-related issues, and environmental conditions.Typically in developed countries, tap water meets drinking water quality standards, even though only a small proportion is actually consumed or used in food preparation. Other typical uses include washing, toilets, and irrigation. Greywater may also be used for toilets or irrigation. Its use for irrigation however may be associated with risks.

**Contaiments in Water**
>The Safe Drinking Water Act defines the term "contaminant" as meaning any physical, chemical, biological, or radiological substance or matter in water. Therefore, the law defines "contaminant" very broadly as being anything other than water molecules. Drinking water may reasonably be expected to contain at least small amounts of some contaminants. Some drinking water contaminants may be harmful if consumed at certain levels in drinking water while others may be harmless. The presence of contaminants does not necessarily indicate that the water poses a health risk. Only a small number of the universe of contaminants as defined above are listed on the Contaminant Candidate List (CCL). The CCL serves as the first level of evaluation for unregulated drinking water contaminants that may need further investigation of potential health effects and the levels at which they are found in drinking water.

>The following are general categories of drinking water contaminants 
>1. Physical
>2. Chemical
>3. Biological
>4. Radiological

**How is Water made potable for drinking?**

> Water purification is the process of removing undesirable chemicals, biological contaminants, suspended solids, and gases from water. The goal is to produce water fit for specific purposes. Most water is purified and disinfected for human consumption (drinking water), but water purification may also be carried out for a variety of other purposes, including medical, pharmacological, chemical, and industrial applications. The methods used include physical processes such as filtration, sedimentation, and distillation; biological processes such as slow sand filters or biologically active carbon; chemical processes such as flocculation and chlorination; and the use of electromagnetic radiation such as ultraviolet light.

**Facts by WHO**

>According to a 2007 World Health Organization (WHO) report, 1.1 billion people lack access to an improved drinking water supply; 88% of the 4 billion annual cases of diarrheal disease are attributed to unsafe water and inadequate sanitation and hygiene, while 1.8 million people die from diarrheal disease each year. The WHO estimates that 94% of these diarrheal disease cases are preventable through modifications to the environment, including access to safe water. Simple techniques for treating water at home, such as chlorination, filters, and solar disinfection, and for storing it in safe containers could save a huge number of lives each year. Reducing deaths from waterborne diseases is a major public health goal in developing countries.

# Columns Discription
1. ph: pH of 1. water (0 to 14).
2. Hardness: Capacity of water to precipitate soap in mg/L.
3. Solids: Total dissolved solids in ppm.
4. Chloramines: Amount of Chloramines in ppm.
5. Sulfate: Amount of Sulfates dissolved in mg/L.
6. Conductivity: Electrical conductivity of water in μS/cm.
7. Organic_carbon: Amount of organic carbon in ppm.
8. Trihalomethanes: Amount of Trihalomethanes in μg/L.
9. Turbidity: Measure of light emiting property of water in NTU.
10. Potability: Indicates if water is safe for human consumption. Potable -1 and Not potable -0

# Importing Liberaries

In [None]:
from sklearn.model_selection import GridSearchCV,StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from imblearn.over_sampling import SMOTE
from lightgbm import LGBMClassifier
from collections import Counter
import matplotlib.pyplot as plt
import pandas_profiling as pp
from sklearn.svm import SVC
import plotly.express as px
import seaborn as sns
import pandas as pd
import numpy as np
import warnings

warnings.simplefilter(action='ignore', category=Warning)
from collections import Counter

# Loading Data

In [None]:
df = pd.read_csv('/kaggle/input/water-potability/water_potability.csv')

# Exploratory Data Analysis

In [None]:
# Displaying top 5 rows
df.head()

In [None]:
# Getting the Shape of the Data
df.shape

In [None]:
# Getting data types
df.info()

In [None]:
# Checking Null Values
df.isnull().sum()

In [None]:
# Dropping NUll Values
df.dropna(inplace = True)

# Checking size of data after dropping NUll Value
df.shape

> As we can see we have dropped all null values but due to this the size of our data is compromized.

> Before dropping it was (3276, 10) 

> After Dropping it is (2011,10)

In [None]:
# Getting the description of Data
df.describe().T

# Data Visualization

**Potability data Count**

In [None]:
pota_count = df['Potability'].value_counts()
print(pota_count)
pota_count.plot.pie();

In [None]:
plt.figure(figsize=(6,5))
sns.countplot(x="Potability", data=df, palette='viridis');

> As we can see from above

> Potable water are 811

> Non Potable water are 1200

**Box And Distribution plot for each column**

In [None]:
def boxdistriplot(columnName):
    if not columnName == 'Potability':
        sns.catplot(x="Potability", y=columnName, data=df, kind="box");
        plt.figure()
        ax = sns.distplot(df[columnName][df.Potability == 1],color="darkturquoise", rug=True)
        sns.distplot(df[columnName][df.Potability == 0], color="lightcoral", rug=True);
        plt.legend(['Potable', 'Not Potable']) 


for column in df.columns:
    boxdistriplot(column)

> we can see that there are many outliners in the data so now we will remove them.

**Correcting Skewness in Data**

In [None]:
def skewnessCorrector(dataset,columnName):
    import seaborn as sns
    from scipy import stats
    from scipy.stats import norm, boxcox

    print('''Before Correcting''')
    (mu, sigma) = norm.fit(dataset[columnName])
    print("Mu before correcting {} : {}, Sigma before correcting {} : {}".format(
        columnName.capitalize(), mu, columnName.capitalize(), sigma))
    plt.figure(figsize=(20, 10))
    plt.subplot(1, 2, 1)
    sns.distplot(dataset[columnName], fit=norm, color="lightcoral");
    plt.title(columnName.capitalize() +
              " Distplot before Skewness Correction", color="black")
    plt.subplot(1, 2, 2)
    stats.probplot(dataset[columnName], plot=plt)
    plt.show()
    # Applying BoxCox Transformation
    dataset[columnName], lam_fixed_acidity = boxcox(
        dataset[columnName])
    
    print('''After Correcting''')
    (mu, sigma) = norm.fit(dataset[columnName])
    print("Mu after correcting {} : {}, Sigma after correcting {} : {}".format(
        columnName.capitalize(), mu, columnName.capitalize(), sigma))
    plt.figure(figsize=(20, 10))
    plt.subplot(1, 2, 1)
    sns.distplot(dataset[columnName], fit=norm, color="orange");
    plt.title(columnName.capitalize() +
              " Distplot After Skewness Correction", color="black")
    plt.subplot(1, 2, 2)
    stats.probplot(dataset[columnName], plot=plt)
    plt.show()

col = ['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity',
       'Organic_carbon', 'Trihalomethanes', 'Turbidity']
for column in col:
    skewnessCorrector(df,column)

**Correlation Matrix**

In [None]:
plt.figure(figsize=(13,10))
sns.heatmap(df.corr(), annot=True, cmap="flag");

>From the Correlation plot above we can get to know positive and negative relationships between different columns

**Pairplot**

In [None]:
plt.figure(figsize=(13,10))
sns.pairplot(df, hue="Potability", palette="magma");

> Pairplot tells us about the relationship between different columns on the basis of Potability

**Using Pandas Profiling**

In [None]:
pp.ProfileReport(df, title = "Water Quality")

***Information about the Columns on which we have done the Analysis***
> 1. pH value:

> PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.
> 
> 2. Hardness:

> Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.
> 
> 3. Solids (Total dissolved solids - TDS):

> Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These minerals produced un-wanted taste and diluted color in appearance of water. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.
> 
> 4. Chloramines:

> Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.
> 
> 5. Sulfate:

> Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.
> 
> 6. Conductivity:

> Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.
> 
> 7. Organic_carbon:

> Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.
> 
> 8. Trihalomethanes:

> THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.
> 
> 9. Turbidity:

> The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.
> 
> 10. Potability:

> Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

# Data Preprocessing And Data Modelling

***Getting Features and predicting values***

In [None]:
# Fratures
x = df.drop(columns = 'Potability', axis = 1)
y = df['Potability']

In [None]:
def preprocessing(data,features,labels,test_size = 0.2,random_state =42, tune = 'n',cv_folds = 5):
    
    print('Checking if labels or features are categorical! [*]\n')
    cat_features=[i for i in features.columns if features.dtypes[i]=='object']
    if len(cat_features) >= 1 :
        index = []
        for i in range(0,len(cat_features)):
            index.append(features.columns.get_loc(cat_features[i]))
        print('Features are Categorical\n')
        ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), index)], remainder='passthrough')
        print('Encoding Features [*]\n')
        features = np.array(ct.fit_transform(features))
        print('Encoding Features Done [',u'\u2713',']\n')
    if labels.dtype == 'O':
        le = LabelEncoder()
        print('Labels are Categorical [*] \n')
        print('Encoding Labels \n')
        labels = le.fit_transform(labels)
        print('Encoding Labels Done [',u'\u2713',']\n')
    else:
        print('Features and labels are not categorical [',u'\u2713',']\n')
        
    ## SMOTE ---------------------------------------------------------------------
    print('Applying SMOTE [*]\n')
    
    sm=SMOTE(k_neighbors=4)
    features,labels=sm.fit_resample(features,labels)
    print('SMOTE Done [',u'\u2713',']\n')
    
    ## Splitting ---------------------------------------------------------------------
    print('Splitting Data into Train and Validation Sets [*]\n')
    
    x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size= test_size, random_state= random_state)
    print('Splitting Done [',u'\u2713',']\n')
    
    ## Scaling ---------------------------------------------------------------------
    print('Scaling Training and Test Sets [*]\n')
    
    sc = StandardScaler()
    X_train = sc.fit_transform(x_train)
    X_val = sc.transform(x_test)
    print('Scaling Done [',u'\u2713',']\n')
    
    print('Training All Basic Classifiers on Training Set [*] \n')
    
    parameters_svm= [
    {'kernel': ['rbf'], 'gamma': [0.1, 0.5, 0.9, 1],
        'C': np.logspace(-4, 4, 5)},
    ]
    parameters_lin = [{
    'penalty': ['l1', 'l2', ],
    'solver': ['newton-cg', 'liblinear', ],
    'C': np.logspace(-4, 4, 5),
    }]
    parameters_knn = [{
    'n_neighbors': list(range(0, 11)),
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'kd_tree', 'brute'],
    }]
    parameters_dt = [{
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'max_depth': [4,  6,  8,  10,  12,  20,  40, 70],

    }]
    parameters_rfc = [{
    'criterion': ['gini', 'entropy'],
    'n_estimators': [100, 300, 500, 750, 1000],
    'max_features': [2, 3],
    }]
    parameters_xgb = [{
    'max_depth': [4,  6,  8,  10],
    'learning_rate': [0.3, 0.1],
    }]
    parameters_lgbm =  {
    'learning_rate': [0.005, 0.01],
    'n_estimators': [8,16,24],
    'boosting_type' : ['gbdt', 'dart'],
    'objective' : ['binary'],
    }
    paramters_pac = {
        'C': np.logspace(-4, 4, 20)},
    
    
    param_nb={}
    parameters_ada={
            'learning_rate': [0.005, 0.01],
            'n_estimators': [8,16,24],
    }
    paramters_sgdc = [{
    'penalty': ['l2', 'l1', 'elasticnet'],
    'loss': ['hinge', 'log'],
    'alpha':np.logspace(-4, 4, 20),
    }]
    models =[("LR", LogisticRegression(), parameters_lin),("SVC", SVC(),parameters_svm),('KNN',KNeighborsClassifier(),parameters_knn),
    ("DTC", DecisionTreeClassifier(),parameters_dt),("GNB", GaussianNB(), param_nb),("SGDC", SGDClassifier(), paramters_sgdc),('RF',RandomForestClassifier(),parameters_rfc),
    ('ADA',AdaBoostClassifier(),parameters_ada),('XGB',GradientBoostingClassifier(),parameters_xgb),('LGBN', LGBMClassifier(),parameters_lgbm),
    ('PAC',PassiveAggressiveClassifier(),paramters_pac)]

    results = []
    names = []
    finalResults = []
    accres = []

    for name,model, param in models:
        
        model.fit(x_train, y_train)
        model_results = model.predict(x_test)
        accuracy = accuracy_score(y_test, model_results)
        print('Validation Accuracy is :',accuracy)
        print('Applying K-Fold Cross validation on Model {}[*]'.format(name))
        accuracies = cross_val_score(estimator=model, X=x_train, y=y_train, cv=cv_folds, scoring='accuracy')
        print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
        acc = accuracies.mean()*100
        print("Standard Deviation: {:.2f} %".format(accuracies.std()*100)) 
        results.append(acc)
        names.append(name)
        accres.append((name,acc))
        if tune == 'y' and not name == 'GNB':
            print('Applying Grid Search Cross validation for model {} []\n'.format(name))
            cv_params = param
            grid_search = GridSearchCV(
            estimator=model,
            param_grid=cv_params,
            scoring='accuracy',
            cv=cv_folds,
            n_jobs=-1,
            verbose=4,
                )
            grid_search.fit(X_train, y_train)
            best_accuracy = grid_search.best_score_
            best_parameters = grid_search.best_params_
            print("Best Accuracy for model {}: {:.2f} %".format(name,best_accuracy*100))
            print("Best Parameters: for model {}".format(name), best_parameters)
            print('Applying Grid Search Cross validation Done[',u'\u2713',']\n')
            
        print('Training Compeleted Showing Predictions [',u'\u2713','] \n')
    accres.sort(key=lambda k:k[1],reverse=True)
    print("\n The Accuracy of the Models Are:\n ")
    tab = pd.DataFrame(accres)
    print(tab)
    sns.barplot(x=tab[1], y=tab[0], palette='mako');
    print("\n\nModel With Highest Accuracy is: \n",accres[0],'\n\n')

In [None]:
preprocessing(df,x,y)

# Summary
* **RF** and **LGBM** were the best here with an accuracy of exactly **~71%%**

***References***

https://en.wikipedia.org/wiki/Drinking_water

https://www.epa.gov/ccl/types-drinking-water-contaminants#:~:text=Examples%20of%20chemical%20contaminants%20include,contaminants%20are%20organisms%20in%20water.&text=Examples%20of%20biological%20or%20microbial,viruses%2C%20protozoan%2C%20and%20parasites.

https://gist.github.com/d4rk-lucif3r/c0349c76a6fed2d4de166e7cff537ff4

https://github.com/GD-Singh011/funky-ml