# Introduction
A basic kernel mainly focussed on data visualisations and the infereneces deduced from the plots.


# Feature Description(taken from the dataset description) 
### 1. pH value:
PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

### 2. Hardness:
Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.

### 3. Solids (Total dissolved solids - TDS):
Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These minerals produced un-wanted taste and diluted color in appearance of water. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

### 4. Chloramines:
Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

### 5. Sulfate:
Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.

### 6. Conductivity:
Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.

### 7. Organic_carbon:
Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.

### 8. Trihalomethanes:
THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.

### 9. Turbidity:
The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

### 10. Potability:
Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing Libraries

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.offline as pyo
import plotly.express as px
import plotly.graph_objs as go
pyo.init_notebook_mode()
import plotly.figure_factory as ff
import missingno as msno
from collections import Counter
from warnings import filterwarnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression,RidgeClassifier,SGDClassifier,PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC,LinearSVC,NuSVC
from sklearn.neighbors import KNeighborsClassifier,NearestCentroid
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB,BernoulliNB
from sklearn.ensemble import VotingClassifier

# Evaluation & CV Libraries
from sklearn.metrics import precision_score,accuracy_score
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV,RepeatedStratifiedKFold

# Importing Dataset

In [None]:
df=pd.read_csv('/kaggle/input/water-potability/water_potability.csv')

# Descriptive Statistics

In [None]:
df.head()

In [None]:
print(df.info())
print("*****************************************************************************************")
print(df.describe())

In [None]:
df.Potability.value_counts()/df.Potability.count()*100

In [None]:
df['Potability'].replace({1:"Yes",0:"No"},inplace=True)

In [None]:
fig = msno.matrix(df)
df.isnull().sum()

### Observations:
1. The discontinuity in ph,sulfate and trihalomethanes represent the amount of data missing in each of them.

# Univariate Analysis

In [None]:
d= pd.DataFrame(df['Potability'].value_counts())
fig = px.pie(d,values='Potability',names=['Not Potable','Potable'],hole=0.4,opacity=0.6,
            
             labels={'label':'Potability','Potability':'No. Of Samples'})

fig.add_annotation(text='Potability',
                   x=0.5,y=0.5,showarrow=False,font_size=14,opacity=0.9,font_family='monospace')

fig.update_layout(
    font_family='monospace',
    title=dict(text='Distribution of target variable',x=0.47,y=0.98,
               ),
    legend=dict(x=0.37,y=-0.05,orientation='h',traceorder='reversed'),
    hoverlabel=dict(bgcolor='white'))

fig.update_traces(textposition='outside', textinfo='percent+label')

fig.show()

## Observations:
1. There is not too much of a data imbalance.
2. It can be concluded from the data that the amount of potable water is less in comparison with the not potable water.(A reminder for humans to start valuing water.)

In [None]:
fig=plt.figure(figsize=(15,15))
fig = px.histogram(df,x='ph',y=Counter(df['ph']),color='Potability',template='plotly_white',color_discrete_sequence=['#ff6161','#28b0eb'],
                  marginal='box',opacity=0.7,nbins=100,
                  barmode='group',histfunc='count')

fig.add_vline(x=7, line_width=1,line_dash='dot',opacity=0.7)

fig.add_annotation(text='<7 is Acidic',x=4,y=70,showarrow=False,font_size=14)
fig.add_annotation(text='>7 is Basic',x=10,y=70,showarrow=False,font_size=14)


fig.update_layout(
    font_family='monospace',
    title=dict(text='pH Level Distribution',x=0.5,y=0.95),
    xaxis_title_text='pH Level',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

## Observations:
1. The ph has normal distribution in cases of both portable and non portable water.
2. For the most cases,the ph of potable water ranges between 6(slightly acidic) to 8(sligthly basic).
3. The highest count for potable water is at ph 7(neutral) which confirms the assumption that neutral ph water is considered good for drinking.
4. Some outliers can be for potable water data records(at ph 0 and ph 13).
5. Median is a good alternative for the missing values.

In [None]:
fig=plt.figure(figsize=(15,15))
fig = px.histogram(df,x='Sulfate',y=Counter(df['Sulfate']),color='Potability',template='plotly_white',
                  marginal='box',opacity=0.7,nbins=100,color_discrete_sequence=['#ff6161','#28b0eb'],
                  barmode='group',histfunc='count')

fig.add_vline(x=250, line_width=1,line_dash='dot',opacity=0.7)

fig.add_annotation(text='<250 mg/L is considered<br> safe for drinking',x=175,y=90,showarrow=False)

fig.update_layout(
    font_family='monospace',
    title=dict(text='Sulfate Distribution',x=0.53,y=0.95
              ),
    xaxis_title_text='Sulfate (mg/L)',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

## Observations:
1. Most of the data records for the potable water have more sulfates then the normal drinking water range.
2. There are some cases of potable water having more than 400mg/L of sulfate(outlier).
3. To decrease the affect of outliers, median can be use for imputation of the missing values.


In [None]:
fig=plt.figure(figsize=(15,15))
fig = px.histogram(df,x='Trihalomethanes',y=Counter(df['Trihalomethanes']),color='Potability',template='plotly_white',
                  marginal='box',opacity=0.7,nbins=100,color_discrete_sequence=['#ff6161','#28b0eb'],
                  barmode='group',histfunc='count')

fig.add_vline(x=80, line_width=1,line_dash='dot',opacity=0.7)

fig.add_annotation(text='Upper limit of Trihalomethanes<br> level is 80 μg/L',x=115,y=90,showarrow=False)

fig.update_layout(
    font_family='monospace',
    title=dict(text='Trihalomethanes Distribution',x=0.5,y=0.95,
              ),
    xaxis_title_text='Trihalomethanes (μg/L)',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

## Observations:
1. Median is a good choice for imputation to minimise the affect of outliers.
2. The data records of potable water having trihalomethanes more than permissible range(80microgram/L) should be re-evaluated.

In [None]:
fig = px.histogram(df,x='Hardness',y=Counter(df['Hardness']),color='Potability',template='plotly_white',
                  marginal='box',opacity=0.7,nbins=100,color_discrete_sequence=['#ff6161','#28b0eb'],
                  barmode='group',histfunc='count')

fig.add_vline(x=151, line_width=1,line_dash='dot',opacity=0.7)
fig.add_vline(x=301, line_width=1,line_dash='dot',opacity=0.7)
fig.add_vline(x=76, line_width=1,line_dash='dot',opacity=0.7)

fig.add_annotation(text='<76 mg/L is<br> considered soft',x=40,y=130,showarrow=False,font_size=12)
fig.add_annotation(text='Between 76 and 150<br> (mg/L) is<br>moderately hard',x=113,y=130,showarrow=False,font_size=12)
fig.add_annotation(text='Between 151 and 300 (mg/L)<br> is considered hard',x=250,y=130,showarrow=False,font_size=12)
fig.add_annotation(text='>300 mg/L is<br> considered very hard',x=340,y=130,showarrow=False,font_size=12)

fig.update_layout(
    font_family='monospace',
    title=dict(text='Hardness Distribution',x=0.53,y=0.95,
              ),
    xaxis_title_text='Hardness (mg/L)',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

## Observations:
1. The water is hard irrespective of it being portable or not which indicates towards the high amount of calcium and magnesium present in water.

In [None]:
fig = px.histogram(df,x='Solids',y=Counter(df['Solids']),color='Potability',template='plotly_white',
                  marginal='box',opacity=0.7,nbins=100,color_discrete_sequence=['#ff6161','#28b0eb'],
                  barmode='group',histfunc='count')

fig.update_layout(
    font_family='monospace',
    title=dict(text='Distribution Of Total Dissolved Solids',x=0.5,y=0.95),
    xaxis_title_text='Dissolved Solids (ppm)',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

## Observations:
1. The high amount of dissolved salts is the reason for increased hardness of water.

# Bivariate Analysis

In [None]:
fig = plt.figure(figsize=(12, 9), facecolor='white')
sns.pairplot(data=df,hue='Potability')
plt.show()

## Observations:
1. No significant conclusion can be inferred from any of the above plots.

In [None]:
cor=df.drop('Potability',axis=1).corr()
cor

In [None]:
fig = px.imshow(cor,height=800,width=800,template='plotly_white')
fig.show()

## Observations:
1. There is not a significant correlation between any pair of features.(Not even in features like hardness and solids dissolved).

In [None]:
print(df[df['Potability']=='No'][['ph','Sulfate','Trihalomethanes']].median())
print('*****************************************')
print(df[df['Potability']=='Yes'][['ph','Sulfate','Trihalomethanes']].median())

## Observations:
1. Negligible difference between the medians for the case of potable and non potable water.So the overall median of the feature can be used for imputation.

# Imputation(filling the missing data)

In [None]:
df['ph'].fillna(value=df.ph.median(),inplace=True)
df['Sulfate'].fillna(value=df.Sulfate.median(),inplace=True)
df['Trihalomethanes'].fillna(value=df.Trihalomethanes.median(),inplace=True)
X=df.drop('Potability',axis=1).values
y=df['Potability'].replace({"Yes":1,"No":0}).values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=27)
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Spot Checking
Using different baseline models to get an overview of which works best

In [None]:
filterwarnings('ignore')
models =[("LR", LogisticRegression(max_iter=1000)),("SVC", SVC()),('KNN',KNeighborsClassifier(n_neighbors=10)),
         ("DTC", DecisionTreeClassifier()),("GNB", GaussianNB()),
        ("SGDC", SGDClassifier()),
         ('RF',RandomForestClassifier()),('ADA',AdaBoostClassifier()),
        ('XGB',GradientBoostingClassifier())]

results = []
names = []
finalResults = []

for name,model in models:
    model.fit(X_train, y_train)
    model_results = model.predict(X_test)
    score = precision_score(y_test, model_results,average='macro')
    results.append(score)
    names.append(name)
    finalResults.append((name,score))
    
finalResults.sort(key=lambda k:k[1],reverse=True)

In [None]:
finalResults

# Conclusion
1. The Logistic Regression is having best accuracy of the above models.

# End Notes
## References:
1. **Water Quality: Analysis (Plotly) and Modelling [here](https://www.kaggle.com/jaykumar1607/water-quality-analysis-plotly-and-modelling)**



If you find this notebook helpful in any kind then an upvote will be motivating.
Please comment if you found anything incorrect.


**Thank You**