### <font color='#2a52be' size=+5.5><center>DRINKING WATER POTABILITY<center></font>

# <font color='#2a52be'><center>EXPLORATORY DATA ANALYSIS<center></font>

![](https://images.pexels.com/photos/7245245/pexels-photo-7245245.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=650&w=940)

<font color='#2a52be' size=+2>DATA</font>

* <font color='#2a52be'>pH value</font>: PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

* <font color='#2a52be'>Hardness</font>: Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.

* <font color='#2a52be'>Solids (Total Dissolved Solids - TDS)</font>: Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These minerals produced un-wanted taste and diluted color in appearance of water. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

* <font color='#2a52be'>Chloramines</font>: Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

* <font color='#2a52be'>Sulfate</font>: Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.

* <font color='#2a52be'>Conductivity</font>: Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.

* <font color='#2a52be'>Organic Carbon</font>: Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.

* <font color='#2a52be'>Trihalomethanes</font>: THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.

* <font color='#2a52be'>Turbidity</font>: The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

* <font color='#2a52be'>Potability</font>: Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

Reference: https://www.kaggle.com/adityakadiwal/water-potability

<font color='#2a52be' size=+2>IMPORTING LIBRARIES</font>

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

<font color='#2a52be' size=+2>IMPORTING DATASET</font>

In [None]:
df = pd.read_csv('../input/water-potability/water_potability.csv')
df

In [None]:
df.shape

* The data contain **3276 entries** and **10 variables**.
* It has **9 independent variables** and **1 dependent variable**.

In [None]:
df.info()

In [None]:
df.describe()

<font color='#2a52be' size=+2>MISSING VALUES</font>

* **DROP COLUMN METHOD**

In [None]:
df.isnull().sum()

* There are **3 variables with missing values (pH, Sulfate, Trihalomethanes)**.

In [None]:
df_drop = df.dropna()
df_drop.shape

* I used the dropped method for dealing with missing values since I am not going to do machine learning for this.
* The data went from **3276** entries to **2011**, almost **38%** of the data were lost.

<font color='#2a52be' size=+2><center>DATA VISUALIZATION<center></font>

In [None]:
df_drop.drop('Potability', axis=1).hist(figsize=(12,8), color = 'lightblue');

* We can observe that the values for each of the variables are close to normal distribution.


<font color='#2a52be' size=+1>Let's compare the different values to their recommended or standard values</font>

In [None]:
df_ph_normal = df_drop['ph'].between(6.5, 8.5).value_counts()
df_ph_normal

In [None]:
sns.barplot(df_ph_normal.index, df_ph_normal.values)
plt.title('Recommended pH Level')
plt.xlabel('pH level is between 6.5 - 8.5')

* Based on the data above, 961 or 47% of the entries were in the recommended pH level.

In [None]:
df_solids_normal = df_drop['Solids'].between(500, 100).value_counts()
df_solids_normal

In [None]:
sns.barplot(df_solids_normal.index, df_solids_normal.values)
plt.title('Recommended TDS Level')
plt.xlabel('TDS level is between 500ppm - 1000ppm')

* 100% of our data has a high TDS level which means that it is unsafe for drinking.

In [None]:
df_ch_normal = df_drop['Chloramines'].between(0, 4).value_counts()
df_ch_normal

In [None]:
sns.barplot(df_ch_normal.index, df_ch_normal.values)
plt.title('Recommended Chloramine Level')
plt.xlabel('Chloramine level is below 4ppm')

* It is considered safe for drinking if the Chloramine level is equal or below 4 ppm. Based on the data above, we can see that only 59 of our samples are below or equal to  4 ppm.

In [None]:
df_con_normal = df_drop['Conductivity'].between(0, 400).value_counts()
df_con_normal

In [None]:
sns.barplot(df_con_normal.index, df_con_normal.values)
plt.title('Recommended Conductivity Level')
plt.xlabel('Conductivity level is within 400 μS/cm')

* Based on the data above we can observe that 794 of the samples are within the recommended value of conductivity.

In [None]:
df_toc_normal = df_drop['Organic_carbon'].between(0, 2.01).value_counts()
df_toc_normal

In [None]:
sns.barplot(df_toc_normal.index, df_toc_normal.values)
plt.title('Recommended TOC Level')
plt.xlabel('TOC level is less than 2 mg/L')

* According to US EPA, less than 2 mg/L of TOC is the recommended level for drinking water and 100% of our samples are greater than 2 mg/L.

In [None]:
df_thm_normal = df_drop['Trihalomethanes'].between(0, 80).value_counts()
df_thm_normal

In [None]:
sns.barplot(df_thm_normal.index, df_thm_normal.values)
plt.title('Recommended THM Level')
plt.xlabel('THM level is less than 80 ppm')

* We can see that 1623 of our samples were within the recommended level of THM for drinking water.

In [None]:
df_tur_normal = df_drop['Turbidity'].between(0, 5).value_counts()
df_tur_normal

In [None]:
sns.barplot(df_tur_normal.index, df_tur_normal.values)
plt.title('Recommended Turbidity Level')
plt.xlabel('Turbidity level is less than 5.00 NTU')

* 1827 of our samples passed WHO's recommended turbidity level for drinking water.

<font color='#2a52be' size=+1>Let's study the correlation between each variables</font>

In [None]:
df_drop.corr()

In [None]:
sns.heatmap(data =  df_drop.corr(), annot = False)
plt.title('Correlation of variables', fontsize = 16)


* We can observe that there is no strong correlation between each variables.

MACHINE LEARNING 

In [None]:
df.columns

In [None]:
df_copy = df.copy()

y = df_copy['Potability']
df_copy_feature = ['ph','Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity',
       'Organic_carbon', 'Trihalomethanes', 'Turbidity']
X = df_copy[df_copy_feature]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)


In [None]:
X.describe()

In [None]:
X.head()

In [None]:
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

## <center>Decision Tree Regressor<center>

In [None]:
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

dtr_model = DecisionTreeRegressor(random_state = 0)
dtr_model.fit(imputed_X_train, y_train)
val_predictions = dtr_model.predict(imputed_X_valid)
dtrmae = mean_absolute_error(y_valid, val_predictions)
print("MAE from Decision Tree Regressor:")
print(mean_absolute_error(y_valid, val_predictions))

## <center>RANDOM FOREST REGRESSOR<center>

* **IMPUTATION**


In [None]:
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
rfrmae = score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid)
print("MAE from Random Forest Regressor:")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

## <center>XGBOOST<center>

In [None]:
my_model = XGBRegressor(n_estimators = 500, learning_rate=0.05, n_jobs=4, random_state=0)
my_model.fit(X_train, y_train)
predictions = my_model.predict(X_valid)
xgbmae = mean_absolute_error(predictions, y_valid)
print("MAE from XGBoost:")
print((mean_absolute_error(predictions, y_valid)))

## <center>MODEL RESULTS<center>

In [None]:
mae_results = pd.DataFrame({'MAE': [dtrmae, rfrmae, xgbmae]})
mae_results.set_axis(['Decision Tree Regressor', 'Random Forest Regressor', 'XGBoost'], axis=0)
plt.figure(figsize = (6,5))
plt.ylim(0.4, 0.43)
plt.title('Mean Absolute Error for each model')
sns.barplot(x = mae_results.index, y = mae_results['MAE'])

