
**Quality Check of Water Potability Data**

My 2 cents on this important and applied topic.

- First of all, what were the measuring procedures and definitions behind these data?
- Were the water samples taken in the same regions with the same water quality standards?
Or the regions, as well as defining standards, were different from sample to sample?
- Also, who and how established the potability value (0 or 1)?
Is it like "you can or can't drink a particular water sample more than once"?

Water quality differs from region to region. As well as people get used to their local water and get better immunity to it.
So, subjective conclusion for particular region's water quality and people's immunity to it can't be generalized for the rest of the world.

These considerations put in doubt the consistency of the data supplied.

Thanks to previous contributors, it was shown that the correlations between parameters are very low.
As to me, this is good. This means that the parameters measured are independent and can be used for characterizing the water samples.
And one doesn't waste the resources for measuring parameters with high correlation or well-established relationship.

But look at the following lists:
- [National Primary Drinking Water Standards](https://www.water-research.net/index.php/standards/primary-standards)
- [National Secondary Drinking Water Standards](https://water-research.net/index.php/standards/secondary-standards)

Aren't there too many of them? And which ones do we have in our data?
Do we have enough data to determine not only theoretical but applied value of water potability?
Will those potability values be consistent with well-known established standards?
And moreover, which of our parameters are also not primary but even secondary?

Let's take a look at the following table.
**<center>Water Quality Allowed Values
<br>by WHO (World Health Organization), USA, Canada</center>**

 | N | Parameter | Allowed Value | Importance | Comment |Reference Link |
 |---|---|---|---|---|----|
 | 0 | ph | 6.5 to 8.5 | Secondary | No guidelines | [WHO](https://www.who.int/water_sanitation_health/dwq/chemicals/ph_revised_2007_clean_version.pdf)
 | 1 | Hardness | 151 to 300 mg/L | Secondary | No guidelines | [USA](https://www.healthvermont.gov/environment/drinking-water/hardness-drinking-water)
 | 2 | Solids | 1200 mg/L | Secondary | Same as TDS (Total Dissolved Solids) | [WHO](https://www.who.int/water_sanitation_health/dwq/chemicals/ph_revised_2007_clean_version.pdf)
 | 3 | **Chloramines** | 4 ppm | **Primary** | Chlorine disinfection traces | [USA](https://www.cdc.gov/healthywater/drinking/public/water_disinfection.html)
 | 4 | Sulfate | 250 ppm | Secondary | Mostly from natural sources | [USA](https://www.water-research.net/index.php/sulfates)
 | 5 | Conductivity | 400 μS/cm | Secondary | Derived value | [WHO](https://rdcu.be/cubAf)
 | 6 | **Organic_carbon** | 10 ppm | **Primary** | Organic molecules cumulative value | [Canada](https://www2.gov.bc.ca/assets/gov/environment/air-land-water/water/waterquality/water-quality-guidelines/approved-wqgs/organic-carbon-tech.pdf)
 | 7 | **Trihalomethanes** | 80 ppb | **Primary** | Chlorine disinfection traces also  | [USA](https://archive.epa.gov/enviro/html/icr/web/html/gloss_dbp.html)
 | 8 | **Turbidity** | 5 NTU | **Primary** | Measured by water scattered light | [WHO](https://www.who.int/water_sanitation_health/publications/turbidity-information-200217.df)


In order to be as consistent as possible, I tried to gather water quality standards from the same region, preferably USA, Canada or WHO (World Health Organization).
So, 5 out 9 parameters in our given dataset turned out to be secondary and 4 (less than half) of them were primary.

Frankly, not a really good set of parameters for making responsible predictions.
But anyway, let's figure out what fraction of a given dataset fits into the above mentioned standards.



In [None]:
import pandas as pd

In [None]:
df_water = pd.read_csv('../input/water-potability/water_potability.csv')

df_water.describe()

In [None]:
# Minimum containment levels
dict_cl_min = {
    'ph'              : 6.5,
    'Hardness'        : 151,
    'Solids'          : 0,
    'Chloramines'     : 0,
    'Sulfate'         : 0,
    'Conductivity'    : 0,
    'Organic_carbon'  : 0,
    'Trihalomethanes' : 0,
    'Turbidity'       : 0
}

# Maximum containment levels
dict_cl_max = {
    'ph'              : 8.5,
    'Hardness'        : 300,
    'Solids'          : 1200,
    'Chloramines'     : 4,
    'Sulfate'         : 250,
    'Conductivity'    : 400,
    'Organic_carbon'  : 10,
    'Trihalomethanes' : 80,
    'Turbidity'       : 5
}

# Primary water quality parameters
set_cl_primary = {'Chloramines', 'Conductivity', 'Organic_carbon', 'Trihalomethanes', 'Turbidity'}

# Secondary water quality parameters
set_cl_secondary = {'ph', 'Hardness', 'Solids', 'Sulfate'}

In [None]:
# Storage of boolean value for each dataset value whether it fits the water quality standard or not
df_cl_filter_applied_by_col = pd.DataFrame()

# Apply water quality standards filter for minimum values
for col, min_val in dict_cl_min.items():
    df_cl_filter_applied_by_col[col] = df_water[col] >= min_val

# Apply water quality standards filter for maximum values
for col, max_val in dict_cl_max.items():
    df_cl_filter_applied_by_col[col] = df_cl_filter_applied_by_col[col] & (df_water[col] <= max_val)
    # How many values in dataset's column passed or didn't pass the water quality standard
    print(df_cl_filter_applied_by_col[col].value_counts())

# Top 5 rows to see the results of applying the filters
df_cl_filter_applied_by_col.head()

In [None]:
# Applying all water quality standards filters to water dataset
df_cl_filter_applied_all = df_cl_filter_applied_by_col.all(axis = 1)

print('all filters result:', df_cl_filter_applied_all.value_counts(False), sep = '\r\n')

# Applying only primary water quality standards filters to water dataset
df_cl_filter_applied_primary = df_cl_filter_applied_by_col[set_cl_primary].all(axis = 1)

print('primary filters result:', df_cl_filter_applied_primary.value_counts(False), sep = '\r\n')

# Applying only secondary water quality standards filters to water dataset
df_cl_filter_applied_secondary = df_cl_filter_applied_by_col[set_cl_secondary].all(axis = 1)

print('secondary filters result:', df_cl_filter_applied_secondary.value_counts(False), sep = '\r\n')

# Only 4 samples passed the primary filters
print(df_water[df_cl_filter_applied_primary])

None of the samples passed secondary and all water quality standards filters.

Only 4 samples passed the primary filters. 2 ot them were marked as non-potable. Perhaps due to other primary parameters not listed in our dataset.
I see no reason to make any valuable and applicable predictions after the dataset quality check.

Eager to apply modern advancements in data handling techniques it's never a bad idea to make a reality check for consistency or to see a meaning behind the dataset being worked on.

In my opinion, if it were somewhere close but the truth was not out there© in the given water potability dataset.