# Water Quality Classification

In this project, we train a classifier using logistic regression to determine whether water is safe to drink or not. 

Dataset source: https://www.kaggle.com/datasets/mssmartypants/water-quality

## Dependenices

We start by importing the required depencies:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Data Preporocessing

First of all, we load the data into a pandas dataframe, specifying the row containing the header:

In [2]:
df = pd.read_csv('water-quality-data.csv', header=0)

Let's inspect our data to ensure everything is alright:

In [3]:
df.head()

Unnamed: 0,aluminium,ammonia,arsenic,barium,cadmium,chloramine,chromium,copper,flouride,bacteria,...,lead,nitrates,nitrites,mercury,perchlorate,radium,selenium,silver,uranium,is_safe
0,1.65,9.08,0.04,2.85,0.007,0.35,0.83,0.17,0.05,0.2,...,0.054,16.08,1.13,0.007,37.75,6.78,0.08,0.34,0.02,1
1,2.32,21.16,0.01,3.31,0.002,5.28,0.68,0.66,0.9,0.65,...,0.1,2.01,1.93,0.003,32.26,3.21,0.08,0.27,0.05,1
2,1.01,14.02,0.04,0.58,0.008,4.24,0.53,0.02,0.99,0.05,...,0.078,14.16,1.11,0.006,50.28,7.07,0.07,0.44,0.01,0
3,1.36,11.33,0.04,2.96,0.001,7.23,0.03,1.66,1.08,0.71,...,0.016,1.41,1.29,0.004,9.12,1.72,0.02,0.45,0.05,1
4,0.92,24.33,0.03,0.2,0.006,2.67,0.69,0.57,0.61,0.13,...,0.117,6.74,1.11,0.003,16.9,2.41,0.02,0.06,0.02,1


As we can see, we have a total of 7999 rows and 21 columns:

In [4]:
df.shape

(7999, 21)

General statistics to ensure validity of our data:

In [5]:
df.describe()

Unnamed: 0,aluminium,arsenic,barium,cadmium,chloramine,chromium,copper,flouride,bacteria,viruses,lead,nitrates,nitrites,mercury,perchlorate,radium,selenium,silver,uranium
count,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0
mean,0.666158,0.161445,1.567715,0.042806,2.176831,0.247226,0.805857,0.771565,0.319665,0.328583,0.09945,9.818822,1.329961,0.005194,16.460299,2.920548,0.049685,0.147781,0.044673
std,1.265145,0.25259,1.216091,0.036049,2.567027,0.27064,0.653539,0.435373,0.329485,0.378096,0.058172,5.541331,0.573219,0.002967,17.687474,2.323009,0.02877,0.143551,0.026904
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.04,0.03,0.56,0.008,0.1,0.05,0.09,0.405,0.0,0.002,0.048,5.0,1.0,0.003,2.17,0.82,0.02,0.04,0.02
50%,0.07,0.05,1.19,0.04,0.53,0.09,0.75,0.77,0.22,0.008,0.102,9.93,1.42,0.005,7.74,2.41,0.05,0.08,0.05
75%,0.28,0.1,2.48,0.07,4.24,0.44,1.39,1.16,0.61,0.7,0.151,14.61,1.76,0.008,29.48,4.67,0.07,0.24,0.07
max,5.05,1.05,4.94,0.13,8.68,0.9,2.0,1.5,1.0,1.0,0.2,19.83,2.93,0.01,60.01,7.99,0.1,0.5,0.09


Let's make sure the labels look good as well:

In [6]:
df['is_safe'].value_counts()

is_safe
0        7084
1         912
#NUM!       3
Name: count, dtype: int64

We notice 3 rows with missing values. Considering we have nearly 8,000 records, we can simply drop those rows:

In [7]:
df.drop(df[df['is_safe'] == '#NUM!'].index, inplace=True)

We double check to ensure missing values have been removed:

In [8]:
df['is_safe'].value_counts()

is_safe
0    7084
1     912
Name: count, dtype: int64

Finally, we select the independent and dependant variables:

In [9]:
x = df.drop(columns='is_safe')
y = df['is_safe']
print(x)
print(y)

      aluminium ammonia  arsenic  barium  cadmium  chloramine  chromium   
0          1.65    9.08     0.04    2.85    0.007        0.35      0.83  \
1          2.32   21.16     0.01    3.31    0.002        5.28      0.68   
2          1.01   14.02     0.04    0.58    0.008        4.24      0.53   
3          1.36   11.33     0.04    2.96    0.001        7.23      0.03   
4          0.92   24.33     0.03    0.20    0.006        2.67      0.69   
...         ...     ...      ...     ...      ...         ...       ...   
7994       0.05    7.78     0.00    1.95    0.040        0.10      0.03   
7995       0.05   24.22     0.02    0.59    0.010        0.45      0.02   
7996       0.09    6.85     0.00    0.61    0.030        0.05      0.05   
7997       0.01      10     0.01    2.00    0.000        2.00      0.00   
7998       0.04    6.85     0.01    0.70    0.030        0.05      0.01   

      copper  flouride  bacteria  viruses   lead  nitrates  nitrites  mercury   
0       0.17      

## Training and Test Data

We begin by splitting our dataset into training and testing data. Considering there are many more data of safe water than unsafe, we use **startify** to ensure the split data is proportional. We also specify that we want to use 20% of the dataset for testing. The **random_state** simply ensures we get the same split when we rerun the code:

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify=y, random_state=0)

Some common statistics to make sure the split sizes are correct:

In [11]:
print(x.shape, x_train.shape, x_test.shape)

(7996, 20) (6396, 20) (1600, 20)


In [12]:
print(y.shape, y_train.shape, y_test.shape)

(7996,) (6396,) (1600,)


We also view the data to make sure everything looks alright: 

In [13]:
print(x_train)
print(y_train)

      aluminium ammonia  arsenic  barium  cadmium  chloramine  chromium   
2825       0.00   29.37    0.040    1.32    0.006        6.35      0.80  \
3360       0.08   26.12    0.490    3.12    0.090        7.11      0.63   
1896       1.12   22.73    0.040    2.21    0.050        3.64      0.05   
1436       0.68   24.88    0.340    3.81    0.100        5.75      0.58   
5456       0.05    8.73    0.090    1.09    0.080        0.04      0.06   
...         ...     ...      ...     ...      ...         ...       ...   
4957       0.01   24.63    0.020    0.03    0.040        0.02      0.08   
878        2.02   24.89    0.001    2.40    0.005        4.90      0.59   
5860       0.04    5.69    0.050    0.56    0.060        0.01      0.10   
4668       0.05    3.31    0.070    0.99    0.000        0.16      0.06   
2039       0.08    19.9    0.030    2.67    0.006        4.30      0.26   

      copper  flouride  bacteria  viruses   lead  nitrates  nitrites  mercury   
2825    1.25      

In [14]:
print(x_test)
print(y_test)

      aluminium ammonia  arsenic  barium  cadmium  chloramine  chromium   
6573       0.01   25.28     0.10    1.44    0.030        0.11      0.07  \
6590       0.07    2.62     0.06    0.23    0.070        0.02      0.07   
639        2.72   25.99     0.01    3.06    0.002        3.36      0.63   
3947       0.03    3.24     0.26    1.32    0.030        8.33      0.51   
6539       0.06   24.81     0.10    0.14    0.090        0.07      0.05   
...         ...     ...      ...     ...      ...         ...       ...   
3914       0.04   27.83     0.20    4.51    0.120        7.73      0.27   
4066       0.03   21.79     0.09    0.42    0.060        0.33      0.01   
5653       0.06   18.38     0.06    0.69    0.010        0.07      0.08   
515        2.24    5.52     0.03    0.80    0.003        4.16      0.54   
5690       0.04    9.47     0.10    1.01    0.100        0.03      0.06   

      copper  flouride  bacteria  viruses   lead  nitrates  nitrites  mercury   
6573    0.76      

## Model Training

It's time to train our model. We decided to use logistic regression, a supervised machine-learning model that is suitable for binary classification:

In [15]:
model = LogisticRegression()

Fitting our data with the default **max_iter=100** causes an error:

In [16]:
model.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


We increase **max_iter** and fit our data again. Much better!

In [17]:
model = LogisticRegression(max_iter=1000)

In [18]:
model.fit(x_train.values, y_train.values)

We make predictions using our test data:

In [19]:
y_test_pred = model.predict(x_test.values)

## Model Evaluation

We test our model with accuracy score using the true labels of our test data and the predicted ones:

In [20]:
accuracy = accuracy_score(y_test, y_test_pred)

Great accuracy score of 90%!

In [21]:
print('Accuracy score is: ', accuracy)

Accuracy score is:  0.90375


## Making Predictions

We can use our model to make predictions:

In [22]:
input_data = [(4.32,10.15,0.04,1.98,0.001,7.97,0.32,1.98,0.66,0.06,0.008,0.101,7.22,1.69,0.009,5.2,5.63,0.04,0.12,0.01)]

prediction = model.predict(input_data)

if prediction[0] == '1':
    print("Water is safe to drink.")
else:
    print('Warning! Water is unsafe!')

Water is safe to drink.
