# Analysis of Water Potability Data

## What is Potability?
Potability basically means suitability for drinking. So here we will be analyzing the data which helps us decide whether a given sample of water is potable or not.

## Which data and where did you get it from? 
The data which will be used consists of factors that are taken into considerations while checking the potability of a given sample of water. We shall look into the details of the dataset later.

Data Souce: https://github.com/MainakRepositor/Datasets/blob/master/water_potability.csv

## Is there any structure of the Analysis? 
 Sure!! following is an broad understanding of the process and then we shall go in detail:
 - Broad Analysis
 - Data Cleaning, if any
 - Univariate Analysis
 - Multivariate Analysis






## 1. Broad Analysis of the dataset

 ### 1.1 Importing Modules
  Let's import all the required modules for the analysis.



In [17]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
%matplotlib inline 

 ### 1.2 Loading the dataset
  Taking the file and loading as a dataframe. The dataframe will help us convert the raw data in a tabular format for ease of analysis. After loading the dataset, we'll have a broad view of the dataset.

In [16]:
df = pd.read_csv('water_potability.csv')
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


   As we can see, there are rows and columns. The columns indicate the factors taken into consideration and the rows simply indicate the samples of the water taken for testing.

   Now, let's see the overall shape of the dataset.


In [19]:
df.shape

(3276, 10)

- No. of rows(samples) : 3276
- No. of columns(factors) : 10

### 1.3 Knowing the features of the dataset

Let's see what all factors are used in the test of water potability.

In [21]:
df.columns

Index(['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity',
       'Organic_carbon', 'Trihalomethanes', 'Turbidity', 'Potability'],
      dtype='object')

The following factors/features are taken into consideration for this test:

- ph : The pH value of the water sample.
- Hardness: Hardness of the water sample.
- Solids: Measure of any solids present in the water sample.
- Chloramines: Measure of any chloramines present in the water sample.
- Sulfate: Measure of any sulfates present in the water sample.
- Conductivity: Measure of conductivity present in the water sample.
- Organic_carbon: Measure of oragnic carbon content present in the water sample.
- Trihalomethanes: Measure of any trihalomethanes in the water sample.
- Turbidity: Measure of turbidity in the water sample.
- Potability: A label whether the water sample is potable or not.

## 2. Data Cleaing 

### 2.1 Checking duplicates
Time to check if there are any duplicate sample observations in the data.

In [23]:
df.duplicated().sum()

0

This means there are no duplicates in the dataset.

## 2.2 Checking Missing Values

Now let's see if there are any missing values in the dataset.

In [24]:
df.isna().sum()

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64