# Chicago Beach Water Quality Analysis -1 

<b>Data Preparation<b>

![image.png](attachment:image.png)

The objective of this lab are <br> 
- Getting familarity with Data preparation tools / libraries 
- Usage of Python and libraries available for data preparation

<b> Problem Understanding</b>

The Chicago Park District maintains sensors in the water at beaches along Chicago's Lake Michigan lakefront. These sensors generally capture the indicated measurements hourly while the sensors are in operation during the summer. During other seasons and at some other times, information from the sensors may not be available. The sensor locations change with the Park District’s operational needs, primarily related to water quality.

You have appointed as analyst who can help the local body to investigate more through the data acuired as a part of this setup. You need to help the department to go through this dataset carefully and check for the quality issues present in the data, fix them with appropriate methods and make the dataset ready for the further analysis.

Data source <a src="https://data.world/cityofchicago/beach-water-quality-automated-sensors"> link</a>

We will carry out this exercise in two stage - <br>
(A) Getting familarity with Pandas DataFrame object <br>
(B) Doing data processing with Pandas DataFrame 

# (A) Getting Familarity with Pandas

![image.png](attachment:image.png)

Python data exploration is made easier with Pandas, the open source Python data analysis library that can single-handedly profile any dataframe and generate a complete HTML report on the dataset. 

The pandas data exploration library provides:


* Tools for reading and writing data between disparate formats
* Integrated handling of missing data and intelligent data alignment 
* Flexible pivoting and reshaping of datasets
* Time series-functionality
* Efficient dataframe object for data manipulation with integrated indexing
* Columns can be inserted and deleted from data structures for size mutability
* Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on datasets
* High performance merging and joining of datasets


# Installation

Install JupyterLab with pip:```
pip install jupyterab
l```b
Once installed, launch JupyterLab with```:

jupyte```rlYou can install it with your favourite package managerl pandas

```$ conda install pandas```  
```$ pip install 

Install desired package using pip command
```pip3 install matplotlib```
pandas```

<b> 1. Data Import using Pandas<b>

To begin:
 * Import pandas library
 * `pandas.read_csv()`: Opens a CSV file as a **DataFrame**, like a table.
 

In [3]:
import pandas as pd
import numpy as np
data = pd.read_csv("beach_water_quality_automated_sensors_1.csv")

Lets preserve the original dataset and work on the copy of the same i.e. another dataframe.

In [4]:
df = data

Now the 'df' object is of type dataframe which can be manipulated further. 

In [5]:
type(df)

pandas.core.frame.DataFrame

In [6]:
l1= [56,"Rahul",56,12]

In [8]:
s=89
type(s)

int

Check the size of the data

In [9]:
df.shape

(34923, 10)

There are 34923 rows and 10 columns in the dataset.

Lets quickly have a look at the starting and ending rows of the dataset.

In [12]:
#df.head()
df.head(15)

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
0,Montrose Beach,2013-08-30T08:00:00,20.3,1.18,0.891,0.08,3.0,9.4,2013-08-30T08:00:00,MontroseBeach201308300800
1,Ohio Street Beach,2016-05-26T13:00:00,14.4,1.23,,0.111,4.0,12.4,2016-05-26T13:00:00,OhioStreetBeach201605261300
2,Calumet Beach,2013-09-03T16:00:00,23.2,3.63,1.201,0.174,6.0,9.4,2013-09-03T16:00:00,CalumetBeach201309031600
3,Calumet Beach,2014-05-28T12:00:00,16.2,1.26,1.514,0.147,4.0,11.7,2014-05-28T12:00:00,CalumetBeach201405281200
4,Montrose Beach,2014-05-28T12:00:00,14.4,3.36,1.388,0.298,4.0,11.9,2014-05-28T12:00:00,MontroseBeach201405281200
5,Montrose Beach,2014-05-28T13:00:00,14.5,2.72,1.395,0.306,3.0,11.9,2014-05-28T13:00:00,MontroseBeach201405281300
6,Calumet Beach,2014-05-28T13:00:00,16.3,1.28,1.524,0.162,4.0,11.7,2014-05-28T13:00:00,CalumetBeach201405281300
7,Montrose Beach,2014-05-28T14:00:00,14.8,2.97,1.386,0.328,3.0,11.9,2014-05-28T14:00:00,MontroseBeach201405281400
8,Calumet Beach,2014-05-28T14:00:00,16.5,1.32,1.537,0.185,4.0,11.7,2014-05-28T14:00:00,CalumetBeach201405281400
9,Calumet Beach,2014-05-28T15:00:00,16.8,1.31,1.568,0.196,4.0,11.7,2014-05-28T15:00:00,CalumetBeach201405281500


Note: `NaN` in the table above means **empty**, *not the floating-point number value*.

In [14]:
#df.tail()
df.tail(10)

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
34913,Ohio Street Beach,2017-09-12T07:00:00,19.3,2.85,,0.187,3.0,10.5,2017-09-12T07:00:00,OhioStreetBeach201709120700
34914,Ohio Street Beach,2017-09-12T08:00:00,19.3,2.76,,0.187,3.0,10.5,2017-09-12T08:00:00,OhioStreetBeach201709120800
34915,Ohio Street Beach,2017-09-12T09:00:00,19.3,2.58,,0.187,3.0,10.5,2017-09-12T09:00:00,OhioStreetBeach201709120900
34916,Ohio Street Beach,2017-09-12T10:00:00,19.5,2.47,,0.187,3.0,10.5,2017-09-12T10:00:00,OhioStreetBeach201709121000
34917,Ohio Street Beach,2017-09-12T11:00:00,19.8,2.39,,0.187,3.0,10.5,2017-09-12T11:00:00,OhioStreetBeach201709121100
34918,Ohio Street Beach,2017-09-12T12:00:00,19.9,2.61,,0.187,3.0,10.5,2017-09-12T12:00:00,OhioStreetBeach201709121200
34919,Ohio Street Beach,2017-09-12T13:00:00,19.8,0.0,,0.187,3.0,10.5,2017-09-12T13:00:00,OhioStreetBeach201709121300
34920,Ohio Street Beach,2017-09-12T15:00:00,22.3,0.0,,0.187,3.0,10.5,2017-09-12T15:00:00,OhioStreetBeach201709121500
34921,Ohio Street Beach,2017-09-12T17:00:00,21.1,26.97,,0.187,3.0,9.4,2017-09-12T17:00:00,OhioStreetBeach201709121700
34922,Ohio Street Beach,2017-09-12T18:00:00,21.3,27.55,,0.187,3.0,9.4,2017-09-12T18:00:00,OhioStreetBeach201709121800


Lets expore the columns present in the dataframe.

In [15]:
df.columns

Index(['beach_name', 'measurement_timestamp', 'water_temperature', 'turbidity',
       'transducer_depth', 'wave_height', 'wave_period', 'battery_life',
       'measurement_timestamp_label', 'measurement_id'],
      dtype='object')

Check the types of the columns

In [16]:
df.dtypes

beach_name                      object
measurement_timestamp           object
water_temperature              float64
turbidity                      float64
transducer_depth               float64
wave_height                    float64
wave_period                    float64
battery_life                   float64
measurement_timestamp_label     object
measurement_id                  object
dtype: object

Some of them are numeric (float) , some are coming as object , so need to expore them closely afterwards.

More detailed information about dataframe can be obtained as follows

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34923 entries, 0 to 34922
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   beach_name                   34923 non-null  object 
 1   measurement_timestamp        34917 non-null  object 
 2   water_temperature            34917 non-null  float64
 3   turbidity                    34917 non-null  float64
 4   transducer_depth             10034 non-null  float64
 5   wave_height                  34690 non-null  float64
 6   wave_period                  34690 non-null  float64
 7   battery_life                 34917 non-null  float64
 8   measurement_timestamp_label  34917 non-null  object 
 9   measurement_id               34923 non-null  object 
dtypes: float64(6), object(4)
memory usage: 2.7+ MB


The number of values for each column are not same, that means we need to deal with the missing values for some of the columns.

<b> 2. Slicing and Indexing of Pandas DataFrame

Sometimes only subset of rows and columns are anayzed from the complete dataset. For this purpose, the dataframe can be accessed or sliced by indexes or names. Index starts with value zero. 

Select first 3 rows of dataframe

In [41]:
df[0:3]

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
0,Montrose Beach,2013-08-30T08:00:00,20.3,1.18,0.891,0.08,3.0,9.4,2013-08-30T08:00:00,MontroseBeach201308300800
1,Ohio Street Beach,2016-05-26T13:00:00,14.4,1.23,,0.111,4.0,12.4,2016-05-26T13:00:00,OhioStreetBeach201605261300
2,Calumet Beach,2013-09-03T16:00:00,23.2,3.63,1.201,0.174,6.0,9.4,2013-09-03T16:00:00,CalumetBeach201309031600


Select intermediate rows from a dataframe with index values

In [19]:
df[5:8]

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
5,Montrose Beach,2014-05-28T13:00:00,14.5,2.72,1.395,0.306,3.0,11.9,2014-05-28T13:00:00,MontroseBeach201405281300
6,Calumet Beach,2014-05-28T13:00:00,16.3,1.28,1.524,0.162,4.0,11.7,2014-05-28T13:00:00,CalumetBeach201405281300
7,Montrose Beach,2014-05-28T14:00:00,14.8,2.97,1.386,0.328,3.0,11.9,2014-05-28T14:00:00,MontroseBeach201405281400


Negative index can also be used to start referring from the bottom of dataframe.

In [20]:
#Display last three records
df[-3:]

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
34920,Ohio Street Beach,2017-09-12T15:00:00,22.3,0.0,,0.187,3.0,10.5,2017-09-12T15:00:00,OhioStreetBeach201709121500
34921,Ohio Street Beach,2017-09-12T17:00:00,21.1,26.97,,0.187,3.0,9.4,2017-09-12T17:00:00,OhioStreetBeach201709121700
34922,Ohio Street Beach,2017-09-12T18:00:00,21.3,27.55,,0.187,3.0,9.4,2017-09-12T18:00:00,OhioStreetBeach201709121800


To select columns, index or column names can be used.

In [23]:
df['beach_name']

0           Montrose Beach
1        Ohio Street Beach
2            Calumet Beach
3            Calumet Beach
4           Montrose Beach
               ...        
34918    Ohio Street Beach
34919    Ohio Street Beach
34920    Ohio Street Beach
34921    Ohio Street Beach
34922    Ohio Street Beach
Name: beach_name, Length: 34923, dtype: object

To select unique column values

In [24]:
df['beach_name'].unique()

array(['Montrose Beach', 'Ohio Street Beach', 'Calumet Beach',
       '63rd Street Beach', 'Osterman Beach', 'Rainbow Beach'],
      dtype=object)

To determine the unique values for each column

In [25]:
df.nunique()

beach_name                         6
measurement_timestamp          10796
water_temperature                195
turbidity                       2616
transducer_depth                 775
wave_height                      593
wave_period                       11
battery_life                      86
measurement_timestamp_label    10796
measurement_id                 34923
dtype: int64

To select multiple columns

In [26]:
df[['beach_name', 'water_temperature']]

Unnamed: 0,beach_name,water_temperature
0,Montrose Beach,20.3
1,Ohio Street Beach,14.4
2,Calumet Beach,23.2
3,Calumet Beach,16.2
4,Montrose Beach,14.4
...,...,...
34918,Ohio Street Beach,19.9
34919,Ohio Street Beach,19.8
34920,Ohio Street Beach,22.3
34921,Ohio Street Beach,21.1


To select specific rows of specific columns

In [27]:
#Display first three rows for two columns specified
df[['beach_name', 'water_temperature']][0:3]

Unnamed: 0,beach_name,water_temperature
0,Montrose Beach,20.3
1,Ohio Street Beach,14.4
2,Calumet Beach,23.2


Indexes also can be used to refer the rows and columns.

In [28]:
#Display first three rows for columns with index 3 and 4
df.iloc[0:3, 3:5]

Unnamed: 0,turbidity,transducer_depth
0,1.18,0.891
1,1.23,
2,3.63,1.201


<b> 3. Quick exploration of the dataframe

There are builtin functions available in pandas which can be used for the quick exploration of the data. 

In [29]:
#Get summary stat for numeric columns
df.describe()

Unnamed: 0,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life
count,34917.0,34917.0,10034.0,34690.0,34690.0,34917.0
mean,19.363387,4.823575,1.570235,-1516.116166,-1512.482041,11.038205
std,3.356908,33.5066,0.175118,12220.244835,12220.696864,0.771769
min,0.0,0.0,-0.082,-99999.992,-100000.0,4.8
25%,17.1,0.66,1.426,0.11,3.0,10.6
50%,19.6,1.26,1.578,0.154,3.0,11.0
75%,22.0,2.54,1.721,0.201,4.0,11.5
max,31.5,1683.48,2.214,1.467,10.0,13.3


In [30]:
#Get summary stat for specific numeric columns
df['wave_height'].describe()

count    34690.000000
mean     -1516.116166
std      12220.244835
min     -99999.992000
25%          0.110000
50%          0.154000
75%          0.201000
max          1.467000
Name: wave_height, dtype: float64

In [31]:
#Get summary stat for specific numeric columns
df.turbidity.describe()

count    34917.000000
mean         4.823575
std         33.506600
min          0.000000
25%          0.660000
50%          1.260000
75%          2.540000
max       1683.480000
Name: turbidity, dtype: float64

In [32]:
#Get summary stat for all the columns
df.describe(include="all")

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
count,34923,34917,34917.0,34917.0,10034.0,34690.0,34690.0,34917.0,34917,34923
unique,6,10796,,,,,,,10796,34923
top,Ohio Street Beach,2015-08-12T19:00:00,,,,,,,2015-08-12T19:00:00,MontroseBeach201308300800
freq,9343,6,,,,,,,6,1
mean,,,19.363387,4.823575,1.570235,-1516.116166,-1512.482041,11.038205,,
std,,,3.356908,33.5066,0.175118,12220.244835,12220.696864,0.771769,,
min,,,0.0,0.0,-0.082,-99999.992,-100000.0,4.8,,
25%,,,17.1,0.66,1.426,0.11,3.0,10.6,,
50%,,,19.6,1.26,1.578,0.154,3.0,11.0,,
75%,,,22.0,2.54,1.721,0.201,4.0,11.5,,


Value_Counts can also be used to have a look at the value counts for each column.

In [33]:
#Determine value count for each of the beach
df.beach_name.value_counts()

beach_name
Ohio Street Beach    9343
Calumet Beach        7570
Montrose Beach       7269
Osterman Beach       4023
63rd Street Beach    3420
Rainbow Beach        3298
Name: count, dtype: int64

Passing parameter normalize=True to the value_counts() will calculate percentage of occurances of each unique value.

In [35]:
df.beach_name.value_counts(normalize=True) * 100

beach_name
Ohio Street Beach    26.753143
Calumet Beach        21.676259
Montrose Beach       20.814363
Osterman Beach       11.519629
63rd Street Beach     9.792973
Rainbow Beach         9.443633
Name: proportion, dtype: float64

DataFrame also can be sorted based on the column values.

In [95]:
#df.columns

In [37]:
df.tail(10)

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
34913,Ohio Street Beach,2017-09-12T07:00:00,19.3,2.85,,0.187,3.0,10.5,2017-09-12T07:00:00,OhioStreetBeach201709120700
34914,Ohio Street Beach,2017-09-12T08:00:00,19.3,2.76,,0.187,3.0,10.5,2017-09-12T08:00:00,OhioStreetBeach201709120800
34915,Ohio Street Beach,2017-09-12T09:00:00,19.3,2.58,,0.187,3.0,10.5,2017-09-12T09:00:00,OhioStreetBeach201709120900
34916,Ohio Street Beach,2017-09-12T10:00:00,19.5,2.47,,0.187,3.0,10.5,2017-09-12T10:00:00,OhioStreetBeach201709121000
34917,Ohio Street Beach,2017-09-12T11:00:00,19.8,2.39,,0.187,3.0,10.5,2017-09-12T11:00:00,OhioStreetBeach201709121100
34918,Ohio Street Beach,2017-09-12T12:00:00,19.9,2.61,,0.187,3.0,10.5,2017-09-12T12:00:00,OhioStreetBeach201709121200
34919,Ohio Street Beach,2017-09-12T13:00:00,19.8,0.0,,0.187,3.0,10.5,2017-09-12T13:00:00,OhioStreetBeach201709121300
34920,Ohio Street Beach,2017-09-12T15:00:00,22.3,0.0,,0.187,3.0,10.5,2017-09-12T15:00:00,OhioStreetBeach201709121500
34921,Ohio Street Beach,2017-09-12T17:00:00,21.1,26.97,,0.187,3.0,9.4,2017-09-12T17:00:00,OhioStreetBeach201709121700
34922,Ohio Street Beach,2017-09-12T18:00:00,21.3,27.55,,0.187,3.0,9.4,2017-09-12T18:00:00,OhioStreetBeach201709121800


In [42]:
df.head(50)

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
0,Montrose Beach,2013-08-30T08:00:00,20.3,1.18,0.891,0.08,3.0,9.4,2013-08-30T08:00:00,MontroseBeach201308300800
1,Ohio Street Beach,2016-05-26T13:00:00,14.4,1.23,,0.111,4.0,12.4,2016-05-26T13:00:00,OhioStreetBeach201605261300
2,Calumet Beach,2013-09-03T16:00:00,23.2,3.63,1.201,0.174,6.0,9.4,2013-09-03T16:00:00,CalumetBeach201309031600
3,Calumet Beach,2014-05-28T12:00:00,16.2,1.26,1.514,0.147,4.0,11.7,2014-05-28T12:00:00,CalumetBeach201405281200
4,Montrose Beach,2014-05-28T12:00:00,14.4,3.36,1.388,0.298,4.0,11.9,2014-05-28T12:00:00,MontroseBeach201405281200
5,Montrose Beach,2014-05-28T13:00:00,14.5,2.72,1.395,0.306,3.0,11.9,2014-05-28T13:00:00,MontroseBeach201405281300
6,Calumet Beach,2014-05-28T13:00:00,16.3,1.28,1.524,0.162,4.0,11.7,2014-05-28T13:00:00,CalumetBeach201405281300
7,Montrose Beach,2014-05-28T14:00:00,14.8,2.97,1.386,0.328,3.0,11.9,2014-05-28T14:00:00,MontroseBeach201405281400
8,Calumet Beach,2014-05-28T14:00:00,16.5,1.32,1.537,0.185,4.0,11.7,2014-05-28T14:00:00,CalumetBeach201405281400
9,Calumet Beach,2014-05-28T15:00:00,16.8,1.31,1.568,0.196,4.0,11.7,2014-05-28T15:00:00,CalumetBeach201405281500


In [44]:
print([1,2,3,"rrr","xyz"])

[1, 2, 3, 'rrr', 'xyz']


In [45]:
#Sorting based on signle column
df[['beach_name','water_temperature']].sort_values('beach_name')

Unnamed: 0,beach_name,water_temperature
15904,63rd Street Beach,19.2
20106,63rd Street Beach,22.6
16273,63rd Street Beach,19.5
24790,63rd Street Beach,14.3
4172,63rd Street Beach,16.7
...,...,...
28174,Rainbow Beach,19.8
14445,Rainbow Beach,16.6
28168,Rainbow Beach,19.8
14421,Rainbow Beach,16.5


In [46]:
#Sorting based on signle column
df.sort_values('beach_name')

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
15904,63rd Street Beach,2015-07-01T14:00:00,19.2,1.65,,0.309,5.0,11.7,2015-07-01T14:00:00,63rdStreetBeach201507011400
20106,63rd Street Beach,2015-07-25T17:00:00,22.6,0.74,,0.156,3.0,10.7,2015-07-25T17:00:00,63rdStreetBeach201507251700
16273,63rd Street Beach,2015-07-04T04:00:00,19.5,2.10,,0.112,4.0,11.5,2015-07-04T04:00:00,63rdStreetBeach201507040400
24790,63rd Street Beach,2015-08-23T05:00:00,14.3,0.47,,0.076,3.0,11.5,2015-08-23T05:00:00,63rdStreetBeach201508230500
4172,63rd Street Beach,2014-07-06T00:00:00,16.7,0.67,1.757,0.106,4.0,10.8,2014-07-06T00:00:00,63rdStreetBeach201407062400
...,...,...,...,...,...,...,...,...,...,...
28174,Rainbow Beach,2015-09-10T07:00:00,19.8,0.32,,0.188,4.0,11.7,2015-09-10T07:00:00,RainbowBeach201509100700
14445,Rainbow Beach,2015-06-23T03:00:00,16.6,0.79,,0.106,4.0,11.6,2015-06-23T03:00:00,RainbowBeach201506230300
28168,Rainbow Beach,2015-09-10T06:00:00,19.8,0.34,,0.204,4.0,11.7,2015-09-10T06:00:00,RainbowBeach201509100600
14421,Rainbow Beach,2015-06-22T23:00:00,16.5,1.02,,0.104,9.0,11.6,2015-06-22T23:00:00,RainbowBeach201506222300


In [47]:
#Sorting based on multiple column
df[['beach_name','water_temperature']].sort_values(['beach_name', 'water_temperature'])

Unnamed: 0,beach_name,water_temperature
5172,63rd Street Beach,10.1
5217,63rd Street Beach,10.1
5229,63rd Street Beach,10.1
5213,63rd Street Beach,10.2
5176,63rd Street Beach,10.3
...,...,...
20458,Rainbow Beach,25.8
20526,Rainbow Beach,25.8
20912,Rainbow Beach,26.7
824,Rainbow Beach,27.1


In [48]:
#Sorting based on multiple column - descending order
df[['beach_name','water_temperature']].sort_values('water_temperature', ascending=False)

Unnamed: 0,beach_name,water_temperature
16945,Ohio Street Beach,31.5
19330,Ohio Street Beach,31.4
19327,Ohio Street Beach,31.2
19333,Ohio Street Beach,30.4
19324,Ohio Street Beach,30.0
...,...,...
30542,Osterman Beach,
31199,Ohio Street Beach,
31309,Calumet Beach,
31310,63rd Street Beach,


<b>4. Filtering the records

We can also write where conditions to filter out the records from the dataframe.

In [32]:
#Get column names
df.columns

Index(['beach_name', 'measurement_timestamp', 'water_temperature', 'turbidity',
       'transducer_depth', 'wave_height', 'wave_period', 'battery_life',
       'measurement_timestamp_label', 'measurement_id'],
      dtype='object')

In [55]:
#Get unique beach names
df.beach_name.value_counts()
#df.beach_name.value_counts()

beach_name
Ohio Street Beach    9343
Calumet Beach        7570
Montrose Beach       7269
Osterman Beach       4023
63rd Street Beach    3420
Rainbow Beach        3298
Name: count, dtype: int64

In [57]:
#Extract records for a particular beach
#df[df.beach_name=='Rainbow Beach']
df[df.beach_name=='Ohio Street Beach'].head()


Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
1,Ohio Street Beach,2016-05-26T13:00:00,14.4,1.23,,0.111,4.0,12.4,2016-05-26T13:00:00,OhioStreetBeach201605261300
278,Ohio Street Beach,2013-09-03T03:00:00,21.9,4.97,1.039,0.241,7.0,9.4,2013-09-03T03:00:00,OhioStreetBeach201309030300
388,Ohio Street Beach,2014-06-05T12:00:00,16.9,1.6,1.78,0.159,3.0,12.8,2014-06-05T12:00:00,OhioStreetBeach201406051200
441,Ohio Street Beach,2014-06-06T14:00:00,18.8,0.7,1.495,0.135,2.0,12.4,2014-06-06T14:00:00,OhioStreetBeach201406061400
444,Ohio Street Beach,2014-06-06T17:00:00,19.8,0.78,1.471,0.162,3.0,12.3,2014-06-06T17:00:00,OhioStreetBeach201406061700


In [58]:
#Few columns and records of dataframe
df[df.beach_name=='Ohio Street Beach'][['beach_name', 'water_temperature']].head()

Unnamed: 0,beach_name,water_temperature
1,Ohio Street Beach,14.4
278,Ohio Street Beach,21.9
388,Ohio Street Beach,16.9
441,Ohio Street Beach,18.8
444,Ohio Street Beach,19.8


In [60]:
#all records of dataframe with more than one condition
filtered_df = df[(df.beach_name=='Ohio Street Beach') & (df.water_temperature > 20)].head()
filtered_df.head()

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
278,Ohio Street Beach,2013-09-03T03:00:00,21.9,4.97,1.039,0.241,7.0,9.4,2013-09-03T03:00:00,OhioStreetBeach201309030300
1555,Ohio Street Beach,2014-06-17T16:00:00,21.1,1.36,1.717,0.159,2.0,11.3,2014-06-17T16:00:00,OhioStreetBeach201406171600
1594,Ohio Street Beach,2014-06-17T14:00:00,20.8,1.3,1.493,0.141,3.0,11.3,2014-06-17T14:00:00,OhioStreetBeach201406171400
1600,Ohio Street Beach,2014-06-17T15:00:00,21.4,0.98,1.688,0.136,2.0,11.3,2014-06-17T15:00:00,OhioStreetBeach201406171500
1606,Ohio Street Beach,2014-06-17T17:00:00,20.5,1.45,1.711,0.131,3.0,11.3,2014-06-17T17:00:00,OhioStreetBeach201406171700


In [61]:
#all records of dataframe with more than one condition, restricted columns
filtered_df = df[(df.beach_name=='Ohio Street Beach') & (df.water_temperature > 20)][['beach_name', 'water_temperature']].head()
filtered_df.head()

Unnamed: 0,beach_name,water_temperature
278,Ohio Street Beach,21.9
1555,Ohio Street Beach,21.1
1594,Ohio Street Beach,20.8
1600,Ohio Street Beach,21.4
1606,Ohio Street Beach,20.5


In [68]:
#all records of dataframe with more than one 'or' condition, restricted columns
#filtered_df = df[(df.beach_name=='Ohio Street Beach') | (df.beach_name=='Calumet Beach')][['beach_name', 'water_temperature']].head()
filtered_df = df[(df.beach_name=='Ohio Street Beach') | (df.beach_name=='Calumet Beach')][['beach_name', 'water_temperature']]
print(filtered_df.head(5))
print(filtered_df.tail(5))

          beach_name  water_temperature
1  Ohio Street Beach               14.4
2      Calumet Beach               23.2
3      Calumet Beach               16.2
6      Calumet Beach               16.3
8      Calumet Beach               16.5
              beach_name  water_temperature
34918  Ohio Street Beach               19.9
34919  Ohio Street Beach               19.8
34920  Ohio Street Beach               22.3
34921  Ohio Street Beach               21.1
34922  Ohio Street Beach               21.3


query method can be directly used for specifying the where clause in simple manner

In [69]:
df.query('wave_period > 3 and water_temperature < 10')

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
5040,Rainbow Beach,2014-07-25T23:00:00,0.0,0.00,1.645,0.096,9.0,9.5,2014-07-25T23:00:00,RainbowBeach201407252300
6853,Rainbow Beach,2014-07-23T03:00:00,0.0,0.00,1.729,0.537,5.0,11.1,2014-07-23T03:00:00,RainbowBeach201407230300
6859,Rainbow Beach,2014-07-23T04:00:00,0.0,0.00,1.831,0.576,6.0,11.0,2014-07-23T04:00:00,RainbowBeach201407230400
6865,Rainbow Beach,2014-07-23T05:00:00,0.0,0.00,1.778,0.535,6.0,11.0,2014-07-23T05:00:00,RainbowBeach201407230500
6871,Rainbow Beach,2014-07-25T16:00:00,0.0,0.00,1.589,0.105,4.0,9.6,2014-07-25T16:00:00,RainbowBeach201407251600
...,...,...,...,...,...,...,...,...,...,...
9053,Osterman Beach,2015-06-01T06:00:00,9.1,123.39,,0.393,4.0,10.8,2015-06-01T06:00:00,OstermanBeach201506010600
9055,Montrose Beach,2015-06-01T09:00:00,9.4,87.54,,0.435,4.0,11.0,2015-06-01T09:00:00,MontroseBeach201506010900
9060,Montrose Beach,2015-06-01T10:00:00,9.6,89.38,,0.404,5.0,11.0,2015-06-01T10:00:00,MontroseBeach201506011000
9061,Osterman Beach,2015-06-01T10:00:00,9.4,119.88,,0.373,4.0,10.8,2015-06-01T10:00:00,OstermanBeach201506011000


<b> 5. Applying groupby on the dataframe

In [70]:
#count(temp) group by beach_name
df.groupby('beach_name')['water_temperature'].count()

beach_name
63rd Street Beach    3419
Calumet Beach        7569
Montrose Beach       7268
Ohio Street Beach    9342
Osterman Beach       4022
Rainbow Beach        3297
Name: water_temperature, dtype: int64

In [71]:
#avg(temp) group by beach_name
df.groupby('beach_name')['water_temperature'].mean()

beach_name
63rd Street Beach    18.459901
Calumet Beach        20.372929
Montrose Beach       18.640534
Ohio Street Beach    20.273603
Osterman Beach       17.933615
Rainbow Beach        18.741250
Name: water_temperature, dtype: float64

In [72]:
#max(temp) group by beach_name
df.groupby('beach_name')['water_temperature'].max()

beach_name
63rd Street Beach    26.1
Calumet Beach        29.2
Montrose Beach       27.0
Ohio Street Beach    31.5
Osterman Beach       25.7
Rainbow Beach        27.1
Name: water_temperature, dtype: float64

In [73]:
#avg(temp) groupby beach_name and wave_period
df.groupby(['beach_name', 'wave_period'])['water_temperature'].mean()

beach_name         wave_period
63rd Street Beach  -100000.0      18.731818
                    2.0           18.535052
                    3.0           18.570020
                    4.0           18.384449
                    5.0           18.309471
                    6.0           18.139510
                    7.0           18.097403
                    8.0           18.285714
                    9.0           17.719231
                    10.0          18.548148
Calumet Beach       2.0           20.747992
                    3.0           20.630416
                    4.0           19.863279
                    5.0           19.561029
                    6.0           20.086461
                    7.0           20.131628
                    8.0           20.004598
                    9.0           20.323596
                    10.0          20.574194
Montrose Beach      1.0           21.969231
                    2.0           18.541403
                    3.0           18.985866
 

# (B) Data Preparation / Preprocessing
![image.png](attachment:image.png)

The objective of this step are -

- To improve data quality
- To modify data to better fit specific data mining technique

<b>The major steps involved are</b>


1) Data cleaning

- Fill in missing values
- Smooth noisy data
- Identify or remove outliers
- Resolve inconsistencies

2) Data integration

- Integration of multiple databases, data cubes, or files

3) Data reduction

- Dimensionality reduction
- Numerosity reduction
- Data compression

4) Data transformation

- Data discretization
- Normalization
- Concept hierarchy generation

<b> 1. Missing values

In [74]:
df.head()

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
0,Montrose Beach,2013-08-30T08:00:00,20.3,1.18,0.891,0.08,3.0,9.4,2013-08-30T08:00:00,MontroseBeach201308300800
1,Ohio Street Beach,2016-05-26T13:00:00,14.4,1.23,,0.111,4.0,12.4,2016-05-26T13:00:00,OhioStreetBeach201605261300
2,Calumet Beach,2013-09-03T16:00:00,23.2,3.63,1.201,0.174,6.0,9.4,2013-09-03T16:00:00,CalumetBeach201309031600
3,Calumet Beach,2014-05-28T12:00:00,16.2,1.26,1.514,0.147,4.0,11.7,2014-05-28T12:00:00,CalumetBeach201405281200
4,Montrose Beach,2014-05-28T12:00:00,14.4,3.36,1.388,0.298,4.0,11.9,2014-05-28T12:00:00,MontroseBeach201405281200


In [75]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34923 entries, 0 to 34922
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   beach_name                   34923 non-null  object 
 1   measurement_timestamp        34917 non-null  object 
 2   water_temperature            34917 non-null  float64
 3   turbidity                    34917 non-null  float64
 4   transducer_depth             10034 non-null  float64
 5   wave_height                  34690 non-null  float64
 6   wave_period                  34690 non-null  float64
 7   battery_life                 34917 non-null  float64
 8   measurement_timestamp_label  34917 non-null  object 
 9   measurement_id               34923 non-null  object 
dtypes: float64(6), object(4)
memory usage: 2.7+ MB


Shows presense of missing values? But how many of them are there?

In [76]:
df.isnull()

Unnamed: 0,beach_name,measurement_timestamp,water_temperature,turbidity,transducer_depth,wave_height,wave_period,battery_life,measurement_timestamp_label,measurement_id
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,True,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
34918,False,False,False,False,True,False,False,False,False,False
34919,False,False,False,False,True,False,False,False,False,False
34920,False,False,False,False,True,False,False,False,False,False
34921,False,False,False,False,True,False,False,False,False,False


It tells me about presence / absense of value for each column,but thats not much meaningfull. I need the count for each column's missing values.

In [77]:
#Get missing values count per column
df.isnull().sum()

beach_name                         0
measurement_timestamp              6
water_temperature                  6
turbidity                          6
transducer_depth               24889
wave_height                      233
wave_period                      233
battery_life                       6
measurement_timestamp_label        6
measurement_id                     0
dtype: int64

Many columns have missing values present in them. Lets deal with them.

For numeric variable, simplest of imputing the missing value by average can be applied.

In [79]:
#Find out the average for "water_temperature"
avg_water_temperature = df["water_temperature"].astype("float").mean(axis = 0)
print("avg_water_temperature : ", avg_water_temperature)

#Use the average "water_temperature" for the inplace replacement
df["water_temperature"].replace(np.nan, avg_water_temperature, inplace = True)

avg_water_temperature :  19.363387461694877


In [None]:
+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
             |         |
             | axis=0  |
             ↓         ↓

#It specifies the axis along which the means are computed.

In [80]:
#Get missing values count per column
df.isnull().sum()

beach_name                         0
measurement_timestamp              6
water_temperature                  0
turbidity                          6
transducer_depth               24889
wave_height                      233
wave_period                      233
battery_life                       6
measurement_timestamp_label        6
measurement_id                     0
dtype: int64

Repeat same for other numerics like turbidity, wave_height, wave_period and battery_life.

In [81]:
#Use the average "turbidity" for the inplace replacement
avg_turbidity = df["turbidity"].astype("float").mean(axis = 0)
df["turbidity"].replace(np.nan, avg_turbidity, inplace = True)

#Use the average "wave_height" for the inplace replacement
avg_wave_height = df["wave_height"].astype("float").mean(axis = 0)
df["wave_height"].replace(np.nan, avg_wave_height, inplace = True)

#Use the average "wave_period" for the inplace replacement
avg_wave_period = df["wave_period"].astype("float").mean(axis = 0)
df["wave_period"].replace(np.nan, avg_wave_period, inplace = True)

#Use the average "battery_life" for the inplace replacement
avg_battery_life = df["battery_life"].astype("float").mean(axis = 0)
df["battery_life"].replace(np.nan, avg_battery_life, inplace = True)

In [82]:
#Get missing values count per column
df.isnull().sum()

beach_name                         0
measurement_timestamp              6
water_temperature                  0
turbidity                          0
transducer_depth               24889
wave_height                        0
wave_period                        0
battery_life                       0
measurement_timestamp_label        6
measurement_id                     0
dtype: int64

How to deal with measurement_timestamp, measurement_timestamp_label and transducer_depth?

<b> 2. Dropping duplicate columns 

Seems columns measurement_timestamp and measurement_timestamp_label are same. Lets verify and if it so then we get rid one of them.

In [83]:
#Compare measurement_timestamp and measurement_timestamp_label 
df['measurement_timestamp'].equals(df['measurement_timestamp_label'])

True

In [84]:
#Remove "measurement_timestamp_label" from dataframe
df.drop('measurement_timestamp_label', inplace=True, axis=1)

In [85]:
df.columns

Index(['beach_name', 'measurement_timestamp', 'water_temperature', 'turbidity',
       'transducer_depth', 'wave_height', 'wave_period', 'battery_life',
       'measurement_id'],
      dtype='object')

<b> 3. Converting data types of column

measurement_timestamp is datetime column but appearing as object in dataframe, so lets convert it to date time first.

In [86]:
df.dtypes

beach_name                object
measurement_timestamp     object
water_temperature        float64
turbidity                float64
transducer_depth         float64
wave_height              float64
wave_period              float64
battery_life             float64
measurement_id            object
dtype: object

In [87]:
#Convert measurement_timestamp to datetime
df.measurement_timestamp = pd.to_datetime(df.measurement_timestamp)

In [89]:
#Check again
df.dtypes

beach_name                       object
measurement_timestamp    datetime64[ns]
water_temperature               float64
turbidity                       float64
transducer_depth                float64
wave_height                     float64
wave_period                     float64
battery_life                    float64
measurement_id                   object
dtype: object

measurement_id columns looks like number but appearing as object. Lets explore it.

In [90]:
df.measurement_id.unique()

array(['MontroseBeach201308300800', 'OhioStreetBeach201605261300',
       'CalumetBeach201309031600', ..., 'OhioStreetBeach201709121500',
       'OhioStreetBeach201709121700', 'OhioStreetBeach201709121800'],
      dtype=object)

The values are not numeric as they are having some intelligence embedded in their values. 

<b> 4. Dropping rows

We still have the missing values for some of the columns.

In [91]:
df.isnull().sum()

beach_name                   0
measurement_timestamp        6
water_temperature            0
turbidity                    0
transducer_depth         24889
wave_height                  0
wave_period                  0
battery_life                 0
measurement_id               0
dtype: int64

Missing values for measurement_timestamp are very less as compared to total number of records hence I can afford to drop them.

In [92]:
#Drop missing values for measurement_timestamp and check again
df = df.dropna(subset=['measurement_timestamp'])
df.isnull().sum()

beach_name                   0
measurement_timestamp        0
water_temperature            0
turbidity                    0
transducer_depth         24883
wave_height                  0
wave_period                  0
battery_life                 0
measurement_id               0
dtype: int64

In [93]:
#Rows are reduced by 6
df.shape

(34917, 9)

For transducer_depth majority of the cells does not have values, so is it really an important attribute. Need to check with domain and decide whether to drop it or not.

<b> 5. Dropping column

Let's drop transducer_depth column as majority of values are missing for it. 

In [94]:
#Drop and check
df.drop('transducer_depth', inplace=True, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop('transducer_depth', inplace=True, axis=1)


In [95]:
df.columns

Index(['beach_name', 'measurement_timestamp', 'water_temperature', 'turbidity',
       'wave_height', 'wave_period', 'battery_life', 'measurement_id'],
      dtype='object')

<b> 6. Dropping duplicate rows

Check for the duplicate records

In [96]:
# Selecting duplicate rows except first occurrence based on all columns 
duplicate = df[df.duplicated()] 
duplicate.shape

(0, 8)

In [97]:
df.shape

(34917, 8)

<b> 7. Renaming a column

Some of the columns have very big names, can those be shortened or renamed ?

In [98]:
df.columns

Index(['beach_name', 'measurement_timestamp', 'water_temperature', 'turbidity',
       'wave_height', 'wave_period', 'battery_life', 'measurement_id'],
      dtype='object')

In [99]:
df.rename(columns = {'measurement_timestamp':'timestamp', 'water_temperature':'temp'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns = {'measurement_timestamp':'timestamp', 'water_temperature':'temp'}, inplace=True)


In [100]:
#Check the renamed columns again
df.columns

Index(['beach_name', 'timestamp', 'temp', 'turbidity', 'wave_height',
       'wave_period', 'battery_life', 'measurement_id'],
      dtype='object')

<b> 8. Replacing the cell values of a column

Some of the column values are still categorial (string) in nature, those can be replaced with the numeric categories.

In [101]:
df['beach_name']=df['beach_name'].replace({"Montrose Beach":0, "Calumet Beach":1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['beach_name']=df['beach_name'].replace({"Montrose Beach":0, "Calumet Beach":1})


In [102]:
df.beach_name.unique()

array([0, 'Ohio Street Beach', 1, '63rd Street Beach', 'Osterman Beach',
       'Rainbow Beach'], dtype=object)

In [103]:
#Do it for all column values of beach
df['beach_name']=df['beach_name'].replace({"Montrose Beach":0, "Calumet Beach":1, "Ohio Street Beach":2, "63rd Street Beach":3, "Osterman Beach":4, "Rainbow Beach":5})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['beach_name']=df['beach_name'].replace({"Montrose Beach":0, "Calumet Beach":1, "Ohio Street Beach":2, "63rd Street Beach":3, "Osterman Beach":4, "Rainbow Beach":5})


In [104]:
df.beach_name.unique()

array([0, 2, 1, 3, 4, 5], dtype=int64)

In [105]:
df.head()

Unnamed: 0,beach_name,timestamp,temp,turbidity,wave_height,wave_period,battery_life,measurement_id
0,0,2013-08-30 08:00:00,20.3,1.18,0.08,3.0,9.4,MontroseBeach201308300800
1,2,2016-05-26 13:00:00,14.4,1.23,0.111,4.0,12.4,OhioStreetBeach201605261300
2,1,2013-09-03 16:00:00,23.2,3.63,0.174,6.0,9.4,CalumetBeach201309031600
3,1,2014-05-28 12:00:00,16.2,1.26,0.147,4.0,11.7,CalumetBeach201405281200
4,0,2014-05-28 12:00:00,14.4,3.36,0.298,4.0,11.9,MontroseBeach201405281200


<b> 9. Exporting the prepared dataset

In [106]:
%ls

 Volume in drive I is Monika
 Volume Serial Number is 24D4-7A8E

 Directory of I:\BITS-IoT Data Management\Session8-Data Preprocessing lab\Labs\Lab1-  Data preparation Lab

10-09-2023  10:08    <DIR>          .
10-09-2023  10:08    <DIR>          ..
09-09-2023  15:34    <DIR>          .ipynb_checkpoints
10-09-2023  10:08           786,469 10-Sep Lab1 - Beach Water Quality Analysis.ipynb
28-01-2021  08:51         3,769,904 beach_water_quality_automated_sensors_1.csv
07-09-2023  19:47           737,800 DMIoT - Lab1 - Beach Water Quality Analysis -1.ipynb
02-09-2023  20:11           719,293 DMIoT - Lab1 - Beach Water Quality Analysis -1-Copy1.ipynb
28-01-2021  11:36            47,147 DMIoT Lab1-  Data preparation Lab.docx
02-09-2023  21:53         2,801,173 exported_data.csv
               6 File(s)      8,861,786 bytes
               3 Dir(s)  383,577,456,640 bytes free


In [107]:
! del  exported_data.csv

In [108]:
%ls

 Volume in drive I is Monika
 Volume Serial Number is 24D4-7A8E

 Directory of I:\BITS-IoT Data Management\Session8-Data Preprocessing lab\Labs\Lab1-  Data preparation Lab

10-09-2023  10:09    <DIR>          .
10-09-2023  10:09    <DIR>          ..
09-09-2023  15:34    <DIR>          .ipynb_checkpoints
10-09-2023  10:08           786,469 10-Sep Lab1 - Beach Water Quality Analysis.ipynb
28-01-2021  08:51         3,769,904 beach_water_quality_automated_sensors_1.csv
07-09-2023  19:47           737,800 DMIoT - Lab1 - Beach Water Quality Analysis -1.ipynb
02-09-2023  20:11           719,293 DMIoT - Lab1 - Beach Water Quality Analysis -1-Copy1.ipynb
28-01-2021  11:36            47,147 DMIoT Lab1-  Data preparation Lab.docx
               5 File(s)      6,060,613 bytes
               3 Dir(s)  383,580,258,304 bytes free


In [109]:
df.to_csv("exported_data.csv")

In [110]:
ls

 Volume in drive I is Monika
 Volume Serial Number is 24D4-7A8E

 Directory of I:\BITS-IoT Data Management\Session8-Data Preprocessing lab\Labs\Lab1-  Data preparation Lab

10-09-2023  10:10    <DIR>          .
10-09-2023  10:10    <DIR>          ..
09-09-2023  15:34    <DIR>          .ipynb_checkpoints
10-09-2023  10:10           791,626 10-Sep Lab1 - Beach Water Quality Analysis.ipynb
28-01-2021  08:51         3,769,904 beach_water_quality_automated_sensors_1.csv
07-09-2023  19:47           737,800 DMIoT - Lab1 - Beach Water Quality Analysis -1.ipynb
02-09-2023  20:11           719,293 DMIoT - Lab1 - Beach Water Quality Analysis -1-Copy1.ipynb
28-01-2021  11:36            47,147 DMIoT Lab1-  Data preparation Lab.docx
10-09-2023  10:10         2,801,173 exported_data.csv
               6 File(s)      8,866,943 bytes
               3 Dir(s)  383,577,452,544 bytes free
