## Is it A good day to go out snorkelling?

## Stakeholders: Dive Buddies

The stakeholder is a diving social club. They hangout in the Chicago waterfront region. 

In [6]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

In [9]:
# load and import dataset using pandas

df = pd.read_csv('data/bwq.csv',parse_dates=True)

In [10]:
# Read first 5 lines of dataframe
df.head()

Unnamed: 0,Beach Name,Measurement Timestamp,Water Temperature,Turbidity,Transducer Depth,Wave Height,Wave Period,Battery Life,Measurement Timestamp Label,Measurement ID
0,Montrose Beach,08/30/2013 08:00:00 AM,20.3,1.18,0.891,0.08,3.0,9.4,8/30/2013 8:00 AM,MontroseBeach201308300800
1,Ohio Street Beach,05/26/2016 01:00:00 PM,14.4,1.23,,0.111,4.0,12.4,05/26/2016 1:00 PM,OhioStreetBeach201605261300
2,Calumet Beach,09/03/2013 04:00:00 PM,23.2,3.63,1.201,0.174,6.0,9.4,9/3/2013 4:00 PM,CalumetBeach201309031600
3,Calumet Beach,05/28/2014 12:00:00 PM,16.2,1.26,1.514,0.147,4.0,11.7,5/28/2014 12:00 PM,CalumetBeach201405281200
4,Montrose Beach,05/28/2014 12:00:00 PM,14.4,3.36,1.388,0.298,4.0,11.9,5/28/2014 12:00 PM,MontroseBeach201405281200


In [4]:
# view data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34923 entries, 0 to 34922
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Beach Name                   34923 non-null  object 
 1   Measurement Timestamp        34917 non-null  object 
 2   Water Temperature            34917 non-null  float64
 3   Turbidity                    34917 non-null  float64
 4   Transducer Depth             10034 non-null  float64
 5   Wave Height                  34690 non-null  float64
 6   Wave Period                  34690 non-null  float64
 7   Battery Life                 34917 non-null  float64
 8   Measurement Timestamp Label  34917 non-null  object 
 9   Measurement ID               34923 non-null  object 
dtypes: float64(6), object(4)
memory usage: 2.7+ MB


In [None]:
# view summary statistics of numerical columns
df.describe()

Ocean wave height (technically referred to as amplitude) is a measurement of the vertical distance of the wave from the average height of the wave. The wave axis is the still water level of the ocean, and is usually equated to zero. Wave height above and below the average are measured as positive and negative values, respectively. Could that be the reason we have negative values in the 'Wave Height" and "Wave Period" columns?

However, on closer inspection, notice these two columns have extreme-looking values as their minimum values. These could be attributed to clerical errors, or perhaps filling in for unknown values as it was done several years ago (before the advent of NaNs). 

However, it does seem like the above -99999.92 and -100000.0 were arbitrarily entered as values in the above columns for some odd reason. 

Also, it is noted from the df.info() output that the timestamp measurements are object datatypes. 

In [None]:
len(df[df['Wave Height'] < 0]) / len(df)

These arbitrarily inputted figures account for a little less than 2% of the dataset. At this point of exploring the data, I will note this and make a decision as to the how to handle this subset of the dataset as I keep exploring and looking at the trends.

In [None]:
df[df['Transducer Depth'] < 0]

These two readings also have negative transducer height values, which seems odd. However, just like the wave height measurements above, I note this off-values and will make a decision as to the usability while I keep exp;oring the dataset.

In [None]:
df[df['Turbidity'] > 100]

In [None]:
beach_names = []

for name in df['Beach Name'].unique():
    beach_names.append(name)
    
beach_names

In [None]:
df = df[(df['Wave Height'] != -99999.992) & 
        (df['Wave Period'] != -100000.0)].drop(columns='Transducer Depth').dropna()

In [None]:
df

In [None]:
df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
df.set_index('Measurement Timestamp')


In [None]:
df.Turbidity.max()

In [None]:
df.hist(figsize=(12,8),bins=10);

## Binning the turbidity level

In [None]:
# Define bins
bins = [0, 5, 35, 50, 2000]

In [None]:
# Define the corresponding category for each bin
Levels = [0, 1, 2, 3]

In [None]:
# create a new column mapping water turbidity to a level
df['Turbidity_level'] = pd.cut(df['Turbidity'],bins,labels=Levels,ordered=True,include_lowest=True)

In [None]:
df

In [None]:
df.isna().any()

In [None]:
df.Turbidity_level.value_counts()

In [None]:
df.describe()