# MEI Data Science Taught Course - Lesson 9 - Putting it all together

## Table of Contents

* [Introduction](#intro)
 - [Importing libraries and data](#import)
* [Exploratory Data Analysis (1)](#eda1)
* [Pre-processing and Preparing (1)](#preprocessing1)
* [Exploratory Data Analysis (2)](#eda2)
* [Pre-processing and Preparing (2)](#preprocessing2)
 - [Feature engineering](#feature2)
* [Exploratory Data Analysis (3)](#eda3)
* [Pre-processing and Preparing (3)](#preprocessing3)
 - [Feature engineering](#feature3)
* [Exploratory Data Analysis (4)](#eda4)

<a id='intro'></a>
## Introduction

Oxygen saturation is a measure of the amount of dissolved oxygen in water. It is important for the sustainability of many aquatic environments to support fish and other aerobic organisms. 

The data in this notebook is taken from CalCOFI (California Cooperative Oceanic Fisheries Investigations). Various features, including oxygen saturation, have been recorded at a number of Pacific ocean sites between 1959 and 2019. 

<a id='import'></a>
### Importing libraries and data

In [1]:
# import pandas and seaborn
import pandas as pd
import seaborn as sns

In [1]:
# import linear regression model
from sklearn.linear_model import LinearRegression
# import linear regression metrics
from sklearn.metrics import r2_score, mean_squared_error
# import train/test split function
from sklearn.model_selection import train_test_split

In [1]:
#import the data
water_data = pd.read_csv('/kaggle/input/water-temperature/water.csv')

# display the data to check it has imported
water_data

In [1]:
water_data.info()

<a id='eda1'></a>
## Exploratory Data Analysis (1)

In [1]:
water_data['O2ml_L'].describe()

It doesn't seem possible to have negative values oxygen saturation. Where are they coming from?

In [1]:
water_data[water_data['O2ml_L'] < 0]

<a id='preprocessing1'></a>
## Preprocessing and Preparing (1)

These values look anomalous. The code below takes a slice which ignores rows with negative `O2ml_L`

In [1]:
# create a copy with only those values >=0
water_data = water_data[water_data['O2ml_L'] >= 0].copy()

# use describe to check
water_data['O2ml_L'].describe()

<a id='eda2'></a>
## Exploratory Data Analysis (2)
The aim is to build a model for `O2ml_L`. You can explore all the possible numerical input features with a pairplot. 

In [1]:
# define an array of features and a list for the target
features = ['T_degC', 'PO4uM', 'SiO3uM', 'NO2uM', 'NO3uM', 'Salnty']
target = 'O2ml_L'

# warning: pairplot can be very slow on a dataset of this size, avoid hue
sns.pairplot(data=water_data, x_vars=features, y_vars=target);

There are some large values for `NO2M`. You can explore these by taking a closer look at `NO2uM` versus `O2ml_L`.

In [1]:
sns.relplot(data=water_data, x='NO2uM', y='O2ml_L', aspect=2);

There are a few points with high`NO2uM` and very low `O2ml_L`

In [1]:
water_data[water_data['NO2uM'] > 3]

<a id='preprocessing2'></a>
## Preprocessing and Preparing (2)

Station `081.8 046.9` seems to be responsible for these unusual values. This is the same station that had some negative `O2ml_L` values. Ideally you would want to find out more about this station and whether there is a reason for these extreme values; however, in the absence of this further exploratory data analysis will be helpful.

<a id='feature2'></a>
### Feature engineering

In [1]:
water_data['Sta_X'] = (water_data['Sta_ID'] == '081.8 046.9').replace({True: 'Yes', False: 'No'})
water_data

<a id='eda3'></a>
## Exploratory Data Analysis (3)

How does this station's `NO2uM` readings compare to other stations?

In [1]:
water_data.groupby('Sta_X')['NO2uM'].describe()

What about the distribution of `O2ml_L` values?

In [1]:
# stat='density', common_norm=False gives density relative to the group totals
# compare to
# sns.displot(data=water_data, x='O2ml_L', col='Sta_X');
sns.displot(data=water_data, x='O2ml_L', col='Sta_X', stat='density', common_norm=False);

<a id='preprocessing3'></a>
## Preprocessing and Preparing (3)
The values for this station are noticably different from the rest of the data. A new data set can be produced with this station excluded.

In [1]:
# Take a slice but filtering out Sta_X
clean_data = water_data[water_data['Sta_ID'] != "081.8 046.9"].copy()
clean_data

<a id='eda4'></a>
## Exploratory Data Analysis (4)

Continue to explore the data, looking at other features using appropriate charts and measures.

You might like to consider grouping by `Zone`


In [1]:
clean_data['Zone'].value_counts()

In [1]:
# box plots of O2ml_L grouped by Zone


You could also explore the data based on the year of the sample.

In [1]:
# create a new column Year based on the data feature