# Linear Regression Analysis
Analyse dataset of oceanographic data to answer the following questions (as posted [here](https://www.kaggle.com/rtatman/datasets-for-regression-analysis)):

1. Is there a relationship between water salinity & water temperature?
2. Can you predict the water temperature based on salinity?

## Imports
Perform all required imports and set constants

In [None]:
# Data Handling
import numpy as np
import pandas as pd

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from mlxtend.preprocessing import minmax_scaling
from sklearn.linear_model import LinearRegression

# Evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# File paths to datasets
PATH_DATA_BOTTLES = '/kaggle/input/calcofi/bottle.csv'
PATH_DATA_CAST = '/kaggle/input/calcofi/cast.csv'

print('Completed imports')

# Load Data
Load datasets into Pandas DataFrame

In [None]:
df_bottles_raw = pd.read_csv(PATH_DATA_BOTTLES, usecols=['Salnty', 'T_degC'])

print('Completed loading data')

# Exploration
Print some info about the data including

1. Simple description
2. Samples (head and tail)
3. Number of missing values

In [None]:
print(f'Bottles Description:\n{df_bottles_raw.describe()}')
print(f'\nBottles Head:\n{df_bottles_raw.head()}')
print(f'\nBottles Tail:\n{df_bottles_raw.tail()}')

n_elements = np.product(df_bottles_raw.shape)
s_nans = df_bottles_raw.isnull().sum()

print(f'\nNumber of missing entries for both columns:\n{s_nans}')

n_nans = s_nans.sum()

print(f'\nOut of {n_elements} total elements, {n_nans} are null. This corresponds to {n_nans / n_elements:.2%}.')

## Results
We find the temparature to vary from as cold as `1.4` up to `31.1` degrees Celsius with a standard deviation of `4.2`. In contrast, the solinity remains more stable with a deviation of only `0.46`, that is most values are close to the overall mean of `33.84`.

## Handle Missing Values
In total, `58,317` out of `1,729,726` values are missing which corresponds to `3.37%`. Most of these missing values are solinity (`47,354` compared to `10,963` for temparature). Luckily, these values are ordered by time, so we can fill the entries using a backfill approach. That is, we replace missing entries by the next valid one. 

In [None]:
df_bottles_nan = df_bottles_raw.fillna(method='backfill')

n_nans_new = df_bottles_nan.isnull().sum().sum()
print(f'A total of {n_nans_new} entries are missing.')

# Visualization
Print distribution of both series.

In [None]:
sns.jointplot(x='Salnty', y='T_degC', data=df_bottles_nan, s=5)

## Results
We find some vertical lines which could be due to the insertion of missing values. We test this hypothesis by alternatively removing all rows with missing values and produce the same plot.

In [None]:
df_bottles_nan = df_bottles_raw.dropna()
sns.jointplot(x='Salnty', y='T_degC', data=df_bottles_nan, s=5)

## Results (continued)
The artifical vertical lines are gone. It is likely that missing values are caused by an outage over an extended time period instead of separate isolated occasions. This leads to consecutive NaN values in the dataset. Applying the backfill method (as was done before) then fills each each of these NaN series with the same salinity values. It is strange however that these chains seem to stretch over large temperature ranges, hence the vertical line.
However, this finding could be biased by the visualization. For similar chains corresponding to temperatures being closer to each other, the entries of the scatter plot might simply overlap. In order to be sure, we continue with the dataset that has the rows containing missing values removed.

As found before, the distribution of the salinity values is much more centered. In particular, we can observe two spikes whereas the temparature is more spread across the entire value range.

# Correlation
There seems to be a pattern in the scatter plot. Let's see whether both values are correlated.

In [None]:
df_bottles_nan.corr()

# Prediction
There seems to be a significant (negative) correlation between the two attributes. This means that low values of one measure usually met by high values of the respective other.

## Scaling
Before building a regression model, we first min-max-scale the data in order to ensure a quicker convergence when training the model. This might be a bit over-cooked, but - afaik - doesn't hurt either.

In [None]:
df_bottles_scaled = minmax_scaling(df_bottles_nan, columns=df_bottles_nan.columns)

min_temp, max_temp = df_bottles_nan['T_degC'].min(), df_bottles_nan['T_degC'].max()
min_sal, max_sal = df_bottles_nan['Salnty'].min(), df_bottles_nan['Salnty'].max()

df_bottles_scaled.describe()

## Result
Both value ranges are now within `0.0` and `1.0`. Time to construct the model!

In [None]:
X = df_bottles_scaled['Salnty'].values.reshape(-1, 1) # reshape required due to single feature
y = df_bottles_scaled['T_degC'].values

# Split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Create model
model = LinearRegression()

# Fit model on training data
model.fit(X_train, y_train)

# Retrieve intercept and coefficient
intercept = model.intercept_
coeff = model.coef_[0]

# Print result
print(f'Fitted model:\n\tintercept: {model.intercept_}\n\tcoefficient: {model.coef_[0]}\n\tscore: {model.score(X_test, y_test)}')


## Validation
We start by plotting the resulting line (simple since we only consider one feature).

In [None]:
sns.scatterplot(x=X_train[:, 0], y=y_train)
plt.xlabel('salinity')
plt.ylabel('temperature')

ticks_x = np.linspace(0., 1., 1000)
ticks_y = intercept + (coeff * ticks_x)

sns.lineplot(ticks_x, ticks_y)

## Results
This looks like a poor fit. However, this can be caused a visualization problem. In fact, most data points are along the derived line (compare the distribution plots). It's just that most of these points overlap. Let's apply some metrics to further analyze the goodness of fit.

The same result could have been derived using the `kind='reg'` option in seaborn's `sns.jointplot` function call, but I'm here to learn ;)

In [None]:
score_r2 = model.score(X_test, y_test)

# Predict values using result
y_pred = model.predict(X_test)
# y_pred = intercept + (coeff * X_test) # equivalent

# Consider absolute error since we only have one dimension at this point
score_mse = mean_absolute_error(y_test, y_pred)
print(f'Scoring\n\tR^2: {score_r2}\n\tMean Absolute Error: {score_mse}')

### Results
On average, roughly `25%` of the variance of unseen samples in temparature can be explained by using the respective silinity as a feature. This leads to an average (absolute) error of only `0.09`. Let's scale the data back to its original range in order to see to what temperature this error corresponds.

In [None]:
'''Inverse function of min-max scaling. The min-max implementation does - afaik - not offer this..
'''
rescale = lambda val: (val*(max_temp - min_temp)) + min_temp # min/max values are constant as we only need to rescale the temperature

# Rescale the test and predicted temperatures values
y_test_r = rescale(y_test)
y_pred_r = rescale(y_pred)

# Compute and print absolute error
score_mse_r = mean_absolute_error(y_test_r, y_pred_r)
print(f'Scoring\n\tR^2: {score_r2}\n\tMean Absolute Error: {score_mse_r}')

# Results
On overage, our model, even though quite simple, misses the temperature by only `2.68` degree celsius when tested on unseen samples. This is a suprisingly good result as we only considered a single feature (namely the salinity). For improvements, one could extend the model by searching for other useful features in the data.  