# Linear Correlations and Other Analysis

In [8]:
# importing essential packages
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.linear_model import LinearRegression
from scipy import stats
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.metrics import mean_squared_error

#loading dataset
sardine_data = pd.read_csv("data/sardine_data.csv")
sardine_data2 = pd.read_csv("data/lagged_sardine_data.csv")
sardine_data = sardine_data.rename(columns={"Sardine Larvae lbs": "Count"})
sardine_data['Count'] = sardine_data['Count'].div(4).round(0)
sardine_data2 = sardine_data2.rename(columns={"Sardine Larvae lbs": "Count"})
sardine_data2['Count'] = sardine_data2['Count'].div(4).round(0)

We want to see if there is any relationships between sardine larvae and sardine catch. We can propose our hypothesis as that there is a linear correlation between these two variables: as there is more sardine larvae to be found within a single given year, it is expected that there is more sardine to be caught that year and vice versa. We can visualize this as follows:

In [9]:
X = sardine_data['CatchLbs'].values.reshape(-1,1)
Y = sardine_data['Count'].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
Y_pred = linear_regressor.predict(X)
Y = np.array(Y).reshape(-1,)
X = np.array(X).reshape(-1,)
rmse = mean_squared_error(Y, Y_pred, squared=False)

fig = px.scatter(sardine_data, x='CatchLbs', y='Count', trendline="ols", title ='Sardine Larvae lbs vs Sardine Catch')
fig.show()
print("Pearson Correlation:", stats.pearsonr(X, Y), "Root Mean Squared Error: ", rmse)

Pearson Correlation: (0.5965270542641967, 0.0002483988771216737) Root Mean Squared Error:  1462.0273633291702


Our results showcase a the first numeric result, a pearson correlation of around .5965, which indicates a moderate positive linear correlation between sardine catch and sardine larvae. Thus from this we can infer that if there is more sardine larvae being caught in a single year, that means that there is more sardine in the ocean that could mate which in turn, lay more larvae.

The second numeric result of .00024 is a p-value that tests whether these variables are correlated at all. The hypothesis testing is as followed with a 5% significant value: <br>
H0 (null hypothesis)- There is no correlation between sardine catch and sardine larvae <br>
H1 (alternate hypothesis) - There is a correlation between sardine catch and sardine larvae

In simpler terms, if our value is below 5%, then we can safely conclude that the two variables we are testing do in fact have a linear correlation with one another. Thus, from our pearson correlation result, we can conclude that there is a positive linear correlation between sardine larvae and sardine catch. 

Finally, we have our Root Mean Squared Error value of 5847.85 which means that our linear prediction line would have an estimated error of around 5800 lbs from the actual value. However, we can visually observe that the cluster for when our sardine catch lbs was around 0 seems to be more accurate, and our prediction appear to have more error as the value of our sardine catch lbs increases. Thus, it may be worthwhile to try a different prediction model to see if our error decreases.

In [10]:
x_data = sardine_data['CatchLbs']
y_data = sardine_data['Count']

log_x_data = np.log(x_data)
log_y_data = np.log(y_data)

curve_fit = np.polyfit(x_data, log_y_data, 1)

x_val = np.arange(0,180000000,1000000)
y_val = np.exp(curve_fit[1]) * np.exp(curve_fit[0]*x_val)
y_pred = np.exp(curve_fit[1]) * np.exp(curve_fit[0]*x_data)
rmse = mean_squared_error(y_data, y_pred, squared=False)

fig = px.scatter(sardine_data, x='CatchLbs', y='Count', title = 'Sardine Larvae lbs vs Sardine Catch')
fig.add_traces(go.Scatter(x=x_val, y=y_val, name='Regression Fit'))
fig.show()
print("Root Mean Squared Error: ", rmse)

Root Mean Squared Error:  1757.876201553466


Here we tried to incorporate an exponential prediction model, which has a higher RMSE value of 7030. Thus, our linear model appears to be more effective than our exponential one. 

## Lagged Correlation and Analysis

Now what if we want to see if there is any connection between fish larvae and them growing up to be caught in the future? We can visualize this through a lagged correlation. According to [NOAA] (https://www.fisheries.noaa.gov/species/pacific-sardine#:~:text=They%20reproduce%20at%20age%201,hatch%20in%20about%203%20days.), it takes about 1-2 years, depending on the factors, for the pacific sardine to mature and become able to reproduce. Thus, we can set back the catch lbs data by 1 year to account for the time it takes for the sardine larvae to reach adulthood. We chose 1 year as our parameter due to this [article] (http://calcofi.org/~calcofi/publications/calcofireports/v37/Vol_37_Butler_etal.pdf) which explains how after the first population collapse of the 1940s, most pacific sardine generally were able to reproduce at age 1, which some individuals being able to do so even earlier. Then we can plot and visualize our results as follows:

In [12]:
X = sardine_data2['CatchLbs'].values.reshape(-1,1)
Y = sardine_data2['Count'].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
Y_pred = linear_regressor.predict(X)
Y = np.array(Y).reshape(-1,)
X = np.array(X).reshape(-1,)

fig = px.scatter(sardine_data2, x='CatchLbs', y='Count', trendline="ols", title = 'Lagged Correlation for Sardine larve vs Sardine Catch')
fig.show()
print("Pearson Corerelation:", stats.pearsonr(X, Y))

Pearson Corerelation: (0.5875986327446723, 0.0004061799479196581)


Our result is quite similar to our previous visualization. The pearson correlation value of .5769 indicates that there a moderate positive linear correlation between our two variables. This is slightly lower than our previous result, but this can be explained by the fact that not all sardine larvae grow up to become adults in the future. In the span of 2 year, some sardine larvae may have been eaten by predators or die from natural causes or disasters. But our significance value of .00068 indicates that these two variables are indeed linearly related. Thus, we can conclude that, up to an extent, there is a positive linear correlation on sardine larvae and if they were to be caught in the future.