# Linear Correlations and Other Analysis

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.linear_model import LinearRegression
from scipy import stats
sardine_data = pd.read_csv("data/sardine_data.csv")
sardine_data2 = pd.read_csv("data/lagged_sardine_data.csv")

We want to see if there is any relationships between sardine larvae and sardine catch. We can propose our hypothesis as that there is a linear correlation between these two variables: as there is more sardine larvae to be found within a single given year, it is expected that there is more sardine to be caught that year and vice versa. We can visualize this as follows:

In [8]:
X = sardine_data['CatchLbs'].values.reshape(-1,1)
Y = sardine_data['Sardine Larvae lbs'].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
Y_pred = linear_regressor.predict(X)
Y = np.array(Y).reshape(-1,)
X = np.array(X).reshape(-1,)

fig = px.scatter(sardine_data, x='CatchLbs', y='Sardine Larvae lbs', trendline="ols", title ='Sardine Larvae lbs vs Sardine Catch')
fig.show()
print("Pearson Correlation:", stats.pearsonr(X, Y))

Pearson Correlation: (0.596545112831949, 0.0002482636504668809)


Our results showcase a the first numeric result, a pearson correlation of around .5965, which indicates a moderate positive linear correlation between sardine catch and sardine larvae. Thus from this we can infer that if there is more sardine larvae being caught in a single year, that means that there is more sardine in the ocean that could mate which in turn, lay more larvae.

The second numeric result of .00024 is a p-value that tests whether these variables are correlated at all. The hypothesis testing is as followed with a 5% significant value: <br>
H0 (null hypothesis)- There is no correlation between sardine catch and sardine larvae <br>
H1 (alternate hypothesis) - There is a correlation between sardine catch and sardine larvae

In simpler terms, if our value is below 5%, then we can safely conclude that the two variables we are testing do in fact have a linear correlation with one another. Thus, from our pearson correlation result, we can conclude that there is a positive linear correlation between sardine larvae and sardine catch. 

## Lagged Correlation and Analysis

Now what if we want to see if there is any connection between fish larvae and them growing up to be caught in the future? We can visualize this through a lagged correlation. According to NOAA, it takes about ~2 years for the pacific sardine to mature and become able to reproduce. Thus, we can set back the catch lbs data by 2 years to account for the time it takes for the sardine larvae to reach adulthood. Then we can plot and visualize our results as follows:

In [11]:
X = sardine_data2['CatchLbs'].values.reshape(-1,1)
Y = sardine_data2['Sardine Larvae lbs'].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
Y_pred = linear_regressor.predict(X)
Y = np.array(Y).reshape(-1,)
X = np.array(X).reshape(-1,)

fig = px.scatter(sardine_data2, x='CatchLbs', y='Sardine Larvae lbs', trendline="ols", title = 'Lagged Correlation for Sardine larve vs Sardine Catch')
fig.show()
print("Pearson Corerelation:", stats.pearsonr(X, Y))

Pearson Corerelation: (0.5768762964879124, 0.0006807965129356197)


Our result is quite similar to our previous visualization. The pearson correlation value of .5769 indicates that there a moderate positive linear correlation between our two variables. This is slightly lower than our previous result, but this can be explained by the fact that not all sardine larvae grow up to become adults in the future. In the span of 2 year, some sardine larvae may have been eaten by predators or die from natural causes or disasters. But our significance value of .00068 indicates that these two variables are indeed linearly related. Thus, we can conclude that, up to an extent, there is a positive linear correlation on sardine larvae and if they were to be caught in the future.