## Context
The CalCOFI data set represents the longest (1949-present) and most complete (more than 50,000 sampling stations) time series of oceanographic and larval fish data in the world. It includes abundance data on the larvae of over 250 species of fish; larval length frequency data and egg abundance data on key commercial species; and oceanographic and plankton data. The physical, chemical, and biological data collected at regular time and space intervals quickly became valuable for documenting climatic cycles in the California Current and a range of biological responses to them. CalCOFI research drew world attention to the biological response to the dramatic Pacific-warming event in 1957-58 and introduced the term “El Niño” into the scientific literature.

The California Cooperative Oceanic Fisheries Investigations (CalCOFI) are a unique partnership of the California Department of Fish & Wildlife, NOAA Fisheries Service and Scripps Institution of Oceanography. The organization was formed in 1949 to study the ecological aspects of the sardine population collapse off California. Today our focus has shifted to the study of the marine environment off the coast of California, the management of its living resources, and monitoring the indicators of El Nino and climate change. CalCOFI conducts quarterly cruises off southern & central California, collecting a suite of hydrographic and biological data on station and underway. Data collected at depths down to 500 m include: temperature, salinity, oxygen, phosphate, silicate, nitrate and nitrite, chlorophyll, transmissometer, PAR, C14 primary productivity, phytoplankton biodiversity, zooplankton biomass, and zooplankton biodiversity.

We are going to investigate the relationship between water salinity and water temperature and to predict the change of salinity based on water temperature.
T_degC : Water temperature in degree Celsius
Salnty : Salinity in g of salt per kg of water (g/kg)

## 1. Data Preprocessing
This code uses some of the most commonly used open source packages for building machine learning models in Python. 
<ul>
    <li> `pandas` - Library for data analysis and manipulation. </li>
    <li> `numpy` - Used for scientific computing </li>
    <li> `sklearn (scikit-learn)` - Provides the implementation of various machine learning tools and techniques. </li>
    <li> `matplotlib` - Used for data visualization </li>
</ul>


Let's begin by importing all the required libraries:

```python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


```

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#using a data frame to store values
bottle = pd.read_csv("../input/calcofi/bottle.csv")


In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#first few rows of data frame
bottle.head()

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#visualizing transpose of the first 5 columns
bottle.head().T

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#showing the last 5 values of the data frame
bottle.tail()

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

## 2. Obtaining basic information of variables

<ul>
    <li> What does the variable represent? </li>
    <li> Meaning of values </li>
    <li> Numerical summary </li>
    <li> Graphical distributions of values</li>
</ul>


#Displaying column names and data types
print('columns: ', bottle.columns)
print('data types: ', bottle.dtypes)

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#statistical information/numerical summary
bottle.describe()

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#info on features of dataset
bottle.info()

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#checking how many rows and columns
bottle.shape

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#extracting 'Salnty' and 'T_degC' for the first 500 rows
data = bottle[['Salnty', 'T_degC']][:500]
plt.plot(data['T_degC'], data['Salnty'], 'b.')

In [None]:

##Type code here and execute
##----------------------------#

##----------------------------#

#checking for null values in Salnty
data['Salnty'].isnull().sum()

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#checking for null values in temperature
data['T_degC'].isnull().sum()

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#removing na values
data = data.dropna()

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#plot salinty and temperature
plt.plot(data['Salnty'], data['T_degC'], 'b.')

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

import seaborn as sns
sns.set(font_scale=1.6)
plt.figure(figsize=(13, 9))
plt.scatter(data["Salnty"], data['T_degC'],s=65)
plt.xlabel('Salinty',fontsize=25)
plt.ylabel('Temperature',fontsize=25)
plt.title('Data  - Salinty vs Temperature',fontsize=25)
plt.show()

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

# Divide Trained_data into two variables X & y
X = data.iloc[:, 0:1].values  # all rows of Sal column
y = data.iloc[:, -1].values  # all rows of Temp column

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#Linear Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()

#model.fit(data[['T_degC']], data['Salnty'])
model.fit(X,y)

In [None]:

##Type code here and execute
##----------------------------#

##----------------------------#

import seaborn as sns
sns.set(font_scale=1.6)
# figure is being created explicitly using plt.figure 
# so that the figure will be a specific size rather than the default size.
plt.figure(figsize=(13, 9))
#scatter plot, collection of points
plt.scatter(X,y,s=65)
plt.plot(X,model.predict(X), color='red', linewidth='6')
plt.xlabel('Salinty',fontsize=25)
plt.ylabel('Temperature',fontsize=25)
plt.title('Comparision Temp and Predicted Temp with Linear Regression',fontsize=25)
plt.show()

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

accuracy = model.score(X, y)
print(accuracy)

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

from sklearn.preprocessing import PolynomialFeatures 


In [None]:
##Type code here and execute

##----------------------------#

##----------------------------#

#Fitting Polynomial Regression
# Consider degree=3 
poly = PolynomialFeatures(degree = 3) 
X_poly = poly.fit_transform(X) 
poly.fit(X_poly, y) 
model2 = LinearRegression() 
model2.fit(X_poly, y)

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

#Predict value of Temp randomly
Prediction_Temp_Poly = model2.predict(poly.fit_transform([[33]])) 
Prediction_Temp_Poly

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

sns.set(font_scale=1.6)
plt.figure(figsize=(13, 9))
x_grid = np.arange(min(X), max(X), 0.1)
x_grid = x_grid.reshape(-1,1)
plt.scatter(X,y,s=65)
plt.plot(x_grid,model2.predict(poly.fit_transform(x_grid)) , color='red', linewidth = '6')
plt.xlabel('Sal',fontsize=25)
plt.ylabel('Temp',fontsize=25)
plt.title('Comparision Temp and Predicted Temp with Linear Regression',fontsize=25)
plt.show()

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

accuracy2 = model2.score(X_poly, y)
print(accuracy2)

In [None]:
##Type code here and execute
##----------------------------#

##----------------------------#

From the above analysis, it is observe that the predictions of temperature are more acurate with Polynomial Regression.