# Linear regression for the CalCOFI dataset.
In this exercise using a public data set called *CalCOFI* on kaggle website, we are investigating whether there is a relationship between Water temperature and Depth in (m) with Salinity (in g of salt per kg of water (g/kg)). In other words the linear model will predict the Salinity given the Water temperature and Depth.
Full dataset is available on Kaggle:

https://www.kaggle.com/sohier/calcofi?select=bottle.csv

In [1]:
import pandas as pd
import kaggle
import zipfile
from linear_regression import *

In order to use Kaggle API, we should do the following steps:
1. pip install kaggle
2. API credentials --> To get kaggle.json file go to: https://www.kaggle.com/username/account
3. Put the .json file downloaded in the following directory: C:\users\username\\.kaggle
3. download the data set with following commands

This will return a True, which mean the data set is downloaded and is located in data folder in the same directory as the source code. Now we can use the data set

In [2]:
kaggle.api.authenticate()
kaggle.api.dataset_download_file('sohier/calcofi', file_name='bottle.csv',  path='data/')

True

Since the *CalCOFI* is a big data set (269 MB), I have modified the CSV to just the 3 columns needed for this exercize. Therefore, I could upload the file to GitHub. 

Furthermore, in my linear regression function, I have only used the first 1000 data for a better runtime and presentation. But this does not effect the data set I am saving as a CSV file.

Since the the downloaded file is in .zip format, We have to extract it into the directory, where source code exists.

In [3]:
file_name = r"C:\Users\shaya\Documents\Shayan Docs\Koc Uni\Semester 1\Advance Data Analysis Python\HW\PythonCourse\HW\HW2\data\bottle.csv.zip"
with zipfile.ZipFile(file_name, 'r') as zip_ref:
    zip_ref.extractall()

Since in this project we only need Depth, Tempreture, and Salnty, only these columns are imported in the first place

In [4]:
df = pd.read_csv(r'C:\Users\shaya\Documents\Shayan Docs\Koc Uni\Semester 1\Advance Data Analysis Python\HW\PythonCourse\HW\HW2\bottle.csv', usecols=['T_degC', 'Depthm', 'Salnty'])
df.head()

Unnamed: 0,Depthm,T_degC,Salnty
0,0,10.5,33.44
1,8,10.46,33.44
2,10,10.46,33.437
3,19,10.45,33.42
4,20,10.45,33.421


Saving the file to CSV called 'modified_bottle.csv' and use it from now on for any change

In [5]:
df.to_csv('modified_bottle.csv', index=False)

In [6]:
df = pd.read_csv('modified_bottle.csv')
df.head()

Unnamed: 0,Depthm,T_degC,Salnty
0,0,10.5,33.44
1,8,10.46,33.44
2,10,10.46,33.437
3,19,10.45,33.42
4,20,10.45,33.421


In [7]:
regression_estimates, standard_errors, credible_intervals, X, Y = linear_regression(data_set='modified_bottle.csv')

   Depthm  T_degC  Salnty
0       0   10.50  33.440
1       8   10.46  33.440
2      10   10.46  33.437
3      19   10.45  33.420
4      20   10.45  33.421
Coefficients:  [34.24427136355757, -0.08710971384286414, 0.0005176710949269715]
t_statistic-values:  [546.9655357167937, -15.8155521804229, 9.84088622253644]
Null hypothesis rejected.
Failed to reject the null hypothesis.
Null hypothesis rejected.


In [8]:
print('regression_estimates: ', regression_estimates)
print('standard_error: ', standard_errors)
print('credible_intervals: ')
print('lower_CI: ', credible_intervals[0])
print('upper_CI: ', credible_intervals[1])

regression_estimates:  [33.32961937 33.33724513 33.33828047 33.3438106  33.34432828 33.34950499
 33.35416403 33.37815145 33.39797257 33.42419492 33.42836122 33.45368754
 33.4772966  33.49711772 33.55884093 33.56423176 33.60475972 33.6541629
 33.6984143  33.72273022 33.84501663 33.88077118 33.99961277 34.09215719
 34.09406363 34.17022146 34.24725039 34.32133754 33.36446325 33.387933
 33.40530508 33.41260234 33.43051702 33.4793879  33.48268311 33.50400736
 33.51391792 33.56382414 33.57651224 33.625169   33.6512271  33.66859917
 33.7340854  33.75110404 33.82315575 33.86324084 33.88126989 33.95265431
 33.97223638 34.06581615 34.12385549 34.15242714 34.22945607 34.30300061
 34.33154733 34.37480295 34.44660529 34.51927874 34.59108108 34.59350519
 33.35575228 33.37399545 33.38788313 33.38978957 33.39567314 33.40416999
 33.4316418  33.42867508 33.41035711 33.44145261 33.44510125 33.49149814
 33.49375801 33.54328587 33.59281372 33.67153065 33.68751395 33.77001882
 33.83079613 33.84468381 33.968