<a id="top"></a> 
# Salinity CalCOFI: Data Clean, Correlation, Visualizations and Folium Map

The CalCOFI data set represents the longest (1949-present) and most complete (more than 50,000 sampling stations) time series of oceanographic and larval fish data in the world. CalCOFI research drew world attention to the biological response to the dramatic Pacific-warming event in 1957-58 and introduced the term “El Niño” into the scientific literature. 

This analysis involves CalCOFI data cleaning, correlation, visualization and folium map.


### Table Of Content
1.  [Data Collection](#coll)<br>

2.  [Understanding the Data](#data)<br>
2.1  [Data Types](#data_info)<br>
2.2  [Statistical Summary](#data_summ)<br>

3.  [Data Cleaning](#prep)<br>
3.1  [Check for NULLs/Duplicates](#prep_nulls)<br>
3.2  [Extract Month/Year from Depth_ID](#prep_extract)<br>
3.3  [Drop columns that cannot be Normalized](#prep_drop)<br>

4.  [Correlations](#corr)<br>
4.1  [Normalization](#corr_norm)<br>
4.2  [Correlation - Salinity](#corr_saln)<br>

5. [Data Visualization](#eda)<br>
5.1  [Salinity Plots](#eda_saln)<br>
5.2  [Distribution Plots - Correlation](#eda_dist)<br>
5.3  [Regression Plots - Correlation](#eda_regr)<br>

6.  [Map - Collection Station Locations](#map)<br>   


---
#  1.  Data Collection <a id="coll"></a>
###  Import Python Libraries

In [None]:
#  Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

#  maps
import folium
from folium.plugins import MarkerCluster


%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 80)

#  Kaggle directories
import os
print(os.listdir("../input"))
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

###  Load the Datasets

In [None]:
#  bottle.csv contains information on ocean conditions
#  cast.csv   contains information on collecting stations
df = pd.read_csv('../input/bottle.csv')

[go to top of document](#top)     

---
#  2.  Understanding the Data <a id="data"></a>

##  2.1  Data Types<a id="data_info"></a>
-  **Categorical data** - collection station information - for map
-  **Numerical data** - scientific data


The collection station locations will be used for the Folium map, but the main attributes for this analysis will be:

*  **Depth_ID**  - extract months and years
*  **Salnty**  - Salinity in g of salt per kg of water (g/kg).  _Target_ variable

In [None]:
df.shape

##  2.2  Statistical Summary <a id="data_summ"></a>
Summarize descriptive statistics of the dataset for *numerical* and *categorical* features. 

**Statistical Summary - NUMERICAL DATA**   
Summarize the central tendency, dispersion and shape of numeric features, excluding categorical and NaN values.

In [None]:
df.describe()   #  NUMERICAL DATA

**Statistical Summary - CATEGORICAL DATA**   
Summarize the count, uniqueness and frequency of categorical features, excluding numerical values.

In [None]:
df.describe(include=['O'])   #  CATEGORICAL DATA

[go to top of document](#top)     

---
#  3.  Data Cleaning <a id="prep"></a>
Clean the data before begining any type of analysis.

## 3.1  Check for NULLs/Duplicates <a id="prep_nulls"></a>
Cleaning up the NULL and duplicate values in the dataset:

*  3.1.1  Check for NULL percentages
*  3.1.2  Drop attributes with more than 30% data missing
*  3.1.3  Fill remaining NULLs with **mean** values
*  3.1.4  Re-check NULL Percentages
*  3.1.5  Check for duplicated

###  3.1.1  Check for NULL percentages

In [None]:
nulls = df.isnull().sum().sort_values(ascending = False)
prcet = round(nulls/len(df)*100,2)

df_null = pd.DataFrame(columns =  ['Attr','Total','Percent'])
df_null.Attr  = nulls.index
df_null.Total = nulls.values
df_null.Percent = prcet.values
print(df_null.head(20))

###  3.1.2  Drop attributes with more than 30% data missing

In [None]:
for i in df_null.Attr[df_null['Percent'] > 30]:
    df = df.drop([i], axis=1)
    #print(df.shape,i)

###  3.1.3  Fill remaining NULLs with **mode** values
Some attributes have more than one mode.  Take mean of the multiple modes for the 'fillna' value.

In [None]:
for i in df.columns:
    if df[i].isnull().sum() > 0:
        df[i].fillna(df[i].mode().mean(), inplace=True)
        #print('filled',i)

###  3.1.4  Re-check NULL Percentages
Shows attributes for that can be used for EDA and Correlation.

In [None]:
nulls = df.isnull().sum().sort_values(ascending = False)
prcet = round(nulls/len(df)*100,2)

df_null = pd.DataFrame(columns =  ['Attr','Total','Percent'])
df_null.Attr  = nulls.index
df_null.Total = nulls.values
df_null.Percent = prcet.values
print(df_null.head())

###  3.1.5  Check for Duplicated values

In [None]:
print('COUNT OF DUPLICATES:  {}'.format(df.duplicated().sum()))

## 3.2  Extract Month/Year from Depth_ID <a id="prep_extract"></a>
_cast.csv_ file has the month/year data, but due to the size of the _bottle.csv_, it's be easier to extract it from here.

In [None]:
#  Depth_ID = [Century]-[YY][MM][ShipCode]-etc
#  19-4903CR-HY-060-0930-05400560-0020A-7
df['Year'] = (df['Depth_ID'].str.split('-', expand=True)[0] + \
                df['Depth_ID'].str.split('-', expand=True)[1]). \
                map(lambda x: str(x)[:4])
df['Month'] = (df['Depth_ID'].str.split('-', expand=True)[1]). \
                 map(lambda x: str(x)[2:4])
                 
df[['Depth_ID','Year','Month']].head(10)

## 3.3  Drop columns that cannot be Normalized <a id="prep_drop"></a>
Dropping columns that cannot be normalized.
   - Cst_Cnt   Auto-numbered Cast Count
   - Btl_Cnt   Auto-numbered Bottle count
   - Sta_ID    CalCOFI Line and Station
   - Depth_ID  [Century]-[YY][MM][ShipCode]

In [None]:
drop_cols = ['Cst_Cnt', 'Btl_Cnt', 'Sta_ID', 'Depth_ID', 'Depthm','Year','Month']
df_norm = df.drop(drop_cols, axis=1)  #  data for normalization
df_scale = df_norm.copy(deep=True)    #  backup data

[go to top of document](#top)     

---
# 4.  Correlation<a id="corr"></a>
Correlation is a statistical metric for measuring to what extent different variables are interdependent.  In order to perform correlation, we need to first normalize the data.

## 4.1  Normalization<a id="corr_norm"></a>
Normalization is a rescaling of the data from the original range so that all values are within a certain range, typically between 0 and 1. Normalized data is essential in machine learning. Correlation and models will not produce good results if the scales are not standardized.

Data in **df_corr** will be normalized and the **df** data frame will be updated with the encoded and normalized data.

In [None]:
df_scale = StandardScaler().fit_transform(df_scale)

#  create dataframe
df_norm = pd.DataFrame(df_scale, index=df_norm.index, columns=df_norm.columns)

## 4.2  Correlation - Salinity<a id="corr_saln"></a>

In [None]:
df_norm.corr()

#  Drop columns with mode = "0.0".  No impact on correlation
for i in df_norm.columns.tolist():
    if (df_norm[i].mode()[0] == 0.0):
        print(' - ',i,df_norm[i].mode()[0])
        df_norm = df_norm.drop(i,axis=1)

#  Create correlation dataframe
df_corr = pd.DataFrame(columns=['Attributes','Correlation'])
df_corr.Attributes = df_norm.corr()['Salnty'].sort_values(ascending=False).index
df_corr.Correlation = df_norm.corr()['Salnty'].sort_values(ascending=False).values
print(df_corr)

[go to top of document](#top)     

---
#  5.  Data Visualization <a id="eda"></a>
Some of the attribute values will result in similar plots, i.e. R_O2Sat and O2Sat.  In that case, only one observation will be selected for plotting the distribution and regression plots.  For this project, nine observations were selected.

**df_sample** - dataframe size will be sampled (reduced) in order to have cleaner plots.

In [None]:
#  observations for plotting
plot_attr = ['R_DYNHT', 'R_SIGMA', 'R_Depth', 'RecInd', 'NH3q',  'T_prec', 'T_degC', 'R_POTEMP', 'O2ml_L']

for i in plot_attr:
    if plot_attr[0] == i:
        df_plot = df_corr[df_corr.Attributes == i]
    else:
        df_plot = df_plot.append(df_corr[df_corr.Attributes == i])
print(df_plot)

#  take sample of data for plotting
df_sample = df_norm.sample(n=int(round(len(df)*.002,0)), random_state=0)
print('\n\nPlotting data shape: {}'.format(df_sample.shape))

##  5.1  Salinity Plots <a id="eda_saln"></a>
Salinity plots will use the complete dataset.

###  5.1.1  Salinity Distribution

In [None]:
#  Salinity distribution
plt.figure(figsize=(8,6))
plt.xlim([32, 36])#  Salinity distribution
plt.title('Salinity Distribution (g/Kg)', fontsize=14)
sns.distplot(df['Salnty'], color='darkgreen')

###  5.1.2  Plot of Salinity over Time

In [None]:
#  Yearly change in Salinity
fig = plt.figure(figsize=(12,6))
fig.autofmt_xdate()
fig.add_subplot(121)
plt.title('Yearly Change in Salinity (g/Kg)', fontsize=14)
sns.scatterplot(data=df, x='Year', y='Salnty', color='darkgreen')

#  Seasonal change in Salinity
fig.add_subplot(122)
plt.title('Seasonal Change in Salinity (g/Kg)', fontsize=14)
sns.scatterplot(data=df, x='Month', y='Salnty', color='darkgreen')
plt.show()

##  5.2  Distribution Plots for Correlations <a id="eda_dist"></a>

In [None]:
fig = plt.figure(figsize=(14,60))
col = 3
row  = int(len(df_corr.Attributes)/col)
count = 1

for i, j in zip(df_plot.Attributes,df_plot.Correlation):
    fig.add_subplot(row, col, count)
    plt.title('Salinity vs {} (corr = {:.4})\nnormalized distribution'.format(i,j))
    plt.xlim(-4,4)
    sns.distplot(df_sample.Salnty)
    sns.distplot(df_sample[i])
    count = count + 1

plt.show()

##  5.3  Regression Plots - Correlation <a id="eda_regr"></a>

In [None]:
fig = plt.figure(figsize=(14,60))
col = 3
row  = int(len(df_corr.Attributes)/col)
count = 1

for i, j in zip(df_plot.Attributes,df_plot.Correlation):
    fig.add_subplot(row, col, count)
    plt.title('Salinity vs {} (corr = {:.4f})\nnormalized distribution'.format(i,j))
    sns.regplot(x=df_sample[i],y="Salnty",data=df_sample,order=2, scatter_kws={'alpha':0.25},color='green');
    count = count + 1

plt.show()

[go to top of document](#top)     

---
#  6.  Map - Collection Station Locations<a id="map"></a>

In [None]:
#  Load the Dataset
dfLOC = pd.read_csv('../input/cast.csv')

In [None]:
#  select location points
dfLOC = dfLOC[['Lat_Dec', 'Lon_Dec','Date']]
dfLOC = dfLOC.tail(1000)
dfLOC = dfLOC.reset_index(drop=True)  # reset index after tail

#  create folium map
salinity_map   = folium.Map(location=[dfLOC.Lat_Dec.mean(),dfLOC.Lon_Dec.mean()], zoom_start=6)
marker_cluster = MarkerCluster().add_to(salinity_map)

for i in range(len(dfLOC)):
    folium.Marker(location=[dfLOC.Lat_Dec[i],dfLOC.Lon_Dec[i]],
            popup = (dfLOC.Date[i]),         # dates in popups
            icon = folium.Icon(color='green')  # green popup icon
    ).add_to(marker_cluster)

salinity_map.add_child(marker_cluster)
salinity_map         #  display map

---
Please upvote if you found this helpful :-)
###  END
[go to top of document](#top)