# Using El Nino Southern Oscillation (ENSO) Information To Predict Water Levels in Norfolk, VA

## Introduction and Motivation: 

### [COPY INFO FROM PRESENTATION SLIDES]

## Data and Methods

- Data Sources:


    * El Nino data: Oceanic Nino Index (ONI) from from the National Oceanic and Atmospheric Administration (NOAA)
        [http://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_v5.php]

    * Southern Oscillation data: Southern Oscillation Index (SOI) from NOAA
        [http://www.cpc.ncep.noaa.gov/data/indices/soi]

    * Water level data: Sewell's Point, VA tidal gauge data from NOAA
        [https://tidesandcurrents.noaa.gov/stationhome.html?id=8638610]

- Methodology:


    * Our period of record will consist of monthly data for years 1996-2017 (existing gauge installed Dec. 1995).

## Preprocessing

First step is to acquire input data, and read into pandas data frames; 
here we'll be using:

    a) standardized monthly sea-level pressure anomalies (SOI) for our ENSO metric, and 

    b) maximum verified monthly water level above the station's Mean Lower Low Water (MLLW) datum, in meters. 
    NOAA defines MLLW as 
> *"the average of the lower low water height of each tidal day observed over the National Tidal Datum Epoch."* 

In [151]:
# Imports and constants
import numpy as np
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

soi_url="http://www.cpc.ncep.noaa.gov/data/indices/soi"
wlev_url='https://raw.githubusercontent.com/skydog71/CMSC641_tutorial/master/wlev_sewells_pt_monthly_1996-2017.csv'


In [152]:
# Read in SOI data, and narrow period of record to ours (1996-2017)
soi_df = pd.read_csv(soi_url,sep='  ',header=2, engine='python')
soi_df = soi_df[45:67]

# Rename columns (to remove superfluous spaces)
soi_df.columns = ['YEAR', 'JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']

soi_df.head()

Unnamed: 0,YEAR,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC
45,1996,1.6,0.4,1.9,1.3,0.5,1.9,1.1,1.2,1.0,1.0,-0.1,1.5
46,1997,0.8,2.9,-0.7,-1.0,-2.2,-2.3,-1.2,-2.4,-2.4,-2.4,-2.0,-1.6
47,1998,-4.4,-3.4,-4.0,-2.4,0.4,1.6,2.0,1.9,1.7,1.8,1.7,2.3
48,1999,3.0,1.6,2.1,2.3,0.4,0.4,0.9,0.6,-0.1,1.6,1.7,2.4
49,2000,1.1,2.7,2.2,2.0,0.6,-0.3,-0.3,1.2,1.4,1.8,3.0,1.3


In [153]:
# Read in water level data, which have been time-sampled down to our record of interest (1996-2017)
wlev_df = pd.read_csv(wlev_url, header=0, engine='python')
wlev_df.head(12)

Unnamed: 0,Year,Month,Highest,MHHW,MHW,MSL,MTL,MLW,MLLW,DTL,GT,MN,DHQ,DLQ,HWI,LWI,Lowest,Inferred
0,1996,1,1.251,0.884,0.818,0.457,0.454,0.091,0.036,0.46,0.848,0.727,0.066,0.055,1.6,7.86,-0.216,0
1,1996,2,1.374,0.818,0.749,0.384,0.378,0.008,-0.042,0.388,0.86,0.741,0.069,0.05,1.57,7.87,-0.309,0
2,1996,3,1.053,0.796,0.738,0.377,0.373,0.008,-0.047,0.375,0.843,0.73,0.058,0.055,1.56,7.84,-0.608,0
3,1996,4,1.134,0.827,0.769,0.404,0.399,0.029,-0.008,0.409,0.835,0.74,0.058,0.037,1.55,7.82,-0.231,0
4,1996,5,1.167,0.865,0.814,0.436,0.43,0.046,0.015,0.44,0.85,0.768,0.051,0.031,1.51,7.77,-0.192,0
5,1996,6,1.102,0.881,0.828,0.45,0.444,0.059,0.036,0.458,0.845,0.769,0.053,0.023,1.44,7.74,-0.126,0
6,1996,7,1.131,0.873,0.815,0.429,0.42,0.025,-0.008,0.432,0.881,0.79,0.058,0.033,1.47,7.82,-0.224,0
7,1996,8,1.138,0.935,0.898,0.506,0.498,0.099,0.078,0.506,0.857,0.799,0.037,0.021,1.45,7.77,-0.072,0
8,1996,9,1.252,1.072,1.019,0.648,0.641,0.264,0.227,0.649,0.845,0.755,0.053,0.037,1.5,7.79,-0.045,0
9,1996,10,1.54,1.024,0.975,0.601,0.597,0.22,0.178,0.601,0.846,0.755,0.049,0.042,1.55,7.79,-0.083,0


There's some superfluous info here -- all we'll be working with in this tutorial is the "Highest" column, which is the maximum water level for a given month. 

In addition, it would make things clearer to transpose the data such that series are arranged like the SOI dataframe.

In [154]:
# Transpose each month's 'Highest' data
jan_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 1.])
feb_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 2.])
mar_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 3.])
apr_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 4.])
may_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 5.])
jun_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 6.])
jul_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 7.])
aug_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 8.])
sep_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 9.])
oct_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 10.])
nov_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 11.])
dec_data = np.array([wlev_df['  Highest'][r] for r in range(len(wlev_df['  Highest'])) if wlev_df[' Month'][r] == 12.])

# Construct our transposed dataframe
wlev_transposed_df = pd.DataFrame(
    {'JAN' : jan_data,
     'FEB' : feb_data,
     'MAR' : mar_data,
     'APR' : apr_data,
     'MAY' : may_data,
     'JUN' : jun_data,
     'JUL' : jul_data,
     'AUG' : aug_data,
     'SEP' : sep_data,
     'OCT' : oct_data,
     'NOV' : nov_data,
     'DEC' : dec_data
    })

This would also be a good time to standardize the data (this was unnecessary with the SOI data, as they were provided by NOAA as standardized anomalies).

In [155]:
# We use a pipeline to a) impute any missing data with the median (doesn't look like anything's missing here, but 
# this improves reusability); and b) standardize each month's data individually
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn_pandas import DataFrameMapper

num_attributes = [x for x in list(wlev_transposed_df)]

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler()),
])

full_pipeline = DataFrameMapper([
    (num_attributes, num_pipeline)
])

wlev_clean = full_pipeline.fit_transform(wlev_transposed_df)


In [156]:
years = np.array(range(1996,2018))
# Construct final (tranposed, imputed, and standardized) water level dataframe
wlev_clean_df = pd.DataFrame(
    {'YEAR' : years,
     'JAN' : wlev_clean[:,0],
     'FEB' : wlev_clean[:,1],
     'MAR' : wlev_clean[:,2],
     'APR' : wlev_clean[:,3],
     'MAY' : wlev_clean[:,4],
     'JUN' : wlev_clean[:,5],
     'JUL' : wlev_clean[:,6],
     'AUG' : wlev_clean[:,7],
     'SEP' : wlev_clean[:,8],
     'OCT' : wlev_clean[:,9],
     'NOV' : wlev_clean[:,10],
     'DEC' : wlev_clean[:,11]
    })
wlev_clean_df.head(12)

Unnamed: 0,YEAR,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC
0,1996,-0.02693078,0.445061,-1.253301,-1.060545,-0.893991,-1.012027,-0.196223,-0.611204,-0.661478,0.044815,-0.152971,0.408277
1,1997,-0.911158,0.300004,-0.939065,1.989432,-1.069758,1.989676,0.536838,-0.667657,-1.120031,0.637162,-0.119782,0.397043
2,1998,2.616774,2.98356,-0.255141,1.215259,1.063697,-1.075055,-1.11255,1.329381,-0.651652,-1.74813,-0.644167,-0.698335
3,1999,-0.2289116,-0.288283,0.779987,-1.040524,1.215221,-0.027217,1.20881,1.389363,1.198938,-1.453944,-1.019201,-0.642162
4,2000,1.84027,-1.416504,0.428783,1.482215,1.833438,-0.586589,0.241577,-0.265427,-0.094837,-1.517552,-0.388612,-1.305006
5,2001,-1.184954,-1.339946,-0.329079,-1.294132,-0.500029,-1.626549,1.249536,-0.854659,-0.209475,-0.229494,-1.218335,-0.990436
6,2002,-0.5161732,-0.723454,-0.920581,-0.733524,-1.930413,-0.689009,-0.623842,-0.688827,-0.848174,-0.380563,-0.644167,-0.097282
7,2003,9.96639e-16,0.533707,0.675242,1.502237,0.142432,-0.090244,-1.010736,-1.179266,3.11176,-0.448146,-0.050085,-0.378148
8,2004,-0.7316194,-0.513927,0.447267,-1.594458,-1.797072,-0.27145,-0.02314,-0.064312,-0.252055,-0.583312,0.613693,0.228523
9,2005,-0.2334001,0.017949,-0.45847,-0.012741,1.263709,0.051568,-0.063865,-0.530052,-0.651652,-0.273224,-1.049071,-0.546668


## Exploratory Data Analysis

Data are now ready for exploratory analysis. First step is to make some plots. It may be useful here to look at plots during the three phases of ENSO (Warm, Neutral and Cold):

    a) NEED TO FIGURE OUT HOW TO MATCH LAGGED ENSO DATA TO CONTEMPORARY WATER HEIGHTS 

    b) HOW TO LABEL WARM/NEUTRAL/COLD -- based on official NOAA designations?

## Findings

## Conclusions and Future Directions