# Neural Network Classification

In this exercise we will use a neural network to predict a binary variable. The challenge consists of predicting whether on a specific day it was raining or not (a True/False response) based on other environmental variables. We will hypothesize that rainy days can be predicted by looking at changes in the soil water storage from sensors located on the same weather station and based on the cloudiness. Our hypothesis rests on the fact that it is cloudy when it rains and that part of the rainfall infiltrates the soil.

Potential errors could arise from days in which the sky is cloudy but it is not raining, or perhaps due to small changes in water content as a consequence of the natural redistribution of soil water in the soil. Such a application could be implemented in environmental monitoring netowrks for detecting malfunctioning rain gauges.

In [18]:
import pandas as pd
from bokeh.plotting import figure, show, output_notebook, gridplot
output_notebook()

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [3]:
# Load weather data
df = pd.read_csv('../datasets/gypsum_ks_daily_2018.csv')

# Inpsect columns so that we can select a subset of the data
df.columns


Index(['TIMESTAMP', 'STATION', 'PRESSUREAVG', 'PRESSUREMAX', 'PRESSUREMIN',
       'SLPAVG', 'TEMP2MAVG', 'TEMP2MMIN', 'TEMP2MMAX', 'TEMP10MAVG',
       'TEMP10MMIN', 'TEMP10MMAX', 'RELHUM2MAVG', 'RELHUM2MMAX', 'RELHUM2MMIN',
       'RELHUM10MAVG', 'RELHUM10MMAX', 'RELHUM10MMIN', 'VPDEFAVG', 'PRECIP',
       'SRAVG', 'SR', 'WSPD2MAVG', 'WSPD2MMAX', 'WSPD10MAVG', 'WSPD10MMAX',
       'WDIR2M', 'WDIR2MSTD', 'WDIR10M', 'WDIR10MSTD', 'SOILTMP5AVG',
       'SOILTMP5MAX', 'SOILTMP5MIN', 'SOILTMP10AVG', 'SOILTMP10MAX',
       'SOILTMP10MIN', 'SOILTMP5AVG655', 'SOILTMP10AVG655', 'SOILTMP20AVG655',
       'SOILTMP50AVG655', 'VWC5CM', 'VWC10CM', 'VWC20CM', 'VWC50CM'],
      dtype='object')

In [4]:
# Only keep the columns we need
df = df[['TIMESTAMP','PRECIP','SR','VPDEFAVG','VWC5CM','VWC10CM','VWC20CM']]

In [5]:
# Convert date to datetim format for easier plotting
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%m/%d/%y %H:%M')
df.head() # Check date conversion


Unnamed: 0,TIMESTAMP,PRECIP,SR,VPDEFAVG,VWC5CM,VWC10CM,VWC20CM
0,2018-01-01,0.0,9.58,0.08,0.1377,0.1167,0.2665
1,2018-01-02,0.0,10.95,0.08,0.1234,0.1021,0.2642
2,2018-01-03,0.0,10.27,0.16,0.1206,0.0965,0.2353
3,2018-01-04,0.0,8.28,0.17,0.1235,0.0973,0.2094
4,2018-01-05,0.0,10.89,0.23,0.1249,0.0976,0.2047


In [6]:
# Check the presence of missing values
df.isna().sum()

TIMESTAMP    0
PRECIP       0
SR           0
VPDEFAVG     1
VWC5CM       0
VWC10CM      0
VWC20CM      0
dtype: int64

In [7]:
# Fill the few missing values using forward fill
df.fillna(method='ffill', inplace=True)

In [8]:
# Observed rainfall boolean
df['raining_obs'] = df['PRECIP'] > 1 # Assume that smaller events cannot be detected with our appraoch

In [9]:
# Approximate clear sky solar radiation (use model or rolling max)
df['SR_clearsky'] = df['SR'].rolling(12, min_periods=1, center=True).max()
df['delta_sr'] = df['SR_clearsky'] - df['SR']


In [10]:
# Compute storage
df['storage'] = df['VWC5CM']*50 + (df['VWC5CM']+df['VWC10CM'])/2*50 + (df['VWC10CM']+df['VWC20CM'])/2*100

In [13]:
# Compute changes in soil water storage
df['delta_storage'] = df['storage'].diff()
df['delta_storage'].fillna(0, inplace=True)
df.head()


Unnamed: 0,TIMESTAMP,PRECIP,SR,VPDEFAVG,VWC5CM,VWC10CM,VWC20CM,raining_obs,SR_clearsky,delta_sr,storage,delta_storage
0,2018-01-01,0.0,9.58,0.08,0.1377,0.1167,0.2665,False,10.95,1.37,32.405,0.0
1,2018-01-02,0.0,10.95,0.08,0.1234,0.1021,0.2642,False,10.95,0.0,30.1225,-2.2825
2,2018-01-03,0.0,10.27,0.16,0.1206,0.0965,0.2353,False,10.95,0.68,28.0475,-2.075
3,2018-01-04,0.0,8.28,0.17,0.1235,0.0973,0.2094,False,10.95,2.67,27.03,-1.0175
4,2018-01-05,0.0,10.89,0.23,0.1249,0.0976,0.2047,False,10.95,0.06,26.9225,-0.1075


In [27]:
f1 = figure(plot_width=600, plot_height=400, x_axis_type='datetime')
f1.line(source=df, x='TIMESTAMP', y='SR')
f1.line(source=df, x='TIMESTAMP', y='SR_clearsky', color='tomato')
f1.yaxis.axis_label='Solar radiation (MJ/m^2)'

f2 = figure(plot_width=600, plot_height=400, x_axis_type='datetime')
f2.line(source=df, x='TIMESTAMP', y='PRECIP')
f2.circle(source=df[df['raining_obs'] == True], x='TIMESTAMP', y='PRECIP', color='tomato')
f2.yaxis.axis_label='Precipitation (mm)'


f3 = figure(plot_width=600, plot_height=400, x_axis_type='datetime')
f3.line(source=df, x='TIMESTAMP', y='storage')
f3.yaxis.axis_label='Soil water storage (mm)'

grid = gridplot([[f1],[f2],[f3]])
show(grid)

## Define training and test sets

In [28]:
# Define train and test sets
X = df[['delta_storage','delta_sr']]
y = df['raining_obs']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


## Normalize training sets

In [29]:
X_train_normalized = StandardScaler().fit_transform(X_train)
X_test_normalized = StandardScaler().fit_transform(X_test)

## Fit neural netowork model for classification

In [30]:
clf = MLPClassifier(solver='adam', 
                    alpha=1e-5,
                    hidden_layer_sizes=(15, 2),
                    random_state=1,
                    max_iter=5000).fit(X_train_normalized, y_train)


In [31]:
y_test_pred = clf.predict(X_test_normalized)


In [32]:
clf.score(X_train_normalized, y_train)

0.9010989010989011

In [33]:
clf.score(X_test_normalized, y_test)

0.8695652173913043

In [59]:
dates = df.iloc[y_test.index]['TIMESTAMP']

# Use scatter plots because dates are not in order and are discontinuous
f4=figure(title='Test',plot_width=700, plot_height=250, x_axis_type='datetime')
f4.scatter(dates, y_test, color='black', size=10)
f4.scatter(dates, y_test_pred, color='tomato')
f4.yaxis.axis_label="Rain (1) or No rain (0)"

show(f4)


## Practice

- What happens if you use only precipitation as the explanatory variable? Often naive examples can help you determine whether the code is working as expected. What is your expectation if we try to predict days with or without rain using rain itseld as a predictor?

- Try a different set of explanatory variables, such as volumetric water content from individual soil depths, or perhaps other varaibles like relative humidity. During rainfall events the relative humidity tends to be close to 100%, so by adding this varaible it may help improving the model.

- Try the same example using logistic regression. Using the same explanatory varaibles, can a simple logistic regression perform better than a neural network? The logistic regression in Scikit-Learn is implemented following the same steps as the machine learning classifier.

In [60]:
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(cv=25, random_state=1).fit(X_train_normalized, y_train)