# TRAINING A NEURAL NETWORK WITH GOOGLE TRENDS AND STOCK MARKET DATA

Example adapted from Dinov (2018), Chapter 11

In this case study, we are going to use the Google trends and stock market dataset. These daily data (between 2008 and 2009) can be used to examine the associations between Google search trends and the daily market index - Dow Jones Industrial Average.



Variables are:

- Index: Time Index of the Observation
- Date: Date of the observation (Format: YYYY-MM-DD)
- Unemployment: The Google Unemployment Index tracks queries related to "unemployment, social, social security, unemployment benefits" and so on.
- Rental: The Google Rental Index tracks queries related to “rent, apartments, for rent, rentals,” etc.
- RealEstate: The Google Real Estate Index tracks queries related to “real estate, mortgage, rent, apartments” and so on.
- Mortgage: The Google Mortgage Index tracks queries related to "mortgage, calculator, mortgage calculator, mortgage rates".
- Jobs: The Google Jobs Index tracks queries related to "jobs, city, job, resume, career, monster" and so forth.
- Investing: The Google Investing Index tracks queries related to "stock, finance,capital, yahoo finance, stocks", etc.
- DJI_Index: The Dow Jones Industrial (DJI) index. These data are interpolated from 5 records per week (Dow Jones stocks are traded on week-days only) to 7 days per week to match the constant 7-day records of the Google-Trends data.
- StdDJI: The standardized-DJI Index computed by: StdDJI ¼ 3 + (DJI-11,091)/ 1,501, where m ¼ 11,091 and s ¼ 1,501 are the approximate mean and standard-deviation of the DJI for the period (2005–2011).
- 30-Day Moving Average Data Columns: The 8 variables below are the 30-day moving averages of the 8 corresponding (raw) variables above.
    - Unemployment30MA,Rental30MA, RealEstate30MA, Mortgage30MA, Jobs30MA, Investing30MA, DJI_Index30MA, StdDJI_30MA
- 180-Day Moving Average Data Columns: The 8 variables below are the 180-day moving averages of the 8 corresponding (raw) variables.
    - Unemployment180MA, Rental180MA, RealEstate180MA, Mortgage180MA, Jobs180MA, Investing180MA, DJI_Index180MA, StdDJI_180MA

Here we use the RealEstate as our dependent variable.

Let’s see if the Google Real Estate Index could be predicted by other variables in the dataset.

## Load libraries and dataset

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(plotly) # for interactive visualizations
library(neuralnet) # for neural network model
library(NeuralNetTools) # for visualizing neural nets
library(fastDummies) # for dummification
#library(Formula) # for extended formulas
library(caret) # for confusion matrix

In [None]:
google <- fread("../data/csv/CaseStudy13_GoogleTrends_Markets_Data.csv")

## Explore and wrangle the data

First view the data:

In [None]:
google

In [None]:
str(google)

We may delete the first two columns

In [None]:
google[,c("Index", "Date") := NULL]

In [None]:
google

Let's see the numeric ranges of variables:

In [None]:
google %>% purrr::keep(is.numeric) %>% sapply(quantile, na.rm = T) %>% t()

Let's now normalize the values to range 0-1:

In [None]:
google_norm <- google[,BBmisc::normalize(.SD, "range")]

And check the ranges again:

In [None]:
google_norm %>% purrr::keep(is.numeric) %>% sapply(quantile, na.rm = T) %>% t()

## ANN for numeric prediction

### Split dataset

In [None]:
set.seed(1)
train <- google_norm[,sample(.I, 0.75 * .N)]

In [None]:
google_train <- google_norm[train]
google_test <- google_norm[-train]

### Train a model

First keep variable names in a vector:

In [None]:
namess <- names(google_norm)
namess

And create a formula for RealEstate vs. all other first 8 variables:

In [None]:
formula1 <- reformulate(namess[1:8][-3], namess[3])
formula1

And run the model:

In [None]:
google_model <- neuralnet::neuralnet(formula1, data = google_train)

Get some model parameters:

In [None]:
google_model$result.matrix

And plot the model:

In [None]:
NeuralNetTools::plotnet(google_model, cex_val = 0.4, line_stag = 0)

We have only a single hidden node H1. B2 and B2 are bias values - constant values, similar to intercept in linear regression

The width of lines show the strength of weights and the color is black for positive and gray for negative weights 

Function garson() obtains and plots (using the ggplot2 infra-structure) a bar plot with the feature relevance
scores of each of the input variables. It is interesting to observe the ranking of the features provided by the garson() function:

(From Torgo (2017), Chapter 3)

In [None]:
NeuralNetTools::garson(google_model) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

So the variables with most impact are rental, unemplyoment and jobs

To get the weights only:

In [None]:
NeuralNetTools::neuralweights(google_model)

### Evaluate model performance

Let's get the predictions on test data

In [None]:
google_pred <- neuralnet::compute(google_model, google_test[,c(1:2, 4:8)])

In [None]:
google_pred

Let's get the correlation between predictions and actual values in the test set:

In [None]:
cor(google_pred$net.result, google_test$RealEstate)

### Improve model performance

Now we will add four hidden nodes

In [None]:
google_model2 <- neuralnet::neuralnet(formula1, data = google_train, hidden = 4)

In [None]:
google_model2$result.matrix

In [None]:
NeuralNetTools::plotnet(google_model2, cex_val = 0.4, line_stag = 0)

Now we have a smaller error

The weights are:

In [None]:
NeuralNetTools::neuralweights(google_model2)

Let's get the predictions:

In [None]:
google_pred2 <- neuralnet::compute(google_model2, google_test[,c(1:2, 4:8)])

In [None]:
cor(google_pred2$net.result, google_test$RealEstate)

The correlation between predictions and actual values in the test set are higher

### Adding more layers

Now we will add 3 hidden layers with 4, 3, and 3 nodes respectively:

In [None]:
google_model3 <- neuralnet::neuralnet(formula1, data = google_train, hidden = c(4,3,3))

The error is even lower:

In [None]:
google_model3$result.matrix

Plot the network:

In [None]:
NeuralNetTools::plotnet(google_model3, cex_val = 0.4, line_stag = 0)

In [None]:
NeuralNetTools::neuralweights(google_model3)

In [None]:
google_pred3 <- neuralnet::compute(google_model3, google_test[,c(1:2, 4:8)])

In [None]:
cor(google_pred3$net.result, google_test$RealEstate)

Correlation is slightly higher

## ANN for classification

In practice, ANN models are also useful as classifiers. Let’s demonstrate this by using again the Stock Market data. We will binarize the samples according to their RealEstate values. For those higher than the 75%, we will lable them 0; For those lower than the 25%, we will label them 2; all others will be labeled 1. Even in the classification setting, the response still must be numeric.

### Discretization

In [None]:
classes <- cut(-google_norm$RealEstate, quantile(-google_norm$RealEstate, c(0, 0.25, 0.75, 1)), include.lowest=TRUE) %>% as.integer %>% -1

In [None]:
classes <- google_norm[,cut(-RealEstate,
                            quantile(-RealEstate, c(0, 0.25, 0.75, 1)),
                            include.lowest=TRUE) %>%
            as.integer %>% -1]
classes

Compare the normalized values and classes:

In [None]:
cbind(google_norm$RealEstate %>% round(2), classes)

Create a copy of the google_norm into google_class.

Note that you should create a deep copy with copy() function, otherwise the object will be locked and you cannot use := operator on columns:

In [None]:
google_class <- copy(google_norm)

And replace RealEstate with class values:

In [None]:
google_class[,RealEstate := classes]

In [None]:
google_class

We can see the distribution of classes:

In [None]:
google_class[,table(RealEstate)]

### Dummification

We can dummify the classes using fastDummies:

In [None]:
class_dummies1 <- google_class[,fastDummies::dummy_cols(.(RealEstate = RealEstate),
                            remove_first_dummy = F)] %>%
    dplyr::select(-RealEstate) %>%
    magrittr::set_colnames(c("Median", "High", "Low")) %>%
    dplyr::select(c("High", "Median", "Low"))

class_dummies1 %>% str

The output is a data frame

Or using base model.matrix() function:

In [None]:
class_dummies2 <- google_class[,model.matrix(~factor(RealEstate)-1)] %>%
    magrittr::set_colnames(c("High", "Median", "Low"))

class_dummies2 %>% str

The output is a matrix with rownames

We can use either

### Split dataset

First split the google_class

In [None]:
google_train_class <- google_class[train]
google_test_class <- google_class[-train]

Then get the x and y values also from the dummies:

In [None]:
train_x <- google_train_class[,c(1:2, 4:8)]
train_y_ind <- as.data.table(class_dummies1)[train]

In [None]:
train_set <- cbind(train_x, train_y_ind)

In [None]:
train_set

### Train a model

In [None]:
names2 <- names(train_set)
names2

In [None]:
formula2 <- paste(paste(names2[8:10], collapse = " + "),
                  paste(names2[1:7], collapse = " + "),
                  sep = " ~ ") %>% as.formula

formula2

We use non-linear output and display every 2,000 iterations:

Note that threshold and stepmax should be finetuned so that the model reaches the minimum error before the maximum steps, otherwise, computing the predictions may cause errors:

In [None]:
nn_single <- neuralnet::neuralnet(formula2,
                                 data = train_set,
                                 hidden = 4,
                                 linear.output = F,
                                 lifesign = "full",
                                 lifesign.step = 2000,
                                 threshold = 0.03,
                                 stepmax = 200000)

Plot the model:

In [None]:
NeuralNetTools::plotnet(nn_single, cex_val = 0.4, line_stag = 0)

We get the predictions on the test data:

In [None]:
prediction1 <- neuralnet::compute(nn_single, google_test_class[,c(1:2, 4:8)])
prediction1

And the net resulsts:

In [None]:
pred_results <- prediction1$net.result

In [None]:
str(pred_results)

In [None]:
pred_results %>% round

The predictions have three columns for each of the 0,1,2 values (High, Median, Low)

Now let's convert the dummies back into numeric class values of 0,1,2

In [None]:
class_preds <- apply(pred_results, 1, which.max) - 1
class_preds

And these are the actual values: 

In [None]:
class_test <- google_test_class[,RealEstate]
class_test

In [None]:
table(class_test, class_preds) %>% caret::confusionMatrix()

We have an accuracy of 96.7%

We can also have multiple hidden layers in our model:

In [None]:
nn_single2 <- neuralnet::neuralnet(formula2,
                                 data = train_set,
                                 hidden = c(4,5),
                                 linear.output = F,
                                 lifesign = "full",
                                 lifesign.step = 2000,
                                 threshold = 0.03,
                                 stepmax = 200000)

Plot the model:

In [None]:
NeuralNetTools::plotnet(nn_single2, cex_val = 0.4, line_stag = 0)

Get the predictions:

In [None]:
class_preds2 <- neuralnet::compute(nn_single2,
                   google_test_class[,c(1:2, 4:8)])$net.result %>%
                    apply(1, which.max) - 1

class_preds2

Compare actual and predicted classes:

In [None]:
table(class_test, class_preds2) %>% caret::confusionMatrix()

The accuracy is lower