# 3rd Model: Deepgraph CNN: Stock Price Prediction using DeepGraphCNN Neural Networks. It includes GCN layers and CNN layers. I have added an MLP at the last layer to predict stock prices.

# Input graphs were created for Pearson, Spearman, and Kendal Tau correlations/coefficients from historical stock prices. Also, another graph is created based on financial news articles.

# For the sake of making execution easier (and at once), I have kept multiple approaches (Pearson, Spearman, and Kendal Tau, News Based) in the same file. One big code file can be difficult to handle; is done just for making execution easier.

# Because I initially tried separately and brought the code together, some code might be a bit redundant/repeating. I may have done some cleaning.

# An use case of DeepGraphCNN for Node Classification
# https://stellargraph.readthedocs.io/en/latest/demos/graph-classification/dgcnn-graph-classification.html


# Import Libraries

In [1]:
# import libraries
import os
import pandas as pd
import math

In [2]:
# Import Libraries for Graph, GNN, and GCN
import stellargraph as sg
from stellargraph import StellarGraph
from stellargraph.layer import DeepGraphCNN
from stellargraph.mapper import FullBatchNodeGenerator
from stellargraph.mapper import PaddedGraphGenerator
from stellargraph.layer import GCN

In [3]:
# Machine Learnig related library Imports
from tensorflow.keras import layers, optimizers, losses, metrics, Model
from sklearn import preprocessing, model_selection
from IPython.display import display, HTML
import matplotlib.pyplot as plt
%matplotlib inline
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow import keras

In [4]:
# If we want to drop NAN column or row wise for stock price data
# I did not need to use this options that much
drop_cols_with_na = 1
drop_rows_with_na = 1

# Dataset: Using 30 companies from the Fortune 500 companies (the paper used these stocks)

In [5]:
df_s = pd.DataFrame();
data_file = "per-day-fortune-30-company-stock-price-data.csv";
df_s = pd.read_csv("./data/" + data_file, low_memory = False);
df_s.head()

Unnamed: 0,Date,AAPL,ABC,AMZN,ANTM,BA,BAC,CAH,COST,CVS,...,PCAR,PSX,T,UNH,UNP,VZ,WBA,WFC,WMT,XOM
0,2017-01-03 00:00:00,29.0375,82.610001,37.683498,,156.970001,22.530001,74.480003,159.729996,80.349998,...,43.546665,86.790001,32.492447,161.449997,102.519997,54.580002,82.959999,56.0,68.660004,90.889999
1,2017-01-04 00:00:00,29.004999,84.660004,37.859001,,158.619995,22.950001,75.629997,159.759995,79.75,...,44.146667,87.260002,32.303623,161.910004,103.139999,54.52,82.980003,56.049999,69.059998,89.889999
2,2017-01-05 00:00:00,29.1525,83.68,39.022499,,158.710007,22.68,74.5,162.910004,81.419998,...,43.426666,86.739998,32.21299,162.179993,102.129997,54.639999,83.029999,55.18,69.209999,88.550003
3,2017-01-06 00:00:00,29.477501,84.800003,39.7995,,159.100006,22.68,75.330002,162.830002,82.199997,...,43.919998,85.400002,31.20846,162.410004,103.190002,53.259998,83.099998,55.040001,68.260002,88.5
4,2017-01-09 00:00:00,29.747499,85.480003,39.846001,,158.320007,22.549999,74.760002,160.970001,81.699997,...,43.380001,84.019997,30.81571,161.949997,102.419998,52.68,82.550003,54.240002,68.709999,87.040001


In [6]:
# You can see ANTM stock price data is empty

# Cure data such as replace missing/null values, use correct data type, sort by date (not really requured)

In [7]:
# convert Date field to be a Date Type
df_s["Date"] = df_s["Date"].astype('datetime64[ns]')

# Sort data by date although this is no longer needed as data already is sorted when I generated data
# df_s = df_s.sort_values( by = ['Ticker','Date'], ascending = True )
df_s = df_s.sort_values( by = 'Date', ascending = True )
df_s.head()

Unnamed: 0,Date,AAPL,ABC,AMZN,ANTM,BA,BAC,CAH,COST,CVS,...,PCAR,PSX,T,UNH,UNP,VZ,WBA,WFC,WMT,XOM
0,2017-01-03,29.0375,82.610001,37.683498,,156.970001,22.530001,74.480003,159.729996,80.349998,...,43.546665,86.790001,32.492447,161.449997,102.519997,54.580002,82.959999,56.0,68.660004,90.889999
1,2017-01-04,29.004999,84.660004,37.859001,,158.619995,22.950001,75.629997,159.759995,79.75,...,44.146667,87.260002,32.303623,161.910004,103.139999,54.52,82.980003,56.049999,69.059998,89.889999
2,2017-01-05,29.1525,83.68,39.022499,,158.710007,22.68,74.5,162.910004,81.419998,...,43.426666,86.739998,32.21299,162.179993,102.129997,54.639999,83.029999,55.18,69.209999,88.550003
3,2017-01-06,29.477501,84.800003,39.7995,,159.100006,22.68,75.330002,162.830002,82.199997,...,43.919998,85.400002,31.20846,162.410004,103.190002,53.259998,83.099998,55.040001,68.260002,88.5
4,2017-01-09,29.747499,85.480003,39.846001,,158.320007,22.549999,74.760002,160.970001,81.699997,...,43.380001,84.019997,30.81571,161.949997,102.419998,52.68,82.550003,54.240002,68.709999,87.040001


In [8]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
df_s_transpose = df_s

try:
  df_s_transpose = df_s_transpose.interpolate(inplace = False)
except:
  print("An exception occurred. Operation ignored")
  exit

# check if any value is null    
df_s_transpose.isnull().values.any()

# check if any column (axis=1) is null
df_s_transpose[df_s_transpose.isna().any(axis = 1)]    

An exception occurred. Operation ignored


Unnamed: 0,Date,AAPL,ABC,AMZN,ANTM,BA,BAC,CAH,COST,CVS,...,PCAR,PSX,T,UNH,UNP,VZ,WBA,WFC,WMT,XOM
0,2017-01-03,29.037500,82.610001,37.683498,,156.970001,22.530001,74.480003,159.729996,80.349998,...,43.546665,86.790001,32.492447,161.449997,102.519997,54.580002,82.959999,56.000000,68.660004,90.889999
1,2017-01-04,29.004999,84.660004,37.859001,,158.619995,22.950001,75.629997,159.759995,79.750000,...,44.146667,87.260002,32.303623,161.910004,103.139999,54.520000,82.980003,56.049999,69.059998,89.889999
2,2017-01-05,29.152500,83.680000,39.022499,,158.710007,22.680000,74.500000,162.910004,81.419998,...,43.426666,86.739998,32.212990,162.179993,102.129997,54.639999,83.029999,55.180000,69.209999,88.550003
3,2017-01-06,29.477501,84.800003,39.799500,,159.100006,22.680000,75.330002,162.830002,82.199997,...,43.919998,85.400002,31.208460,162.410004,103.190002,53.259998,83.099998,55.040001,68.260002,88.500000
4,2017-01-09,29.747499,85.480003,39.846001,,158.320007,22.549999,74.760002,160.970001,81.699997,...,43.380001,84.019997,30.815710,161.949997,102.419998,52.680000,82.550003,54.240002,68.709999,87.040001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
748,2019-12-23,71.000000,85.599998,89.650002,,337.549988,35.169998,51.130001,293.309998,74.379997,...,53.033333,112.669998,29.509064,295.089996,179.419998,61.400002,58.570000,53.810001,119.029999,70.290001
749,2019-12-24,71.067497,85.419998,89.460503,,333.000000,35.220001,51.290001,294.230011,74.510002,...,52.993332,113.199997,29.425982,294.540009,179.889999,61.279999,58.349998,53.820000,119.510002,70.019997
750,2019-12-26,72.477501,85.050003,93.438499,,329.920013,35.520000,51.169998,295.730011,74.480003,...,53.006668,112.059998,29.577040,295.649994,180.809998,61.290001,58.900002,54.150002,119.519997,70.129997
751,2019-12-27,72.449997,84.910004,93.489998,,330.140015,35.349998,51.500000,294.109985,74.400002,...,52.939999,110.599998,29.637463,295.970001,181.410004,61.529999,59.020000,53.919998,119.589996,69.889999


In [9]:
df_s_transpose

Unnamed: 0,Date,AAPL,ABC,AMZN,ANTM,BA,BAC,CAH,COST,CVS,...,PCAR,PSX,T,UNH,UNP,VZ,WBA,WFC,WMT,XOM
0,2017-01-03,29.037500,82.610001,37.683498,,156.970001,22.530001,74.480003,159.729996,80.349998,...,43.546665,86.790001,32.492447,161.449997,102.519997,54.580002,82.959999,56.000000,68.660004,90.889999
1,2017-01-04,29.004999,84.660004,37.859001,,158.619995,22.950001,75.629997,159.759995,79.750000,...,44.146667,87.260002,32.303623,161.910004,103.139999,54.520000,82.980003,56.049999,69.059998,89.889999
2,2017-01-05,29.152500,83.680000,39.022499,,158.710007,22.680000,74.500000,162.910004,81.419998,...,43.426666,86.739998,32.212990,162.179993,102.129997,54.639999,83.029999,55.180000,69.209999,88.550003
3,2017-01-06,29.477501,84.800003,39.799500,,159.100006,22.680000,75.330002,162.830002,82.199997,...,43.919998,85.400002,31.208460,162.410004,103.190002,53.259998,83.099998,55.040001,68.260002,88.500000
4,2017-01-09,29.747499,85.480003,39.846001,,158.320007,22.549999,74.760002,160.970001,81.699997,...,43.380001,84.019997,30.815710,161.949997,102.419998,52.680000,82.550003,54.240002,68.709999,87.040001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
748,2019-12-23,71.000000,85.599998,89.650002,,337.549988,35.169998,51.130001,293.309998,74.379997,...,53.033333,112.669998,29.509064,295.089996,179.419998,61.400002,58.570000,53.810001,119.029999,70.290001
749,2019-12-24,71.067497,85.419998,89.460503,,333.000000,35.220001,51.290001,294.230011,74.510002,...,52.993332,113.199997,29.425982,294.540009,179.889999,61.279999,58.349998,53.820000,119.510002,70.019997
750,2019-12-26,72.477501,85.050003,93.438499,,329.920013,35.520000,51.169998,295.730011,74.480003,...,53.006668,112.059998,29.577040,295.649994,180.809998,61.290001,58.900002,54.150002,119.519997,70.129997
751,2019-12-27,72.449997,84.910004,93.489998,,330.140015,35.349998,51.500000,294.109985,74.400002,...,52.939999,110.599998,29.637463,295.970001,181.410004,61.529999,59.020000,53.919998,119.589996,69.889999


In [10]:
# df_s_transpose = df_s

if drop_cols_with_na == 1:
    df_s_transpose = df_s_transpose.dropna(axis = 1);    
   
print(df_s_transpose.shape)
df_s_transpose.head() 

(753, 29)


Unnamed: 0,Date,AAPL,ABC,AMZN,BA,BAC,CAH,COST,CVS,CVX,...,PCAR,PSX,T,UNH,UNP,VZ,WBA,WFC,WMT,XOM
0,2017-01-03,29.0375,82.610001,37.683498,156.970001,22.530001,74.480003,159.729996,80.349998,117.849998,...,43.546665,86.790001,32.492447,161.449997,102.519997,54.580002,82.959999,56.0,68.660004,90.889999
1,2017-01-04,29.004999,84.660004,37.859001,158.619995,22.950001,75.629997,159.759995,79.75,117.82,...,44.146667,87.260002,32.303623,161.910004,103.139999,54.52,82.980003,56.049999,69.059998,89.889999
2,2017-01-05,29.1525,83.68,39.022499,158.710007,22.68,74.5,162.910004,81.419998,117.309998,...,43.426666,86.739998,32.21299,162.179993,102.129997,54.639999,83.029999,55.18,69.209999,88.550003
3,2017-01-06,29.477501,84.800003,39.7995,159.100006,22.68,75.330002,162.830002,82.199997,116.839996,...,43.919998,85.400002,31.20846,162.410004,103.190002,53.259998,83.099998,55.040001,68.260002,88.5
4,2017-01-09,29.747499,85.480003,39.846001,158.320007,22.549999,74.760002,160.970001,81.699997,115.839996,...,43.380001,84.019997,30.81571,161.949997,102.419998,52.68,82.550003,54.240002,68.709999,87.040001


In [11]:
# further check and verify
df_s_transpose.isnull().values.any()
df_s_transpose[df_s_transpose.isna().any( axis = 1 )]

Unnamed: 0,Date,AAPL,ABC,AMZN,BA,BAC,CAH,COST,CVS,CVX,...,PCAR,PSX,T,UNH,UNP,VZ,WBA,WFC,WMT,XOM


In [12]:
# making the date column as the index column for the dataset
# df_s_transpose.index = df_s_transpose['Date']
df_s_transpose.index = df_s_transpose.index.astype('datetime64[ns]')

# Pearson Correlation Coefficient

In [13]:
df_s_transpose_pearson = df_s_transpose.corr(method = 'pearson', numeric_only = True)
df_s_transpose_pearson

Unnamed: 0,AAPL,ABC,AMZN,BA,BAC,CAH,COST,CVS,CVX,F,...,PCAR,PSX,T,UNH,UNP,VZ,WBA,WFC,WMT,XOM
AAPL,1.0,-0.036748,0.786106,0.69257,0.763609,-0.66677,0.863643,-0.34235,0.459125,-0.593845,...,0.427859,0.745233,-0.262982,0.760652,0.798973,0.676181,-0.658245,-0.197965,0.827489,-0.505612
ABC,-0.036748,1.0,-0.127768,-0.126217,0.163579,0.46616,-0.130971,0.427752,0.029482,0.302385,...,0.154842,0.097029,0.358636,-0.055076,-0.178882,-0.252366,0.300615,0.574354,-0.079702,0.250188
AMZN,0.786106,-0.127768,1.0,0.909833,0.739494,-0.876488,0.826927,-0.673998,0.601387,-0.721688,...,0.071989,0.690894,-0.677553,0.886126,0.929052,0.708967,-0.763988,-0.409286,0.765665,-0.457154
BA,0.69257,-0.126217,0.909833,1.0,0.782307,-0.828416,0.699197,-0.661338,0.662725,-0.672301,...,0.103744,0.688694,-0.708575,0.886833,0.873865,0.653679,-0.707509,-0.328412,0.765026,-0.413556
BAC,0.763609,0.163579,0.739494,0.782307,1.0,-0.523495,0.613895,-0.389465,0.670267,-0.347324,...,0.484901,0.804676,-0.315603,0.770331,0.727145,0.485319,-0.604695,0.127918,0.705303,-0.297798
CAH,-0.66677,0.46616,-0.876488,-0.828416,-0.523495,1.0,-0.766108,0.746348,-0.53961,0.70946,...,0.05984,-0.558826,0.741128,-0.795163,-0.851249,-0.711528,0.791882,0.596125,-0.739558,0.523205
COST,0.863643,-0.130971,0.826927,0.699197,0.613895,-0.766108,1.0,-0.553276,0.444068,-0.71275,...,0.317508,0.541579,-0.297706,0.695051,0.902083,0.86442,-0.766485,-0.473844,0.893336,-0.648205
CVS,-0.34235,0.427752,-0.673998,-0.661338,-0.389465,0.746348,-0.553276,1.0,-0.44043,0.46455,...,0.017925,-0.225111,0.586122,-0.439705,-0.68606,-0.463266,0.85014,0.643467,-0.529032,0.44905
CVX,0.459125,0.029482,0.601387,0.662725,0.670267,-0.53961,0.444068,-0.44043,1.0,-0.087233,...,0.29314,0.717509,-0.38858,0.594834,0.602565,0.419555,-0.535137,0.040413,0.482086,0.101968
F,-0.593845,0.302385,-0.721688,-0.672301,-0.347324,0.70946,-0.71275,0.46455,-0.087233,1.0,...,0.115762,-0.306461,0.563575,-0.678168,-0.727356,-0.726151,0.467808,0.607512,-0.64043,0.652238


# Pearson Correlation Coefficient based Adjacency Graph Matrix

In [14]:
df_s_transpose_pearson[df_s_transpose_pearson >= 0.5] = 1
df_s_transpose_pearson[df_s_transpose_pearson < 0.5] = 0
df_s_transpose_pearson

Unnamed: 0,AAPL,ABC,AMZN,BA,BAC,CAH,COST,CVS,CVX,F,...,PCAR,PSX,T,UNH,UNP,VZ,WBA,WFC,WMT,XOM
AAPL,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
ABC,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
AMZN,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
BA,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
BAC,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
CAH,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
COST,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
CVS,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
CVX,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
F,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


In [15]:
# make the diagonal element to be zero. No self loop/edge
import numpy as np
np.fill_diagonal(df_s_transpose_pearson.values, 0)
df_s_transpose_pearson

Unnamed: 0,AAPL,ABC,AMZN,BA,BAC,CAH,COST,CVS,CVX,F,...,PCAR,PSX,T,UNH,UNP,VZ,WBA,WFC,WMT,XOM
AAPL,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
ABC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
AMZN,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
BA,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
BAC,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
CAH,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
COST,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
CVS,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
CVX,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
F,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


In [16]:
STOP

NameError: name 'STOP' is not defined

Create and visualize the Graphs

In [None]:
import networkx as nx
Graph_pearson = nx.Graph(df_s_transpose_pearson)

In [None]:
nx.draw_networkx(Graph_pearson, pos = nx.circular_layout( Graph_pearson ), node_color = 'r', edge_color = 'b')

# Experiment

In [None]:
df_s_transpose.corr(method = 'pearson', numeric_only = True)
#df_s_transpose[[{1,2,3}]]

df_s_transpose.iloc[:, 0:10]


In [None]:
df_s_pearson_train = df_s_transpose.iloc[:, 0:15]
df_s_transpose_pearson_train = df_s_pearson_train.corr(method = 'pearson', numeric_only = True)
np.fill_diagonal(df_s_transpose_pearson_train.values, 0)

df_s_transpose_pearson_train[df_s_transpose_pearson_train >= 0.5] = 1
df_s_transpose_pearson_train[df_s_transpose_pearson_train < 0.5] = 0
df_s_transpose_pearson_train

df_s_transpose_pearson_train

In [None]:
df_s_pearson_test = df_s_transpose.iloc[:, 15:23]
df_s_transpose_pearson_test = df_s_pearson_test.corr(method = 'pearson', numeric_only = True)
np.fill_diagonal(df_s_transpose_pearson_test.values, 0)

df_s_transpose_pearson_train[df_s_transpose_pearson_test >= 0.5] = 1
df_s_transpose_pearson_train[df_s_transpose_pearson_test < 0.5] = 0
df_s_transpose_pearson_test


df_s_pearson_validation = df_s_transpose.iloc[:, 23:]
df_s_transpose_pearson_validation = df_s_pearson_validation.corr(method = 'pearson', numeric_only = True)
np.fill_diagonal(df_s_transpose_pearson_validation.values, 0)
df_s_transpose_pearson_validation

df_s_transpose_pearson_validation[df_s_transpose_pearson_validation >= 0.5] = 1
df_s_transpose_pearson_validation[df_s_transpose_pearson_validation < 0.5] = 0
df_s_transpose_pearson_validation

In [None]:
graph_pearson_train = nx.Graph(df_s_transpose_pearson_train)
graph_pearson_test = nx.Graph(df_s_transpose_pearson_test)
graph_pearson_validation = nx.Graph(df_s_transpose_pearson_validation)


nx.draw_networkx(graph_pearson_train, pos = nx.circular_layout( graph_pearson_train ), node_color = 'r', edge_color = 'b')


In [None]:
df_s_pearson_train.corr(numeric_only = True)

In [None]:
nx.draw_networkx(graph_pearson_test, pos = nx.circular_layout( graph_pearson_test ), node_color = 'r', edge_color = 'b')


In [None]:
nx.draw_networkx(graph_pearson_validation, pos = nx.circular_layout( graph_pearson_validation ), node_color = 'r', edge_color = 'b')

# Create GCN layer. Pearson

# Find all stocks = nodes

In [None]:
# improvement: make sure only stocks/nodes that are in the graph are taken
all_stock_nodes = df_s_transpose_pearson.index.to_list()
all_stock_nodes[:5]

# Find all edges between nodes

In [None]:
source = [];
target = [];
edge_feature = [];

for aStock in all_stock_nodes:
    for anotherStock in all_stock_nodes:
        if df_s_transpose_pearson[aStock][anotherStock] > 0:
            #print(df_s_transpose_pearson[aStock][anotherStock])
            source.append(aStock)
            target.append(anotherStock)
            edge_feature.append(1)

# edge feature is not required except for news based graph
source, target, edge_feature            

In [None]:
trainSource = [];
trainTarget = [];
trainEdge_feature = [];
trainNodeList = df_s_transpose_pearson_train.index.to_list();

testSource = [];
testTarget = [];
testEdge_feature = [];
testNodeList = df_s_transpose_pearson_test.index.to_list();


validationSource = [];
validationTarget = [];
validationEdge_feature = [];
validationNodeList = df_s_transpose_pearson_validation.index.to_list();

for aStock in trainNodeList:
    for anotherStock in trainNodeList:        
        if df_s_transpose_pearson_train[aStock][anotherStock] > 0:
            #print(df_s_transpose_pearson[aStock][anotherStock])
            trainSource.append(aStock)
            trainTarget.append(anotherStock)
            trainEdge_feature.append(1)
                
                
for aStock in testNodeList:
    for anotherStock in testNodeList:        
        if df_s_transpose_pearson_test[aStock][anotherStock] > 0:
            #print(df_s_transpose_pearson[aStock][anotherStock])
            testSource.append(aStock)
            testTarget.append(anotherStock)
            testEdge_feature.append(1)

for aStock in validationNodeList:
    for anotherStock in validationNodeList:                    
        if df_s_transpose_pearson_validation[aStock][anotherStock] > 0:
            # print(df_s_transpose_pearson[aStock][anotherStock])
            validationSource.append(aStock)
            validationTarget.append(anotherStock)
            validationEdge_feature.append(1)
                        
# edge feature is not required except for news based graph
trainSource, trainTarget, trainEdge_feature
testSource, testTarget, testEdge_feature
validationSource, validationTarget, validationEdge_feature

# Create variables to create stellar graph

In [None]:
# https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html
pearson_edges = pd.DataFrame(
    {"source": source, "target": target}
)

pearson_edges_data = pd.DataFrame(
    {"source": source, "target": target, "edge_feature": edge_feature}
)


# https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html
pearson_edges_train = pd.DataFrame(
    {"source": trainSource, "target": trainTarget}
)

pearson_edges_data_train = pd.DataFrame(
    {"source": trainSource, "target": trainTarget, "edge_feature": trainEdge_feature}
)

pearson_edges_test = pd.DataFrame(
    {"source": testSource, "target": testTarget}
)

pearson_edges_data_test = pd.DataFrame(
    {"source": testSource, "target": testTarget, "edge_feature": testEdge_feature}
)


pearson_edges_validation = pd.DataFrame(
    {"source": validationSource, "target": validationTarget}
)


pearson_edges[:10]

# Have the time series data as part of the nodes

# Structure the Feature Matrix so that it can be passed to the GCN

In [None]:
df_s_transpose_feature = df_s_transpose.reset_index(drop = True, inplace = False)
# df_s_transpose_feature =  df_s_transpose_feature.values.tolist()
# print(df_s_transpose_feature.values.tolist())
#df_s_transpose_feature['WY'].values
df_s_transpose_feature['AAPL'].values

In [None]:
# bring/assign data to nodes
node_Data = [];
for x in all_stock_nodes:
    node_Data.append( df_s_transpose_feature[x].values)
    
    
node_Data    

In [None]:
# convert node data variable into a dataframe so that the data structure is compatible with graph NN
pearson_graph_node_data = pd.DataFrame(node_Data, index = all_stock_nodes)
pearson_graph_node_data.head()

In [None]:
node_Data[14:15], 
len(validationNodeList)
len(testNodeList)

In [None]:
# convert node data variable into a dataframe so that the data structure is compatible with graph NN
pearson_graph_node_data_train = pd.DataFrame(node_Data[0:14], index = trainNodeList)
pearson_graph_node_data_train.head()

pearson_graph_node_data_test = pd.DataFrame(node_Data[15:23], index = testNodeList)
pearson_graph_node_data_test.head()

pearson_graph_node_data_validation = pd.DataFrame(node_Data[22:30], index = validationNodeList)
pearson_graph_node_data_validation.head()



# Graph (stellar) with features as part of Nodes

In [None]:
pearson_graph_with_node_features = StellarGraph(pearson_graph_node_data, edges = pearson_edges, node_type_default = "corner", edge_type_default = "line")
print(pearson_graph_with_node_features.info())

# train nodes
pearson_train_graph_with_node_features = StellarGraph(pearson_graph_node_data_train, edges = pearson_edges_train, node_type_default = "corner", edge_type_default = "line")
print(pearson_train_graph_with_node_features.info())


pearson_test_graph_with_node_features = StellarGraph(pearson_graph_node_data_test, edges = pearson_edges_test, node_type_default = "corner", edge_type_default = "line")
print(pearson_test_graph_with_node_features.info())

pearson_validation_graph_with_node_features = StellarGraph(pearson_graph_node_data_validation, edges = pearson_edges_validation, node_type_default = "corner", edge_type_default = "line")
print(pearson_validation_graph_with_node_features.info())




# Adapting everything for DeepGraphCNN

In [None]:
pearson_graph_node_data.iloc[0:15, :]

# train
pearson_train_graph_with_node_features = StellarGraph(pearson_graph_node_data.iloc[0:15, :], edges = pearson_edges.iloc[0:15, :], node_type_default = "corner", edge_type_default = "line")
print(pearson_train_graph_with_node_features.info())


'''
pearson_train_graph_with_node_features = StellarGraph(pearson_graph_node_data[:10], edges = pearson_edges[:10], node_type_default = "corner", edge_type_default = "line")
print(pearson_train_graph_with_node_features.info())

pearson_train_graph_with_node_features = StellarGraph(pearson_graph_node_data[:10], edges = pearson_edges[:10], node_type_default = "corner", edge_type_default = "line")
print(pearson_train_graph_with_node_features.info())
'''


In [None]:
# graphs

In [None]:
graphs = list()
#graphs.append(pearson_graph_with_node_features)
graphs.append(pearson_train_graph_with_node_features)
graphs.append(pearson_test_graph_with_node_features)
graphs.append(pearson_validation_graph_with_node_features)


In [None]:
summary = pd.DataFrame(
    [(g.number_of_nodes(), g.number_of_edges()) for g in graphs],
    columns=["nodes", "edges"],
)
summary.describe().round(1)

In [None]:
graph_labels = all_stock_nodes

In [None]:
# Generator
#generator = FullBatchNodeGenerator(pearson_graph_with_node_features, method = "gcn") # , sparse = False
#vars(generator)

generator = PaddedGraphGenerator( graphs = graphs)
# generator = PaddedGraphGenerator( pearson_graph_with_node_features)

In [None]:
generator

# Train Test Split

In [None]:
train_subjects, test_subjects = model_selection.train_test_split(
    pearson_graph_node_data 
)

val_subjects, test_subjects_step_2 = model_selection.train_test_split(
    test_subjects 
)

#, train_size = 500, test_size = None, stratify = test_subjects

train_subjects.shape, test_subjects.shape, val_subjects.shape, test_subjects_step_2.shape

In [None]:
train_targets = train_subjects; 
val_targets = val_subjects; 
test_targets = test_subjects; 

In [None]:
train_gen = generator.flow(train_subjects.index, train_targets)
test_gen = generator.flow(test_subjects.index, test_targets)
valid_gen = generator.flow(val_subjects.index, val_targets)

In [None]:
# debug
train_subjects.index, 
train_targets[:2]

In [None]:
# train data size
unit_count = train_subjects.shape[0]
unit_count

In [None]:
# hard coded size adjustments
test_subjects_ = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_)
#train_gen[1], val_gen[1]
#val_gen[4]

# The Model for all of the approaches utilized in this file
# Model for Pearson, Spearman, Kendal Tau, Financial News Based prediction

hard coded size adjustments
test_subjects_adjusted = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_adjusted)
#train_gen[1], val_gen[1]

# Models : DeepGraph CNN

In [None]:
epochs_to_test = 2
patience_to_test = 2

In [None]:
# Experiment with DeepGraphCNN
# https://stellargraph.readthedocs.io/en/latest/demos/graph-classification/dgcnn-graph-classification.html


# unit_count = 35
k =   unit_count # the number of rows for the output tensor
layer_sizes = [32, 32, 32, 1]

dgcnn_model = DeepGraphCNN(
    layer_sizes = layer_sizes,
    activations = ["tanh", "tanh", "tanh", "tanh"],
    k = k,
    bias = False,
    generator = generator,
)
x_inp, x_out = dgcnn_model.in_out_tensors()

#print(graphs[0].info())
x_inp, x_out

In [None]:
# dgcnn_model.summary()

# print(dgcnn_model.info())
dgcnn_model    

In [None]:
x_out = Conv1D(filters = 16, kernel_size = sum(layer_sizes), strides = sum(layer_sizes))(x_out)
x_out = MaxPool1D(pool_size=2)(x_out)

x_out = Conv1D(filters = 32, kernel_size = 5, strides = 1)(x_out)

x_out = Flatten()(x_out)

x_out = Dense(units = 128, activation = "relu")(x_out)
x_out = Dropout(rate = 0.5)(x_out)

#predictions = Dense(units=1, activation="linear")(x_out)
predictions = layers.Dense(units = train_targets.shape[1], activation = "linear")(x_out)
#predictions = layers.Dense(units = 1, activation = "linear")(x_out)

In [None]:
model = Model(inputs=x_inp, outputs=predictions)

model.compile(
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam( learning_rate = 0.1), 
    metrics = ['mean_squared_error']
)

# Start using model.fit from 1st and 2nd models **********************************

In [None]:
# https://keras.io/api/callbacks/early_stopping/
from tensorflow.keras.callbacks import EarlyStopping

es_callback = EarlyStopping(
    monitor = "val_mean_squared_error", 
    patience = patience_to_test, 
    restore_best_weights = True
)

In [None]:
# this fit is from 2nd model
history = model.fit( train_gen_data, epochs = epochs_to_test, validation_data = data_valid, verbose = 1,    
    # shuffling = true means shuffling the whole graph
    shuffle = False , callbacks = [es_callback],
)

# End using model.fit from 1st and 2nd models ************************************

In [None]:
initial_list = set(list(all_stock_nodes))
keys = [x for x in range(len(initial_list))]
new_dict = dict(zip(keys, initial_list)) 

[ list(new_dict.values()).index(item) for item in  train_subjects.index]

#dict((item['id'], item) for item in initial_list)
keys, new_dict, train_gen
train_gen_node_id_list = [ list(new_dict.values()).index(item) for item in  train_subjects.index]
test_gen_node_id_list = [ list(new_dict.values()).index(item) for item in  test_targets.index]
train_gen_node_id_list[:5], test_gen_node_id_list[:], keys[:5]

In [None]:
# hard coded size adjustments
test_subjects_ = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_)
#train_gen[1], val_gen[1]
#val_gen[4]

In [None]:
# Worked
train_gen = generator.flow(
    #list(train_subjects.index),
    #train_gen_node_id_list,
    [0],
    targets = [0], #train_subjects.values,
    batch_size = 1,
    symmetric_normalization = False,
)

test_gen = generator.flow(
    #list(test_targets.index),
    #test_gen_node_id_list,
    [1],
    targets = [1],#test_targets.values,
    batch_size = 1,
    symmetric_normalization = False,
)


all_gen = generator.flow(
    #list(test_targets.index),
    #test_gen_node_id_list,
    [0],
    targets=[0],#test_targets.values,
    batch_size=1,
    symmetric_normalization=False,
)

data_valid = val_gen #[:1][:4];
train_gen_data = train_gen #[:1][:4];


In [None]:
train_gen[0], 
#train_subjects
#train_subjects.values

In [None]:
# Experiment
train_gen = generator.flow(
    #list(train_subjects.index),
    train_gen_node_id_list,
    #[0],
    targets = train_subjects[0],
    batch_size = 1,
    symmetric_normalization = False,
)

test_gen = generator.flow(
    #list(test_targets.index),
    test_gen_node_id_list,
    #[0],
    targets = test_targets,
    batch_size = 1,
    symmetric_normalization = False,
)

'''
all_gen = generator.flow(
    #list(test_targets.index),
    #test_gen_node_id_list,
    [0],
    targets=[0],#test_targets.values,
    batch_size=1,
    symmetric_normalization=False,
)
'''

data_valid = val_gen #[:1][:4];
train_gen_data = train_gen #[:1][:4];


In [None]:
vars(train_gen), 
#train_subjects
#train_subjects.values

In [None]:
train_subjects, test_gen

In [None]:
train_gen, data_valid, train_gen_data

In [None]:
# https://stellargraph.readthedocs.io/en/stable/demos/graph-classification/dgcnn-graph-classification.html?highlight=cnn
# with DeepGraphCNN model.fit
history = model.fit(
    train_gen, epochs = epochs_to_test, verbose = 1, validation_data = test_gen, shuffle = False,
)

# https://keras.io/api/callbacks/early_stopping/
from tensorflow.keras.callbacks import EarlyStopping

es_callback = EarlyStopping(monitor = "val_mean_squared_error", patience = 50, restore_best_weights = True)


'''
history = model.fit( train_gen_data, epochs = 100, validation_data = data_valid, verbose = 1,    
    # shuffling = true means shuffling the whole graph
    shuffle = True, callbacks = [es_callback],
)
'''

history = model.fit (
    train_gen, epochs=epochs, verbose=1, validation_data=test_gen, shuffle=True,
)




sg.utils.plot_history(history)

# [1]

# loss functions: https://keras.io/api/losses/

model = Model(
    inputs = x_inp, outputs = predictions)

'''
model.compile(
    optimizer=optimizers.Adam(learning_rate=0.1),
    loss=losses.MeanSquaredError(),
    metrics=["acc"],
)

# REF: https://stackoverflow.com/questions/57301698/how-to-change-a-learning-rate-for-adam-in-tf2
# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/PolynomialDecay
train_steps = 1000
lr_fn = optimizers.schedules.PolynomialDecay(1e-3, train_steps, 1e-5, 2)

# https://keras.io/api/metrics/
model.compile(
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam( lr_fn ),
    # metrics = ['mean_squared_error']
    metrics=['mse', 'mae', 'mape']
)
'''

# 1st block
# mape: https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac
model.compile( 
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam(learning_rate = 0.015), 
    #optimizer = optimizers.Adam(lr_fn), 
    # metrics=['mean_squared_error']
    metrics=['mean_squared_error', 'mae', 'mape']
)

len(x_inp), predictions.shape, print(model.summary())

len(val_subjects)
test_subjects_ = test_subjects[:len(val_subjects)]

# hard coded size adjustments
test_subjects_ = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_)
#train_gen[1], val_gen[1]
val_gen[4]

train_gen[:1][:4]

# type(train_gen_data), type(data_valid), type(x_inp), type(x_out) 

# https://keras.io/api/callbacks/early_stopping/
from tensorflow.keras.callbacks import EarlyStopping



es_callback = EarlyStopping(
    monitor = "val_mean_squared_error", 
    patience = patience_to_test, 
    restore_best_weights = True
)



history = model.fit( train_gen_data, epochs = epochs_to_test, validation_data = data_valid, verbose = 2,    
    # shuffling = true means shuffling the whole graph
    shuffle = False, callbacks = [es_callback],
)


In [None]:
sg.utils.plot_history(history)

In [None]:
val_subjects, 
test_subjects

In [None]:
#test_gen = generator.flow(test_subjects.index, test_targets)
test_metrics = model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))

In [None]:
df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE'])#, 'MAE', 'MAPE'])

temp = list()
temp.append('GCN-Pearson');
for name, val in zip(model.metrics_names, test_metrics):
    # print(val)
    temp.append(val)

print(temp)
df_metrics.loc[1] = temp
df_metrics

# Show the predicted prices by the Model

At this point, I still need to make sense of what GCN ( and CNN) combination + MLP is predicting. 
I am just displaying the output. 
It appears that price is predicted for each timestamp (day)

In [None]:
all_nodes = pearson_graph_node_data.index;
all_gen = generator.flow(all_nodes)
all_predictions = model.predict(test_gen)
all_predictions = model.predict(train_gen)

all_nodes, all_predictions, all_predictions.shape, pearson_graph_node_data.shape

In [None]:
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict
model.predict(
    all_gen,
    batch_size = None,
    verbose = 2,
    steps = None,
    callbacks = None,
    max_queue_size = 10,
    workers = 1,
    use_multiprocessing = False
)

In [None]:
# all_predictions = model.predict(all_nodes)

# all_predictions, all_predictions.shape, pearson_graph_node_data.shape
vars(all_gen)

In [None]:
pearson_graph_node_data

In [None]:
vars(all_gen)

In [None]:
train_gen[:1][:4]

In [None]:
****************************************************
STOP because we are testing a new model
****************************************************

# SPEARMAN ***************************************************************************

In [None]:
# Spearman

df_s_transpose_spearman = df_s_transpose.corr(method = 'spearman', numeric_only = True)
df_s_transpose_spearman


# # Pearson Correlation Coefficient based Adjacency Graph Matrix

# In[32]:


df_s_transpose_spearman[df_s_transpose_spearman >= 0.4] = 1
df_s_transpose_spearman[df_s_transpose_spearman < 0.4] = 0
df_s_transpose_spearman


# In[33]:


# make the diagonal element to be zero. No self loop
import numpy as np
np.fill_diagonal(df_s_transpose_spearman.values, 0)
df_s_transpose_spearman


# Create and visualize the Graphs

# In[34]:


import networkx as nx
Graph_spearman = nx.Graph(df_s_transpose_spearman)


# In[36]:


nx.draw_networkx(Graph_spearman, pos=nx.circular_layout(Graph_spearman), node_color='r', edge_color='b')


# # Create GCN layer. Graph_spearman

# # Find all stocks = nodes

# In[37]:


# improvement: make sure only stocks/nodes that are in the graph are taken
all_stock_nodes = df_s_transpose_spearman.index.to_list()
all_stock_nodes


# # Find all edges between nodes

# In[38]:


source = [];
target = [];
edge_feature = [];

for aStock in all_stock_nodes:
    for anotherStock in all_stock_nodes:
        if df_s_transpose_spearman[aStock][anotherStock] > 0:
            #print(df_s_transpose_spearman[aStock][anotherStock])
            source.append(aStock)
            target.append(anotherStock)
            edge_feature.append(1)
            
source, target, edge_feature            


# In[39]:


# https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html
spearman_edges = pd.DataFrame(
    {"source": source, "target": target}
)

spearman_edges_data = pd.DataFrame(
    {"source": source, "target": target, "edge_feature": edge_feature}
)


spearman_edges[:10]


# # Graph with No Feature Data, No node data, only edges

# spearman_graph = StellarGraph(edges = spearman_edges, node_type_default="corner", edge_type_default="line")
# #spearman_graph = StellarGraph(nodes = all_stock_nodes, edges = spearman_edges)
# # graph = sg.StellarGraph(all_stock_nodes, square_edges)
# print(spearman_graph.info())

# In[40]:


# Trying to have the time series data as part of the nodes


# In[41]:


df_s_transpose


# # Structure the Feature Matrix so that it can be passed to the GCN

# In[43]:


df_s_transpose_feature = df_s_transpose.reset_index(drop = True, inplace = False)
# df_s_transpose_feature =  df_s_transpose_feature.values.tolist()
# print(df_s_transpose_feature.values.tolist())
#df_s_transpose_feature['WY'].values
df_s_transpose_feature['AAPL'].values


# In[44]:


node_Data = [];
for x in all_stock_nodes:
    node_Data.append( df_s_transpose_feature[x].values)
    
    
node_Data    


# In[45]:


spearman_graph_node_data = pd.DataFrame(node_Data, index = all_stock_nodes)
spearman_graph_node_data


# # Graph with feature as part of Nodes

# In[46]:


spearman_graph_with_node_features = StellarGraph(spearman_graph_node_data, edges = spearman_edges, node_type_default = "corner", edge_type_default = "line")
print(pearson_graph_with_node_features.info())


# In[47]:


# Generator
generator = FullBatchNodeGenerator(spearman_graph_with_node_features, method = "gcn") # , sparse = False
vars(generator)


# # Train Test Split

# In[48]:


train_subjects, test_subjects = model_selection.train_test_split(
    spearman_graph_node_data #, train_size = 6, test_size = 4
)
# , train_size=6, test_size=None, stratify=pearson_graph_node_data

val_subjects, test_subjects_step_2 = model_selection.train_test_split(
    test_subjects #, test_size = 2
)

#, train_size = 500, test_size = None, stratify = test_subjects


train_subjects.shape, test_subjects.shape, val_subjects.shape, test_subjects_step_2.shape


# In[49]:


spearman_graph_node_data


# In[50]:


train_targets = train_subjects; 
val_targets = val_subjects; 
test_targets = test_subjects; 


# In[51]:


train_gen = generator.flow(train_subjects.index, train_targets)


# In[52]:


# debug
train_subjects.index, 
train_targets


# In[53]:


# train data size
# it is not must to use a number like unit_count
unit_count = train_subjects.shape[0]
unit_count


# In[54]:


'''
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow import keras

layer_sizes = [32, 32]
activations = ["relu", "relu"]
'''

gcn = GCN(layer_sizes = layer_sizes, activations = activations, generator = generator) #, dropout = 0.5
x_inp, x_out = gcn.in_out_tensors()

# MLP -- Regression
predictions = layers.Dense(units = train_targets.shape[1], activation = "linear")(x_out)

'''
x_out, 
x_inp, x_out
'''

# # hard coded size adjustments
# test_subjects_adjusted = test_subjects[:len(val_subjects)]
# 
# val_gen = generator.flow(val_subjects.index, test_subjects_adjusted)
# # train_gen[1], val_gen[1]

# In[55]:


# Models Although this code could be removed as Model is defined earlier and the same model/architecture is used by all approaches


# In[56]:


# loss functions: https://keras.io/api/losses/
'''
model = Model(
    inputs = x_inp, outputs = predictions
)
'''
'''
model.compile(
    optimizer=optimizers.Adam(learning_rate=0.1),
    loss=losses.MeanSquaredError(),
    metrics=["acc"],
)
'''

# REF: https://stackoverflow.com/questions/57301698/how-to-change-a-learning-rate-for-adam-in-tf2
# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/PolynomialDecay
# train_steps = 1000
# lr_fn = optimizers.schedules.PolynomialDecay(1e-3, train_steps, 1e-5, 2)


# https://keras.io/api/metrics/
'''
model.compile(
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam( lr_fn ),
    # metrics = ['mean_squared_error']
    metrics=['mse', 'mae', 'mape']
)
'''
# 2nd block
# mape: https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac
model.compile( 
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam(learning_rate = 0.015), 
    #optimizer = optimizers.Adam(lr_fn), 
    # metrics=['mean_squared_error']
    metrics=['mean_squared_error', 'mae', 'mape']
    # metrics=[
    #    metrics.MeanSquaredError(),
    #    metrics.AUC(),
    #]
)


# In[57]:


len(x_inp), predictions.shape, print(model.summary())


# In[58]:


len(val_subjects)
test_subjects_ = test_subjects[:len(val_subjects)]


# In[59]:


# hard coded size adjustments
test_subjects_ = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_)
#train_gen[1], val_gen[1]


# train_gen[:1][:4]

# In[60]:


# https://keras.io/api/callbacks/early_stopping/
from tensorflow.keras.callbacks import EarlyStopping

'''
#epochs_to_test = 10000
#patience_to_test = 10000

es_callback = EarlyStopping(
    monitor = "val_mean_squared_error", 
    patience = patience_to_test, 
    restore_best_weights = True
)

data_valid = val_gen #[:1][:4];
train_gen_data = train_gen #[:1][:4];
'''

history = model.fit( train_gen_data, epochs = epochs_to_test, validation_data = data_valid, verbose = 2,    
    # shuffling = true means shuffling the whole graph
    shuffle = False, callbacks = [es_callback],
)




In [None]:
sg.utils.plot_history(history)

In [None]:
# [1]


# In[61]:


val_subjects, 
test_subjects


# In[62]:


test_gen = generator.flow(test_subjects.index, test_targets)
test_metrics = model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))
    
    
#df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAE', 'MAPE'])

temp = list()
temp.append('GCN-Spearman');
for name, val in zip(model.metrics_names, test_metrics):
    # print(val)
    temp.append(val)

print(temp)
df_metrics.loc[2] = temp
df_metrics

    


# # Show the predicted prices by the Model
# 
# At this point, I still need to make sense of what GCN ( and CNN) combination + MLP is predicting. 
# I am just displaying the output. 
# It appears that price is predicted for each timestamp (day)

# In[63]:


all_nodes = spearman_graph_node_data.index;
all_gen = generator.flow(all_nodes)
all_predictions = model.predict(all_gen)

all_nodes, all_predictions, all_predictions.shape, spearman_graph_node_data.shape


# In[64]:


# https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict
model.predict(
    all_gen,
    batch_size = None,
    verbose = 2,
    steps = None,
    callbacks = None,
    max_queue_size = 10,
    workers = 1,
    use_multiprocessing = False
)


# In[65]:


# all_predictions = model.predict(all_nodes)

# all_predictions, all_predictions.shape, spearman_graph_node_data.shape
vars(all_gen)


# In[66]:


spearman_graph_node_data


# In[67]:


vars(all_gen)


# In[ ]:


# In[68]:


train_gen[:1][:4]


# In[ ]:



# Kendal Tau

In [None]:
# kendall_tau

df_s_transpose_kendall_tau = df_s_transpose.corr(method = 'kendall', numeric_only = True)
df_s_transpose_kendall_tau


# # kendall_tau Correlation Coefficient based Adjacency Graph Matrix

# In[32]:


df_s_transpose_kendall_tau[df_s_transpose_kendall_tau >= 0.3] = 1
df_s_transpose_kendall_tau[df_s_transpose_kendall_tau < 0.3] = 0
df_s_transpose_kendall_tau


# In[33]:


# make the diagonal element to be zero. No self loop
import numpy as np
np.fill_diagonal(df_s_transpose_kendall_tau.values, 0)
df_s_transpose_kendall_tau


# Create and visualize the Graphs

# In[34]:


import networkx as nx
Graph_kendall_tau = nx.Graph(df_s_transpose_kendall_tau)


# In[36]:


nx.draw_networkx(Graph_kendall_tau, pos=nx.circular_layout(Graph_kendall_tau), node_color='r', edge_color='b')


# # Create GCN layer. Graph_kendall_tau

# # Find all stocks = nodes

# In[37]:


# improvement: make sure only stocks/nodes that are in the graph are taken
all_stock_nodes = df_s_transpose_kendall_tau.index.to_list()
all_stock_nodes


# # Find all edges between nodes

# In[38]:


source = [];
target = [];
edge_feature = [];

for aStock in all_stock_nodes:
    for anotherStock in all_stock_nodes:
        if df_s_transpose_kendall_tau[aStock][anotherStock] > 0:
            #print(df_s_transpose_kendall_tau[aStock][anotherStock])
            source.append(aStock)
            target.append(anotherStock)
            edge_feature.append(1)
            
source, target, edge_feature            


# In[39]:


# https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html
kendall_tau_edges = pd.DataFrame(
    {"source": source, "target": target}
)

kendall_tau_edges_data = pd.DataFrame(
    {"source": source, "target": target, "edge_feature": edge_feature}
)


kendall_tau_edges[:10]


# # Graph with No Feature Data, No node data, only edges

# kendall_tau_graph = StellarGraph(edges = kendall_tau_edges, node_type_default="corner", edge_type_default="line")
# #kendall_tau_graph = StellarGraph(nodes = all_stock_nodes, edges = kendall_tau_edges)
# # graph = sg.StellarGraph(all_stock_nodes, square_edges)
# print(kendall_tau_graph.info())

# In[40]:


# Trying to have the time series data as part of the nodes


# In[41]:


df_s_transpose


# # Structure the Feature Matrix so that it can be passed to the GCN

# In[43]:


df_s_transpose_feature = df_s_transpose.reset_index(drop = True, inplace = False)
# df_s_transpose_feature =  df_s_transpose_feature.values.tolist()
# print(df_s_transpose_feature.values.tolist())
#df_s_transpose_feature['WY'].values
df_s_transpose_feature['AAPL'].values


# In[44]:


node_Data = [];
for x in all_stock_nodes:
    node_Data.append( df_s_transpose_feature[x].values)
    
    
node_Data    


# In[45]:


kendall_tau_graph_node_data = pd.DataFrame(node_Data, index = all_stock_nodes)
kendall_tau_graph_node_data


# # Graph with feature as part of Nodes

# In[46]:


kendall_tau_graph_with_node_features = StellarGraph(kendall_tau_graph_node_data, edges = kendall_tau_edges, node_type_default = "corner", edge_type_default = "line")
print(kendall_tau_graph_with_node_features.info())


# In[47]:


# Generator
generator = FullBatchNodeGenerator(kendall_tau_graph_with_node_features, method = "gcn") # , sparse = False
vars(generator)


# # Train Test Split

# In[48]:


train_subjects, test_subjects = model_selection.train_test_split(
    kendall_tau_graph_node_data #, train_size = 6, test_size = 4
)
# , train_size=6, test_size=None, stratify=kendall_tau_graph_node_data

val_subjects, test_subjects_step_2 = model_selection.train_test_split(
    test_subjects #, test_size = 2
)

#, train_size = 500, test_size = None, stratify = test_subjects


train_subjects.shape, test_subjects.shape, val_subjects.shape, test_subjects_step_2.shape


# In[49]:


kendall_tau_graph_node_data


# In[50]:


train_targets = train_subjects; 
val_targets = val_subjects; 
test_targets = test_subjects; 


# In[51]:


train_gen = generator.flow(train_subjects.index, train_targets)


# In[52]:


# debug
train_subjects.index, 
train_targets


# In[53]:


# train data size
# it is not must to use a number like unit_count
unit_count = train_subjects.shape[0]
unit_count


# In[54]:

'''
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow import keras

layer_sizes = [32, 32]
activations = ["relu", "relu"]
'''
gcn = GCN(layer_sizes = layer_sizes, activations = activations, generator = generator) #, dropout = 0.5
x_inp, x_out = gcn.in_out_tensors()

# MLP -- Regression
predictions = layers.Dense(units = train_targets.shape[1], activation = "linear")(x_out)

'''
x_out, 
x_inp, x_out


# # hard coded size adjustments
# test_subjects_adjusted = test_subjects[:len(val_subjects)]
# 
# val_gen = generator.flow(val_subjects.index, test_subjects_adjusted)
# # train_gen[1], val_gen[1]

# In[55]:


# Models


# In[56]:


# loss functions: https://keras.io/api/losses/

model = Model(
    inputs = x_inp, outputs = predictions)


model.compile(
    optimizer=optimizers.Adam(learning_rate=0.1),
    loss=losses.MeanSquaredError(),
    metrics=["acc"],
)


# REF: https://stackoverflow.com/questions/57301698/how-to-change-a-learning-rate-for-adam-in-tf2
# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/PolynomialDecay
train_steps = 1000
lr_fn = optimizers.schedules.PolynomialDecay(1e-3, train_steps, 1e-5, 2)


# https://keras.io/api/metrics/
model.compile(
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam( lr_fn ),
    # metrics = ['mean_squared_error']
    metrics=['mse', 'mae', 'mape']
)

'''

# 3rd block
# mape: https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac
model.compile( 
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam(learning_rate = 0.015), 
    #optimizer = optimizers.Adam(lr_fn), 
    # metrics=['mean_squared_error']
    metrics=['mean_squared_error', 'mae', 'mape']
    # metrics=[
    #    metrics.MeanSquaredError(),
    #    metrics.AUC(),
    #]
)


len(x_inp), predictions.shape, print(model.summary())

len(val_subjects)
test_subjects_ = test_subjects[:len(val_subjects)]

# hard coded size adjustments
test_subjects_ = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_)
#train_gen[1], val_gen[1]


# train_gen[:1][:4]

# In[60]:

'''
# https://keras.io/api/callbacks/early_stopping/
from tensorflow.keras.callbacks import EarlyStopping
es_callback = EarlyStopping(
    monitor = "val_mean_squared_error", 
    patience = patience_to_test, 
    restore_best_weights = True
)

data_valid = val_gen #[:1][:4];
train_gen_data = train_gen #[:1][:4];
'''

history = model.fit( train_gen_data, epochs = epochs_to_test, validation_data = data_valid, verbose = 2,    
    # shuffling = true means shuffling the whole graph
    shuffle = False, callbacks = [es_callback],
)




In [None]:
sg.utils.plot_history(history)

In [None]:
nx.draw_networkx(Graph_kendall_tau, pos=nx.circular_layout(Graph_kendall_tau), node_color='r', edge_color='b')

In [None]:
# [1]


# In[61]:


val_subjects, 
test_subjects


# In[62]:


test_gen = generator.flow(test_subjects.index, test_targets)
test_metrics = model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))
    


# # Show the predicted prices by the Model
# 
# At this point, I still need to make sense of what GCN ( and CNN) combination + MLP is predicting. 
# I am just displaying the output. 
# It appears that price is predicted for each timestamp (day)

# In[63]:


all_nodes = kendall_tau_graph_node_data.index;
all_gen = generator.flow(all_nodes)
all_predictions = model.predict(all_gen)

all_nodes, all_predictions, all_predictions.shape, kendall_tau_graph_node_data.shape


# In[64]:


# https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict
model.predict(
    all_gen,
    batch_size = None,
    verbose = 2,
    steps = None,
    callbacks = None,
    max_queue_size = 10,
    workers = 1,
    use_multiprocessing = False
)


# In[65]:


# all_predictions = model.predict(all_nodes)

# all_predictions, all_predictions.shape, kendall_tau_graph_node_data.shape
vars(all_gen)


# In[66]:


kendall_tau_graph_node_data


# In[67]:


vars(all_gen)


# In[ ]:


# In[68]:


train_gen[:1][:4]


# In[ ]:



In [None]:
# df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAE', 'MAPE'])
# df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAE', 'MAPE'])

temp = list()
temp.append('GCN-Kendall');
for name, val in zip(model.metrics_names, test_metrics):    
    temp.append(val)

print(temp)
df_metrics.loc[3] = temp


In [None]:
# import math
df_metrics_plot = df_metrics[['Loss', 'MSE', 'MAE', 'MAPE']]

#temp = [10.71573, 13.578422, 10.71573, 16.638063]
#temp = [19.04899024963379, 1377.4075927734375, 19.04899024963379, 26.09033203125]
#df_metrics_plot.loc[4] = temp

df_metrics_plot['MSE'] = [ math.sqrt(x) for x in df_metrics_plot['MSE']];
df_metrics_plot
#df_metrics, df_metrics_plot

In [None]:
df_metrics_plot.plot( kind = 'bar')

# For the sake of easier execution, I have brought financial news based prediction in the same code file

In [None]:
#!/usr/bin/env python
# coding: utf-8

# # Import Libraries

# In[1]:


import pandas as pd
# Import Libraries for Graph, GNN, and GCN

import stellargraph as sg
from stellargraph import StellarGraph

from stellargraph.mapper import FullBatchNodeGenerator
from stellargraph.layer import GCN


# In[2]:


# Machine Learnig related library Imports

from tensorflow.keras import layers, optimizers, losses, metrics, Model
from sklearn import preprocessing, model_selection
from IPython.display import display, HTML
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')


# In[3]:


# was active

data_folder = './data/yahoonewsarchive/'
# os.chdir(data_folder);
# file = data_folder + 'NEWS_YAHOO_stock_prediction.csv';
file = data_folder + 'News_Yahoo_stock.csv';


# In[4]:


df_news = pd.read_csv(file)
df_news.head()


# In[5]:


df_news = df_news[:100]


# # Approaches: Find all stock tickers in an/all article/articles
# 
# 1. Find code that does this: from internet or from previous work or from courses that you have taken online or in academia
# 2. Iterative read the article and match with stock tickers, and find all tickers. Drawback: to which tickers to match or how will you know what is a ticker? Any two to four letters Uppercase, NASDAQ AAPL
# 3. Load the article in database and then use SQL -> may not work that well unless you write some functions
# 4. NLTK, remove stop words, find all tokens, then find All Uppercase words. create a list. attach article ids to the list. Then match with the list of tockers. find common tickers between them. then create tuples with two (indicating edge) (source target weight) 

# In[6]:


# import NLTK libraries
# remove stop words using NLTK methods 
# remove all sorts of unnecessary words
# find all tokens
# Keep only All Uppercase words in a list : dictionary/map: dataframe will be ideal
# create a list/dictionary/map: dataframe will be ideal. attach article ids to the list/dataframe data.
# Create a list of all NasDAQ Tickers
# Then match with the list of NASDAQ tockers. 
# find common tickers between them. 
# then create tuples with two (indicating edge) (source target weight)
# increase weight for each article and pair when you see a match


# In[7]:


# import NLTK libraries
import nltk


# In[8]:


# remove stop words using NLTK methods 
# remove all sorts of unnecessary words
# find all tokens
# Keep only All Uppercase words in a list : dictionary/map: dataframe will be ideal

from nltk.tokenize import RegexpTokenizer

dataFrameWithOnlyCapitalWords = pd.DataFrame(columns =  ["id", "Title", "Content"]) 
for index, row in df_news.iterrows():
    # print(row[id], row['title'], row['content'])
                
    # words with capital letters in the beginning +  as much as possible
    capitalWords = RegexpTokenizer('[A-Z]+[A-Z]\w+')
    # print("\n::All Capital Words::", capitalWords.tokenize(row['content']))
    allCapitalWords = capitalWords.tokenize(row['content'])
        
    dataFrameWithOnlyCapitalWords.loc[index] = [index, row['title'], allCapitalWords]
    #break



dataFrameWithOnlyCapitalWords.head() #, dataFrameWithOnlyCapitalWords.shape


# # Create a list of all (NasDAQ) 30 stocks as per the paper
# 

# In[9]:


# Find/Create a list of NASDAQ Stocks
import os
import glob
nasdaqDataFolder = './archive/stock_market_data/nasdaq/csv'
os.chdir(nasdaqDataFolder)





# In[10]:


# Create a list of all NasDAQ Tickers

extension = "csv"
fileTypesToMerge = ""
# all_filenames = [i for i in glob.glob('*' + '*.{}'.format(extension))]
all_nasdaq_tickers = [i[:-4] for i in glob.glob('*' + fileTypesToMerge + '*.{}'.format(extension))]
nasdaq_tickers_to_process = all_nasdaq_tickers #[:10]
nasdaq_tickers_to_process


# In[11]:


nasdaq_tickers_to_process.remove('FREE')
nasdaq_tickers_to_process.remove('CBOE')
nasdaq_tickers_to_process.remove('III')
nasdaq_tickers_to_process.remove('RVNC')
sorted(nasdaq_tickers_to_process)


# In[ ]:





# In[12]:


fortune_30_tickers_to_process = [
'WMT',
'XOM',
'AAPL',
'UNH',
'MCK',
'CVS',
'AMZN',
'T',
'GM',
'F',
'ABC',
'CVX',
'CAH',
'COST',
'VZ',
'KR',
'GE',
'WBA',
'JPM',
'GOOGL',
'HD',
'BAC',
'WFC',
'BA',
'PSX',
'ANTM',
'MSFT',
'UNP',
'PCAR',
'DWDP']




# In[ ]:





# nasdaq_tickers_to_process = [
# 'WMT',
# 'XOM',
# 'AAPL',
# 'UNH',
# 'MCK',
# 'CVS',
# 'AMZN',
# 'T',
# 'GM',
# 'F',
# 'ABC',
# 'CVX',
# 'CAH',
# 'COST',
# 'VZ',
# 'KR',
# 'GE',
# 'WBA',
# # 'JPM',
# #'GOOGL',
# 'HD',
# 'BAC',
# 'WFC',
# 'BA',
# 'PSX',
# 'ANTM',
# 'MSFT',
# 'UNP',
# 'PCAR',
# 'DWDP']
# 

# # Find NASDQ Tickers in each article
# Create graph steps
# Find all edges 

# In[13]:


combinedTupleList = [];
allMatchingTickers = [];
from itertools import combinations
for index, row in dataFrameWithOnlyCapitalWords.iterrows():
    #print(index)
    #print(set(row['Content']))
    #print(set(nasdaq_tickers_to_process));    
    matchingTickers = set(set(fortune_30_tickers_to_process).intersection(set(row['Content'])))
    #print(matchingTickers)
    if len (matchingTickers) > 1:
        allTuples = list(combinations(matchingTickers, 2));
        #print(list(combinations(matchingTickers, 2)))
        
        #allMatchingTickers = set(allMatchingTickers).union(matchingTickers);
        for aTuple in allTuples:
            combinedTupleList.append(tuple(sorted(aTuple)));
            allMatchingTickers.append(aTuple[0])
            allMatchingTickers.append(aTuple[1])
            
        
    # print("*******************");
    #break
    
#combinedTupleList = list(set(combinedTupleList))
allMatchingTickers = set(allMatchingTickers)

# list(set(combinedTupleList)), len(combinedTupleList), len(set(combinedTupleList)), allMatchingTickers, len(allMatchingTickers), len(set(allMatchingTickers))
sorted(combinedTupleList), type(aTuple), type(sorted(aTuple)), allMatchingTickers


# In[14]:


#combinedTupleList[:1], set(allMatchingTickers)


# In[15]:


# calculate edge weights
from collections import Counter

tuplesWithCount = dict(Counter(combinedTupleList))
tuplesWithCount


# In[16]:


l = list(tuplesWithCount.keys())
l

#print(list(zip(*l))[0])
#print(list(zip(*l))[1])

source = list(zip(*l))[0];
target = list(zip(*l))[1];
edge_weights = tuplesWithCount.values()
source, target, edge_weights, len(source), len(target)


# In[62]:


import networkx as nx
Graph_news = nx.Graph(tuplesWithCount.keys())
nx.draw_networkx(Graph_news, pos = nx.circular_layout(Graph_news), node_color = 'r', edge_color = 'b')
#tuplesWithCount.keys()


# # Finally Create graph based on financial news

# In[17]:


import os
os.getcwd()
os.chdir('../../../../')
#os.chdir('./mcmaster/meng/747/project/')
os.getcwd()


# In[18]:


# Now create node data i.e time series to pass as part of the nodes
'''
df = pd.DataFrame();
data_file = "../../../..//archive/stock_market_data/nasdaq/nasdq-stock-price--all-merged.csv"
# stock-price--all-merged.csv"
df = pd.read_csv(data_file);
df.head()
'''

# this is the place where the new dataset starts i.e. fortune 30 companies
df = pd.DataFrame();
data_file = "per-day-fortune-30-company-stock-price-data.csv";
df = pd.read_csv("./data/" + data_file, low_memory = False);
df.head()


# In[19]:


df.index


# In[20]:


drop_cols_with_na = 1
drop_rows_with_na = 0


# In[21]:


# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html


try:
  df = df.interpolate(inplace = False)
except:
  print("An exception occurred. Operation ignored")
  exit
    
df.isnull().values.any()
df[df.isna().any(axis = 1)]  


#----



if drop_cols_with_na == 1:
    df = df.dropna(axis = 1);    
   
df, df.shape

## -- 

df.isnull().values.any()
df[df.isna().any( axis = 1 )]


## --

# df_s_transpose.index = df_s_transpose['Date']
#df.index = df.index.astype('datetime64[ns]')
df


# In[22]:


df_s =  df #[ ['Ticker', 'Date', 'Adjusted Close'] ];
df_s


# In[23]:


df_s["Date"] = df_s["Date"].astype('datetime64[ns]')
df_s = df_s.sort_values( by = 'Date', ascending = True )
df_s


# df_s_pivot = df_s.pivot_table(index = 'Ticker', columns = 'Date', values = 'Adjusted Close')
# df_s_pivot

# In[24]:


allMatchingTickers


# 
# 
# drop_rows_with_na = 0
# if drop_rows_with_na == 1:
#     df_s_transpose = df_s_transpose.dropna(axis=0);
#     #df_s_transpose["Date"] = df_s_transpose["Date"].astype('datetime64[ns]')
#     #df_s_transpose.sort_values(by='Date', ascending=False)
#     df_s_transpose.to_csv('../../../..//archive/stock_market_data/nasdaq/-na-dropped-nasdq-stock-price--all-merged.csv');
#    
# df_s_transpose.head(100)
# 
# 

# In[25]:


df_s_transpose = df_s #_pivot.T
df_s_transpose

df_s_transpose_feature = df_s_transpose.reset_index(drop = True, inplace=False)
# df_s_transpose_feature =  df_s_transpose_feature.values.tolist()
# print(df_s_transpose_feature.values.tolist())
#df_s_transpose_feature['AAPL'].values



# In[26]:


df_s_transpose_feature = df_s_transpose.set_index('Date')


# In[27]:


df_s_transpose_feature


# In[28]:


#df_s_transpose['SIMO']


# In[29]:


# df_s_transpose_feature['AAPL'].values
len(allMatchingTickers), len(set(allMatchingTickers)) #, df_s['Ticker']
#df_s_tickers = df_s['Ticker'];
#len(df_s_tickers), len(set(allMatchingTickers)), df_s_transpose.columns.unique, len(set(df_s['Ticker']))
#df_s_tickers = list(set(df_s['Ticker'])); # list(df_s_transpose.columns.unique) #
#sorted(df_s_tickers)
#for x in df_s_tickers:
 #   print(x)


# In[30]:


df_s_tickers = df_s_transpose_feature.columns
#df_s_tickers = list(set(df_s_tickers.drop('Date')))
sorted(df_s_tickers[:5])


# In[31]:


set_allMatchingTickers = set(allMatchingTickers)
df_s_tickers = fortune_30_tickers_to_process #list(set(df_s['Ticker'])); # list(df_s_transpose.columns.unique) #
node_Data_financial_news = [];

'''
for x in set_allMatchingTickers :
    # if x in df_s_tickers:
    print(x)
    node_Data_financial_news.append( df_s_transpose_feature[x].values)
'''  

node_Data_financial_news = pd.DataFrame(df_s_transpose_feature) #, index = list(allMatchingTickers)) #, index = list(set_allMatchingTickers))
#node_Data_financial_news = node_Data_financial_news.T 
node_Data_financial_news


# In[32]:


node_Data_financial_news = node_Data_financial_news.T
node_Data_financial_news


# In[33]:


node_Data_financial_news


# node_Data_financial_news#.drop(axis = 0)
# node_Data_financial_news = node_Data_financial_news.T
# node_Data_financial_news

# node_Data_financial_news = node_Data_financial_news.drop('Date')
# node_Data_financial_news

# In[34]:


financial_news_edge_data = pd.DataFrame(
    {"source": source, "target": target, "edge_feature": edge_weights}
)

financial_news_graph = StellarGraph(node_Data_financial_news, edges = financial_news_edge_data, node_type_default="corner", edge_type_default="line")
print(financial_news_graph.info())


# In[35]:


# debug code
# financial_news_graph_data,  sorted(node_Data_financial_news.columns.unique())
# [1,2] + [2, 3,4], set(source + target).difference(sorted(node_Data_financial_news.columns.unique()))


# In[36]:


# Generator
generator = FullBatchNodeGenerator(financial_news_graph, method = "gcn")


# # Machine Learning, Deep Learning, GCN, CNN

# # Train Test Split

# In[37]:


train_subjects, test_subjects = model_selection.train_test_split(
    node_Data_financial_news #, train_size = 6, test_size = 4
)
# , train_size=6, test_size=None, stratify=pearson_graph_node_data

val_subjects, test_subjects_step_2 = model_selection.train_test_split(
    test_subjects #, test_size = 2
)

#, train_size = 500, test_size = None, stratify = test_subjects


train_subjects.shape, test_subjects.shape, val_subjects.shape, test_subjects_step_2.shape


# In[38]:


# just the target variables

train_targets = train_subjects; 
val_targets = val_subjects; 
test_targets = test_subjects; 


# In[39]:


# Architecture of the Neural Network
train_subjects.index, train_targets


# In[40]:


train_gen = generator.flow(train_subjects.index, train_targets)


# In[41]:

'''
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow import keras

layer_sizes = [32, 32]
activations = ["relu", "relu"]
'''

gcn = GCN(layer_sizes = layer_sizes, activations = activations, generator = generator) #, dropout = 0.5
x_inp, x_out = gcn.in_out_tensors()

# MLP -- Regression
predictions = layers.Dense(units = train_targets.shape[1], activation = "linear")(x_out)


'''
x_out, 
x_inp, x_out


# # Models

# In[42]:


# loss functions: https://keras.io/api/losses/

model = Model(
    inputs = x_inp, outputs = predictions)


model.compile(
    optimizer=optimizers.Adam(learning_rate=0.1),
    loss=losses.MeanSquaredError(),
    metrics=["acc"],
)


# REF: https://stackoverflow.com/questions/57301698/how-to-change-a-learning-rate-for-adam-in-tf2
# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/PolynomialDecay
train_steps = 1000
lr_fn = optimizers.schedules.PolynomialDecay(1e-3, train_steps, 1e-5, 2)


# https://keras.io/api/metrics/
model.compile(
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam( lr_fn ),
    # metrics = ['mean_squared_error']
    metrics=['mse', 'mae', 'mape']
)
'''
# 4th block

# mape: https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac
model.compile( 
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam(learning_rate = 0.015), 
    #optimizer = optimizers.Adam(lr_fn), 
    # metrics=['mean_squared_error']
    metrics=['mean_squared_error', 'mae', 'mape']
    
)


# In[43]:


len(x_inp), predictions.shape, print(model.summary())


# In[44]:


len(val_subjects)
test_subjects_ = test_subjects[:len(val_subjects)]


# In[45]:


# hard coded size adjustments
test_subjects_ = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_)
#train_gen[1], val_gen[1]


# In[46]:



data_valid = val_gen #[:1][:4];
train_gen_data = train_gen #[:1][:4];


# In[47]:


type(train_gen_data), type(data_valid), type(x_inp), type(x_out) 


# In[48]:

'''
# https://keras.io/api/callbacks/early_stopping/
from tensorflow.keras.callbacks import EarlyStopping

es_callback = EarlyStopping(
    monitor = "val_mean_squared_error", 
    patience = patience_to_test, 
    restore_best_weights = True
)
'''

history = model.fit( train_gen_data, epochs = epochs_to_test, validation_data = data_valid, verbose = 2,    
    # shuffling = true means shuffling the whole graph
    shuffle = False, callbacks = [es_callback],
)



In [None]:
sg.utils.plot_history(history)

In [None]:
# [1]
val_subjects, 
test_subjects

test_gen = generator.flow(test_subjects.index, test_targets)
test_metrics = model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))
    


# # Show the predicted prices by the Model
# 
# At this point, I still need to make sense of what GCN ( and CNN) combination + MLP is predicting. 
# I am just displaying the output. 
# It appears that price is predicted for each timestamp (day)

# In[51]:


all_nodes = node_Data_financial_news.index;
all_gen = generator.flow(all_nodes)
all_predictions = model.predict(all_gen)

all_predictions, all_predictions.shape, node_Data_financial_news.shape


# In[52]:




df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAE', 'MAPE'])

temp = list()
temp.append('GCN-Causation-News');
for name, val in zip(model.metrics_names, test_metrics):    
    temp.append(val)

print(temp)
df_metrics.loc[1] = temp

import math
df_metrics_plot = df_metrics[['Loss', 'MSE', 'MAE', 'MAPE']]
df_metrics_plot['MSE'] = math.sqrt(df_metrics['MSE'])
df_metrics_plot


df_metrics_plot.plot( kind = 'bar')

In [None]:
# df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAE', 'MAPE'])

temp = list()
temp.append('GCN-News');
for name, val in zip(model.metrics_names, test_metrics):    
    temp.append(val)

df_metrics.loc[4] = temp

# import math
df_metrics_plot = df_metrics[['Loss', 'MSE', 'MAE', 'MAPE']]

df_metrics_plot['MSE'] = [ math.sqrt(x) for x in df_metrics_plot['MSE']]
df_metrics_plot

In [None]:
df_metrics_plot.plot( kind = 'bar')

To start with, I have taken ideas from the following code esp. to see what GCN is and how GCN works.

Although, it does not use any CNN. 

Node classification with Graph Convolutional Network (GCN). 

https://stellargraph.readthedocs.io/en/stable/demos/node-classification/gcn-node-classification.html 

References:



[1] Node classification with Graph Convolutional Network (GCN). https://stellargraph.readthedocs.io/en/stable/demos/node-classification/gcn-node-classification.html 


[2] Loading data into StellarGraph from Pandas. https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html

[3] Load Timeseries https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-numpy.html

[4] NetworkX: https://networkx.org/documentation/stable/reference/introduction.html 

[5]  StellerGraph and Networkx https://stellargraph.readthedocs.io/en/latest/demos/basics/loading-networkx.html 

[6] Select StellerGraph Algorithm : https://stellargraph.readthedocs.io/en/stable/demos/#find-a-demo-for-an-algorithm 
[link text](https://)


Learning: 
GNN/GCN/Keras
https://www.youtube.com/watch?v=0KH95BEz370


Install StellarGraph:
https://pypi.org/project/stellargraph/#install-stellargraph-using-pypi


May want to use without Stellar
https://keras.io/examples/graph/gnn_citations/

to get feature data from pandas dataframe: 
https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html


Create graph properly:
https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html    

https://stellargraph.readthedocs.io/en/v0.11.0/api.html


Graph Regression Dataset
https://paperswithcode.com/task/graph-regression/codeless

StellerGraph Reference:
https://stellargraph.readthedocs.io/en/stable/demos/time-series/gcn-lstm-time-series.html
https://stellargraph.readthedocs.io

GRaph CNN or similar
It has multiple GCN layers and one 1d CNN + ... this idea might help
https://stellargraph.readthedocs.io/en/stable/demos/graph-classification/dgcnn-graph-classification.html?highlight=cnn

# References -- exploring ideas on the GCN-CNN
https://ieeexplore.ieee.org/document/9149910

https://antonsruberts.github.io/graph/gcn/

This may work. As Unit GCN is created also unit tcn. This may give the opportunity to customize to product the correct output
https://github.com/lshiwjx/2s-AGCN  https://paperswithcode.com/paper/non-local-graph-convolutional-networks-for

    

# from scracth and equations
https://towardsdatascience.com/understanding-graph-convolutional-networks-for-node-classification-a2bfdb7aba7b

https://jonathan-hui.medium.com/graph-convolutional-networks-gcn-pooling-839184205692