# 2nd Model : Stock Price Prediction using Graph Convolutional Layer. The model uses GCN and MLP. The input graphs were created using Pearson, Spearman, and Kendal Tau correlations/coefficients. Also, another graph is created based on financial news articles

# For the sake of making execution easier (and at once), I have kept multiple approaches in the same file. Because I initially tried separately and brought them together, some code might be a bit redundant/repeating. I may or may not have cleaned enough.

# Import Libraries

In [None]:
# import libraries
import os
import pandas as pd
import math

In [None]:
# Import Libraries for Graph, GNN, and GCN
import stellargraph as sg
from stellargraph import StellarGraph
from stellargraph.layer import DeepGraphCNN
from stellargraph.mapper import FullBatchNodeGenerator
from stellargraph.layer import GCN

In [None]:
# Machine Learnig related library Imports
from tensorflow.keras import layers, optimizers, losses, metrics, Model
from sklearn import preprocessing, model_selection
from IPython.display import display, HTML
import matplotlib.pyplot as plt
%matplotlib inline
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow import keras

In [None]:
# If we want to drop NAN column or row wise
drop_cols_with_na = 1
drop_rows_with_na = 1

# Dataset: Use Fortune 30 companies as the paper used

In [None]:
df_s = pd.DataFrame();
data_file = "per-day-fortune-30-company-stock-price-data.csv";
df_s = pd.read_csv("./data/" + data_file, low_memory = False);
df_s.head()

# Cure data such as replace missing/null values, use correct data type, sort by date (not really requured)

In [None]:
# convert Date field to be a Date Type
df_s["Date"] = df_s["Date"].astype('datetime64[ns]')

# Sort data by date although this is no longer needed as data already is sorted
#df_s = df_s.sort_values( by = ['Ticker','Date'], ascending = True )
df_s = df_s.sort_values( by = 'Date', ascending = True )
df_s.head()

In [None]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
df_s_transpose = df_s

try:
  df_s_transpose = df_s_transpose.interpolate(inplace = False)
except:
  print("An exception occurred. Operation ignored")
  exit
    
df_s_transpose.isnull().values.any()
df_s_transpose[df_s_transpose.isna().any(axis = 1)]    

In [None]:
df_s_transpose = df_s

if drop_cols_with_na == 1:
    df_s_transpose = df_s_transpose.dropna(axis = 1);    
   
df_s_transpose, df_s_transpose.shape

In [None]:
df_s_transpose.isnull().values.any()
df_s_transpose[df_s_transpose.isna().any( axis = 1 )]

In [None]:
# df_s_transpose.index = df_s_transpose['Date']
df_s_transpose.index = df_s_transpose.index.astype('datetime64[ns]')

# Pearson Correlation Coefficient

In [None]:
df_s_transpose_pearson = df_s_transpose.corr(method = 'pearson', numeric_only = True)
df_s_transpose_pearson

# Pearson Correlation Coefficient based Adjacency Graph Matrix

In [None]:
df_s_transpose_pearson[df_s_transpose_pearson >= 0.5] = 1
df_s_transpose_pearson[df_s_transpose_pearson < 0.5] = 0
df_s_transpose_pearson

In [None]:
# make the diagonal element to be zero. No self loop
import numpy as np
np.fill_diagonal(df_s_transpose_pearson.values, 0)
df_s_transpose_pearson

Create and visualize the Graphs

In [None]:
import networkx as nx
Graph_pearson = nx.Graph(df_s_transpose_pearson)

In [None]:
nx.draw_networkx(Graph_pearson, pos=nx.circular_layout(Graph_pearson), node_color='r', edge_color='b')

# Create GCN layer. Pearson

# Find all stocks = nodes

In [None]:
# improvement: make sure only stocks/nodes that are in the graph are taken
all_stock_nodes = df_s_transpose_pearson.index.to_list()
all_stock_nodes

# Find all edges between nodes

In [None]:
source = [];
target = [];
edge_feature = [];

for aStock in all_stock_nodes:
    for anotherStock in all_stock_nodes:
        if df_s_transpose_pearson[aStock][anotherStock] > 0:
            #print(df_s_transpose_pearson[aStock][anotherStock])
            source.append(aStock)
            target.append(anotherStock)
            edge_feature.append(1)
            
source, target, edge_feature            

# variables to create stellar graph

In [None]:
# https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html
pearson_edges = pd.DataFrame(
    {"source": source, "target": target}
)

pearson_edges_data = pd.DataFrame(
    {"source": source, "target": target, "edge_feature": edge_feature}
)


pearson_edges[:10]

#Graph with No Feature Data, No node data, only edges

pearson_graph = StellarGraph(edges = pearson_edges, node_type_default="corner", edge_type_default="line")
#pearson_graph = StellarGraph(nodes = all_stock_nodes, edges = pearson_edges)
#graph = sg.StellarGraph(all_stock_nodes, square_edges)
print(pearson_graph.info())

# have the time series data as part of the nodes

df_s_transpose

# Structure the Feature Matrix so that it can be passed to the GCN

In [None]:
df_s_transpose_feature = df_s_transpose.reset_index(drop = True, inplace = False)
# df_s_transpose_feature =  df_s_transpose_feature.values.tolist()
# print(df_s_transpose_feature.values.tolist())
#df_s_transpose_feature['WY'].values
df_s_transpose_feature['AAPL'].values

In [None]:
node_Data = [];
for x in all_stock_nodes:
    node_Data.append( df_s_transpose_feature[x].values)
    
    
node_Data    

In [None]:
pearson_graph_node_data = pd.DataFrame(node_Data, index = all_stock_nodes)
pearson_graph_node_data

# Graph (stellar) with feature as part of Nodes

In [None]:
pearson_graph_with_node_features = StellarGraph(pearson_graph_node_data, edges = pearson_edges, node_type_default = "corner", edge_type_default = "line")
print(pearson_graph_with_node_features.info())

In [None]:
# Generator
generator = FullBatchNodeGenerator(pearson_graph_with_node_features, method = "gcn") # , sparse = False
vars(generator)

# Train Test Split

In [None]:
train_subjects, test_subjects = model_selection.train_test_split(
    pearson_graph_node_data #, train_size = 6, test_size = 4
)
# , train_size=6, test_size=None, stratify=pearson_graph_node_data

val_subjects, test_subjects_step_2 = model_selection.train_test_split(
    test_subjects #, test_size = 2
)

#, train_size = 500, test_size = None, stratify = test_subjects


train_subjects.shape, test_subjects.shape, val_subjects.shape, test_subjects_step_2.shape

In [None]:
pearson_graph_node_data[:10]

In [None]:
train_targets = train_subjects; 
val_targets = val_subjects; 
test_targets = test_subjects; 

In [None]:
train_gen = generator.flow(train_subjects.index, train_targets)

In [None]:
# debug
train_subjects.index, 
train_targets[:2]

#train data size
#it is not must to use a number like unit_count
unit_count = train_subjects.shape[0]
unit_count

# The Model for all of the approaches utilized in this file
# Model for Pearson, Spearman, Kendal Tau, Financial News Based prediction

In [None]:
# 1D, 2d, 3D CNN: https://towardsdatascience.com/understanding-1d-and-3d-convolution-neural-network-keras-9d8f76e29610
import tensorflow as tf

layer_sizes = [32, 32]
activations = ["relu", "relu"]

gcn = GCN(layer_sizes = layer_sizes, activations = activations, generator = generator) #, dropout = 0.5
x_inp, x_out = gcn.in_out_tensors()

input_shape = (1, 21, 753)
x_out = Conv1D(filters = 1, kernel_size = 1,  activation='relu', strides=1, input_shape = input_shape)(x_out)




#x_out = tf.keras.layers.Flatten()(x_out)

# x_out, pool_size=2, strides=1, padding='VALID'
#x_out = tf.keras.layers.MaxPooling1D(pool_size=1, strides=1, padding='VALID')(x_out)
print(x_out.shape)

# [(length_in + 2 * padding - dilation * (kernel_size - 1) - 1) // stride + 1]



#x_out = MaxPool1D(pool_size=2)(x_out)
#x_out = Conv1D(filters = 32, kernel_size = sum(layer_sizes))(x_out)
#prediction = keras.layers.Reshape((-1,))(prediction)
#x_out = keras.layers.Reshape((1,16))(x_out)
#x_out = GCN(layer_sizes = layer_sizes, activations = activations, generator = generator)(x_out) #, dropout = 0.5


# MLP -- Regression
predictions = layers.Dense(units = train_targets.shape[1], activation = "linear")(x_out) 
# len(x_inp), x_out[1:].shape
train_targets.shape[1]

hard coded size adjustments
test_subjects_adjusted = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_adjusted)
#train_gen[1], val_gen[1]

In [None]:
# Models

In [None]:
# loss functions: https://keras.io/api/losses/

model = Model(
    inputs = x_inp, outputs = predictions)

'''
model.compile(
    optimizer=optimizers.Adam(learning_rate=0.1),
    loss=losses.MeanSquaredError(),
    metrics=["acc"],
)

# REF: https://stackoverflow.com/questions/57301698/how-to-change-a-learning-rate-for-adam-in-tf2
# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/PolynomialDecay
train_steps = 1000
lr_fn = optimizers.schedules.PolynomialDecay(1e-3, train_steps, 1e-5, 2)

# https://keras.io/api/metrics/
model.compile(
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam( lr_fn ),
    # metrics = ['mse']
    metrics=['mse', 'mape', 'mae']
)
'''

# 1st block
# mape: https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac
model.compile( 
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam(learning_rate = 0.015), 
    #optimizer = optimizers.Adam(lr_fn), 
    # metrics=['mse']
    metrics=['mse', 'mape', 'mae']
)

In [None]:
len(x_inp), predictions.shape, print(model.summary())

In [None]:
len(val_subjects)
test_subjects_ = test_subjects[:len(val_subjects)]

In [None]:
# hard coded size adjustments
test_subjects_ = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_)
#train_gen[1], val_gen[1]
val_gen[4]

train_gen[:1][:4]

In [None]:
# type(train_gen_data), type(data_valid), type(x_inp), type(x_out) 

In [None]:
# https://keras.io/api/callbacks/early_stopping/
from tensorflow.keras.callbacks import EarlyStopping

epochs_to_test = 10000
patience_to_test = 10000

es_callback = EarlyStopping(
    monitor = "val_mean_squared_error", 
    patience = patience_to_test, 
    restore_best_weights = True
)

data_valid = val_gen #[:1][:4];
train_gen_data = train_gen #[:1][:4];

history = model.fit( train_gen_data, epochs = epochs_to_test, validation_data = data_valid, verbose = 2,    
    # shuffling = true means shuffling the whole graph
    shuffle = False, callbacks = [es_callback],
)


In [None]:
sg.utils.plot_history(history)

In [None]:
val_subjects, 
test_subjects

# Train Metrics for Pearson Based Prediction: GCN + CNN

In [None]:
train_gen = generator.flow(train_subjects.index, train_targets)
train_metrics = model.evaluate(train_gen)
print("\nTrain Set Metrics:")

print("Train Metrics for Pearson Based Prediction: GCN + CNN");
for name, val in zip(model.metrics_names, train_metrics):
    print("\t{}: {:0.4f}".format(name, val))

In [None]:
test_gen = generator.flow(test_subjects.index, test_targets)
test_metrics = model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))

In [None]:
df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAPE', 'MAE'])

temp = list()
temp.append('GCN-Pearson');
for name, val in zip(model.metrics_names, test_metrics):
    # print(val)
    temp.append(round(val,2))

print(temp)
df_metrics.loc[1] = temp
df_metrics

# Show the predicted prices by the Model

At this point, I still need to make sense of what GCN ( and CNN) combination + MLP is predicting. 
I am just displaying the output. 
It appears that price is predicted for each timestamp (day)

In [None]:
all_nodes = pearson_graph_node_data.index;
all_gen = generator.flow(all_nodes)
all_predictions = model.predict(all_gen)

all_nodes, all_predictions, all_predictions.shape, pearson_graph_node_data.shape

In [None]:
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict
model.predict(
    all_gen,
    batch_size = None,
    verbose = 2,
    steps = None,
    callbacks = None,
    max_queue_size = 10,
    workers = 1,
    use_multiprocessing = False
)

In [None]:
# all_predictions = model.predict(all_nodes)

# all_predictions, all_predictions.shape, pearson_graph_node_data.shape
vars(all_gen)

In [None]:
pearson_graph_node_data

In [None]:
vars(all_gen)

In [None]:
train_gen[:1][:4]

****************************************************
STOP because we are testing a new model
****************************************************

# SPEARMAN

In [None]:
# Spearman
# epochs_to_test = 15000
# patience_to_test = 15000

df_s_transpose_spearman = df_s_transpose.corr(method = 'spearman', numeric_only = True)
df_s_transpose_spearman


# # Pearson Correlation Coefficient based Adjacency Graph Matrix

# In[32]:


df_s_transpose_spearman[df_s_transpose_spearman >= 0.4] = 1
df_s_transpose_spearman[df_s_transpose_spearman < 0.4] = 0
df_s_transpose_spearman


# In[33]:


# make the diagonal element to be zero. No self loop
import numpy as np
np.fill_diagonal(df_s_transpose_spearman.values, 0)
df_s_transpose_spearman


# Create and visualize the Graphs

# In[34]:


import networkx as nx
Graph_spearman = nx.Graph(df_s_transpose_spearman)


# In[36]:


nx.draw_networkx(Graph_spearman, pos=nx.circular_layout(Graph_spearman), node_color='r', edge_color='b')


# # Create GCN layer. Graph_spearman

# # Find all stocks = nodes

# In[37]:


# improvement: make sure only stocks/nodes that are in the graph are taken
all_stock_nodes = df_s_transpose_spearman.index.to_list()
all_stock_nodes


# # Find all edges between nodes

# In[38]:


source = [];
target = [];
edge_feature = [];

for aStock in all_stock_nodes:
    for anotherStock in all_stock_nodes:
        if df_s_transpose_spearman[aStock][anotherStock] > 0:
            #print(df_s_transpose_spearman[aStock][anotherStock])
            source.append(aStock)
            target.append(anotherStock)
            edge_feature.append(1)
            
source, target, edge_feature            


# In[39]:


# https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html
spearman_edges = pd.DataFrame(
    {"source": source, "target": target}
)

spearman_edges_data = pd.DataFrame(
    {"source": source, "target": target, "edge_feature": edge_feature}
)


spearman_edges[:10]


# # Graph with No Feature Data, No node data, only edges

# spearman_graph = StellarGraph(edges = spearman_edges, node_type_default="corner", edge_type_default="line")
# #spearman_graph = StellarGraph(nodes = all_stock_nodes, edges = spearman_edges)
# # graph = sg.StellarGraph(all_stock_nodes, square_edges)
# print(spearman_graph.info())

# In[40]:


# Trying to have the time series data as part of the nodes


# In[41]:


df_s_transpose


# # Structure the Feature Matrix so that it can be passed to the GCN

# In[43]:


df_s_transpose_feature = df_s_transpose.reset_index(drop = True, inplace = False)
# df_s_transpose_feature =  df_s_transpose_feature.values.tolist()
# print(df_s_transpose_feature.values.tolist())
#df_s_transpose_feature['WY'].values
df_s_transpose_feature['AAPL'].values


# In[44]:


node_Data = [];
for x in all_stock_nodes:
    node_Data.append( df_s_transpose_feature[x].values)
    
    
node_Data    


# In[45]:


spearman_graph_node_data = pd.DataFrame(node_Data, index = all_stock_nodes)
spearman_graph_node_data


# # Graph with feature as part of Nodes

# In[46]:


spearman_graph_with_node_features = StellarGraph(spearman_graph_node_data, edges = spearman_edges, node_type_default = "corner", edge_type_default = "line")
print(pearson_graph_with_node_features.info())


# In[47]:


# Generator
generator = FullBatchNodeGenerator(spearman_graph_with_node_features, method = "gcn") # , sparse = False
vars(generator)


# # Train Test Split

# In[48]:


train_subjects, test_subjects = model_selection.train_test_split(
    spearman_graph_node_data #, train_size = 6, test_size = 4
)
# , train_size=6, test_size=None, stratify=pearson_graph_node_data

val_subjects, test_subjects_step_2 = model_selection.train_test_split(
    test_subjects #, test_size = 2
)

#, train_size = 500, test_size = None, stratify = test_subjects


train_subjects.shape, test_subjects.shape, val_subjects.shape, test_subjects_step_2.shape


# In[49]:


spearman_graph_node_data


# In[50]:


train_targets = train_subjects; 
val_targets = val_subjects; 
test_targets = test_subjects; 


# In[51]:


train_gen = generator.flow(train_subjects.index, train_targets)


# In[52]:


# debug
train_subjects.index, 
train_targets


# In[53]:


# train data size
# it is not must to use a number like unit_count
unit_count = train_subjects.shape[0]
unit_count


# In[54]:


'''
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow import keras

layer_sizes = [32, 32]
activations = ["relu", "relu"]
'''

gcn = GCN(layer_sizes = layer_sizes, activations = activations, generator = generator) #, dropout = 0.5
x_inp, x_out = gcn.in_out_tensors()

# MLP -- Regression
predictions = layers.Dense(units = train_targets.shape[1], activation = "linear")(x_out)

'''
x_out, 
x_inp, x_out
'''

# # hard coded size adjustments
# test_subjects_adjusted = test_subjects[:len(val_subjects)]
# 
# val_gen = generator.flow(val_subjects.index, test_subjects_adjusted)
# # train_gen[1], val_gen[1]

# In[55]:


# Models Although this code could be removed as Model is defined earlier and the same model/architecture is used by all approaches


# In[56]:


# loss functions: https://keras.io/api/losses/
'''
model = Model(
    inputs = x_inp, outputs = predictions
)
'''
'''
model.compile(
    optimizer=optimizers.Adam(learning_rate=0.1),
    loss=losses.MeanSquaredError(),
    metrics=["acc"],
)
'''

# REF: https://stackoverflow.com/questions/57301698/how-to-change-a-learning-rate-for-adam-in-tf2
# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/PolynomialDecay
# train_steps = 1000
# lr_fn = optimizers.schedules.PolynomialDecay(1e-3, train_steps, 1e-5, 2)


# https://keras.io/api/metrics/
'''
model.compile(
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam( lr_fn ),
    # metrics = ['mean_squared_error']
    metrics=['mse', 'mape', 'mae]
)
'''
# 2nd block
# mape: https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac
model.compile( 
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam(learning_rate = 0.015), 
    #optimizer = optimizers.Adam(lr_fn), 
    # metrics=['mean_squared_error']
    metrics=['mean_squared_error', 'mape', 'mae']
   
)


# In[57]:


len(x_inp), predictions.shape, print(model.summary())


# In[58]:


len(val_subjects)
test_subjects_ = test_subjects[:len(val_subjects)]


# In[59]:


# hard coded size adjustments
test_subjects_ = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_)
#train_gen[1], val_gen[1]


# train_gen[:1][:4]

# In[60]:


# https://keras.io/api/callbacks/early_stopping/
from tensorflow.keras.callbacks import EarlyStopping

'''
#epochs_to_test = 10000
#patience_to_test = 10000

es_callback = EarlyStopping(
    monitor = "val_mean_squared_error", 
    patience = patience_to_test, 
    restore_best_weights = True
)

data_valid = val_gen #[:1][:4];
train_gen_data = train_gen #[:1][:4];
'''

history = model.fit( train_gen_data, epochs = epochs_to_test, validation_data = data_valid, verbose = 2,    
    # shuffling = true means shuffling the whole graph
    shuffle = False, callbacks = [es_callback],
)



In [None]:
sg.utils.plot_history(history)

In [None]:
# [1]


# In[61]:


val_subjects, 
test_subjects


# In[62]:


train_gen = generator.flow(train_subjects.index, train_targets)
train_metrics = model.evaluate(train_gen)
print("\nTrain Set Metrics:")

print("Train Metrics for Spearman Based Prediction: GCN + CNN");
for name, val in zip(model.metrics_names, train_metrics):
    print("\t{}: {:0.4f}".format(name, val))
    
test_gen = generator.flow(test_subjects.index, test_targets)
test_metrics = model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))
    
    
#df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAPE', 'MAE'])

temp = list()
temp.append('GCN-Spearman');
for name, val in zip(model.metrics_names, test_metrics):
    # print(val)
    temp.append(round(val,2))

print(temp)
df_metrics.loc[2] = temp
df_metrics

    


# # Show the predicted prices by the Model
# 
# At this point, I still need to make sense of what GCN ( and CNN) combination + MLP is predicting. 
# I am just displaying the output. 
# It appears that price is predicted for each timestamp (day)

# In[63]:


all_nodes = spearman_graph_node_data.index;
all_gen = generator.flow(all_nodes)
all_predictions = model.predict(all_gen)

all_nodes, all_predictions, all_predictions.shape, spearman_graph_node_data.shape


# In[64]:


# https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict
model.predict(
    all_gen,
    batch_size = None,
    verbose = 2,
    steps = None,
    callbacks = None,
    max_queue_size = 10,
    workers = 1,
    use_multiprocessing = False
)


# In[65]:


# all_predictions = model.predict(all_nodes)

# all_predictions, all_predictions.shape, spearman_graph_node_data.shape
vars(all_gen)


# In[66]:


spearman_graph_node_data


# In[67]:


vars(all_gen)


# In[ ]:


# In[68]:


train_gen[:1][:4]


# In[ ]:



# Kendal Tau

In [None]:
# kendall_tau

#epochs_to_test = 15000
#patience_to_test = 15000


df_s_transpose_kendall_tau = df_s_transpose.corr(method = 'kendall', numeric_only = True)
df_s_transpose_kendall_tau


# # kendall_tau Correlation Coefficient based Adjacency Graph Matrix

# In[32]:


df_s_transpose_kendall_tau[df_s_transpose_kendall_tau >= 0.3] = 1
df_s_transpose_kendall_tau[df_s_transpose_kendall_tau < 0.3] = 0
df_s_transpose_kendall_tau


# In[33]:


# make the diagonal element to be zero. No self loop
import numpy as np
np.fill_diagonal(df_s_transpose_kendall_tau.values, 0)
df_s_transpose_kendall_tau


# Create and visualize the Graphs

# In[34]:


import networkx as nx
Graph_kendall_tau = nx.Graph(df_s_transpose_kendall_tau)


# In[36]:


nx.draw_networkx(Graph_kendall_tau, pos=nx.circular_layout(Graph_kendall_tau), node_color='r', edge_color='b')


# # Create GCN layer. Graph_kendall_tau

# # Find all stocks = nodes

# In[37]:


# improvement: make sure only stocks/nodes that are in the graph are taken
all_stock_nodes = df_s_transpose_kendall_tau.index.to_list()
all_stock_nodes


# # Find all edges between nodes

# In[38]:


source = [];
target = [];
edge_feature = [];

for aStock in all_stock_nodes:
    for anotherStock in all_stock_nodes:
        if df_s_transpose_kendall_tau[aStock][anotherStock] > 0:
            #print(df_s_transpose_kendall_tau[aStock][anotherStock])
            source.append(aStock)
            target.append(anotherStock)
            edge_feature.append(1)
            
source, target, edge_feature            


# In[39]:


# https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html
kendall_tau_edges = pd.DataFrame(
    {"source": source, "target": target}
)

kendall_tau_edges_data = pd.DataFrame(
    {"source": source, "target": target, "edge_feature": edge_feature}
)


kendall_tau_edges[:10]


# # Graph with No Feature Data, No node data, only edges

# kendall_tau_graph = StellarGraph(edges = kendall_tau_edges, node_type_default="corner", edge_type_default="line")
# #kendall_tau_graph = StellarGraph(nodes = all_stock_nodes, edges = kendall_tau_edges)
# # graph = sg.StellarGraph(all_stock_nodes, square_edges)
# print(kendall_tau_graph.info())

# In[40]:


# Trying to have the time series data as part of the nodes


# In[41]:


df_s_transpose


# # Structure the Feature Matrix so that it can be passed to the GCN

# In[43]:


df_s_transpose_feature = df_s_transpose.reset_index(drop = True, inplace = False)
# df_s_transpose_feature =  df_s_transpose_feature.values.tolist()
# print(df_s_transpose_feature.values.tolist())
#df_s_transpose_feature['WY'].values
df_s_transpose_feature['AAPL'].values


# In[44]:


node_Data = [];
for x in all_stock_nodes:
    node_Data.append( df_s_transpose_feature[x].values)
    
    
node_Data    


# In[45]:


kendall_tau_graph_node_data = pd.DataFrame(node_Data, index = all_stock_nodes)
kendall_tau_graph_node_data


# # Graph with feature as part of Nodes

# In[46]:


kendall_tau_graph_with_node_features = StellarGraph(kendall_tau_graph_node_data, edges = kendall_tau_edges, node_type_default = "corner", edge_type_default = "line")
print(kendall_tau_graph_with_node_features.info())


# In[47]:


# Generator
generator = FullBatchNodeGenerator(kendall_tau_graph_with_node_features, method = "gcn") # , sparse = False
vars(generator)


# # Train Test Split

# In[48]:


train_subjects, test_subjects = model_selection.train_test_split(
    kendall_tau_graph_node_data #, train_size = 6, test_size = 4
)
# , train_size=6, test_size=None, stratify=kendall_tau_graph_node_data

val_subjects, test_subjects_step_2 = model_selection.train_test_split(
    test_subjects #, test_size = 2
)

#, train_size = 500, test_size = None, stratify = test_subjects


train_subjects.shape, test_subjects.shape, val_subjects.shape, test_subjects_step_2.shape


# In[49]:


kendall_tau_graph_node_data


# In[50]:


train_targets = train_subjects; 
val_targets = val_subjects; 
test_targets = test_subjects; 


# In[51]:


train_gen = generator.flow(train_subjects.index, train_targets)


# In[52]:


# debug
train_subjects.index, 
train_targets


# In[53]:


# train data size
# it is not must to use a number like unit_count
unit_count = train_subjects.shape[0]
unit_count


# In[54]:

'''
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow import keras

layer_sizes = [32, 32]
activations = ["relu", "relu"]
'''
gcn = GCN(layer_sizes = layer_sizes, activations = activations, generator = generator) #, dropout = 0.5
x_inp, x_out = gcn.in_out_tensors()

# MLP -- Regression
predictions = layers.Dense(units = train_targets.shape[1], activation = "linear")(x_out)

'''
x_out, 
x_inp, x_out


# # hard coded size adjustments
# test_subjects_adjusted = test_subjects[:len(val_subjects)]
# 
# val_gen = generator.flow(val_subjects.index, test_subjects_adjusted)
# # train_gen[1], val_gen[1]

# In[55]:


# Models


# In[56]:


# loss functions: https://keras.io/api/losses/

model = Model(
    inputs = x_inp, outputs = predictions)


model.compile(
    optimizer=optimizers.Adam(learning_rate=0.1),
    loss=losses.MeanSquaredError(),
    metrics=["acc"],
)


# REF: https://stackoverflow.com/questions/57301698/how-to-change-a-learning-rate-for-adam-in-tf2
# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/PolynomialDecay
train_steps = 1000
lr_fn = optimizers.schedules.PolynomialDecay(1e-3, train_steps, 1e-5, 2)


# https://keras.io/api/metrics/
model.compile(
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam( lr_fn ),
    # metrics = ['mean_squared_error']
    metrics=['mse', 'mape',  'mae']
)

'''

# 3rd block
# mape: https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac
model.compile( 
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam(learning_rate = 0.015), 
    #optimizer = optimizers.Adam(lr_fn), 
    # metrics=['mean_squared_error']
    metrics=['mean_squared_error', 'mape', 'mae']
    # metrics=[
    #    metrics.MeanSquaredError(),
    #    metrics.AUC(),
    #]
)


len(x_inp), predictions.shape, print(model.summary())

len(val_subjects)
test_subjects_ = test_subjects[:len(val_subjects)]

# hard coded size adjustments
test_subjects_ = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_)
#train_gen[1], val_gen[1]


# train_gen[:1][:4]

# In[60]:

'''
# https://keras.io/api/callbacks/early_stopping/
from tensorflow.keras.callbacks import EarlyStopping
es_callback = EarlyStopping(
    monitor = "val_mean_squared_error", 
    patience = patience_to_test, 
    restore_best_weights = True
)

data_valid = val_gen #[:1][:4];
train_gen_data = train_gen #[:1][:4];
'''

history = model.fit( train_gen_data, epochs = epochs_to_test, validation_data = data_valid, verbose = 2,    
    # shuffling = true means shuffling the whole graph
    shuffle = False, callbacks = [es_callback],
)




In [None]:
sg.utils.plot_history(history)

In [None]:
nx.draw_networkx(Graph_kendall_tau, pos=nx.circular_layout(Graph_kendall_tau), node_color='r', edge_color='b')

In [None]:
# [1]


# In[61]:


val_subjects, 
test_subjects


# In[62]:

train_gen = generator.flow(train_subjects.index, train_targets)
train_metrics = model.evaluate(train_gen)
print("\nTrain Set Metrics:")

print("Train Metrics for Kendall Tau Based Prediction: GCN + CNN");
for name, val in zip(model.metrics_names, train_metrics):
    print("\t{}: {:0.4f}".format(name, val))


test_gen = generator.flow(test_subjects.index, test_targets)
test_metrics = model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))
    


# # Show the predicted prices by the Model
# 
# At this point, I still need to make sense of what GCN ( and CNN) combination + MLP is predicting. 
# I am just displaying the output. 
# It appears that price is predicted for each timestamp (day)

# In[63]:


all_nodes = kendall_tau_graph_node_data.index;
all_gen = generator.flow(all_nodes)
all_predictions = model.predict(all_gen)

all_nodes, all_predictions, all_predictions.shape, kendall_tau_graph_node_data.shape


# In[64]:


# https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict
model.predict(
    all_gen,
    batch_size = None,
    verbose = 2,
    steps = None,
    callbacks = None,
    max_queue_size = 10,
    workers = 1,
    use_multiprocessing = False
)


# In[65]:


# all_predictions = model.predict(all_nodes)

# all_predictions, all_predictions.shape, kendall_tau_graph_node_data.shape
vars(all_gen)


# In[66]:


kendall_tau_graph_node_data


# In[67]:


vars(all_gen)


# In[ ]:


# In[68]:


train_gen[:1][:4]


# In[ ]:



In [None]:
# df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAPE', 'MAE'])
# df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAPE', 'MAE'])

temp = list()
temp.append('GCN-Kendall');
for name, val in zip(model.metrics_names, test_metrics):    
    temp.append(round(val,2))

print(temp)
df_metrics.loc[3] = temp


In [None]:
# import math
df_metrics_plot = df_metrics[['Loss', 'MSE', 'MAPE', 'MAE']]

#temp = [10.71573, 13.578422, 10.71573, 16.638063]
#temp = [19.04899024963379, 1377.4075927734375, 19.04899024963379, 26.09033203125]
#df_metrics_plot.loc[4] = temp

df_metrics_plot['MSE'] = [ math.sqrt(x) for x in df_metrics_plot['MSE']];
df_metrics_plot
#df_metrics, df_metrics_plot

In [None]:
df_metrics_plot.plot( kind = 'bar')

# For the sake of easier execution, I have brought financial news based prediction in the same code file

In [None]:
#!/usr/bin/env python
# coding: utf-8

# # Import Libraries

# In[1]:

#epochs_to_test = 15000
#patience_to_test = 15000



import pandas as pd
# Import Libraries for Graph, GNN, and GCN

import stellargraph as sg
from stellargraph import StellarGraph

from stellargraph.mapper import FullBatchNodeGenerator
from stellargraph.layer import GCN


# In[2]:


# Machine Learnig related library Imports

from tensorflow.keras import layers, optimizers, losses, metrics, Model
from sklearn import preprocessing, model_selection
from IPython.display import display, HTML
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')


# In[3]:


# was active

data_folder = './data/yahoonewsarchive/'
# os.chdir(data_folder);
# file = data_folder + 'NEWS_YAHOO_stock_prediction.csv';
file = data_folder + 'News_Yahoo_stock.csv';


# In[4]:


df_news = pd.read_csv(file)
df_news.head()


# In[5]:


df_news = df_news[:100]


# # Approaches: Find all stock tickers in an/all article/articles
# 
# 1. Find code that does this: from internet or from previous work or from courses that you have taken online or in academia
# 2. Iterative read the article and match with stock tickers, and find all tickers. Drawback: to which tickers to match or how will you know what is a ticker? Any two to four letters Uppercase, NASDAQ AAPL
# 3. Load the article in database and then use SQL -> may not work that well unless you write some functions
# 4. NLTK, remove stop words, find all tokens, then find All Uppercase words. create a list. attach article ids to the list. Then match with the list of tockers. find common tickers between them. then create tuples with two (indicating edge) (source target weight) 

# In[6]:


# import NLTK libraries
# remove stop words using NLTK methods 
# remove all sorts of unnecessary words
# find all tokens
# Keep only All Uppercase words in a list : dictionary/map: dataframe will be ideal
# create a list/dictionary/map: dataframe will be ideal. attach article ids to the list/dataframe data.
# Create a list of all NasDAQ Tickers
# Then match with the list of NASDAQ tockers. 
# find common tickers between them. 
# then create tuples with two (indicating edge) (source target weight)
# increase weight for each article and pair when you see a match


# In[7]:


# import NLTK libraries
import nltk


# In[8]:


# remove stop words using NLTK methods 
# remove all sorts of unnecessary words
# find all tokens
# Keep only All Uppercase words in a list : dictionary/map: dataframe will be ideal

from nltk.tokenize import RegexpTokenizer

dataFrameWithOnlyCapitalWords = pd.DataFrame(columns =  ["id", "Title", "Content"]) 
for index, row in df_news.iterrows():
    # print(row[id], row['title'], row['content'])
                
    # words with capital letters in the beginning +  as much as possible
    capitalWords = RegexpTokenizer('[A-Z]+[A-Z]\w+')
    # print("\n::All Capital Words::", capitalWords.tokenize(row['content']))
    allCapitalWords = capitalWords.tokenize(row['content'])
        
    dataFrameWithOnlyCapitalWords.loc[index] = [index, row['title'], allCapitalWords]
    #break



dataFrameWithOnlyCapitalWords.head() #, dataFrameWithOnlyCapitalWords.shape


# # Create a list of all (NasDAQ) 30 stocks as per the paper
# 

# In[9]:


# Find/Create a list of NASDAQ Stocks
import os
import glob
nasdaqDataFolder = './archive/stock_market_data/nasdaq/csv'
os.chdir(nasdaqDataFolder)





# In[10]:


# Create a list of all NasDAQ Tickers

extension = "csv"
fileTypesToMerge = ""
# all_filenames = [i for i in glob.glob('*' + '*.{}'.format(extension))]
all_nasdaq_tickers = [i[:-4] for i in glob.glob('*' + fileTypesToMerge + '*.{}'.format(extension))]
nasdaq_tickers_to_process = all_nasdaq_tickers #[:10]
nasdaq_tickers_to_process


# In[11]:


nasdaq_tickers_to_process.remove('FREE')
nasdaq_tickers_to_process.remove('CBOE')
nasdaq_tickers_to_process.remove('III')
nasdaq_tickers_to_process.remove('RVNC')
sorted(nasdaq_tickers_to_process)


# In[ ]:





# In[12]:


fortune_30_tickers_to_process = [
'WMT',
'XOM',
'AAPL',
'UNH',
'MCK',
'CVS',
'AMZN',
'T',
'GM',
'F',
'ABC',
'CVX',
'CAH',
'COST',
'VZ',
'KR',
'GE',
'WBA',
'JPM',
'GOOGL',
'HD',
'BAC',
'WFC',
'BA',
'PSX',
'ANTM',
'MSFT',
'UNP',
'PCAR',
'DWDP']




# In[ ]:





# nasdaq_tickers_to_process = [
# 'WMT',
# 'XOM',
# 'AAPL',
# 'UNH',
# 'MCK',
# 'CVS',
# 'AMZN',
# 'T',
# 'GM',
# 'F',
# 'ABC',
# 'CVX',
# 'CAH',
# 'COST',
# 'VZ',
# 'KR',
# 'GE',
# 'WBA',
# # 'JPM',
# #'GOOGL',
# 'HD',
# 'BAC',
# 'WFC',
# 'BA',
# 'PSX',
# 'ANTM',
# 'MSFT',
# 'UNP',
# 'PCAR',
# 'DWDP']
# 

# # Find NASDQ Tickers in each article
# Create graph steps
# Find all edges 

# In[13]:


combinedTupleList = [];
allMatchingTickers = [];
from itertools import combinations
for index, row in dataFrameWithOnlyCapitalWords.iterrows():
    #print(index)
    #print(set(row['Content']))
    #print(set(nasdaq_tickers_to_process));    
    matchingTickers = set(set(fortune_30_tickers_to_process).intersection(set(row['Content'])))
    #print(matchingTickers)
    if len (matchingTickers) > 1:
        allTuples = list(combinations(matchingTickers, 2));
        #print(list(combinations(matchingTickers, 2)))
        
        #allMatchingTickers = set(allMatchingTickers).union(matchingTickers);
        for aTuple in allTuples:
            combinedTupleList.append(tuple(sorted(aTuple)));
            allMatchingTickers.append(aTuple[0])
            allMatchingTickers.append(aTuple[1])
            
        
    # print("*******************");
    #break
    
#combinedTupleList = list(set(combinedTupleList))
allMatchingTickers = set(allMatchingTickers)

# list(set(combinedTupleList)), len(combinedTupleList), len(set(combinedTupleList)), allMatchingTickers, len(allMatchingTickers), len(set(allMatchingTickers))
sorted(combinedTupleList), type(aTuple), type(sorted(aTuple)), allMatchingTickers


# In[14]:


#combinedTupleList[:1], set(allMatchingTickers)


# In[15]:


# calculate edge weights
from collections import Counter

tuplesWithCount = dict(Counter(combinedTupleList))
tuplesWithCount


# In[16]:


l = list(tuplesWithCount.keys())
l

#print(list(zip(*l))[0])
#print(list(zip(*l))[1])

source = list(zip(*l))[0];
target = list(zip(*l))[1];
edge_weights = tuplesWithCount.values()
source, target, edge_weights, len(source), len(target)


# In[62]:


import networkx as nx
Graph_news = nx.Graph(tuplesWithCount.keys())
nx.draw_networkx(Graph_news, pos = nx.circular_layout(Graph_news), node_color = 'r', edge_color = 'b')
#tuplesWithCount.keys()


# # Finally Create graph based on financial news

# In[17]:


import os
os.getcwd()
os.chdir('../../../../')
#os.chdir('./mcmaster/meng/747/project/')
os.getcwd()


# In[18]:


# Now create node data i.e time series to pass as part of the nodes
'''
df = pd.DataFrame();
data_file = "../../../..//archive/stock_market_data/nasdaq/nasdq-stock-price--all-merged.csv"
# stock-price--all-merged.csv"
df = pd.read_csv(data_file);
df.head()
'''

# this is the place where the new dataset starts i.e. fortune 30 companies
df = pd.DataFrame();
data_file = "per-day-fortune-30-company-stock-price-data.csv";
df = pd.read_csv("./data/" + data_file, low_memory = False);
df.head()


# In[19]:


df.index


# In[20]:


drop_cols_with_na = 1
drop_rows_with_na = 0


# In[21]:


# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html


try:
  df = df.interpolate(inplace = False)
except:
  print("An exception occurred. Operation ignored")
  exit
    
df.isnull().values.any()
df[df.isna().any(axis = 1)]  


#----



if drop_cols_with_na == 1:
    df = df.dropna(axis = 1);    
   
df, df.shape

## -- 

df.isnull().values.any()
df[df.isna().any( axis = 1 )]


## --

# df_s_transpose.index = df_s_transpose['Date']
#df.index = df.index.astype('datetime64[ns]')
df


# In[22]:


df_s =  df #[ ['Ticker', 'Date', 'Adjusted Close'] ];
df_s


# In[23]:


df_s["Date"] = df_s["Date"].astype('datetime64[ns]')
df_s = df_s.sort_values( by = 'Date', ascending = True )
df_s


# df_s_pivot = df_s.pivot_table(index = 'Ticker', columns = 'Date', values = 'Adjusted Close')
# df_s_pivot

# In[24]:


allMatchingTickers


# 
# 
# drop_rows_with_na = 0
# if drop_rows_with_na == 1:
#     df_s_transpose = df_s_transpose.dropna(axis=0);
#     #df_s_transpose["Date"] = df_s_transpose["Date"].astype('datetime64[ns]')
#     #df_s_transpose.sort_values(by='Date', ascending=False)
#     df_s_transpose.to_csv('../../../..//archive/stock_market_data/nasdaq/-na-dropped-nasdq-stock-price--all-merged.csv');
#    
# df_s_transpose.head(100)
# 
# 

# In[25]:


df_s_transpose = df_s #_pivot.T
df_s_transpose

df_s_transpose_feature = df_s_transpose.reset_index(drop = True, inplace=False)
# df_s_transpose_feature =  df_s_transpose_feature.values.tolist()
# print(df_s_transpose_feature.values.tolist())
#df_s_transpose_feature['AAPL'].values



# In[26]:


df_s_transpose_feature = df_s_transpose.set_index('Date')


# In[27]:


df_s_transpose_feature


# In[28]:


#df_s_transpose['SIMO']


# In[29]:


# df_s_transpose_feature['AAPL'].values
len(allMatchingTickers), len(set(allMatchingTickers)) #, df_s['Ticker']
#df_s_tickers = df_s['Ticker'];
#len(df_s_tickers), len(set(allMatchingTickers)), df_s_transpose.columns.unique, len(set(df_s['Ticker']))
#df_s_tickers = list(set(df_s['Ticker'])); # list(df_s_transpose.columns.unique) #
#sorted(df_s_tickers)
#for x in df_s_tickers:
 #   print(x)


# In[30]:


df_s_tickers = df_s_transpose_feature.columns
#df_s_tickers = list(set(df_s_tickers.drop('Date')))
sorted(df_s_tickers[:5])


# In[31]:


set_allMatchingTickers = set(allMatchingTickers)
df_s_tickers = fortune_30_tickers_to_process #list(set(df_s['Ticker'])); # list(df_s_transpose.columns.unique) #
node_Data_financial_news = [];

'''
for x in set_allMatchingTickers :
    # if x in df_s_tickers:
    print(x)
    node_Data_financial_news.append( df_s_transpose_feature[x].values)
'''  

node_Data_financial_news = pd.DataFrame(df_s_transpose_feature) #, index = list(allMatchingTickers)) #, index = list(set_allMatchingTickers))
#node_Data_financial_news = node_Data_financial_news.T 
node_Data_financial_news


# In[32]:


node_Data_financial_news = node_Data_financial_news.T
node_Data_financial_news


# In[33]:


node_Data_financial_news


# node_Data_financial_news#.drop(axis = 0)
# node_Data_financial_news = node_Data_financial_news.T
# node_Data_financial_news

# node_Data_financial_news = node_Data_financial_news.drop('Date')
# node_Data_financial_news

# In[34]:


financial_news_edge_data = pd.DataFrame(
    {"source": source, "target": target, "edge_feature": edge_weights}
)

financial_news_graph = StellarGraph(node_Data_financial_news, edges = financial_news_edge_data, node_type_default="corner", edge_type_default="line")
print(financial_news_graph.info())


# In[35]:


# debug code
# financial_news_graph_data,  sorted(node_Data_financial_news.columns.unique())
# [1,2] + [2, 3,4], set(source + target).difference(sorted(node_Data_financial_news.columns.unique()))


# In[36]:


# Generator
generator = FullBatchNodeGenerator(financial_news_graph, method = "gcn")


# # Machine Learning, Deep Learning, GCN, CNN

# # Train Test Split

# In[37]:


train_subjects, test_subjects = model_selection.train_test_split(
    node_Data_financial_news #, train_size = 6, test_size = 4
)
# , train_size=6, test_size=None, stratify=pearson_graph_node_data

val_subjects, test_subjects_step_2 = model_selection.train_test_split(
    test_subjects #, test_size = 2
)

#, train_size = 500, test_size = None, stratify = test_subjects


train_subjects.shape, test_subjects.shape, val_subjects.shape, test_subjects_step_2.shape


# In[38]:


# just the target variables

train_targets = train_subjects; 
val_targets = val_subjects; 
test_targets = test_subjects; 


# In[39]:


# Architecture of the Neural Network
train_subjects.index, train_targets


# In[40]:


train_gen = generator.flow(train_subjects.index, train_targets)


# In[41]:

'''
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow import keras

layer_sizes = [32, 32]
activations = ["relu", "relu"]
'''

gcn = GCN(layer_sizes = layer_sizes, activations = activations, generator = generator) #, dropout = 0.5
x_inp, x_out = gcn.in_out_tensors()

# MLP -- Regression
predictions = layers.Dense(units = train_targets.shape[1], activation = "linear")(x_out)


'''
x_out, 
x_inp, x_out


# # Models

# In[42]:


# loss functions: https://keras.io/api/losses/

model = Model(
    inputs = x_inp, outputs = predictions)


model.compile(
    optimizer=optimizers.Adam(learning_rate=0.1),
    loss=losses.MeanSquaredError(),
    metrics=["acc"],
)


# REF: https://stackoverflow.com/questions/57301698/how-to-change-a-learning-rate-for-adam-in-tf2
# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/PolynomialDecay
train_steps = 1000
lr_fn = optimizers.schedules.PolynomialDecay(1e-3, train_steps, 1e-5, 2)


# https://keras.io/api/metrics/
model.compile(
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam( lr_fn ),
    # metrics = ['mean_squared_error']
    metrics=['mse', 'mape',  'mae']
)
'''
# 4th block

# mape: https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac
model.compile( 
    loss = 'mean_absolute_error', 
    optimizer = optimizers.Adam(learning_rate = 0.015), 
    #optimizer = optimizers.Adam(lr_fn), 
    # metrics=['mean_squared_error']
    metrics=['mean_squared_error', 'mape', 'mae']
    
)


# In[43]:


len(x_inp), predictions.shape, print(model.summary())


# In[44]:


len(val_subjects)
test_subjects_ = test_subjects[:len(val_subjects)]


# In[45]:


# hard coded size adjustments
test_subjects_ = test_subjects[:len(val_subjects)]

val_gen = generator.flow(val_subjects.index, test_subjects_)
#train_gen[1], val_gen[1]


# In[46]:



data_valid = val_gen #[:1][:4];
train_gen_data = train_gen #[:1][:4];


# In[47]:


type(train_gen_data), type(data_valid), type(x_inp), type(x_out) 


# In[48]:

'''
# https://keras.io/api/callbacks/early_stopping/
from tensorflow.keras.callbacks import EarlyStopping

es_callback = EarlyStopping(
    monitor = "val_mean_squared_error", 
    patience = patience_to_test, 
    restore_best_weights = True
)
'''

history = model.fit( train_gen_data, epochs = epochs_to_test, validation_data = data_valid, verbose = 2,    
    # shuffling = true means shuffling the whole graph
    shuffle = False, callbacks = [es_callback],
)



In [None]:
sg.utils.plot_history(history)

# Train and Test Metrics

In [None]:
train_gen = generator.flow(train_subjects.index, train_targets)
train_metrics = model.evaluate(train_gen)
print("\nTrain Set Metrics:")

print("Train Metrics for Financial News Based Prediction: GCN + CNN");
for name, val in zip(model.metrics_names, train_metrics):
    print("\t{}: {:0.4f}".format(name, val))

In [None]:
# [1]
val_subjects, 
test_subjects

test_gen = generator.flow(test_subjects.index, test_targets)
test_metrics = model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))
    


# # Show the predicted prices by the Model
# 
# At this point, I still need to make sense of what GCN ( and CNN) combination + MLP is predicting. 
# I am just displaying the output. 
# It appears that price is predicted for each timestamp (day)

# In[51]:


all_nodes = node_Data_financial_news.index;
all_gen = generator.flow(all_nodes)
all_predictions = model.predict(all_gen)

all_predictions, all_predictions.shape, node_Data_financial_news.shape


# In[52]:




df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAPE', 'MAE'])

temp = list()
temp.append('GCN-Causation-News');
for name, val in zip(model.metrics_names, test_metrics):    
    temp.append(round(val,2))

print(temp)
df_metrics.loc[1] = temp

import math
df_metrics_plot = df_metrics[['Loss', 'MSE', 'MAPE', 'MAE']]
df_metrics_plot['MSE'] = math.sqrt(df_metrics['MSE'])
df_metrics_plot


df_metrics_plot.plot( kind = 'bar')

In [None]:
# df_metrics = pd.DataFrame(columns=['Method', 'Loss', 'MSE', 'MAPE', 'MAE'])

temp = list()
temp.append('GCN-News');
for name, val in zip(model.metrics_names, test_metrics):    
   temp.append(round(val,2))

df_metrics.loc[4] = temp

# import math
df_metrics_plot = df_metrics[['Loss', 'MSE', 'MAPE', 'MAE']]

df_metrics_plot['MSE'] = [ math.sqrt(x) for x in df_metrics_plot['MSE']]
df_metrics_plot

In [None]:
df_metrics_plot.plot( kind = 'bar')

In [None]:
round(df_metrics_plot.T,1)

To start with, I have taken ideas from the following code esp. to see what GCN is and how GCN works.

Although, it does not use any CNN. 

Node classification with Graph Convolutional Network (GCN). 

https://stellargraph.readthedocs.io/en/stable/demos/node-classification/gcn-node-classification.html 

References:



[1] Node classification with Graph Convolutional Network (GCN). https://stellargraph.readthedocs.io/en/stable/demos/node-classification/gcn-node-classification.html 


[2] Loading data into StellarGraph from Pandas. https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html

[3] Load Timeseries https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-numpy.html

[4] NetworkX: https://networkx.org/documentation/stable/reference/introduction.html 

[5]  StellerGraph and Networkx https://stellargraph.readthedocs.io/en/latest/demos/basics/loading-networkx.html 

[6] Select StellerGraph Algorithm : https://stellargraph.readthedocs.io/en/stable/demos/#find-a-demo-for-an-algorithm 
[link text](https://)


Learning: 
GNN/GCN/Keras
https://www.youtube.com/watch?v=0KH95BEz370


Install StellarGraph:
https://pypi.org/project/stellargraph/#install-stellargraph-using-pypi


May want to use without Stellar
https://keras.io/examples/graph/gnn_citations/

to get feature data from pandas dataframe: 
https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html


Create graph properly:
https://stellargraph.readthedocs.io/en/stable/demos/basics/loading-pandas.html    

https://stellargraph.readthedocs.io/en/v0.11.0/api.html


Graph Regression Dataset
https://paperswithcode.com/task/graph-regression/codeless

StellerGraph Reference:
https://stellargraph.readthedocs.io/en/stable/demos/time-series/gcn-lstm-time-series.html
https://stellargraph.readthedocs.io

GRaph CNN or similar
It has multiple GCN layers and one 1d CNN + ... this idea might help
https://stellargraph.readthedocs.io/en/stable/demos/graph-classification/dgcnn-graph-classification.html?highlight=cnn

# References -- exploring ideas on the GCN-CNN
https://ieeexplore.ieee.org/document/9149910

https://antonsruberts.github.io/graph/gcn/

This may work. As Unit GCN is created also unit tcn. This may give the opportunity to customize to product the correct output
https://github.com/lshiwjx/2s-AGCN  https://paperswithcode.com/paper/non-local-graph-convolutional-networks-for

    

# from scracth and equations
https://towardsdatascience.com/understanding-graph-convolutional-networks-for-node-classification-a2bfdb7aba7b

https://jonathan-hui.medium.com/graph-convolutional-networks-gcn-pooling-839184205692