# Financial Stock Price Prediction with Knowledge Graph Embeddings

## Introduction

This notebook is aimed to serve as an introduction to the creation of a not personalized recommendation algorithm operating on stock market prices in order to predict future profitability of those assets. The time series are converted into financial technical indicators, and it is enriched with knowledge graph information downloaded from WikiData. 

The notebook covers the processing steps, the calculation of the features fed to the prediction models, and provides the outcome of several profitability prediction models using the combination of both technical features and knowledge graph embeddings.

In [1]:
# Local Mode
#storageDIR = "HugeStockMarketDataset" # creates a dataset directory in the same folder as the notebook
#storageDIRNews = "NewsSentimentDataset"
# Container Mode
storageDIR = "/tmp/data/"


## Dataset

Different types of financial asset recommendation systems use different sources of data to produce their recommendations. The approach we introduce in this notebook is known as Profitability Prediction, where assets that are predicted to gain significant value over the following six months are recommended. This type of approach uses past pricing data, i.e. the price for different assets over time, to identify pricing trends and hence future profitable assets. Hence, as input, we need the price history over time for a range of assets. In addition, we enrich our recommendations with a knowledge graph representing relations between companies and other entities related to them (important people in the company like CEO or board members, products released, awards). 

### Pricing data

For illustration, in this notebook we will use open pricing data, available from Yahoo! Finance. In particular, it contains the historical price and volume data for US-based stocks and ETFs trading on the NYSE, NASDAQ and AMEX markets, and it runs up to the end of March 2022. Each entry of this dataset is comprised of: 
 - Date: The date of the pricing data 
 - Open: Opening price for that day
 - High: The maximum price for that day
 - Low: The minimum price for that day
 - Close: The closing price for that day
 - AdjClose: The adjusted closing price
 - Volume: The amount of the asset that is traded 
 
We introduce here three different ways to download the data. Along with this example, we provide the pricing information for ~ 1700 financial assets, stored in multiple files. In case you want to use these files, go to the "Loading the data from files" section. If you have a single file , go to the section "Loading data from a single file". Finally, if you want to download the data from an online source (Yahoo! Finance), go to the "Loading the data from Yahoo! Finance section".
 
#### Loading the data from files

In this example, we load the data into a single Pandas Dataframe, which acts like a large data table that makes raw data easier to analyse. In case we do already have the pricing information, it is enough to execute the following code snippet. It assumes that we store every asset in a separate file and combines them. If this is the case, you can skip the remaining steps until the "Knowledge graph" section. Otherwise, ignore this snippet and continue with the tutorial.

In case you want to download the data we provide, just download the pricing data, unzip it and copy it to the `stocks` directory in the data folder. You can use the following URL:

https://download-directory.github.io/?url=https://github.com/terrierteam/Infinitech-FAR-KnowledgeGraphEmbeddings/tree/main/data/stocks

In [2]:
import pandas as pd
import numpy as np
import glob, os, random, math

directory = os.path.join(storageDIR, "stocks")
all_files = glob.glob(os.path.join(directory, "*.csv"))
dfs = []

tickers = []
# Iterating through files and only using non-empty files
for f in all_files:
    if os.path.getsize(f) > 0:
        df = pd.read_csv(f)
        ticker = f.split('/')[-1].split('.')[0]
        df['Stock'] = ticker
        tickers.append(ticker)
        dfs.append(df)
prices_df = pd.concat(dfs)

print("Dataset Extraction and Loading as Dataframe Complete")

Dataset Extraction and Loading as Dataframe Complete


#### Loading the data from a single file

If we have all the pricing data into a single file, we can use the following snippet. If this is the case, you can skip the following steps until the "Knowledge graph" section. Otherwise, ignore this snippet and continue with the tutorial.



In [None]:
import pandas as pd
import numpy as np
import glob, os, random, math

file_name = "timeseries.csv"
directory = os.path.join(storageDIR, "stocks")

file = os.path.join(directory, file_name)
prices_df = pd.read_csv(file)
prices_df["Date"] = pd.to_datetime(prices_df["Date"])

print("Dataset Extraction and Loading as Dataframe Complete")

#### Loading the data from Yahoo! Finance

The historical prices to use in this notebook can be also downloaded through <b><a href='https://finance.yahoo.com/'>Yahoo! Finance</a></b>. To download this data, we should first download the set of assets on the NASDAQ, AMEX and NYSE. In order to obtain this information, we can download the asset information from the <a href='https://www.nasdaq.com/market-activity/stocks/screener'>NASDAQ Stock Screener</a> webpage (just use the Download .csv button).

In [None]:
import zipfile
import pandas as pd
import numpy as np
import glob, os, random, math
import datetime

ticker_data = "./nasdaq_screener_1664986396364.csv"
data = pd.read_csv(ticker_data, sep=",")
tickers = data["Symbol"].tolist()
tickers = [ticker for ticker in tickers if not pd.isna(ticker)]
tickers

Once we have a list of tickers, we can then ask Yahoo! Finance for the pricing information. Yahoo! Finance provides a URL for each ticker, from which we can download the data. The URL has the following format:

https://query1.finance.yahoo.com/v7/finance/download/[TICKER]?period1=[START_DATE]&period2=[END_DATE]&interval=1[d,wk,mo]&events=[EVENT_TYPE]&includeAdjustedClose=true

where:
- [TICKER] represents the ticker we want to retrieve.
- [START_DATE] represents the UNIX timestamp of the first date we want to retrieve.
- [END_DATE] represents the UNIX timestamp of the last date we want to retrieve.
- [FREQ] indicates whether we want to retrieve daily (d), weekly (wk) or monthly (mo) information
- [EVENT_TYPE] identifies the type of event that we want to retrieve between: "history" (the pricing history), "div" (the dividends only history), "split" (stock split history) and "capital" (capital games).

In this example, we are taking the history data, and we want to collect all the possible information for each ticker (this meaning daily data from the farthest possible period until today). We define the following function for generating the Yahoo! Finance URLs:

In [None]:
import time
def get_url(ticker, start_date, end_date, freq, event_type):
    start_date_unix = int(time.mktime(start_date.timetuple()))
    end_date_unix = int(time.mktime(end_date.timetuple()))
    
    url = "https://query1.finance.yahoo.com/v7/finance/download/"
    url += ticker
    url += "?period1="
    url += str(start_date_unix)
    url += "&period2="
    url += str(end_date_unix)
    url += "&interval=1"
    url += freq
    url += "&events="
    url += event_type
    url += "&includeAdjustedClose=true"
    return url

Then, we can establish the parameters of the information we want, and collect and store the data. In our case, the parameters will be:
- [START_DATE]: 13/12/1901
- [END_DATE]: today (09/05/2023)
- [FREQ]: d (daily)
- [EVENT_TYPE]: history

In [None]:
start_time = datetime.datetime(1901,12,13)
end_time = datetime.datetime.now()
freq = "d"
event_type = "history"

directory = os.path.join(storageDIR, "stocks")

Then, using the get_url function, we can download the information we seek. Note that, in some cases, the time series for some of the tickers cannot be retrieved (when we tried this 7200 out of 8238 tickers had been retrieved). For the rest, there are many reason why this might happen:
- Yahoo! Finance does not contain the corresponding data. This might happen because the stocks are no longer traded on the market, or there might be some invalid information on them. These stock are invalid, and as such, should be discarded.
- The ticker is not the same as in Yahoo! Finance as in teh NASDAQ file. These assets can be fixed. The procedure to follow is:
    - Change "^" by "-P".
    - Change "\" by "-".
    - Remove extra blank spaces.

In the example below, we first try to collect information from the original tickers, and then, we apply the changes to the tickers to obtain some more. In case some assets could not be retrieved this way, we consider them impossible to retrieve.

We store the files in a directory (one file per stock)

In [None]:
unretrieved = set()

dfs = []

i = 0
for ticker in tickers:
    try:
        url = get_url(ticker, start_time, end_time, freq, event_type)
        ticker_data = pd.read_csv(url, sep=",")
        ticker_data.to_csv(directory + "/" + ticker + ".csv", index=False)
        dfs.append(ticker_data)
        i += 1
        if i%100 == 0:
            print("Retrieved " + str(i) + " tickers ( " + str(len(unretrieved)) + " failed)")
    except Exception as e:
        if e.code == 429:
            time.sleep(3600)
        unretrieved.add(ticker)
        
unretrieved_list = list(list(unretrieved)[1:-1])
unretrieved_list.sort()
unretrieved_list

i = 0
modified = 0
for ticker in unretrieved_list:
    if "^" in ticker:
        ticker = ticker.replace("^","-P")
        unretrieved_list[i] = ticker
        modified += 1
    elif "/" in ticker:
        ticker = ticker.replace("/","-")
        unretrieved_list[i] = ticker
        modified += 1
    elif " " in ticker:
        ticker = ticker.strip()
        unretrieved_list[i] = ticker
        modified += 1
    i = i + 1
print("Modified " + str(modified) + " tickers.")

unretrieved_2 = set()

i = 0
for ticker in unretrieved_list:
    try:
        ticker_data = pd.read_csv(get_url(ticker, start_time, end_time, freq, event_type), sep=",")
        ticker_data.to_csv(directory + "/" + ticker + ".csv", index=False)
        dfs.append(ticker_data)
        i += 1
        if i%100 == 0:
            print("Retrieved " + str(i) + " tickers ( " + str(len(unretrieved_2)) + " failed)")
    except Exception as e:
        if e.code == 429:
            time.sleep(3600)
        unretrieved_2.add(ticker)

prices_df = pd.concat(dfs)

In [3]:
prices_df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock
0,1999-11-18,32.546494,35.765381,28.612303,31.473534,26.845928,62546380.0,A
1,1999-11-19,30.713518,30.758226,28.478184,28.880545,24.634182,15234146.0,A
2,1999-11-22,29.551144,31.473534,28.657009,31.473534,26.845928,6577870.0,A
3,1999-11-23,30.400572,31.205294,28.612303,28.612303,24.405384,5975611.0,A
4,1999-11-24,28.701717,29.998213,28.612303,29.372318,25.053661,4843231.0,A


#### Filtering the pricing data

Pandas allows us to perform manipulations on the pricing data so that we can extract only what we need for training the model. We will only use pricing data from 2018 to 2021. We shall consider data until July 2019 as the past, and we shall train models at different points of time.

Lets first filter the dataset to only hold data from the dates we care about:

In [4]:
prices_df['Date'] = pd.to_datetime(prices_df['Date'])
min_date = pd.to_datetime('2018-01-01')
max_date = pd.to_datetime('2021-01-10')
# Selecting only that data from either 2016 or 2017
prices_df = prices_df[prices_df['Date'] >= min_date]
prices_df = prices_df[prices_df['Date'] <= max_date]
print("Filtered the data prices")
prices_df

Filtered the data prices


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock
4558,2018-01-02,67.419998,67.889999,67.339996,67.599998,64.989273,1047800.0,A
4559,2018-01-03,67.620003,69.489998,67.599998,69.320000,66.642830,1698900.0,A
4560,2018-01-04,69.540001,69.820000,68.779999,68.800003,66.142914,2230700.0,A
4561,2018-01-05,68.730003,70.099998,68.730003,69.900002,67.200424,1632500.0,A
4562,2018-01-08,69.730003,70.330002,69.550003,70.050003,67.344635,1613400.0,A
...,...,...,...,...,...,...,...,...
6570695,2021-01-04,2.280000,2.300000,2.250000,2.250000,1.824063,1146900.0,timeseries
6570696,2021-01-05,2.250000,2.280000,2.250000,2.250000,1.824063,1296300.0,timeseries
6570697,2021-01-06,2.270000,2.280000,2.240000,2.260000,1.832170,1903600.0,timeseries
6570698,2021-01-07,2.260000,2.290000,2.250000,2.270000,1.840278,2517800.0,timeseries


After this step, we print below the number of stocks.

In [5]:
stocks = prices_df['Stock'].unique().tolist()
print("Num. stocks with data between 2018 and 2021: " + str(len(stocks)))
stocks

Num. stocks with data between 2018 and 2021: 1388


['A',
 'AA',
 'AACG',
 'AADI',
 'AAIC',
 'AAL',
 'AAME',
 'AAN',
 'AAOI',
 'AAON',
 'AAP',
 'AAPL',
 'AAT',
 'AATC',
 'AAU',
 'AAWW',
 'AB',
 'ABB',
 'ABBV',
 'ABC',
 'ABCB',
 'ABCL',
 'ABCM',
 'ABEO',
 'ABEV',
 'ABG',
 'ABIO',
 'ABM',
 'ABMD',
 'ABNB',
 'ABR',
 'ABST',
 'ABT',
 'ABUS',
 'ABVC',
 'AC',
 'ACA',
 'ACAD',
 'ACB',
 'ACCD',
 'ACCO',
 'ACEL',
 'ACER',
 'ACET',
 'ACGL',
 'ACGLO',
 'ACHC',
 'ACHR',
 'ACHV',
 'ACI',
 'ACIU',
 'ACIW',
 'ACLS',
 'ACM',
 'ACMR',
 'ACN',
 'ACNB',
 'ACNT',
 'ACOR',
 'ACP',
 'ACR',
 'ACRE',
 'ACRS',
 'ACRX',
 'ACST',
 'ACTG',
 'ACU',
 'ACV',
 'ADAP',
 'ADBE',
 'ADC',
 'ADCT',
 'ADEA',
 'ADES',
 'ADI',
 'ADIL',
 'ADM',
 'ADMA',
 'ADMP',
 'ADN',
 'ADNT',
 'ADOC',
 'ADP',
 'ADPT',
 'ADSK',
 'ADT',
 'ADTN',
 'ADTX',
 'ADUS',
 'ADV',
 'ADVM',
 'ADX',
 'ADXN',
 'AE',
 'AEE',
 'AEF',
 'AEFC',
 'AEG',
 'AEHL',
 'AEHR',
 'AEI',
 'AEIS',
 'AEL',
 'AEMD',
 'AENZ',
 'AEO',
 'AEP',
 'AEPPZ',
 'AER',
 'AES',
 'AEVA',
 'AEY',
 'AEYE',
 'AEZS',
 'AFB',
 'AFBI',
 'AF

### Knowledge graph

In addition to the pricing data, we use for this work a knowledge graph extracted from Wikidata. We share the different information available in the knowledge graph in the following files. A knowledge graph consists on different elements:
- **Entities:** Objects representing real life objects / people or concepts. For instance, a company is represented with an entity. In Wikidata, an entity contains: a) a unique identifier starting with Q (ex: Q312 for Apple), b) a label or name (e.g. "Apple Inc."), c) a list of alternative names or aliases (e.g. "Apple Computer Inc", "Apple", "Apple Incorporated", etc.), a description (e.g. "American technology company based in Cupertino, California") and an associated Wikipedia page (e.g. "https://en.wikipedia.org/wiki/Apple_Inc."). Entities are represented as nodes in the graph.  
  In our data, entities are shared in the *entities.txt* file. This contains a list of JSON objects (one per line) with each line representing a different entity. The format of an entity JSON is:  
```json
{"id": node_id, "labels": ["StockEntity"], "properties": { "alias" : [alias1, alias2,...aliasM], "description": entity description, "id": Wikidata ID, "label": Entity name, "wikipedia": Wikipedia page}}
```
- **Values:** Objects representing constants (either numerical, dates, strings). Values are represented as nodes in the graph.
  In our data, values are shared in the *values.txt* file. This contains a list of JSON objects (one per line) with each line representing a different value. Note that these values might be repeated. The format of a value JSON is:  
```json
{"id": node_id, "labels": ["StockValue"], "properties": { dictionary containing the necessary values}}
```
- **Properties:** Objects representing the possible types of connections between nodes in the knowledge graph, or the possible types of properties of the edges. In Wikidata, a property is defined by: a) a unique identifier starting with P (e.g.: P452), b) a label or name (e.g.: "industry"), c) a list of alternative names or aliases (e.g. "field of action", "sector", "branch", etc.), and d) a description (e.g.:  "specific industry of company or organization"). Properties are represented as nodes in the knowledge graph.  
  In our data, properties are shared in the *properties.txt* file. This contains a list of JSON objects (one per line) with the following format:
```json
{"id": node_id, "labels": ["StockProperty"], "properties": { "alias" : [alias1, alias2,...aliasM], "description": entity description, "id": Wikidata ID, "label": Entity name}}
```

- **Relations:**  Connections between different entities in the knowledge graph. They are quartets of the form `(head, relation type, tail, properties)` where:
    - head: the initial entity (e.g.: Amazon)
    - relation type: the type of relation (e.g.: has subsidiary)
    - tail: the end entity (e.g.: Twitch Interactive)
    - properties: additional information about the link (e.g.: start time: 25/08/2014)    
  
  Relations appear as directed edges in the knowledge graph.  
  In our data, they are included in the file *value_relations.txt*. This file contains a list of JSON objects (one per line) representing each relation. Every line has the following format:  
```json
{"source": head_node_id, "type": Wikidata relation type ID, "dest": tail_node_id, "properties": { property type 1: value1, ..., property type N: valueN}}
```

- **Property values:** Connections between entities and values in the knowledge graph. They are quartets of the form `(entity, property type, value, properties)` where:
    - entity: the entity who has a property (e.g.: Amazon)
    - property type: the type of property (e.g.: total revenue)
    - value: the value (e.g.: 513,983,000,000 USD)
    - properties: additional information about the property (e.g.: point in time: 2022)    
  
  These connections appear as directed edges in the knowledge graph.  
  In our data, they are included in the file *relations.txt*. This file contains a list of JSON objects (one per line) representing each relation. Every line has the following format:  
```json
{"source": entity_node_id, "type": Wikidata relation type ID, "dest": value_node_id, "properties": { property type 1: value1, ..., property type N: valueN}}
```

In addition to the previous information, we share an additional file, *mapping.txt* which contains, for every ticker, the corresponding Wikidata id of the entiies representing it in the knowledge graph. It has format:

```
ticker:entityID_1,...,entityID_n
```

In case you want to download the data we provide, just download the knowledge graph directory, unzip it and copy it to the `kg` directory in the data folder. You can use the following URL:

https://download-directory.github.io/?url=https://github.com/terrierteam/Infinitech-FAR-KnowledgeGraphEmbeddings/tree/main/data/kg

We first load the information we need for the knowledge graph:

In [6]:
import json
import urllib.request

store = False # Set to true to store the file again
directory = os.path.join(storageDIR, "kge")

In [7]:
# Entities file
entities_file = os.path.join(directory, "entities.txt")
entities_url = "URL"

entities = []
text = ""

# If container mode:
with open(entities_file, "r") as f:
    lines = f.readlines()
# If collab mode / need to download from URL
    #lines = urllib2.urlopen(entities_url)
    for line in lines:
        dictionary = json.loads(line)
        entities.append({"nodeID" : dictionary["id"], 
                         "wikidataID": dictionary["properties"]["id"], 
                         "label" : dictionary["properties"]["label"]})
        if store:
            text += line + "\n"

if store:
    with open(entities_file, "w") as f:
        f.write(text)

entities_df = pd.DataFrame(entities)
entities_df

Unnamed: 0,nodeID,wikidataID,label
0,1,Q30268840,Celyad (Belgium)
1,5,Q1001788,Buenaventura
2,7,Q16858667,Kratos Defense & Security Solutions
3,9,Q6783802,Masonite International
4,10,Q846246,Neonode
...,...,...,...
102734,327123,Q181790,composite material
102735,327124,Q369820,music of Africa
102736,327125,Q11700058,folk-pop
102737,327126,Q42982,allergy


In [8]:
# Values file
values_file = os.path.join(directory, "values.txt")
values_url = "URL"

values = []
text = ""

# If container mode:
with open(values_file, "r") as f:
    lines = f.readlines()
# If collab mode / need to download from URL
    #lines = urllib2.urlopen(values_url)
    for line in lines:
        dictionary = json.loads(line)
        values.append({"nodeID" : dictionary["id"], 
                         "value": dictionary["properties"]})
        if store:
            text += line + "\n"

if store:
    with open(values_file, "w") as f:
        f.write(text)

values_df = pd.DataFrame(values)
values_df

Unnamed: 0,nodeID,value
0,0,"{'amount': 367771.0, 'unit': 'unit'}"
1,2,"{'amount': 125988209.0, 'unit': 'unit'}"
2,3,{'value': '+2004-01-01T00:00:00Z'}
3,4,"{'amount': 126004305.0, 'unit': 'unit'}"
4,6,{'value': '+1953-01-01T00:00:00Z'}
...,...,...
223053,318697,{'value': '+2021-00-00T00:00:00Z'}
223054,318699,{'value': '+1914-01-01T00:00:00Z'}
223055,318701,{'value': '+2016-06-30T00:00:00Z'}
223056,318702,{'value': '+1940-09-17T00:00:00Z'}


In [9]:
# Properties file
properties_file = os.path.join(directory, "properties.txt")
properties_url = "URL"

properties = []
text = ""

# If container mode:
with open(properties_file, "r") as f:
    lines = f.readlines()
# If collab mode / need to download from URL
    #lines = urllib2.urlopen(properties_url)
    for line in lines:
        dictionary = json.loads(line)
        properties.append({"nodeID" : dictionary["id"], 
                         "wikidataID": dictionary["properties"]["id"], 
                         "label" : dictionary["properties"]["label"]})
        if store:
            text += line + "\n"

if store:
    with open(properties_file, "w") as f:
        f.write(text)

properties_df = pd.DataFrame(properties)
properties_df

Unnamed: 0,nodeID,wikidataID,label
0,548,P246,element symbol
1,549,P1082,population
2,550,P2054,density
3,551,P3095,practiced by
4,552,P452,industry
...,...,...,...
109,657,P106,occupation
110,658,P2138,total liabilities
111,659,P2397,YouTube channel ID
112,660,P69,educated at


In [10]:
# Relations file
relations_file = os.path.join(directory, "relations.txt")
relations_url = "URL"

relations = []
text = ""

# If container mode:
with open(relations_file, "r") as f:
    lines = f.readlines()
# If collab mode / need to download from URL
    #lines = urllib2.urlopen(relations_url)
    for line in lines:
        dictionary = json.loads(line)
        relations.append({"source" : dictionary["source"], 
                           "dest": dictionary["dest"], 
                           "type" : dictionary["type"],
                           "properties": dictionary["properties"]})
        if store:
            text += line + "\n"

if store:
    with open(relations_file, "w") as f:
        f.write(text)

relations_df = pd.DataFrame(relations)
relations_df

Unnamed: 0,source,dest,type,properties
0,50283,24,P1889,{}
1,128884,47,P108,"{'P580': '+2011-00-00T00:00:00Z', 'P582': '+20..."
2,15073,47,P1830,{}
3,50321,64,P127,{}
4,50347,64,P127,{}
...,...,...,...,...
457753,327107,327123,P279,{}
457754,327117,327124,P279,{}
457755,327118,327125,P1889,{}
457756,327122,327126,P1889,{}


In [11]:
# Relations with values file
valuerelations_file = os.path.join(directory, "value_relations.txt")
valuerelations_url = "URL"

valuerelations = []
text = ""

# If container mode:
with open(relations_file, "r") as f:
    lines = f.readlines()
# If collab mode / need to download from URL
    #lines = urllib2.urlopen(relations_url)
    for line in lines:
        dictionary = json.loads(line)
        valuerelations.append({"source" : dictionary["source"], 
                           "dest": dictionary["dest"], 
                           "type" : dictionary["type"],
                           "properties": dictionary["properties"]})
        if store:
            text += line + "\n"

if store:
    with open(relations_file, "w") as f:
        f.write(text)

valuerelations_df = pd.DataFrame(valuerelations)
valuerelations_df

Unnamed: 0,source,dest,type,properties
0,50283,24,P1889,{}
1,128884,47,P108,"{'P580': '+2011-00-00T00:00:00Z', 'P582': '+20..."
2,15073,47,P1830,{}
3,50321,64,P127,{}
4,50347,64,P127,{}
...,...,...,...,...
457753,327107,327123,P279,{}
457754,327117,327124,P279,{}
457755,327118,327125,P1889,{}
457756,327122,327126,P1889,{}


In [12]:
mapping_file = os.path.join(directory, "mapping.txt")
mapping = dict()
with open(mapping_file, "r") as f:
    for line in f.readlines():
        aux = line.split(":")
        ticker = aux[0]
        entities = aux[1].strip().split(",")
        mapping[ticker] = entities
mapping

{'ALEX': ['Q135281'],
 'UMC': ['Q143616'],
 'BP': ['Q152057'],
 'BPMP': ['Q152057'],
 'BPT': ['Q4836297'],
 'GER': ['Q193326'],
 'GJS': ['Q193326'],
 'GMZ': ['Q193326'],
 'GS': ['Q193326'],
 'GSBD': ['Q193326'],
 'GSC': ['Q193326'],
 'AIG': ['Q212235'],
 'BCS': ['Q245343'],
 'FFEU': ['Q245343'],
 'FIYY': ['Q245343'],
 'GAZ': ['Q245343'],
 'GSP': ['Q245343'],
 'ACCO': ['Q288129'],
 'NOC': ['Q329953'],
 'AMBC': ['Q456563'],
 'WMT': ['Q483551'],
 'LEA': ['Q502344'],
 'IHIT': ['Q522617'],
 'IHTA': ['Q522617'],
 'IIM': ['Q522617'],
 'IQI': ['Q522617'],
 'IVZ': ['Q522617'],
 'OIA': ['Q522617'],
 'VBF': ['Q522617'],
 'VCV': ['Q522617'],
 'VGM': ['Q522617'],
 'VKI': ['Q522617'],
 'VKQ': ['Q522617'],
 'VLT': ['Q522617'],
 'VMO': ['Q522617'],
 'VPV': ['Q522617'],
 'VTA': ['Q522617'],
 'VTN': ['Q522617'],
 'VVR': ['Q522617'],
 'IVR': ['Q522617'],
 'BBY': ['Q533415'],
 'KGC': ['Q546880'],
 'CI': ['Q642271'],
 'ING': ['Q645708'],
 'STM': ['Q661845'],
 'MCO': ['Q675585'],
 'MTD': ['Q680186'],
 'MGA'

### Data cleaning

Now that we have retrieved and loaded the pricing information and the knowledge graph, we just keep all those stocks with both.

In [13]:
stocks = list(set(stocks) & mapping.keys())

Once we do have the intersection between the assets for which we do have time series and for which we do have knowledge graph information, we just clean the data by getting only the allowed tickers.

In [14]:
print("Number of stocks to consider: " + str(len(stocks)))

Number of stocks to consider: 819


In [15]:
stocks

['CMTL',
 'CEIX',
 'ABM',
 'CZR',
 'CDE',
 'AUTL',
 'BURL',
 'AMSC',
 'CREG',
 'CTSO',
 'CTMX',
 'DESP',
 'DAC',
 'BMY',
 'BLW',
 'CORT',
 'CPRT',
 'ATRC',
 'BHC',
 'BKNG',
 'AAPL',
 'ARGO',
 'CAC',
 'CM',
 'ALV',
 'AI',
 'DCOM',
 'ABB',
 'BKN',
 'AEFC',
 'CXE',
 'DDS',
 'BNGO',
 'CPS',
 'CDW',
 'ATOM',
 'BSTZ',
 'ALCO',
 'CHKP',
 'AVID',
 'AUDC',
 'ACOR',
 'AEM',
 'BEP',
 'CAPR',
 'AMTD',
 'CG',
 'CRTO',
 'BABA',
 'BE',
 'ARCH',
 'BBBY',
 'AIF',
 'CDXC',
 'BAH',
 'AMBA',
 'AAP',
 'BLCM',
 'CLF',
 'ARCT',
 'BBY',
 'DBD',
 'AU',
 'BKCC',
 'DCF',
 'APTO',
 'DDF',
 'AQNB',
 'AKR',
 'CWT',
 'BWAY',
 'CW',
 'COOP',
 'CVU',
 'AIO',
 'BEAM',
 'BXP',
 'DBX',
 'CSBR',
 'AVDL',
 'APEI',
 'ASR',
 'CAE',
 'CYCCP',
 'CONN',
 'ARCO',
 'ARLO',
 'APH',
 'CGO',
 'CTHR',
 'CBRE',
 'CSTM',
 'CWK',
 'ANGO',
 'CVX',
 'ACGLO',
 'ASML',
 'ACIU',
 'CWBR',
 'CMCM',
 'AIR',
 'BCOV',
 'CSSE',
 'ACHC',
 'AMAL',
 'CWBC',
 'AKBA',
 'AXDX',
 'AYX',
 'ATHM',
 'ADP',
 'CHTR',
 'CBT',
 'BOH',
 'ATVI',
 'DCP',
 'CACI',


In [16]:
import datetime as dt
pricedfs = []
i = 0
timea = dt.datetime.now()
for s in stocks:
    df = prices_df[prices_df['Stock'] == s]
    pricedfs.append(df)
    i += 1
    if i % 100 == 0:
        print("Processed " + str(i) + " stocks (" + str((dt.datetime.now() - timea).seconds) + " s)")
print("Dataset Filtering Complete")

Processed 100 stocks (8 s)
Processed 200 stocks (17 s)
Processed 300 stocks (26 s)
Processed 400 stocks (35 s)
Processed 500 stocks (43 s)
Processed 600 stocks (52 s)
Processed 700 stocks (61 s)
Processed 800 stocks (69 s)
Dataset Filtering Complete


In [17]:
pricedfs

[            Date       Open       High        Low      Close  Adj Close  \
 9532  2018-01-02  22.209999  22.730000  22.209999  22.500000  20.224899   
 9533  2018-01-03  22.530001  22.690001  21.700001  22.000000  19.775457   
 9534  2018-01-04  22.139999  22.260000  21.900000  21.980000  19.757481   
 9535  2018-01-05  22.070000  22.070000  21.620001  21.860001  19.649614   
 9536  2018-01-08  21.770000  21.770000  21.030001  21.350000  19.191181   
 ...          ...        ...        ...        ...        ...        ...   
 10288 2021-01-04  21.090000  21.309999  20.090000  20.270000  19.151667   
 10289 2021-01-05  20.230000  20.840000  20.209999  20.570000  19.435112   
 10290 2021-01-06  20.830000  21.980000  20.719999  21.680000  20.483871   
 10291 2021-01-07  21.850000  22.400000  21.790001  22.090000  20.871252   
 10292 2021-01-08  22.200001  22.200001  21.580000  21.920000  20.710630   
 
          Volume Stock  
 9532   129400.0  CMTL  
 9533   214800.0  CMTL  
 9534   131

## Feature Creation for the Model

Now that we have collected the pricing data and the knowledge graph, we can craft the features we can use in our model. We distinguish to kind of features: price-based technical indicators and knowledge graph embeddings.

### Technical indicators

Now that we have the pricing data in a more useful form, we can now convert that data into additional indicators that a machine learned model can use for identifying patterns/trends. In effect, we want to capture how the price for an asset changed in the recent past, for use as indicators for future performance (of course past performance is not always a good indicator, and more advanced approaches may mix in other sources of evidence here). We convert the pricing data into 3 different indicator (feature) types:

**NOTE:** In the following equations, the sub-index $t$ indicates the time of computation of the metric. $t-1$ might indicate, then, the previous day, and so on.

1. <b>Returns</b>: The returns on investment (ROI) represent the percentage change between close prices on different dates, across different periods.

\begin{equation}
\text{ROI}_t(n) = \frac{\text{Close}_t - \text{Close}_{t-n}}{\text{Close}_{t-n}}
\end{equation}

2. <b>Volatility</b>: Volatility represents the risk of a stock as expressed by its fluctuations, and is expressed as the standard deviation of the logarithmic returns of the stock. In this case, we take the daily returns.
\begin{equation}
\text{Volatility}_t(N,n) = \sqrt{\frac{1}{N-1} \sum_{i=0}^{N-1} \log^2(\text{ROI}_{t-i}(n)) - \left(\frac{1}{N-1} \sum_{i=0}^{N-1} \log(\text{ROI}_{t-i}(n))\right)^2} * \sqrt{n}
\end{equation}
Here, $N$ represents the number of periods we consider for measuring the Volatility (here, we take $N$ days), and $n$ represents the period of time for computing the ROI (here, we take $n = 1$ day). In the right square root, $n$ is the number of periods covered by the ROI calculation. For instance, if we took a monthly measure of ROI, we should measure $n$ in months. In this example, as each period is equal to a day, we take $n = 1$.

3. <b>Mean price</b>: This indicator just represents the average price of an asset over a period of time:
\begin{equation}
\text{Mean}_t(n) = \frac{1}{n} \sum_{i=0}^{n-1} \text{Close}_{t-i}
\end{equation}




In [18]:
def returns(df, periods=[1,3,5,7,14,21,28,63,126]):
    for t in periods:
        df[f"return_{t}"] = (df['Close'] - df['Close'].shift(t)) / df['Close'].shift(t)
    return df

def log_returns(df, periods=[1,3,5,7,14,21,28,63,126]):
    for t in periods:
        df[f"log_return_{t}"] = (df['Close'] - df['Close'].shift(t)) / df['Close'].shift(t)
    return df

def volatility(df, roi_periods = [1], periods=[3,5,7,14,21,28,63,126]):
    for n in roi_periods:
        name = f"log_return_{n}"
        if not name in df.columns:
            log_returns(df, roi_periods)
            break
    
    for t in periods:
        for n in roi_periods:
            df[f"volatility_{t}_{n}"] = df[f"log_return_{n}"].rolling(window=t).std()*np.sqrt(n)

    df['3_28_volatility_ratio'] = df['volatility_3_1'] / df['volatility_28_1']
    return df

def mean_price(df, periods=[3,5,7,14,21,28,63,126]):
    for t in periods:
        df[f'mean_{t}'] = df['Close'].rolling(window=t).mean()
    
    return df


In [19]:
pd.options.mode.chained_assignment = None  # default='warn'

newpricedfs = []
i = 0
timea = dt.datetime.now()
for p in pricedfs:
    if not p.empty:
        p1 = returns(p)
        p1 = volatility(p1)
        p1 = mean_price(p1)
        newpricedfs.append(p1.dropna())
        i += 1
        if i % 100 == 0:
            timeb = dt.datetime.now()
            print("Processed " + str(i) + " stocks (" + str((dt.datetime.now() - timea).seconds) + " s)")
print ("Metrics calculated for all stocks")

Processed 100 stocks (1 s)
Processed 200 stocks (2 s)
Processed 300 stocks (3 s)
Processed 400 stocks (5 s)
Processed 500 stocks (6 s)
Processed 600 stocks (7 s)
Processed 700 stocks (9 s)
Processed 800 stocks (10 s)
Metrics calculated for all stocks


In [20]:
newpricedfs[0]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock,return_1,return_3,...,volatility_126_1,3_28_volatility_ratio,mean_3,mean_5,mean_7,mean_14,mean_21,mean_28,mean_63,mean_126
9658,2018-07-03,32.169998,32.599998,32.099998,32.180000,29.150316,84700.0,CMTL,0.005625,0.003743,...,0.026414,0.556103,32.020000,31.998,31.985714,32.101429,32.139524,31.951071,31.279524,27.858968
9659,2018-07-05,32.250000,32.310001,32.029999,32.290001,29.249962,96900.0,CMTL,0.003418,0.012861,...,0.026315,0.112446,32.156667,32.082,32.055715,32.086429,32.100952,32.006428,31.307460,27.940635
9660,2018-07-06,32.349998,32.889999,32.349998,32.700001,29.621367,117400.0,CMTL,0.012697,0.021875,...,0.026325,0.451192,32.390001,32.210,32.140000,32.094286,32.115714,32.071428,31.354920,28.025714
9661,2018-07-09,32.830002,32.900002,32.599998,32.750000,29.666653,113700.0,CMTL,0.001529,0.017713,...,0.026314,0.557080,32.580001,32.384,32.265715,32.100715,32.130476,32.132857,31.410635,28.112143
9662,2018-07-10,32.849998,33.450001,32.750000,33.259998,30.128639,125100.0,CMTL,0.015572,0.030040,...,0.026224,0.672664,32.903333,32.636,32.437143,32.168571,32.199524,32.207857,31.453016,28.206667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10288,2021-01-04,21.090000,21.309999,20.090000,20.270000,19.151667,186500.0,CMTL,-0.020300,-0.002461,...,0.034832,0.377104,20.513334,20.620,20.612857,20.025000,19.567619,19.511786,17.658730,16.788571
10289,2021-01-05,20.230000,20.840000,20.209999,20.570000,19.435112,150600.0,CMTL,0.014800,-0.000486,...,0.034582,0.396513,20.510000,20.486,20.620000,20.241429,19.663333,19.578928,17.738571,16.828809
10290,2021-01-06,20.830000,21.980000,20.719999,21.680000,20.483871,265500.0,CMTL,0.053962,0.047849,...,0.034831,0.834648,20.840000,20.758,20.764286,20.507143,19.793333,19.632857,17.828730,16.874921
10291,2021-01-07,21.850000,22.400000,21.790001,22.090000,20.871252,142700.0,CMTL,0.018911,0.089788,...,0.034309,0.482590,21.446667,21.060,20.885714,20.737857,19.918095,19.701428,17.935555,16.932540


We finally compute the target of our recommendations: return at 6 months into the future (126 financial days)

In [21]:
for i in range(len(newpricedfs)):
    newpricedfs[i]["target"] = newpricedfs[i]["return_126"].shift(-126)
    newpricedfs[i] = newpricedfs[i].dropna()

In [22]:
newpricedfs[0]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock,return_1,return_3,...,3_28_volatility_ratio,mean_3,mean_5,mean_7,mean_14,mean_21,mean_28,mean_63,mean_126,target
9658,2018-07-03,32.169998,32.599998,32.099998,32.180000,29.150316,84700.0,CMTL,0.005625,0.003743,...,0.556103,32.020000,31.998,31.985714,32.101429,32.139524,31.951071,31.279524,27.858968,-0.256060
9659,2018-07-05,32.250000,32.310001,32.029999,32.290001,29.249962,96900.0,CMTL,0.003418,0.012861,...,0.112446,32.156667,32.082,32.055715,32.086429,32.100952,32.006428,31.307460,27.940635,-0.246206
9660,2018-07-06,32.349998,32.889999,32.349998,32.700001,29.621367,117400.0,CMTL,0.012697,0.021875,...,0.451192,32.390001,32.210,32.140000,32.094286,32.115714,32.071428,31.354920,28.025714,-0.253211
9661,2018-07-09,32.830002,32.900002,32.599998,32.750000,29.666653,113700.0,CMTL,0.001529,0.017713,...,0.557080,32.580001,32.384,32.265715,32.100715,32.130476,32.132857,31.410635,28.112143,-0.244580
9662,2018-07-10,32.849998,33.450001,32.750000,33.259998,30.128639,125100.0,CMTL,0.015572,0.030040,...,0.672664,32.903333,32.636,32.437143,32.168571,32.199524,32.207857,31.453016,28.206667,-0.247745
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10162,2020-07-06,16.620001,16.760000,15.980000,16.240000,15.152655,438000.0,CMTL,0.012469,-0.038484,...,0.493988,16.136667,16.344,16.100000,16.315000,16.805714,17.219286,17.108571,22.098254,0.248153
10163,2020-07-07,16.090000,16.219999,15.410000,15.500000,14.462201,290500.0,CMTL,-0.045567,-0.039058,...,0.494750,15.926667,16.160,16.057143,16.180714,16.703333,17.112143,17.134127,21.937857,0.327097
10164,2020-07-08,15.510000,16.299999,15.410000,15.870000,14.807427,353100.0,CMTL,0.023871,-0.010599,...,0.620774,15.870000,15.956,16.155714,16.120714,16.519047,17.040357,17.151746,21.781032,0.366100
10165,2020-07-09,15.750000,15.780000,14.760000,14.830000,13.837062,419000.0,CMTL,-0.065532,-0.086823,...,0.767310,15.400000,15.696,15.928571,16.003571,16.340000,16.933928,17.132381,21.611429,0.489548


In [23]:
full_kpis_df = pd.concat(newpricedfs)
full_kpis_df

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock,return_1,return_3,...,3_28_volatility_ratio,mean_3,mean_5,mean_7,mean_14,mean_21,mean_28,mean_63,mean_126,target
9658,2018-07-03,32.169998,32.599998,32.099998,32.180000,29.150316,84700.0,CMTL,0.005625,0.003743,...,0.556103,32.020000,31.998,31.985714,32.101429,32.139524,31.951071,31.279524,27.858968,-0.256060
9659,2018-07-05,32.250000,32.310001,32.029999,32.290001,29.249962,96900.0,CMTL,0.003418,0.012861,...,0.112446,32.156667,32.082,32.055715,32.086429,32.100952,32.006428,31.307460,27.940635,-0.246206
9660,2018-07-06,32.349998,32.889999,32.349998,32.700001,29.621367,117400.0,CMTL,0.012697,0.021875,...,0.451192,32.390001,32.210,32.140000,32.094286,32.115714,32.071428,31.354920,28.025714,-0.253211
9661,2018-07-09,32.830002,32.900002,32.599998,32.750000,29.666653,113700.0,CMTL,0.001529,0.017713,...,0.557080,32.580001,32.384,32.265715,32.100715,32.130476,32.132857,31.410635,28.112143,-0.244580
9662,2018-07-10,32.849998,33.450001,32.750000,33.259998,30.128639,125100.0,CMTL,0.015572,0.030040,...,0.672664,32.903333,32.636,32.437143,32.168571,32.199524,32.207857,31.453016,28.206667,-0.247745
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3258,2020-07-06,1.940000,1.940000,1.850000,1.870000,1.870000,567700.0,ABUS,-0.005319,0.027473,...,0.384533,1.860000,1.864,1.907143,1.947857,1.920952,1.972857,1.677778,2.145317,0.946524
3259,2020-07-07,1.850000,1.960000,1.840000,1.860000,1.860000,1011200.0,ABUS,-0.005348,0.016393,...,0.436426,1.870000,1.852,1.872857,1.939286,1.911905,1.960000,1.690794,2.134841,0.908602
3260,2020-07-08,1.880000,1.910000,1.810000,1.880000,1.880000,578700.0,ABUS,0.010753,0.000000,...,0.215536,1.870000,1.864,1.865714,1.937143,1.903810,1.950000,1.704127,2.125635,0.898936
3261,2020-07-09,1.880000,1.910000,1.810000,1.860000,1.860000,443200.0,ABUS,-0.010638,-0.005348,...,0.258668,1.866667,1.870,1.857143,1.930000,1.901429,1.938929,1.717460,2.116032,1.096774


In [24]:
full_kpis_df.to_csv(os.path.join(storageDIR, "kpis.csv"), index=False)

In order to use the model for learning and testing, we thin the dataset by getting only the data at Mondays.

In [25]:
filtered_kpis_df = full_kpis_df.loc[full_kpis_df['Date'].dt.weekday == 1]
filtered_kpis_df

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock,return_1,return_3,...,3_28_volatility_ratio,mean_3,mean_5,mean_7,mean_14,mean_21,mean_28,mean_63,mean_126,target
9658,2018-07-03,32.169998,32.599998,32.099998,32.180000,29.150316,84700.0,CMTL,0.005625,0.003743,...,0.556103,32.020000,31.998000,31.985714,32.101429,32.139524,31.951071,31.279524,27.858968,-0.256060
9662,2018-07-10,32.849998,33.450001,32.750000,33.259998,30.128639,125100.0,CMTL,0.015572,0.030040,...,0.672664,32.903333,32.636000,32.437143,32.168571,32.199524,32.207857,31.453016,28.206667,-0.247745
9667,2018-07-17,34.299999,34.770000,34.090000,34.220001,31.088135,80600.0,CMTL,-0.008978,-0.010697,...,1.573182,34.623333,34.460000,34.044285,33.092143,32.744286,32.597857,31.696825,28.719127,-0.267095
9672,2018-07-24,33.580002,34.099998,32.990002,32.990002,29.970705,222100.0,CMTL,-0.014341,-0.031415,...,0.452333,33.446668,33.764001,33.938572,33.722857,33.143810,32.912143,31.883016,29.200714,-0.239770
9677,2018-07-31,33.049999,33.889999,33.049999,33.599998,30.524870,107000.0,CMTL,0.017257,-0.018692,...,1.646444,33.359999,33.604000,33.497143,33.950000,33.539048,33.123214,32.124444,29.679365,-0.256845
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3240,2020-06-09,2.050000,2.100000,1.900000,1.910000,1.910000,1323900.0,ABUS,-0.068293,-0.063725,...,0.529132,2.003333,2.022000,2.051429,2.150714,2.048571,1.855000,1.583810,2.237222,1.068063
3245,2020-06-16,1.940000,2.000000,1.870000,1.980000,1.980000,792600.0,ABUS,0.076087,0.207317,...,0.140859,1.850000,1.808000,1.857143,1.980714,2.063333,1.958929,1.573333,2.215079,1.464646
3250,2020-06-23,2.110000,2.200000,2.050000,2.050000,2.050000,1162400.0,ABUS,0.004902,0.045918,...,0.675409,2.000000,1.974000,1.955714,1.925714,2.001429,2.040714,1.603175,2.198651,1.073171
3255,2020-06-30,1.910000,1.920000,1.800000,1.820000,1.820000,814300.0,ABUS,-0.052083,-0.133333,...,0.886922,1.890000,1.968000,1.990000,1.921429,1.950000,2.017857,1.636349,2.170873,0.923077


### Knowledge graph embeddings

As a second group of features, we consider knowledge graph embeddings for every asset. There are multiple algorithms which summarize the information about entities in knowledge graphs as low-dimension vectors, also known as embeddings. These embeddings encode information about the node and the connections it has with other entities in the knowledge graph.

#### Knowledge graph filtering

As we plan to use embeddings as features for profitability prediction, we need knowledge graph embeddings for every possible date. In order to ensure that we are using correct information, we need to provide some filtering for the knowledge graphs: we need to ensure that, before computing the embeddings, we only have information prior to the date to consider.

In order to do this, we need first to filter the entities and then we need to filter the relations.

For filtering the entities, we take the whole set of entities. Then, we remove two sets of entities:
- **Entities created after the split date:** companies, products, etc. which first appeared after the split date. This is defined by the "inception" value relation (P571 in Wikidata).
- **People who were born after the split date or died before the split date:** This is defined by the "date of birth" relation (P569) and "date of death" relations (P570).

In [26]:
dates = ["+" + pd.to_datetime(str(date)).strftime("%Y-%m-%dT%H:%M:%S") + "Z" for date in filtered_kpis_df['Date'].unique()]

In [27]:
dates

['+2018-07-03T00:00:00Z',
 '+2018-07-10T00:00:00Z',
 '+2018-07-17T00:00:00Z',
 '+2018-07-24T00:00:00Z',
 '+2018-07-31T00:00:00Z',
 '+2018-08-07T00:00:00Z',
 '+2018-08-14T00:00:00Z',
 '+2018-08-21T00:00:00Z',
 '+2018-08-28T00:00:00Z',
 '+2018-09-04T00:00:00Z',
 '+2018-09-11T00:00:00Z',
 '+2018-09-18T00:00:00Z',
 '+2018-09-25T00:00:00Z',
 '+2018-10-02T00:00:00Z',
 '+2018-10-09T00:00:00Z',
 '+2018-10-16T00:00:00Z',
 '+2018-10-23T00:00:00Z',
 '+2018-10-30T00:00:00Z',
 '+2018-11-06T00:00:00Z',
 '+2018-11-13T00:00:00Z',
 '+2018-11-20T00:00:00Z',
 '+2018-11-27T00:00:00Z',
 '+2018-12-04T00:00:00Z',
 '+2018-12-11T00:00:00Z',
 '+2018-12-18T00:00:00Z',
 '+2019-01-08T00:00:00Z',
 '+2019-01-15T00:00:00Z',
 '+2019-01-22T00:00:00Z',
 '+2019-01-29T00:00:00Z',
 '+2019-02-05T00:00:00Z',
 '+2019-02-12T00:00:00Z',
 '+2019-02-19T00:00:00Z',
 '+2019-02-26T00:00:00Z',
 '+2019-03-05T00:00:00Z',
 '+2019-03-12T00:00:00Z',
 '+2019-03-19T00:00:00Z',
 '+2019-03-26T00:00:00Z',
 '+2019-04-02T00:00:00Z',
 '+2019-04-0

In [28]:
entities_date = dict()
for date in dates:
    entities_to_remove = entities_df
    
    rel_type = "P571" ## inception
    rel_df = valuerelations_df[valuerelations_df["type"] == rel_type][["source","dest","properties"]]
    rel_df = rel_df.merge(values_df, left_on="dest", right_on="nodeID")
    if rel_df.shape[0] > 0:
        rel_df = rel_df[rel_df["properties"]["value"] > date]
    
    entities_to_remove = entities_to_remove[entities_to_remove["nodeID"].isin(rel_df["source"])]
    
    rel_type = "P569" ## inception
    rel_df = valuerelations_df[valuerelations_df["type"] == rel_type][["source","dest","properties"]]
    rel_df = rel_df.merge(values_df, left_on="dest", right_on="nodeID")
    if rel_df.shape[0] > 0:
        rel_df = rel_df[rel_df["properties"]["value"] > date]
    
    entities_to_remove = entities_to_remove[entities_to_remove["nodeID"].isin(rel_df["source"])]
    
    rel_type = "P570" ## inception
    rel_df = valuerelations_df[valuerelations_df["type"] == rel_type][["source","dest", "properties"]]
    rel_df = rel_df.merge(values_df, left_on="dest", right_on="nodeID")
    if rel_df.shape[0] > 0:
        rel_df = rel_df[rel_df["properties"]["value"] < date]
    
    entities_to_remove = entities_to_remove[entities_to_remove["nodeID"].isin(rel_df["source"])]
    
    entities_date[date] = entities_df[~entities_df["nodeID"].isin(entities_to_remove)]

Now, for filtering the relations, we use the properties. We remove those edges that satisfy:
- The relation started after the date (P580)
- The relation ended before the date (P582)
- The relation is established by a time point after the date (P583)

We store the remaining relationships into files.

In [29]:
def contains_rel(x, rel, greater, value):
    if rel in x and greater:
        if x[rel] > value:
            return True
    elif rel in x and not greater:
        if x[rel] < value:
            return True
    return False

split_dir = os.path.join(os.path.join(storageDIR, "kg"), "splits")
if not os.path.exists(split_dir):
    os.makedirs(split_dir)

for date in dates:
    relations_def = relations_df[(relations_df["source"].isin(entities_date[date]["nodeID"])) & 
                                 (relations_df["dest"].isin(entities_date[date]["nodeID"]))]
    
    
    relations_with_start_date = relations_def[relations_def["properties"].apply(lambda x: contains_rel(x, "P580", True, date))]    
    relations_def = relations_def.drop(relations_with_start_date.index)
    
    relations_with_end_date = relations_def[relations_def["properties"].apply(lambda x: contains_rel(x, "P582", False, date))]
    relations_def = relations_def.drop(relations_with_end_date.index)
    
    relations_with_point_time = relations_def[relations_def["properties"].apply(lambda x: contains_rel(x, "P583", True, date))]
    relations_def = relations_def.drop(relations_with_point_time.index)

    relations_def = relations_def.merge(entities_date[date], left_on="source", right_on="nodeID")
    relations_def = relations_def[["wikidataID", "type", "dest"]]
    relations_def = relations_def.rename(columns={"wikidataID" : "Source"})
    relations_def = relations_def.merge(entities_date[date], left_on="dest", right_on="nodeID")
    relations_def = relations_def[["Source","type","wikidataID"]]
    relations_def = relations_def.rename(columns={"wikidataID" : "Target"})
    
    relations_def.to_csv(os.path.join(split_dir, "graph_" + pd.to_datetime(date, format="+%Y-%m-%dT%H:%M:%SZ").strftime("%Y-%m-%d") + ".csv"), index=False)

Once we have found the knowledge graphs for every date, we can then train the corresponding knowledge graph embeddings. We are using for that the <a href="https://github.com/pykeen/pykeen">PyKeen</a> library. This library contains multiple knowledge graph embedding methods which we can use. In particular, in this example, we use a knowledge graph embedding method known as TransH.

In [30]:
from pykeen.triples import TriplesFactory
from pykeen.pipeline import pipeline

  warn(f"Failed to load image Python extension: {e}")


In [31]:
def_nodes = []
node_list = []
for node in mapping:
    if len(mapping[node]) == 1:
        def_nodes.append(node)
        node_list.append(mapping[node][0])
node_list

['Q135281',
 'Q143616',
 'Q152057',
 'Q152057',
 'Q4836297',
 'Q193326',
 'Q193326',
 'Q193326',
 'Q193326',
 'Q193326',
 'Q193326',
 'Q212235',
 'Q245343',
 'Q245343',
 'Q245343',
 'Q245343',
 'Q245343',
 'Q288129',
 'Q329953',
 'Q456563',
 'Q483551',
 'Q502344',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q522617',
 'Q533415',
 'Q546880',
 'Q642271',
 'Q645708',
 'Q661845',
 'Q675585',
 'Q680186',
 'Q697311',
 'Q826526',
 'Q836040',
 'Q837982',
 'Q866972',
 'Q905806',
 'Q908324',
 'Q918206',
 'Q920037',
 'Q930919',
 'Q994153',
 'Q1046951',
 'Q1046951',
 'Q1053422',
 'Q1134746',
 'Q1134746',
 'Q1134746',
 'Q1138291',
 'Q1220078',
 'Q1275577',
 'Q1282130',
 'Q20858163',
 'Q1341590',
 'Q1341588',
 'Q1345971',
 'Q1374135',
 'Q2114414',
 'Q2114414',
 'Q2114414',
 'Q2114414',
 'Q1374135',
 'Q1472539',
 'Q1511043',
 'Q1539185',
 'Q

NOTE: This step might be take a while to execute. Also note that, in this case, we are configuring the TransH method with a single epoch: for production purposes, larger numbers of epochs should be considered (but, in this case, it would only increase execution time). To reduce the time needed for execution, we have just limited the computation to a few dates (between 1st July 2019 and 31st December 2019, and after 29th June 2020).

In [33]:
dates = filtered_kpis_df["Date"].unique().flatten()

# For tutorial reasons, we restrict the computation of these embeddings to just a few dates:
def_dates = [date for date in dates if (date > pd.to_datetime("2019-07-01") and date < pd.to_datetime("2020-01-01"))  or date >= pd.to_datetime("2020-06-29")]

additional_data = []

for date in def_dates:
    filename = os.path.join(split_dir, "graph_" + pd.to_datetime(date).strftime("%Y-%m-%d") + ".csv")
    print(filename)
    
    # We read the graph for the given date
    graph = pd.read_csv(filename)
    
    # We create the (head, type, tail) triple set using the graph.
    tf = TriplesFactory.from_labeled_triples(
        graph[['Source', 'type', 'Target']].values,
        create_inverse_triples=False,
        entity_to_id=None,
        relation_to_id=None,
        compact_id=True,
        filter_out_candidate_inverse_relations=True,
        metadata=None,
    )
    
    training, test = tf.split(ratios=0.999)
    
    # We configure the TransH model, with 1 epoch. This number is established for tutorial purposes:
    # A larger number of epochs should be considered for production.
    result = pipeline(
        training=training,
        testing=test,
        model='TransH',
        epochs=1,
        random_seed=0
    )
    
    single_data = dict()
    single_data["Date"] = pd.to_datetime(date)
    
    # We get the list of nodes for which we want to retrieve embeddings
    emb_list=[]
    for node in node_list:
        if node in result.training.entity_to_id:
            emb_list.append(result.training.entity_to_id[node])
        else:
            emb_list.append(-1)
    
    # We retrieve the embeddings
    for i in range(0, len(def_nodes)):
        single_data = dict()
        single_data["Date"] = pd.to_datetime(date)
        single_data["Stock"] = def_nodes[i]
        if emb_list[i] >= 0:
            emb = result.model.entity_representations[0].forward()[emb_list[i]].tolist()
            for j in range(0, 50):
                single_data["emb_" + str(j)] = emb[j]
        else:
            for j in range(0, 50):
                single_data["emb_" + str(j)] = 0.0
        additional_data.append(single_data)
        
embedding_data = pd.DataFrame(additional_data)

aux_embedding_df = embedding_data.copy()
for i in range(0, 50):
    aux_embedding_df["emb_"+str(i)] = aux_embedding_df["emb_" + str(i)].apply(lambda x: float(x))
    if i % 10 == 0:
        print(i)

/nfs/notebooks/KGE/data2/kg/splits/graph_2019-07-02.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339665, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.70s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-07-09.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339666, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.70s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-07-16.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339665, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.68s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-07-23.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339663, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.70s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-07-30.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339663, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.69s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-08-06.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339672, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.71s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-08-13.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339672, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.67s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-08-20.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339673, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.71s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-08-27.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339673, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.75s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-09-03.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339687, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.72s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-09-10.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339688, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.71s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-09-17.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339689, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.71s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-09-24.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339691, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.68s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-10-01.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339694, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.70s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-10-08.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339695, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.71s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-10-15.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339696, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.71s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-10-22.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339702, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.72s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-10-29.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339703, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.72s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-11-05.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339707, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.72s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-11-12.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339707, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.71s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-11-19.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339708, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.68s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-11-26.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339709, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.71s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-12-03.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339711, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.72s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-12-10.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339712, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.71s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-12-17.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339714, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.73s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-12-24.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339711, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.70s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2019-12-31.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339712, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.70s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2020-06-30.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339757, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.73s seconds


/nfs/notebooks/KGE/data2/kg/splits/graph_2020-07-07.csv


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [339763, 443]
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/1 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1728 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.


Evaluating on cuda:0:   0%|          | 0.00/443 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 1.71s seconds


0
10
20
30
40


In [34]:
aux_embedding_df

Unnamed: 0,Date,Stock,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,...,emb_40,emb_41,emb_42,emb_43,emb_44,emb_45,emb_46,emb_47,emb_48,emb_49
0,2019-07-02,ALEX,0.012258,-0.002819,-0.039719,-0.014789,0.008833,0.008910,-0.026472,0.032472,...,0.007392,-0.033460,0.051683,-0.017940,0.025596,-0.043847,0.055780,-0.022569,0.008385,0.028152
1,2019-07-02,UMC,0.043303,0.012475,-0.013208,-0.051690,0.065664,0.013147,-0.022311,-0.028408,...,-0.014749,-0.043526,0.024834,0.045914,0.026062,0.018382,0.027156,-0.074581,0.010762,0.000162
2,2019-07-02,BP,0.043079,-0.038000,-0.085028,-0.050656,0.079308,0.063371,-0.091170,-0.076359,...,-0.087561,-0.053419,0.077678,0.083396,-0.060554,0.006615,0.039481,-0.086902,-0.064915,-0.021611
3,2019-07-02,BPMP,0.043079,-0.038000,-0.085028,-0.050656,0.079308,0.063371,-0.091170,-0.076359,...,-0.087561,-0.053419,0.077678,0.083396,-0.060554,0.006615,0.039481,-0.086902,-0.064915,-0.021611
4,2019-07-02,BPT,0.036754,-0.017834,-0.017140,-0.059579,0.019490,0.008845,-0.002627,-0.041105,...,0.024648,0.012722,0.050788,0.028738,-0.059635,0.016882,0.001816,-0.027347,-0.016653,-0.077225
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108165,2020-07-07,XP,0.010071,0.003291,-0.019107,-0.040694,0.023292,-0.023060,-0.026578,-0.048145,...,0.006886,0.005121,0.036937,0.006838,-0.009337,-0.024992,0.021884,-0.009780,-0.003846,-0.017526
108166,2020-07-07,XRAY,0.028427,-0.039921,-0.025032,-0.009958,0.037584,0.037607,-0.025059,-0.066898,...,-0.041461,-0.026892,0.010143,0.065303,-0.039509,0.035102,-0.000813,-0.049818,0.025319,-0.039857
108167,2020-07-07,XVZ,-0.142858,-0.088120,-0.024537,-0.191846,0.245535,0.114878,-0.044984,0.035975,...,-0.141802,0.147015,0.090973,-0.147548,-0.117695,-0.009450,-0.164969,0.126821,-0.142927,0.219653
108168,2020-07-07,YGRN,0.007717,-0.002709,0.016964,0.024163,0.003132,0.006996,0.020115,0.000584,...,-0.032307,-0.037174,-0.020983,-0.015940,0.030629,-0.000684,0.026519,-0.012271,0.019070,0.021328


## Dataset split

### Getting the training / test examples

In order to execute this, we need training and test examples. Basically, we shall use the training examples for our model, and the test examples 

Then, we are doing the following:
1. Choose a recommendation date. For instance, 2020-06-30.
2. We get 6 months for obtaining target values,
3. The previous 6 months are used as training examples. (Essentially from 2019-07-02 to 2019-12-31)
4. Then, we take the targets of the examples at that point as test targets.

In [35]:
train_embeddings = aux_embedding_df[(aux_embedding_df["Date"] >= pd.to_datetime("2019-07-02")) &
                              (aux_embedding_df["Date"] <= pd.to_datetime("2019-12-31"))]
test_embeddings = aux_embedding_df[aux_embedding_df["Date"] == pd.to_datetime("2020-06-30")]

In [36]:
train_embeddings.to_csv(os.path.join(storageDIR, "train-embeddings.csv"), index=False)
test_embeddings.to_csv(os.path.join(storageDIR, "test-embeddings.csv"), index=False)

In [37]:
train_data = filtered_kpis_df[filtered_kpis_df["Date"].isin(train_embeddings["Date"])]
test_data = filtered_kpis_df[filtered_kpis_df["Date"] == pd.to_datetime("2020-06-30")]

In [38]:
train_data.to_csv(os.path.join(storageDIR, "training.csv"), index=False)
test_data.to_csv(os.path.join(storageDIR, "test.csv"), index=False)

## Basic model
As a baseline, we provide here the training model. Considering the data, we train a random forest model using, as input, the different technical indicators.

In [39]:
basic_kpis = ["return_28","return_63","return_126", "volatility_28_1", "volatility_63_1", "volatility_126_1", "mean_28", "mean_63", "mean_126"]

In [40]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

We first read the training and test data from files.

In [41]:
train_data = pd.read_csv(os.path.join(storageDIR, "training.csv"))
test_data = pd.read_csv(os.path.join(storageDIR, "test.csv"))

And we separate the training features and the training targets the algorithm shall learn.

In [42]:
basic_training_data = train_data[basic_kpis]
basic_targets = train_data["target"]

Then, we create our model: a random forest regressor. Any regression algorithm might be configured at this point (for instance, a Linear Regression algorithm can also be tested)

In [43]:
model = RandomForestRegressor()

Finally, we call the function `fit` to train the model. This function receives, as imput, the training data features and the training targets. Once the algorithm is train, we use the `predict` method over the test features to generate the predictions, and we sort the values by descending scores to generate a recommendation ranking

In [44]:
model.fit(basic_training_data, basic_targets)
basic_test_data = test_data[["Stock", "target"]]
basic_test_data["prediction"] = model.predict(test_data[basic_kpis])
basic_ranking = basic_test_data.sort_values( by="prediction", ascending = False).head(10)
basic_ranking

Unnamed: 0,Stock,target,prediction
750,CAN,1.094737,1.199079
278,AHPI,-0.567912,1.022066
448,AR,1.070866,0.953765
353,CYAN,0.274678,0.943349
592,ANY,-0.493007,0.929794
205,BLNK,6.113556,0.905251
137,BW,0.434211,0.900773
631,AIM,-0.245968,0.805574
453,AGRX,0.035971,0.796861
685,CGEN,-0.176431,0.786474


And we compute the average profitability of the model:

In [45]:
basic_prof = basic_ranking["target"].mean()

We do the same for the advanced technical indicators.

We show here the results for our two algorithms, in terms of profitability (return on investment) at 6 months. As we can see, the model with advanced KPIs (6.4) is showing improvements with respect to the basic model (1.8) and the market average (0.37)

In [46]:
profitability_df = pd.DataFrame([{"Model" : "Basic KPIs", "RoI@10" : basic_prof}, {"Model" : "Market average", "RoI@10" : test_data["target"].mean()}])
profitability_df

Unnamed: 0,Model,RoI@10
0,Basic KPIs,0.75407
1,Market average,0.32826


## Knowledge graph embedding model

For our model with knowledge graph embeddings, we concatenate the knowledge graph embeddings to the feature information. Then, we, again, run a random forest regression algorithm. We consider three variants here:
- **Pure KGE**: We just consider here the knowledge graph embeddings as features.
- **Basic KPIS + KGE**: We concatenate the knowledge graph embeddings to the basic set of KPIs.
- **Adv. KPIS + KGE**: We concatenate the knowledge graph embeddings to the advanced set of KPIs.

In [47]:
train_embeddings = pd.read_csv(os.path.join(storageDIR, "train-embeddings.csv"))
test_embeddings = pd.read_csv(os.path.join(storageDIR, "test-embeddings.csv"))

In [48]:
train_embeddings["Date"] = pd.to_datetime(train_embeddings["Date"])
test_embeddings["Date"] = pd.to_datetime(test_embeddings["Date"])
train_data["Date"] = pd.to_datetime(train_data["Date"])
test_data["Date"] = pd.to_datetime(test_data["Date"])

In [49]:
emb_train = train_data.merge(aux_embedding_df, on=["Date", "Stock"])
emb_test = test_data.merge(test_embeddings, on=["Date","Stock"])

In [50]:
emb_train

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock,return_1,return_3,...,emb_40,emb_41,emb_42,emb_43,emb_44,emb_45,emb_46,emb_47,emb_48,emb_49
0,2019-07-02,28.42,28.559999,27.879999,28.540001,26.226006,109800.0,CMTL,-0.000350,0.031442,...,-0.012970,-0.039455,0.032538,0.002180,0.012329,0.010228,0.028398,-0.047774,-0.011668,-0.002596
1,2019-07-09,27.33,27.790001,27.000000,27.719999,25.472492,108300.0,CMTL,0.010941,-0.026344,...,-0.050830,-0.040176,0.027692,0.047752,-0.019799,0.028655,0.018530,-0.032030,-0.029249,-0.024014
2,2019-07-16,28.10,28.320000,28.000000,28.049999,25.867464,74300.0,CMTL,-0.005319,0.008630,...,-0.014004,-0.038994,0.034837,0.001975,0.011657,0.010211,0.029813,-0.048357,0.022484,-0.001982
3,2019-07-23,27.17,27.540001,27.059999,27.500000,25.360260,94900.0,CMTL,0.016636,0.006589,...,-0.045770,-0.048394,0.019055,0.053915,-0.009383,0.006669,0.022837,-0.036695,0.036573,-0.022881
4,2019-07-30,29.01,29.660000,28.850000,29.459999,27.167759,127400.0,CMTL,0.008904,0.043571,...,-0.059846,-0.064269,0.036434,0.055885,-0.022395,0.023843,0.038082,-0.052611,0.048595,-0.035730
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21047,2019-12-03,1.58,2.040000,1.580000,2.000000,2.000000,533300.0,ABUS,0.226994,0.212121,...,-0.014861,-0.058989,0.031925,0.008556,0.019961,0.060837,-0.033116,0.002977,-0.049212,-0.014728
21048,2019-12-10,2.39,2.390000,2.290000,2.350000,2.350000,204300.0,ABUS,-0.004237,0.058559,...,0.033042,0.032762,0.005310,-0.025500,-0.012480,-0.003715,-0.012297,0.009887,-0.006931,-0.008355
21049,2019-12-17,2.24,2.320000,2.160000,2.260000,2.260000,335100.0,ABUS,0.013453,-0.054393,...,-0.014279,-0.062424,0.056326,0.050022,-0.009541,0.050551,0.046695,-0.004526,0.032055,-0.003775
21050,2019-12-24,2.78,2.800000,2.700000,2.750000,2.750000,152400.0,ABUS,0.000000,0.206140,...,0.013618,-0.028418,-0.044631,0.015484,-0.006356,0.058976,-0.053771,-0.008415,-0.056421,-0.039388


In [51]:
emb_test

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock,return_1,return_3,...,emb_40,emb_41,emb_42,emb_43,emb_44,emb_45,emb_46,emb_47,emb_48,emb_49
0,2020-06-30,16.320000,16.940001,16.320000,16.889999,15.759132,337400.0,CMTL,0.028624,0.068987,...,-0.023096,-0.027073,0.026937,0.029492,-0.033293,0.029246,0.021184,-0.041365,0.020737,-0.044903
1,2020-06-30,5.250000,5.280000,5.000000,5.070000,4.812879,734800.0,CEIX,-0.034286,-0.099467,...,-0.060620,-0.084170,0.082831,0.061060,-0.035738,0.088497,0.093575,-0.091325,0.088214,-0.058954
2,2020-06-30,36.360001,36.799999,36.020000,36.299999,34.346691,415500.0,ABM,-0.005207,0.015953,...,-0.006010,-0.061017,0.061057,0.012548,-0.042969,0.082653,0.005060,-0.068639,0.019363,-0.001039
3,2020-06-30,39.040001,40.419998,38.299999,40.060001,40.060001,4074000.0,CZR,0.011872,0.037824,...,0.010868,0.000785,-0.035460,0.009052,0.042924,-0.010649,-0.021738,0.040082,0.012832,0.013241
4,2020-06-30,4.750000,5.170000,4.690000,5.080000,5.080000,9793200.0,CDE,0.060543,0.083156,...,-0.051203,-0.038175,0.008630,0.030166,-0.036829,0.027516,0.015614,0.023243,0.017368,-0.019433
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
799,2020-06-30,3.460000,3.560000,3.460000,3.480000,3.038411,18800.0,BBDO,-0.027933,-0.069519,...,-0.102401,-0.075713,0.089866,0.090565,-0.067391,0.047019,0.014694,-0.091565,0.020135,-0.028702
800,2020-06-30,10.030000,10.290000,9.950000,10.250000,10.122540,1421300.0,CCJ,0.015857,0.032226,...,-0.049577,-0.024091,0.040084,0.044302,-0.055835,0.054958,-0.003294,-0.060699,0.006934,-0.022913
801,2020-06-30,5.280000,5.290000,5.180000,5.220000,4.313803,1435400.0,BSBR,-0.026119,-0.052632,...,-0.009084,-0.082725,0.013626,0.067962,-0.004170,0.039842,0.065395,-0.057461,0.109681,-0.052192
802,2020-06-30,26.309999,27.490000,26.040001,27.160000,22.215637,1351300.0,CWH,0.010793,0.050271,...,-0.034890,-0.049898,0.028444,0.012949,0.003233,0.037562,0.029762,-0.033205,0.072647,-0.009082


In [52]:
emb_feats = []
for i in range(0,50):
    emb_feats.append("emb_"+str(i))

In [53]:
profit_list = []

#### Model 1: Pure embeddings:

In this model, we just take the embeddings: not any other feature:

In [54]:
model = RandomForestRegressor()

In [55]:
emb_train_data = emb_train[emb_feats]
emb_test_data = emb_test[emb_feats]
emb_train_targets = emb_train["target"]

In [56]:
model.fit(emb_train_data, emb_train_targets)
emb_test_data = emb_test[["Stock", "target"]]
emb_test_data["prediction"] = model.predict(emb_test[emb_feats])
emb_ranking = emb_test_data.sort_values(by="prediction", ascending = False).head(10)

In [57]:
emb_ranking

Unnamed: 0,Stock,target,prediction
666,AGI,-0.077825,1.25552
367,AVB,0.016619,0.998749
754,APVO,3.173652,0.945309
537,CMPR,0.131648,0.879546
101,AMAL,0.075949,0.845867
711,CEMI,0.449231,0.710225
774,BWMX,2.311225,0.558311
285,ABEO,-0.472603,0.55288
576,ABCB,0.585418,0.546014
78,ASR,0.467272,0.478554


In [58]:
profit_list.append({"Model" : "Pure KGE", "RoI@10" : emb_ranking["target"].mean()})

#### Model 2: Basic KPIs + Embeddings

For this model, we take, as features, (a) the knowledge graph embeddings and (b) the basic technical indicators (RoI, average price and volatility).

In [59]:
model = RandomForestRegressor()

In [60]:
emb_train_data = emb_train[emb_feats + basic_kpis]
emb_train_targets = emb_train["target"]

In [61]:
model.fit(emb_train_data, emb_train_targets)
emb_test_data = emb_test[["Stock", "target"]]
emb_test_data["prediction"] = model.predict(emb_test[emb_feats + basic_kpis])
emb_basic_ranking = emb_test_data.sort_values(by="prediction", ascending = False).head(10)

In [62]:
emb_basic_ranking

Unnamed: 0,Stock,target,prediction
204,BLNK,6.113556,1.11635
452,AGRX,0.035971,1.102323
277,AHPI,-0.567912,1.081253
803,ABUS,0.923077,1.041603
416,CIDM,-0.638743,1.033581
447,AR,1.070866,0.999187
747,CAN,1.094737,0.892252
591,ANY,-0.493007,0.891672
548,BGI,0.153846,0.764591
284,APRN,-0.462579,0.743703


In [63]:
profit_list.append({"Model" : "Basic KPIs + KGE", "RoI@10" : emb_basic_ranking["target"].mean()})

Finally, we show the performance of our algorithms in terms of RoI at six months of the top 10 recommended assets. 

In [64]:
adv_df = pd.DataFrame(profit_list)

In [65]:
pd.concat([profitability_df, adv_df]).reset_index(drop=True)

Unnamed: 0,Model,RoI@10
0,Basic KPIs,0.75407
1,Market average,0.32826
2,Pure KGE,0.666059
3,Basic KPIs + KGE,0.722981


As we can see, in this test, we can observe that the knowledge graph embeddings achieve slightly worse results than the basic KPIs. However, there are many considerations here: first, we restrict our training examples to a few dates -- which can be further extended for more training data. Also, we run a single epoch to train the embeddings: more epochs would be expected to improve the performance of the method.

<!-- In the above table, we are showing the top 10 stocks that were predicted to be profitable. The last two columns report the actual return on investment after 9 months and asset volitility. Note that a return value of 1.5 means a 150% return on investment. 

We notice that predicted returns for the top stocks are exceedingly high, which is not ordinary. However, we can also see that the actual returns for these stocks are similar for several of these instances, i.e. the model is not wrong in predicting these as profitable investments in the short term. However, we can also see that the volitility fo these stocks is very high, i.e. these are 'high-risk' assets that may subsequently crash in price. 

We can also analyse the statistics for this predictions across the dataset. -->

<!-- The returns and volatility for the top stocks, ranked by predicted returns, are far higher than their averages across the test set. This indicates that ranking assets by their predicted returns can produce some highly profitable but risk-laden investment recommendations, which might be suitable for aggressive investors. However, it remains to be seen how much of this is owed to fluctuations and outliers in the data, and perhaps even if there are better ways to capture the returns and volatility of the dataset.

Next, we look at the differences between the actual and predicted returns. -->

<!-- Lastly, we can examine the mean absolute error and mean squared error of the predictions. As these can be quite dependent on the dataset and problem in question, we also assume a simple baseline, by taking the median of all stock returns from the test dataset. We then compare the results of applying these metrics to the baseline and our predictor model. -->

<!-- We can see from this that the random forest model presents an improvement (reduction) in both MAE and MSE. -->