# Transform dataset to LibKGE-ready files
Generates the .del files in the format that LibKGE reads them.

Required files for this Jupyter Notebook to work.
1. Two mapping CSV files which contains information about all entities in Wikidata5M and all relatins for the entities in Wikidata5M. You can check the 02_csv_mapping_entities.csv and 02_csv_mapping_relations.csv file for the format or just use it directly.
3. A folder containing a JSON file for each entity in Wikidata5M and all relations.

## How does this Notebook work?
The notebook is divided into five different sections. Sections 1, 2, 3 and 5 start with a cell where you need to set variables according to the dataset you want to create. Section 4 is a room for you to pre-process the data if you need to. If you don't need to change anything you can skip section 4.

To create a dataset you have to go through the Notebook and run each cell. You need to set variables in the cells that start with #please set variables. And es mentioned before, section 4 is for you to pre-process your data the way you need it.
E.g.: Transforming a date into a number to create a regression downstream task.

In [34]:
#imports
import pandas as pd
import json
import os
import shutil
import numpy

In [58]:
#set variables
libkge_entity_ids = "00_wikidata5m_entity_ids.del"
json_download_folder = '01_download_entities_relations'
folder_of_triples = '07_datasets/airport_elevation_above_sea_level'
downstreak_task_type = 'regression'  #regression or classification
train_percentage = 0.8
valid_percentage = 0.1
test_percentage = 0.1

In [59]:
df = pd.read_csv(folder_of_triples + '/all_triples.csv', delimiter=';', header=None)
df = df.rename(columns={0: 'entity_id', 1: 'relation_id', 2: 'value'})
df['value'] = df['value'].replace('\[', '', regex=True).replace('\]', '', regex=True).replace("'", '', regex=True).replace(', ', ',', regex=True)
relation_id = df['relation_id'].loc[0]
df = df.set_index('entity_id')
df = df.drop(['relation_id'], axis=1)

if downstreak_task_type == 'classification':
    df['value'] = df['value'].apply(lambda x: x.split(',')).to_frame()
    df = df.explode('value')
    df = pd.get_dummies(df['value'])

In [60]:
path = folder_of_triples

#mapping the wikidata IDs to the IDs for LibKGE in the entity_ids.del file
df_mapping = pd.read_csv(libkge_entity_ids, sep='\t', names=['libkge_id','entity_id']).set_index("entity_id")
df = df.merge(df_mapping, on='entity_id', how='left')
df = df.set_index("libkge_id")

#creating the split files
train_size = int(len(df) * train_percentage)
valid_size = int(len(df) * valid_percentage)
test_size = int(len(df) * test_percentage)

df_train = df.iloc[:train_size]
df_valid = df.iloc[train_size:train_size + valid_size]
df_test = df.iloc[train_size + valid_size:]

df_train.to_csv(path + "/train.del", sep="\t", index=True, header=False)
df_valid.to_csv(path + "/valid.del", sep="\t", index=True, header=False)
df_test.to_csv(path + "/test.del", sep="\t", index=True, header=False)

#README FILE
with open(os.path.join(path, "README.md"), "a") as file:
    file.write("\n")
    file.write("Split for LibKGE created with 06_notebook_triples_to_libkge.\n")
    file.write("Variables in notebook:\n")
    file.write("json_download_folder:                  {}\n".format(json_download_folder))    
    file.write("folder_of_triples:                     {}\n".format(folder_of_triples))
    file.write("downstreak_task_type:                  {}\n".format(downstreak_task_type))
    file.write("train_percentage:                      {}\n".format(train_percentage))
    file.write("valid_percentage:                      {}\n".format(valid_percentage))
    file.write("test_percentage:                       {}\n".format(test_percentage))
shutil.copyfile('06_notebook_triples_to_libkge.ipynb', path + '/06_notebook_triples_to_libkge.ipynb')

'07_datasets/album_producer/06_notebook_triples_to_libkge.ipynb'