# Create downstream tasks
This Jupyter Notebook hepls to create new downstream tasks for models that are trained on Wikidata5M.

Required files for this Jupyter Notebook to work.
1. Two mapping CSV files which contains information about all entities in Wikidata5M and all relatins for the entities in Wikidata5M. You can check the 02_csv_mapping_entities.csv and 02_csv_mapping_relations.csv file for the format or just use it directly.
3. A folder containing a JSON file for each entity in Wikidata5M and all relations.

## How does this Notebook work?
The notebook is divided into five different sections. Sections 1, 2, 3 and 5 start with a cell where you need to set variables according to the dataset you want to create. Section 4 is a room for you to pre-process the data if you need to. If you don't need to change anything you can skip section 4.

To create a dataset you have to go through the Notebook and run each cell. You need to set variables in the cells that start with #please set variables. And es mentioned before, section 4 is for you to pre-process your data the way you need it.
E.g.: Transforming a date into a number to create a regression downstream task.

In [13]:
#imports
import pandas as pd
import csv
import os
import json
import numpy as np
import sys
import shutil

In [14]:
#please set variables

#Filename of the mapping CSV which contains the information about the entities.
mapping_file_entities = '02_csv_mapping_entities.csv'
mapping_file_relations = '02_csv_mapping_relations.csv'

#Folder which contains all the JSON files for every entity
json_download_folder = '01_download_entities_relations/'

In [15]:
#no need to change anything here
#reading all entities and relations from csv
df = pd.read_csv(mapping_file_entities, delimiter=';')

#cleaning up dataframes
df['instance_of'] = df['instance_of'].replace("'", "", regex=True).replace("\[", "", regex=True).replace("\]", "", regex=True).replace(", ", ",", regex=True)
df['relations_wikidata'] = df['relations_wikidata'].replace("'", "", regex=True).replace("\[", "", regex=True).replace("\]", "", regex=True).replace(", ", ",", regex=True)
df['relations_wikidata5m'] = df['relations_wikidata5m'].replace("'", "", regex=True).replace("\[", "", regex=True).replace("\]", "", regex=True).replace(", ", ",", regex=True)
df['entity_id'] = df['entity_id'].replace("'", "", regex=True)
df['instance_of'] = df['instance_of'].apply(lambda x: x.split(',')).to_frame()
df['relations_wikidata'] = df['relations_wikidata'].apply(lambda x: x.split(',')).to_frame()
df['relations_wikidata5m'] = df['relations_wikidata5m'].apply(lambda x: x.split(',')).to_frame()
df = df.set_index('entity_id')

# 1.) Filtering for downstreamtask.
The DataFrame is filtered for the relation you put in and potentially for the entity type.

In [250]:
#please set variables

#Relation for the downstream task. Label and Wikidata ID is needed!
#E.g.: 
#downstream_task_relation_label = 'P21'
#downstream_task_relation_label = 'sex or gender'
relation_id = 'P2046'
relation_label = 'area'


#Entity type for which you want to create the downstream task. 
#E.g.: 
#entity_type_id = 'Q5'
#entity_type_label = 'human'
entity_type_id = 'Q262166'
entity_type_label = 'municipality of germany'

In [251]:
#filtering dataframe for 
df_sample = df.copy()
df_sample = df_sample[df_sample.instance_of.apply(lambda x: entity_type_id in x)]
df_sample = df_sample[df_sample.relations_wikidata5m.apply(lambda x: relation_id not in x)]
df_sample = df_sample[df_sample.relations_wikidata.apply(lambda x: relation_id in x)]

#amount of entities that fit cirteria for downstream task
print(len(df_sample), 'entities fit criteria')

8995 entities fit criteria


# 2.) Creating a sample for the data set.
In the cell above this one you can read how many entities fit the filtered critarea. Based on this you can now decide how big your sample_size should be.
In this section the sample based on your random seed and sample size will be created.
The last cell prints out the content of the JSON area containing information about the relation. Since there are diffrent JSON structures for different relation you will need to set what item you need in the next section. 

In [252]:
#please set variables

#random seed
seed = 4

#sample size for dataset
sample_size = 8995

In [253]:
#Creating the sample of the data.
df_sample = df_sample['relations_wikidata'].sample(n = sample_size, random_state = seed).to_frame()
df_sample['relations_wikidata'] = relation_id

#Printing out 
try:
    file = open(json_download_folder + df_sample.index[0] + '.json', encoding='UTF-8')
    data = json.load(file)
    key = next(iter(data['entities'].keys()))
    print('Entity ID:', key)
    print('Content of JSON:', data['entities'][key]['claims'][relation_id][0]['mainsnak']['datavalue']['value'])
    file.close()
except Exception as e:
    print(index, e)

Entity ID: Q564476
Content of JSON: {'amount': '+8.27', 'unit': 'http://www.wikidata.org/entity/Q712226'}


# 3.) Set element which contains the value
This section maps the entity IDs to the values of the relation. 
In this step you need to figure out what information of the JSON file you need. Here are two examples:
Depending on the sample size this step may take a bit.
#### Example: date_of_birth
In this example we want to build a downstream task for the relation date_of_birth. The output of the above cell should look somewhat like this:

Content of JSON: {'time': '+1930-01-11T00:00:00Z', 'timezone': 0, 'before': 0, 'after': 0, 'precision': 11, 'calendarmodel': 'http://www.wikidata.org/entity/Q1985727'}


Since we are looking for the date we have to choose the element "time" which contains the data we want to use: +1930-01-11T00:00:00Z

So we set the set the variable: value_in_json = "time"

#### Example: sex_or_gender
In this example we want to build a downstream task for the relation sex_or_gender. The output of the above cell should look somewhat like this:

Content of JSON: {'entity-type': 'item', 'numeric-id': 6581072, 'id': 'Q6581072'}

Since we are looking for the gender we have to choose the element "id" which contains the ID of the gender entity:
Q6581072 (which is the ID for 'female')

So we set the set the variable: value_in_json = "time"

In [254]:
#please set variable
value_in_json = 'amount'

In [255]:
%%time
def getValue(row):
    try:
        file_name = json_download_folder + row.entity_id + ".json"
        file = open(file_name, encoding='UTF-8')
        data = json.load(file)
        key = next(iter(data["entities"].keys()))
        values = []
        for relation in data["entities"][key]["claims"][relation_id]:
            value = relation["mainsnak"]["datavalue"]["value"][value_in_json]
            values.append(value)
        file.close()
        return values
    except Exception as e:
        pass
        
df_sample = df_sample.reset_index()
df_sample['value'] = df_sample.apply(lambda row: getValue(row), axis=1)
df_sample = df_sample.set_index('entity_id')
df_sample = df_sample.dropna()

CPU times: user 14.3 s, sys: 2.75 s, total: 17 s
Wall time: 6min 17s


# 4.) Room for manipulation of the data value
You might need to manupilate the data to get a suitable format. 
In the cell above you see how your current sample set looks like. Here you have room to finalise your dataset.

E.g.: You may still have some empty fields you want to clear out. You may want to simplify your dataset by filtering out for specific results. You may want to pre-process your data to fit the format you need for your downstream task.

In [256]:
df_sample

Unnamed: 0_level_0,relations_wikidata,value
entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1
Q564476,P2046,"[+8.27, +8.26]"
Q93295,P2046,[+37.3]
Q118603,P2046,[+4.27]
Q627678,P2046,[+7.94]
Q554327,P2046,[+19.68]
...,...,...
Q558193,P2046,"[+18.82, +18.83]"
Q10757,P2046,"[+34.96, +35.05]"
Q647749,P2046,"[+6.22, +6.2]"
Q546179,P2046,"[+8.54, +8.56]"


In [257]:
#drop entities with multiple values to clean up data. they are inconsistently saved.
df_sample['count'] = df_sample.apply(lambda x: len(x.value), axis=1)
df_sample = df_sample[df_sample['count'] == 1]
df_sample = df_sample.drop('count', axis=1)
df_sample

Unnamed: 0_level_0,relations_wikidata,value
entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1
Q93295,P2046,[+37.3]
Q118603,P2046,[+4.27]
Q627678,P2046,[+7.94]
Q554327,P2046,[+19.68]
Q552690,P2046,[+13.38]
...,...,...
Q552375,P2046,[+4.15]
Q637384,P2046,[+8.39]
Q182426,P2046,[+59.03]
Q184476,P2046,[+10.98]


In [258]:
#replace + symbol for simplification since there are only positive values
df_sample = df_sample.explode('value')
df_sample["value"] = df_sample["value"].replace("\+", "", regex=True)

In [259]:
df_sample

Unnamed: 0_level_0,relations_wikidata,value
entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1
Q93295,P2046,37.3
Q118603,P2046,4.27
Q627678,P2046,7.94
Q554327,P2046,19.68
Q552690,P2046,13.38
...,...,...
Q552375,P2046,4.15
Q637384,P2046,8.39
Q182426,P2046,59.03
Q184476,P2046,10.98


# 5.) Creating triples file
In this section the triples file and the readme file are being creating:

all_triples.csv: Contains three columns. In the first column contains the entity ID. The second column contains the relation ID. And the third column the relation value. The relation value is saved in a list since one entity can have more than one value per relation.

README.md: The readme file contains information about which variables you set in this notebook.

In [245]:
path = "07_datasets/" + "_".join(entity_type_label.split()) + "_" + "_".join(relation_label.split())
if not os.path.exists(path):
    os.mkdir(path)
df_sample.to_csv(path + "/all_triples.csv", sep=";", index=True, header=False)
with open(os.path.join(path, "README.md"), "w") as file:
    file.write("Dataset created with jupyter notebook 05_notebook_downstreak_task_creator\n")
    file.write("Variables in notebook:\n")
    file.write("relation_id:                           {}\n".format(relation_id))
    file.write("relation_label:                        {}\n".format(relation_label))
    file.write("entity_type_id:                        {}\n".format(entity_type_id))
    file.write("entity_type_label:                     {}\n".format(entity_type_label))
    file.write("seed:                                  {}\n".format(seed))
    file.write("sample_size (variable):                {}\n".format(sample_size))
    file.write("actual size (df_sample.dropna()):      {}\n".format(len(df_sample)))
    file.write("json_download_folder:                  {}\n".format(json_download_folder))
    file.write("value_in_json:                         {}\n".format(value_in_json))
#copying this file to sub folder
shutil.copyfile('05_notebook_downstream_task_creator.ipynb', path + '/05_notebook_downstream_task_creator.ipynb')

'07_datasets/village_population/05_notebook_downstream_task_creator.ipynb'