# Supervised Training with SimpleDBpediaQA

- Dataset repository: [castorini/SimpleDBpediaQA](https://github.com/castorini/SimpleDBpediaQA)
- Stats (forward only):
  - 3,777 training questions
  - 500 validation questions
  - 1,000 test questions

## Step 0. Prerequisite

Install and import dependencies:

```bash
pip install srsly srtk pandas
```

In [1]:
import os
from pathlib import Path

import srsly
from tqdm import tqdm

Download data from GitHub to `data/dbpedia-qa/raw` with the following commands:

```bash
mkdir -p data/dbpedia-qa/raw
wget https://raw.githubusercontent.com/castorini/SimpleDBpediaQA/master/V1/train.json -P data/dbpedia-qa/raw
wget https://raw.githubusercontent.com/castorini/SimpleDBpediaQA/master/V1/valid.json -P data/dbpedia-qa/raw
wget https://raw.githubusercontent.com/castorini/SimpleDBpediaQA/master/V1/test.json -P data/dbpedia-qa/raw
```

## Step 1. Format the Raw Data for Preprocessing

### 1.1 Inspect raw data

In [2]:
raw_root = 'data/dbpedia-qa/raw'
splits = ['train', 'valid', 'test']
raw_paths = {split: os.path.join(raw_root, f'{split}.json') for split in splits}

intermediate_dir = 'data/dbpedia-qa/intermediate' # intermediate data, here it's the scored paths
dataset_dir = 'data/dbpedia-qa/dataset'  # preprocessed data
output_model_dir = 'artifacts/models/dbpedia_qa'
retrieved_subgraph_path = 'artifacts/subgraphs/dbpedia_qa.jsonl'

# Create directories
Path(intermediate_dir).mkdir(parents=True, exist_ok=True)
Path(dataset_dir).mkdir(parents=True, exist_ok=True)
Path(output_model_dir).mkdir(parents=True, exist_ok=True)

In [3]:
raw_train_path = raw_paths['train']
! head -n 16 $raw_train_path

{
  "DatasetName": "SimpleDBpediaQA-TRAIN",
  "Questions": [
    {
      "ID": "00007",
      "Query": "what movie is produced by warner bros.",
      "Subject": "http://dbpedia.org/resource/Warner_Bros.",
      "FreebasePredicate": "www.freebase.com/film/production_company/films",
      "PredicateList": [
        {
          "Predicate": "http://dbpedia.org/ontology/distributor",
          "Direction": "backward",
          "Constraint": "http://dbpedia.org/ontology/Film"
        }
      ]
    },


### 1.2 Remove backward relations

Currently, our retrieval model can only handle forward relations. So we need to remove backward relations from the dataset.

In [4]:
preserved_data = {}
for split in splits:
    raw_path = raw_paths[split]
    data = srsly.read_json(raw_path)['Questions']
    print('Full train size:', len(data))
    before_len = len(data)
    # Remove samples where the Direction are all 'reverse'
    data = [sample for sample in data if any(p['Direction'] == 'forward' 
                                             and p['Predicate'].startswith('http://dbpedia.org/ontology/')
                                             for p in sample['PredicateList'])]
    after_len = len(data)
    print(f'{split} size after removing reverse relation: {len(data)}')
    print(f'{after_len/before_len*100:.2f}% percentage {split} data are kept')
    preserved_data[split] = data


Full train size: 30186
train size after removing reverse relation: 16000
53.00% percentage train data are kept
Full train size: 4305
valid size after removing reverse relation: 2310
53.66% percentage valid data are kept
Full train size: 8595
test size after removing reverse relation: 4608
53.61% percentage test data are kept


### 1.3 Convert to scored path format

The paths format is a JSONL file, where each line is a dictionary as:
```json
{
    "id": "train-100",
    "question": "What is the birth place of Barack Obama?",
    "question_entities": ["Q76"],
    "paths": [["P19"]]  # there may be multiple paths, and each path may have variable lengths
    "scores": [1.0]     # the score of each path. for ground truth paths, we assign max score 1.0 to each path.
}

In [5]:
for split in splits:
    data = preserved_data[split]
    samples = []
    for sample in tqdm(data, desc=f'Processing {split}'):
        question_entity = sample['Subject'].split('/')[-1]
        relations = [[p['Predicate'].split('/')[-1]] for p in sample['PredicateList']
                     if p['Direction'] == 'forward' and p['Predicate'].startswith('http://dbpedia.org/ontology/')]   
        question = sample['Query']
        idx = sample['ID']
        sample = {
            "id": idx,
            "question": question,
            "question_entities": [question_entity],
            "paths": relations,
            "path_scores": [1.0]
        }
        samples.append(sample)
    save_path = os.path.join(intermediate_dir, f'scores_{split}.jsonl')
    srsly.write_jsonl(save_path, samples)
    print(f'Saved {split} scored paths file to {save_path}')

Processing train: 100%|██████████| 16000/16000 [00:00<00:00, 69926.04it/s]


Saved train scored paths file to data/dbpedia-qa/intermediate/scores_train.jsonl


Processing valid: 100%|██████████| 2310/2310 [00:00<00:00, 125726.25it/s]


Saved valid scored paths file to data/dbpedia-qa/intermediate/scores_valid.jsonl


Processing test: 100%|██████████| 4608/4608 [00:00<00:00, 120893.43it/s]


Saved test scored paths file to data/dbpedia-qa/intermediate/scores_test.jsonl


Inspect a train sample

In [6]:
!head -n 1 data/dbpedia-qa/intermediate/scores_train.jsonl | jq

[1;39m{
  [0m[34;1m"id"[0m[1;39m: [0m[0;32m"00012"[0m[1;39m,
  [0m[34;1m"question"[0m[1;39m: [0m[0;32m"Which city did the artist ryna originate in"[0m[1;39m,
  [0m[34;1m"question_entities"[0m[1;39m: [0m[1;39m[
    [0;32m"RYNA"[0m[1;39m
  [1;39m][0m[1;39m,
  [0m[34;1m"paths"[0m[1;39m: [0m[1;39m[
    [1;39m[
      [0;32m"hometown"[0m[1;39m
    [1;39m][0m[1;39m
  [1;39m][0m[1;39m,
  [0m[34;1m"path_scores"[0m[1;39m: [0m[1;39m[
    [0;39m1[0m[1;39m
  [1;39m][0m[1;39m
[1;39m}[0m


## Step 2. Preprocessing

Use the `srtk preprocess` command to creating training samples. Do not pass `--search-path` beacuse the paths are already provided in the dataset. This step mainly involves negative sampling and dataset generation.

In [9]:
for split in splits:
    print(f'Processing {split} data...')
    scored_path = os.path.join(intermediate_dir, f'scores_{split}.jsonl')
    dataset_path = os.path.join(dataset_dir, f'{split}.jsonl')
    !srtk preprocess -i $scored_path \
        -o $dataset_path \
        -e http://localhost:8890/sparql \
        -kg dbpedia

Processing train data...
Negative sampling: 100%|█████████████████| 16000/16000 [01:12<00:00, 219.56it/s]
Number of training records: 0
Converting relation ids to labels: 0it [00:00, ?it/s]
Training samples are saved to data/dbpedia-qa/dataset/train.jsonl
Processing valid data...
Negative sampling: 100%|███████████████████| 2310/2310 [00:10<00:00, 224.15it/s]
Number of training records: 0
Converting relation ids to labels: 0it [00:00, ?it/s]
Training samples are saved to data/dbpedia-qa/dataset/valid.jsonl
Processing test data...
Negative sampling: 100%|███████████████████| 4608/4608 [00:18<00:00, 254.34it/s]
Number of training records: 0
Converting relation ids to labels: 0it [00:00, ?it/s]
Training samples are saved to data/dbpedia-qa/dataset/test.jsonl


In [8]:
!head -n 1 data/dbpedia-qa/dataset/train.jsonl | jq