# Build Wikidata Infobox

## Datasets
1. Sitelinks: <div/>
`id    node1    label    node2    rank` <div/>
`Q45-wikipedia_sitelink-1    Q45    wikipedia_sitelink    http://en.wikipedia.org/wiki/Portugal`
2. DBpedia infobox: <div/>
<http://dbpedia.org/resource/!!!>    <http://dbpedia.org/property/alias>    "Chk Chk Chk"@en .

## Procedure

In [1]:
import csv
import pandas as pd

### Step 1 Build Wikidata nodes to DBpedia nodes link
Convert node2 from ` http://en.wikipedia.org/wiki/Portugal` into `dbpedia-resource:Portugal` use namespace;

In [None]:
sitelink_path = '../data/wikidata-20181210.sitelinks.en.tsv'
df = pd.read_csv(sitelink_path, sep='\t')

df['id'] = df['id'].apply(lambda x: re.sub('wiki', 'db', x))
df['node2'] = df['node2'].apply(lambda x: re.sub('http://en.wikipedia.org/wiki/', 'dbpedia-resource:', x))
df.loc[len(df.index)] = ['prefix', 'dbpedia-resource', 'prefix_expansion', '"http://dbpedia.org/resource/"', np.nan]

df.to_csv('../data/sitelinks.tsv', sep='\t', index=False,
          quoting=csv.QUOTE_NONE, escapechar='', quotechar='')

### Step 2 Build namespaces

In [None]:
namespaces = [
    [
        'dbpedia-resource',
        'prefix_expansion',
        '"http://dbpedia.org/resource/"'
    ],
    [
        'property',
        'prefix_expansion',
        '"http://dbpedia.org/property/"'
    ],
    [
        'dbpedia-datatype',
        'prefix_expansion',
        '"http://dbpedia.org/datatype/"'
    ],
    [
        'rdf',
        'prefix_expansion',
        '"http://www.w3.org/1999/02/22-rdf-syntax-ns#"'
    ],
    [
        'xml-schema-type',
        'prefix_expansion',
        '"http://www.w3.org/2001/XMLSchema#"'
    ]
]
with open('../data/namespaces.tsv', 'w+', newline='') as f:
    fieldnames = ['node1', 'label', 'node2']
    # fieldnames = ['subject', 'to', 'object']
    writer = csv.writer(f, delimiter='\t', quoting=csv.QUOTE_NONE, escapechar='', quotechar='')
    writer.writerow(fieldnames)
    for ns in namespaces:
        writer.writerow(ns)

### Step 3 Import DBpedia infobox
Use `kgtk` `import-ntriples` to convert ttl to tsv;

In [None]:
%%bash
INFOBOX_INPUT_FILE="../data/infobox-properties_lang=en.ttl"
INFOBOX_OUTPUT_FILE="../data/infobox_properties_lang=en.tsv"

kgtk import-ntriples -i $INFOBOX_INPUT_FILE \
                     -o $INFOBOX_OUTPUT_FILE \
                     --namespace-file ../data/namespaces.tsv \
                     --namespace-id-use-uuid True \
                     --build-new-namespaces False \
                     --output-only-used-namespaces True \
                     --structured-value-label dbpedia:structured_value \
                     --structured-uri-label dbpedia:structured_uri \
                     --newnode-prefix node \
                     --newnode-use-uuid True

### Step 4 Map node2 to Wikidata nodes
For records like `<Wikidata node> <property> <DBpedia node>` map value node (node2) into `Wikidata node`, like `<Wikidata node> <property> <Wikidata node>`;

In [None]:
%%bash
WIKI_INFO='../data/wikidata_infobox_lang=en.tsv'
SITELINK='../data/wikidata-20181210.sitelinks.en.tsv'

kgtk query -i $WIKI_INFO -i $SITELINK \
          --match 'wi: (w)-[p]->(v), s: (q)-[]->(v)' \
          --return 'w, p.label, q' \
          -o ../data/node2_wikidata.tsv

In [None]:
%%bash
WIKI_INFO='../data/wikidata_infobox_lang=en.tsv'

kgtk filter -i $WIKI_INFO \
           --regex --match-type match \
           -p ';;^(?!dbpedia-resource:).*' \
           -o ../data/wiki_infobox_no_dbvalue.tsv

In [None]:
%%bash
kgtk join --left-join --right-join \
          --left-file ../data/wiki_infobox_no_dbvalue.tsv \
          --right-file ../data/node2_wikidata.tsv \
          -o ../data/wiki_infobox_mapped.tsv

### Step 5 Add structured literals
Concatenate structure literals in original DBpedia infobox with generated Wikidata infobox;

In [None]:
%%bash
INFOBOX="../data/infobox_properties_lang=en.tsv"

kgtk filter -i $INFOBOX \
           --regex --match-type match \
           -p "node;;" \
           -o ../data/structured_literals.tsv

In [None]:
%%bash
kgtk cat -i ../data/structured_literals.tsv ../data/wiki_infobox_mapped.tsv \
         -o ../data/wikidata_infobox.tsv

In [4]:
WIKIINFO="./data/wikidata_infobox.tsv"

!kgtk query -i $WIKIINFO \
    --match '(q)-[]->()' \
    --where 'q = "nodemxZbyK2VRrGoaxfdLmyLxw-5457912"'

node1	label	node2
nodemxZbyK2VRrGoaxfdLmyLxw-5457912	dbpedia:structured_value	"1964-09-27"
nodemxZbyK2VRrGoaxfdLmyLxw-5457912	dbpedia:structured_uri	xml-schema-type:date
