# Getting P580 and P585 Data From Wikipedia Tables
This notebook explores the idea of using Wikipedia tables to find missing P580 and P585 qualifiers in Wikidata.

In this notebook we will use a sample of 10,000 from the ntiples files available in https://github.com/bfetahu/wiki_tables_kg/.
The full corpus contains over 1.2 billion ntriples and takes about 4 hours to import to KGTK format on a laptop.

In [1]:
import numpy as np
import pandas as pd
import os
import io
from IPython.display import display, HTML, Image

In [2]:
def wt(name):
    "Construct the name of a file by appending the given name to the prefix defined by the WT environment variable."
    return "{}.{}".format(os.getenv("WT"), name)

### Before you start
This notebook assumes that you are running Jupyter from the kgtk/examples folder, using the files in `sample_data/tables`. The output files will be place in the `results` folder inside so you can see the results generated from each of the KGTK operations. This way if a cells produces an error, you can continue browsing the notebook.

The TN environment variable controls the location of the files, so if you want to run in a different folder, set TN as shown below.

In [3]:
%env WD18=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20181210
%env WD18temp=/Users/pedroszekely/Downloads
#%env TN=sample_data/tables
%env TN=/Users/pedroszekely/Downloads/tn
%env R=/Users/pedroszekely/Downloads/tn/results
%env WT=wt.10000

env: WD18=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20181210
env: WD18temp=/Users/pedroszekely/Downloads
env: TN=/Users/pedroszekely/Downloads/tn
env: R=/Users/pedroszekely/Downloads/tn/results
env: WT=wt.10000


In [4]:
os.chdir(os.getenv("TN"))

In [5]:
pwd

'/Users/pedroszekely/Downloads/tn'

In [6]:
mkdir results

mkdir: results: File exists


In [7]:
os.chdir("results")

### Convert table ntriples to TSV KGTK format

In [8]:
!head $TN/$WT.nt

<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://www.tablenet.l3s.uni-hannover.de/TableNet#Table> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#hasTableID> 5020573 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#numberOfColumns> 2 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#numberOfRows> 1 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#document> <http://en.wikipedia.org/wiki/Metropolis_Gold%23> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <http://purl.org/dc/terms/source> <http://dbepdia.org/resource/Metropolis_Gold> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablen

**Define prefixes to compress the URIs**

In [9]:
pd.read_csv(os.getenv("TN")+"/table-namespaces.tsv", delimiter='\t')

Unnamed: 0,node1,label,node2
0,tn,prefix_expansion,https://www.tablenet.l3s.uni-hannover.de/Table...
1,tn,prefix_expansion,https://www.tablenet.l3s.uni-hannover.de/Table...
2,rdf,prefix_expansion,http://www.w3.org/1999/02/22-rdf-syntax-ns#
3,tn,prefix_expansion,https://www.tablenet.l3s.uni-hannover.de/Table...
4,wiki,prefix_expansion,http://en.wikipedia.org/wiki/
5,dc,prefix_expansion,http://purl.org/dc/terms/
6,db,prefix_expansion,http://dbepdia.org/resource/
7,tn-json,prefix_expansion,http://www.tablenet.l3s.uni-hannover.de/TableN...
8,tn-col,prefix_expansion,https://www.tablenet.l3s.uni-hannover.de/Table...


**Import the AIF triples**

In [10]:
!kgtk import-ntriples \
  --namespace-file $TN/table-namespaces.tsv \
  --namespace-id-use-uuid False \
  --newnode-use-uuid False \
  --local-namespace-use-uuid False \
  -i $TN/$WT.nt \
> $WT.tsv

## Set TSV file containing table edges

In [11]:
%env WT=wt.100000000

env: WT=wt.100000000


In [12]:
tables = pd.read_csv(wt("tsv"), delimiter='\t')

**Statistics of our data sample**

In [13]:
tables['label'].value_counts()

rdf:type              19561712
tn:cellValue          14837198
tn:hasCell            14576057
tn:cellType           14576057
tn:partOfColumn       14576057
tn:refersTo            4497431
tn:hasRow              2605591
tn:rowPosition         2605591
tn:columnPosition      2118405
tn:hasColumn           2118405
tn:hasLevel            2118405
dc:subject             2118405
tn:columnName          2118403
tn:numberOfColumns      261658
tn:resourceURL          261658
dc:source               261658
tn:document             261658
tn:hasTableID           261658
tn:numberOfRows         261658
tn:hasCaption             2334
Name: label, dtype: int64

**Get all the Wikipedia links from the Wikipedia tables**

Here we produce an edge file that has cells in node1 and wikipedia URLs in node2. We expand the wikipedia URLs to full URLs to match the format used in Wikidata.

In [14]:
!kgtk filter -p ';tn:cellValue;' -i $WT.tsv | grep '\twiki:' | sed 's/wiki:/http:\/\/en.wikipedia.org\/wiki\//' > temp.tsv
!echo -e "node1\tlabel\tnode2" > $WT.wikipedia.tsv
!cat temp.tsv >> $WT.wikipedia.tsv
!rm temp.tsv

In [15]:
# wikipedia = pd.read_csv("{}.wikipedia.tsv".format(os.getenv("WT")), delimiter='\t')
wikipedia = pd.read_csv(wt("wikipedia.tsv"), delimiter='\t')
wikipedia.rename(columns={'node1': 'wikipedia_cell_id', 'node2': 'wikipedia'}, inplace=True)
wikipedia

Unnamed: 0,wikipedia_cell_id,label,wikipedia
0,X:c5020573_0,tn:cellValue,http://en.wikipedia.org/wiki/Allmusic
1,X:c5020574_6,tn:cellValue,http://en.wikipedia.org/wiki/Rockwilder
2,X:c5020574_10,tn:cellValue,http://en.wikipedia.org/wiki/Pete_Rock
3,X:c5020574_18,tn:cellValue,http://en.wikipedia.org/wiki/Fredro_Starr
4,X:c5020574_22,tn:cellValue,http://en.wikipedia.org/wiki/DJ_Clark_Kent
...,...,...,...
4497426,X:c5280560_57,tn:cellValue,http://en.wikipedia.org/wiki/Ekaterina_Kurbatova
4497427,X:c5280560_64,tn:cellValue,http://en.wikipedia.org/wiki/Ksenia_Semenova
4497428,X:c5280560_71,tn:cellValue,http://en.wikipedia.org/wiki/Gabriela_Dr%C4%83goi
4497429,X:c5280561_14,tn:cellValue,http://en.wikipedia.org/wiki/Two-party-preferr...


**Get all the rows that contain cells that have Wikipedia links**

The resulting table has row ids in node1 and cell ids in node2, where the node2 entities correspond to those cells that contain wikipedia links.

In [16]:
!kgtk ifexists \
    --filter-on $WT.wikipedia.tsv --filter-keys node1 \
    --input-file $WT.tsv --input-keys node2 \
  / filter -p ';tn:hasCell;' \
  > $WT.wikipedia-cells.rows.tsv

In [17]:
wikipedia_cells_rows = pd.read_csv(wt("wikipedia-cells.rows.tsv"), delimiter='\t')
wikipedia_cells_rows

Unnamed: 0,node1,label,node2
0,X:r5020573_0,tn:hasCell,X:c5020573_0
1,X:r5020574_1,tn:hasCell,X:c5020574_6
2,X:r5020574_2,tn:hasCell,X:c5020574_10
3,X:r5020574_4,tn:hasCell,X:c5020574_18
4,X:r5020574_5,tn:hasCell,X:c5020574_22
...,...,...,...
3734752,X:r5280560_8,tn:hasCell,X:c5280560_57
3734753,X:r5280560_9,tn:hasCell,X:c5280560_64
3734754,X:r5280560_10,tn:hasCell,X:c5280560_71
3734755,X:r5280561_14,tn:hasCell,X:c5280561_14


**Get all the columns with "Year" as the heading**

In [18]:
!kgtk filter -i $WT.tsv -p ';tn:columnName;"Year"' > $WT.column-year.tsv

In [19]:
year_columns = pd.read_csv(wt("column-year.tsv"), delimiter='\t')
year_columns.rename(columns={'node1': 'year_column_id', 'node2': 'colunm_heading'}, inplace=True)
year_columns

Unnamed: 0,year_column_id,label,colunm_heading
0,tn-col:5020576_0_Year,tn:columnName,Year
1,tn-col:5020576_1_Year,tn:columnName,Year
2,tn-col:5020577_0_Year,tn:columnName,Year
3,tn-col:5020577_1_Year,tn:columnName,Year
4,tn-col:5020578_0_Year,tn:columnName,Year
...,...,...,...
38007,tn-col:5280532_0_Year,tn:columnName,Year
38008,tn-col:5280532_0_Year,tn:columnName,Year
38009,tn-col:5280532_0_Year,tn:columnName,Year
38010,tn-col:5280533_0_Year,tn:columnName,Year


**Get all the cells in the columns that have "Year" as the heading**

The resulting edge file has cell ids in node1 and column ids in node2 , corresponding to columns that have "Year" in their heading.

In [20]:
!kgtk ifexists \
    --filter-on $WT.column-year.tsv --filter-keys node1 \
    --input-file $WT.tsv --input-keys node2 \
  / filter -p ';tn:partOfColumn;' \
  > $WT.year-cells.tsv

In [21]:
year_cells = pd.read_csv(wt("year-cells.tsv"), delimiter='\t')
year_cells.rename(columns={'node1': 'year_cell_id', 'node2': 'year_column_id'}, inplace=True)
year_cells

Unnamed: 0,year_cell_id,label,year_column_id
0,X:c5020576_0,tn:partOfColumn,tn-col:5020576_1_Year
1,X:c5020577_0,tn:partOfColumn,tn-col:5020577_1_Year
2,X:c5020577_5,tn:partOfColumn,tn-col:5020577_1_Year
3,X:c5020578_0,tn:partOfColumn,tn-col:5020578_0_Year
4,X:c5020578_4,tn:partOfColumn,tn-col:5020578_0_Year
...,...,...,...
411000,X:c5280532_290,tn:partOfColumn,tn-col:5280532_0_Year
411001,X:c5280532_295,tn:partOfColumn,tn-col:5280532_0_Year
411002,X:c5280532_300,tn:partOfColumn,tn-col:5280532_0_Year
411003,X:c5280533_0,tn:partOfColumn,tn-col:5280533_0_Year


**Get all the rows that contain year cells**

Now, we want the row ids of the cells that contains years, the row ids are in node1 and the cell ids are in node2.

In [22]:
!kgtk ifexists \
    --filter-on $WT.year-cells.tsv --filter-keys node1 \
    --input-file $WT.tsv --input-keys node2 \
  / filter -p ';tn:hasCell;' \
  > $WT.year-cells.rows.tsv

In [23]:
year_cells_rows = pd.read_csv(wt("year-cells.rows.tsv"), delimiter='\t')
year_cells_rows

Unnamed: 0,node1,label,node2
0,X:r5020576_0,tn:hasCell,X:c5020576_0
1,X:r5020577_0,tn:hasCell,X:c5020577_0
2,X:r5020577_1,tn:hasCell,X:c5020577_5
3,X:r5020578_0,tn:hasCell,X:c5020578_0
4,X:r5020578_1,tn:hasCell,X:c5020578_4
...,...,...,...
411000,X:r5280532_58,tn:hasCell,X:c5280532_290
411001,X:r5280532_59,tn:hasCell,X:c5280532_295
411002,X:r5280532_60,tn:hasCell,X:c5280532_300
411003,X:r5280533_0,tn:hasCell,X:c5280533_0


**Get all the rows that contain both a cell with a year and a cell with a wikipedia link**

Above we computed a file (`$WT.year-cells.rows.tsv`) with the row ids of cells that contain years, and a file (`$WT.wikipedia-cells.rows.tsv`) with the row ids of cells that contain Wikipedia links. Here we compute the intersection. The row ids are in node1, and node2 contains the node ids of the cells that contain Wikipedia links.

In [24]:
!kgtk ifexists \
    --filter-on $WT.year-cells.rows.tsv --filter-keys node1 \
    --input-file $WT.wikipedia-cells.rows.tsv --input-keys node1 \
  > $WT.year.wikipedia.cells.rows.tsv

In [25]:
year__wikipedia_rows = pd.read_csv(wt("year.wikipedia.cells.rows.tsv"), delimiter='\t')
year__wikipedia_rows

Unnamed: 0,node1,label,node2
0,X:r5020578_0,tn:hasCell,X:c5020578_1
1,X:r5020578_1,tn:hasCell,X:c5020578_5
2,X:r5020578_3,tn:hasCell,X:c5020578_13
3,X:r5020578_5,tn:hasCell,X:c5020578_21
4,X:r5020578_9,tn:hasCell,X:c5020578_37
...,...,...,...
706138,X:r5280532_60,tn:hasCell,X:c5280532_300
706139,X:r5280533_0,tn:hasCell,X:c5280533_0
706140,X:r5280533_0,tn:hasCell,X:c5280533_1
706141,X:r5280533_0,tn:hasCell,X:c5280533_2


**Get all the cells that have years and also appear in `year.wikipedia.cells.rows`**

In `$WT.consolidated.year.cells.tsv`, node1 contains the row ids of rows that contain both a cell with a time and a cell with a Wikipedia link; node2 contains the ids of cells that contain times. We may have fewer edges than in `$WT.year.wikipedia.cells.rows.tsv` because in that file node2 has the ids of cells with Wikipedia links and it is possible that in the same row there are multiple cells with Wikipedia links.

In [26]:
!kgtk ifexists \
    --filter-on $WT.year.wikipedia.cells.rows.tsv --filter-keys node1 \
    --input-file $WT.year-cells.rows.tsv --input-keys node1 \
  > $WT.consolidated.year.cells.tsv

In [27]:
consolidated_year_cells = pd.read_csv(wt("consolidated.year.cells.tsv"), delimiter='\t')
consolidated_year_cells.rename(columns={'node1': 'row_id', 'node2': 'year_cell_id'}, inplace=True)
consolidated_year_cells

Unnamed: 0,row_id,label,year_cell_id
0,X:r5020578_0,tn:hasCell,X:c5020578_0
1,X:r5020578_1,tn:hasCell,X:c5020578_4
2,X:r5020578_3,tn:hasCell,X:c5020578_12
3,X:r5020578_5,tn:hasCell,X:c5020578_20
4,X:r5020578_9,tn:hasCell,X:c5020578_36
...,...,...,...
303411,X:r5280532_58,tn:hasCell,X:c5280532_290
303412,X:r5280532_59,tn:hasCell,X:c5280532_295
303413,X:r5280532_60,tn:hasCell,X:c5280532_300
303414,X:r5280533_0,tn:hasCell,X:c5280533_0


**Get the year values for the year cells**

In [28]:
!kgtk ifexists \
    --filter-on $WT.consolidated.year.cells.tsv --filter-keys node2 \
    --input-file $WT.tsv --input-keys node1 \
  / filter -p ';tn:cellValue;' \
  > $WT.year-cells.info.tsv

In [29]:
year_cells_info = pd.read_csv(wt("year-cells.info.tsv"), delimiter='\t')
year_cells_info.rename(columns={'node1': 'year_cell_id', 'node2': 'year'}, inplace=True)
year_cells_info

Unnamed: 0,year_cell_id,label,year
0,X:c5020578_0,tn:cellValue,2005
1,X:c5020578_4,tn:cellValue,2006
2,X:c5020578_12,tn:cellValue,2007
3,X:c5020578_20,tn:cellValue,2008
4,X:c5020578_36,tn:cellValue,2010
...,...,...,...
307359,X:c5280532_290,tn:cellValue,wiki:United_Kingdom_general_election%2C_1857
307360,X:c5280532_295,tn:cellValue,wiki:United_Kingdom_general_election%2C_1859
307361,X:c5280532_300,tn:cellValue,wiki:United_Kingdom_general_election%2C_1868
307362,X:c5280533_0,tn:cellValue,wiki:United_Kingdom_general_election%2C_1868


**Get the cells that have wikipedia links and also appear in `year.wikipedia.cells.rows`**

This file should be identical to `$WT.year.wikipedia.cells.rows.tsv`

In [30]:
!kgtk ifexists \
    --filter-on $WT.year.wikipedia.cells.rows.tsv --filter-keys node1 \
    --input-file $WT.wikipedia-cells.rows.tsv --input-keys node1 \
  > $WT.consolidated.wikipedia.cells.tsv

In [31]:
consolidated_wikipedia_cells = pd.read_csv(wt("consolidated.wikipedia.cells.tsv"), delimiter='\t')
consolidated_wikipedia_cells.rename(columns={'node1': 'row_id', 'node2': 'wikipedia_cell_id'}, inplace=True)
consolidated_wikipedia_cells

Unnamed: 0,row_id,label,wikipedia_cell_id
0,X:r5020578_0,tn:hasCell,X:c5020578_1
1,X:r5020578_1,tn:hasCell,X:c5020578_5
2,X:r5020578_3,tn:hasCell,X:c5020578_13
3,X:r5020578_5,tn:hasCell,X:c5020578_21
4,X:r5020578_9,tn:hasCell,X:c5020578_37
...,...,...,...
706138,X:r5280532_60,tn:hasCell,X:c5280532_300
706139,X:r5280533_0,tn:hasCell,X:c5280533_0
706140,X:r5280533_0,tn:hasCell,X:c5280533_1
706141,X:r5280533_0,tn:hasCell,X:c5280533_2


**Get the wikipedia links from the cells in rows that contain both years and Wikipedia links**

The file contains cell ids in node1 and the Wikipedia links in node2.

**Get the columns for the cells that contain wikipedia links**

The edge file contains cell ids in node1 and column ids in node2. It may happen that node1 contains ids for cells that are present in the same row. This can be seen when the suffix of the column id is different, but the table id (the first part of the column id) is the same for both columns.

We don't want this situation, so in the cells below we are going to select the column that contains more cells with Wikipedia links.

In [32]:
!kgtk ifexists \
    --filter-on $WT.consolidated.wikipedia.cells.tsv --filter-keys node2 \
    --input-file $WT.tsv --input-keys node1 \
  / filter -p ';tn:partOfColumn;' \
  > $WT.consolidated.wikipedia.columns.tsv

In [33]:
consolidated_wikipedia_columns = pd.read_csv(wt("consolidated.wikipedia.columns.tsv"), delimiter='\t')
consolidated_wikipedia_columns.rename(columns={"node1": "wikipedia_cell_id", "node2": "wikipedia_column_id"}, inplace=True)
consolidated_wikipedia_columns

Unnamed: 0,wikipedia_cell_id,label,wikipedia_column_id
0,X:c5020578_1,tn:partOfColumn,tn-col:5020578_0_Title
1,X:c5020578_5,tn:partOfColumn,tn-col:5020578_0_Title
2,X:c5020578_13,tn:partOfColumn,tn-col:5020578_0_Title
3,X:c5020578_21,tn:partOfColumn,tn-col:5020578_0_Title
4,X:c5020578_37,tn:partOfColumn,tn-col:5020578_0_Title
...,...,...,...
706138,X:c5280532_300,tn:partOfColumn,tn-col:5280532_0_Year
706139,X:c5280533_0,tn:partOfColumn,tn-col:5280533_0_Year
706140,X:c5280533_1,tn:partOfColumn,tn-col:5280533_0_Member
706141,X:c5280533_2,tn:partOfColumn,tn-col:5280533_0_Party


In [34]:
consolidated_wikipedia_columns['wikipedia_column_id'].value_counts()

tn-col:5124818_0_Name           682
tn-col:5146594_0_Inductee       562
tn-col:5146594_0_Notable        561
tn-col:5133448_0_Description    505
tn-col:5034055_0_Film           475
                               ... 
tn-col:5220876_1_Title            1
tn-col:5165540_0_Award            1
tn-col:5130879_0_Notes            1
tn-col:5067123_1_14               1
tn-col:5208184_0_9                1
Name: wikipedia_column_id, Length: 93641, dtype: int64

**Get the table ids for the columns that contain Wikipedia links**

In [35]:
!kgtk ifexists \
    --filter-on $WT.consolidated.wikipedia.columns.tsv --filter-keys node2 \
    --input-file $WT.tsv --input-keys node2 \
  / filter -p ';tn:hasColumn;' \
  > $WT.wikipedia.columns.tables.tsv

In [36]:
wikipedia_column_tables = pd.read_csv(wt("wikipedia.columns.tables.tsv"), delimiter='\t')
wikipedia_column_tables.rename(columns={"node1": "table_id", "node2": "wikipedia_column_id"}, inplace=True)
wikipedia_column_tables

Unnamed: 0,table_id,label,wikipedia_column_id
0,tn:5020578,tn:hasColumn,tn-col:5020578_0_Title
1,tn:5020579,tn:hasColumn,tn-col:5020579_0_Title
2,tn:5020579,tn:hasColumn,tn-col:5020579_0_Notes
3,tn:5020580,tn:hasColumn,tn-col:5020580_0_Name
4,tn:5020580,tn:hasColumn,tn-col:5020580_0_Rationale%5C%5B3%5C%5D%5C%5B4...
...,...,...,...
96006,tn:5280532,tn:hasColumn,tn-col:5280532_0_Second_party
96007,tn:5280533,tn:hasColumn,tn-col:5280533_0_Year
96008,tn:5280533,tn:hasColumn,tn-col:5280533_0_Year
96009,tn:5280533,tn:hasColumn,tn-col:5280533_0_Member


In [37]:
!kgtk unique -i $WT.wikipedia.columns.tables.tsv --column node1 \
  / ifexists --filter-on - --filter-keys node1 --input-file $WT.tsv --input-keys node1 \
  / filter -p ';tn:document;' \
  | sed 's/wiki:/http:\/\/en.wikipedia.org\/wiki\//' \
  > $WT.table.url.tsv

In [38]:
table_urls = pd.read_csv(wt("table.url.tsv"), delimiter='\t')
table_urls.rename(columns={"node1": "table_id", "node2": "wikipedia_table_url"}, inplace=True)
table_urls

Unnamed: 0,table_id,label,wikipedia_table_url
0,tn:5020578,tn:document,http://en.wikipedia.org/wiki/Daniel_Booko%23Fi...
1,tn:5020579,tn:document,http://en.wikipedia.org/wiki/Daniel_Booko%23Fi...
2,tn:5020580,tn:document,http://en.wikipedia.org/wiki/List_of_alumni_of...
3,tn:5020581,tn:document,http://en.wikipedia.org/wiki/List_of_alumni_of...
4,tn:5020587,tn:document,http://en.wikipedia.org/wiki/2013%E2%80%9314_M...
...,...,...,...
27575,tn:5280514,tn:document,http://en.wikipedia.org/wiki/Chok_Sukkaew%23Clubs
27576,tn:5280530,tn:document,http://en.wikipedia.org/wiki/Malton_%28UK_Parl...
27577,tn:5280531,tn:document,http://en.wikipedia.org/wiki/Malton_%28UK_Parl...
27578,tn:5280532,tn:document,http://en.wikipedia.org/wiki/Malton_%28UK_Parl...


In [51]:
!shuf -n 100 $WT.table.url.tsv

tn:5267130	tn:document	http://en.wikipedia.org/wiki/Cowboys%E2%80%93Redskins_rivalry%231970s_%28Cowboys_12%E2%80%939%29
tn:5228380	tn:document	http://en.wikipedia.org/wiki/Finland_national_handball_team%23World_Championships
tn:5189320	tn:document	http://en.wikipedia.org/wiki/Coastal_Carolina_Chanticleers_men%27s_basketball%23NIT_results
tn:5116192	tn:document	http://en.wikipedia.org/wiki/List_of_awards_and_nominations_received_by_Britney_Spears%23Myx_Music_Award
tn:5040714	tn:document	http://en.wikipedia.org/wiki/John_Jayd_Daniels%23John_Jayd_Daniels_discography_%5B2%5D%5B3%5D%5B4%5D
tn:5169179	tn:document	http://en.wikipedia.org/wiki/Buffy_Dee%23Filmography
tn:5174731	tn:document	http://en.wikipedia.org/wiki/Suchmos%23Discography
tn:5192565	tn:document	http://en.wikipedia.org/wiki/Carlos_Pace%23Complete_24_Hours_of_Le_Mans_results
tn:5038867	tn:document	http://en.wikipedia.org/wiki/Dr._Hook_%26_the_Medicine_Show%23Singles
tn:5228798	tn:document	http://en.wikipedia.org/wiki/Gibraltar_

In [40]:
year_and_wikipedia_table = pd.merge(left=consolidated_wikipedia_columns, right=wikipedia_column_tables, left_on='wikipedia_column_id', right_on='wikipedia_column_id')
year_and_wikipedia_table=year_and_wikipedia_table.drop(columns=['label_x', 'label_y'])
year_and_wikipedia_table

Unnamed: 0,wikipedia_cell_id,wikipedia_column_id,table_id
0,X:c5020578_1,tn-col:5020578_0_Title,tn:5020578
1,X:c5020578_5,tn-col:5020578_0_Title,tn:5020578
2,X:c5020578_13,tn-col:5020578_0_Title,tn:5020578
3,X:c5020578_21,tn-col:5020578_0_Title,tn:5020578
4,X:c5020578_37,tn-col:5020578_0_Title,tn:5020578
...,...,...,...
715471,X:c5280533_0,tn-col:5280533_0_Year,tn:5280533
715472,X:c5280533_3,tn-col:5280533_0_Year,tn:5280533
715473,X:c5280533_3,tn-col:5280533_0_Year,tn:5280533
715474,X:c5280533_1,tn-col:5280533_0_Member,tn:5280533


In [41]:
year_and_wikipedia_table = pd.merge(left=year_and_wikipedia_table, right=wikipedia, left_on='wikipedia_cell_id', right_on='wikipedia_cell_id')
year_and_wikipedia_table = year_and_wikipedia_table.drop(columns=['label'])
year_and_wikipedia_table

Unnamed: 0,wikipedia_cell_id,wikipedia_column_id,table_id,wikipedia
0,X:c5020578_1,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/American_Pie_Pres...
1,X:c5020578_5,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/The_Fast_and_the_...
2,X:c5020578_13,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Bratz%3A_The_Movie
3,X:c5020578_21,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Foreign_Exchange_...
4,X:c5020578_37,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Crazy_on_the_Outside
...,...,...,...,...
791491,X:c5280533_0,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...
791492,X:c5280533_3,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...
791493,X:c5280533_3,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...
791494,X:c5280533_1,tn-col:5280533_0_Member,tn:5280533,http://en.wikipedia.org/wiki/Charles_Wentworth...


In [42]:
year_and_wikipedia_table = pd.merge(left=year_and_wikipedia_table, right=consolidated_wikipedia_cells, left_on='wikipedia_cell_id', right_on='wikipedia_cell_id')
year_and_wikipedia_table = year_and_wikipedia_table.drop(columns=['label'])
year_and_wikipedia_table

Unnamed: 0,wikipedia_cell_id,wikipedia_column_id,table_id,wikipedia,row_id
0,X:c5020578_1,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/American_Pie_Pres...,X:r5020578_0
1,X:c5020578_5,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/The_Fast_and_the_...,X:r5020578_1
2,X:c5020578_13,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Bratz%3A_The_Movie,X:r5020578_3
3,X:c5020578_21,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Foreign_Exchange_...,X:r5020578_5
4,X:c5020578_37,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Crazy_on_the_Outside,X:r5020578_9
...,...,...,...,...,...
791491,X:c5280533_0,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...,X:r5280533_0
791492,X:c5280533_3,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...,X:r5280533_1
791493,X:c5280533_3,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...,X:r5280533_1
791494,X:c5280533_1,tn-col:5280533_0_Member,tn:5280533,http://en.wikipedia.org/wiki/Charles_Wentworth...,X:r5280533_0


In [43]:
year_and_wikipedia_table = pd.merge(left=year_and_wikipedia_table, right=consolidated_year_cells, left_on='row_id', right_on='row_id')
year_and_wikipedia_table = year_and_wikipedia_table.drop(columns=['label'])
year_and_wikipedia_table

Unnamed: 0,wikipedia_cell_id,wikipedia_column_id,table_id,wikipedia,row_id,year_cell_id
0,X:c5020578_1,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/American_Pie_Pres...,X:r5020578_0,X:c5020578_0
1,X:c5020578_5,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/The_Fast_and_the_...,X:r5020578_1,X:c5020578_4
2,X:c5020578_13,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Bratz%3A_The_Movie,X:r5020578_3,X:c5020578_12
3,X:c5020578_21,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Foreign_Exchange_...,X:r5020578_5,X:c5020578_20
4,X:c5020578_37,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Crazy_on_the_Outside,X:r5020578_9,X:c5020578_36
...,...,...,...,...,...,...
791491,X:c5280533_0,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...,X:r5280533_0,X:c5280533_0
791492,X:c5280533_1,tn-col:5280533_0_Member,tn:5280533,http://en.wikipedia.org/wiki/Charles_Wentworth...,X:r5280533_0,X:c5280533_0
791493,X:c5280533_2,tn-col:5280533_0_Party,tn:5280533,http://en.wikipedia.org/wiki/Liberal_Party_%28...,X:r5280533_0,X:c5280533_0
791494,X:c5280533_3,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...,X:r5280533_1,X:c5280533_3


In [44]:
year_and_wikipedia_table = pd.merge(left=year_and_wikipedia_table, right=year_cells_info, left_on='year_cell_id', right_on='year_cell_id')
year_and_wikipedia_table = year_and_wikipedia_table.drop(columns=['label'])
year_and_wikipedia_table

Unnamed: 0,wikipedia_cell_id,wikipedia_column_id,table_id,wikipedia,row_id,year_cell_id,year
0,X:c5020578_1,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/American_Pie_Pres...,X:r5020578_0,X:c5020578_0,2005
1,X:c5020578_5,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/The_Fast_and_the_...,X:r5020578_1,X:c5020578_4,2006
2,X:c5020578_13,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Bratz%3A_The_Movie,X:r5020578_3,X:c5020578_12,2007
3,X:c5020578_21,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Foreign_Exchange_...,X:r5020578_5,X:c5020578_20,2008
4,X:c5020578_37,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Crazy_on_the_Outside,X:r5020578_9,X:c5020578_36,2010
...,...,...,...,...,...,...,...
814489,X:c5280533_0,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...,X:r5280533_0,X:c5280533_0,wiki:United_Kingdom_general_election%2C_1868
814490,X:c5280533_1,tn-col:5280533_0_Member,tn:5280533,http://en.wikipedia.org/wiki/Charles_Wentworth...,X:r5280533_0,X:c5280533_0,wiki:United_Kingdom_general_election%2C_1868
814491,X:c5280533_2,tn-col:5280533_0_Party,tn:5280533,http://en.wikipedia.org/wiki/Liberal_Party_%28...,X:r5280533_0,X:c5280533_0,wiki:United_Kingdom_general_election%2C_1868
814492,X:c5280533_3,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...,X:r5280533_1,X:c5280533_3,wiki:United_Kingdom_general_election%2C_1885


In [45]:
year_and_wikipedia_table = pd.merge(left=year_and_wikipedia_table, right=table_urls, left_on='table_id', right_on='table_id')
year_and_wikipedia_table = year_and_wikipedia_table.drop(columns=['label'])
year_and_wikipedia_table

Unnamed: 0,wikipedia_cell_id,wikipedia_column_id,table_id,wikipedia,row_id,year_cell_id,year,wikipedia_table_url
0,X:c5020578_1,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/American_Pie_Pres...,X:r5020578_0,X:c5020578_0,2005,http://en.wikipedia.org/wiki/Daniel_Booko%23Fi...
1,X:c5020578_5,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/The_Fast_and_the_...,X:r5020578_1,X:c5020578_4,2006,http://en.wikipedia.org/wiki/Daniel_Booko%23Fi...
2,X:c5020578_13,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Bratz%3A_The_Movie,X:r5020578_3,X:c5020578_12,2007,http://en.wikipedia.org/wiki/Daniel_Booko%23Fi...
3,X:c5020578_21,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Foreign_Exchange_...,X:r5020578_5,X:c5020578_20,2008,http://en.wikipedia.org/wiki/Daniel_Booko%23Fi...
4,X:c5020578_37,tn-col:5020578_0_Title,tn:5020578,http://en.wikipedia.org/wiki/Crazy_on_the_Outside,X:r5020578_9,X:c5020578_36,2010,http://en.wikipedia.org/wiki/Daniel_Booko%23Fi...
...,...,...,...,...,...,...,...,...
814489,X:c5280533_0,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...,X:r5280533_0,X:c5280533_0,wiki:United_Kingdom_general_election%2C_1868,http://en.wikipedia.org/wiki/Malton_%28UK_Parl...
814490,X:c5280533_1,tn-col:5280533_0_Member,tn:5280533,http://en.wikipedia.org/wiki/Charles_Wentworth...,X:r5280533_0,X:c5280533_0,wiki:United_Kingdom_general_election%2C_1868,http://en.wikipedia.org/wiki/Malton_%28UK_Parl...
814491,X:c5280533_2,tn-col:5280533_0_Party,tn:5280533,http://en.wikipedia.org/wiki/Liberal_Party_%28...,X:r5280533_0,X:c5280533_0,wiki:United_Kingdom_general_election%2C_1868,http://en.wikipedia.org/wiki/Malton_%28UK_Parl...
814492,X:c5280533_3,tn-col:5280533_0_Year,tn:5280533,http://en.wikipedia.org/wiki/United_Kingdom_ge...,X:r5280533_1,X:c5280533_3,wiki:United_Kingdom_general_election%2C_1885,http://en.wikipedia.org/wiki/Malton_%28UK_Parl...


In [46]:
year_and_wikipedia_table.to_csv(wt("year_and_wikipedia_table.tsv"), sep="\t")

**Get the Wikipedia site links from the Wikidata dump**

In [47]:
#!exa "$WD18"/wikidata-20181210-all-edges.tsv.bz2
#
#!bzip2 -dc "$WD18"/wikidata-20181210-all-edges.tsv.bz2|grep wikipedia | head
#
# COMMENT OUT BECAUSE TAKES A LONG TIME TO RUN
#
#!kgtk filter -i "$WD18"/wikidata-20181210-all-edges.tsv.bz2 -p ';wikipedia_sitelink;' | gzip > "$WD18temp"/wikidata-20181210.sitelinks.tsv.gz

In [48]:
!exa "$WD18temp"/wikidata-20181210.sitelinks.tsv.gz

[36m/Users/pedroszekely/Downloads/[31mwikidata-20181210.sitelinks.tsv.gz[0m


!kgtk ifexists \
    --filter-on results/wt.10000.wikipedia.tsv --filter-keys node2 \
    --input-file "$WD18temp"/wikidata-20181210.sitelinks.tsv.gz --input-keys node2 \
  > results/wikidata-qnodes.tsv

In [49]:
###### !kgtk filter -i "$TN"/wiki_tables_all.tsv.gz -p ';tn:columnName;' > "$TN"/wt.column-names.tsv

**SPARQL Query**

with pd.option_context('display.max_rows', None, 'display.width', 1000):  # more options can be specified also
    print(tables[0:100])

#Cats
SELECT ?item ?itemLabel (YEAR(?time) AS ?year) WHERE {
  VALUES ?wikipedia {
    <https://en.wikipedia.org/wiki/John_Herschel>
    <https://en.wikipedia.org/wiki/James_Joseph_Sylvester>
    <https://en.wikipedia.org/wiki/John_Newport_Langley>
    <https://en.wikipedia.org/wiki/Charles_Pritchard>
    <https://en.wikipedia.org/wiki/Arthur_Schuster>
    <https://en.wikipedia.org/wiki/Percy_MacMahon>
    <https://en.wikipedia.org/wiki/William_Burnside>
    <https://en.wikipedia.org/wiki/Augustus_Love>
    <https://en.wikipedia.org/wiki/William_Mitchinson_Hicks>
    <https://en.wikipedia.org/wiki/Grafton_Elliot_Smith>
    <https://en.wikipedia.org/wiki/William_Johnson_Sollas>
    <https://en.wikipedia.org/wiki/Joseph_Larmor>
    <https://en.wikipedia.org/wiki/William_Rivers>
    <https://en.wikipedia.org/wiki/William_Bateson>
    <https://en.wikipedia.org/wiki/Frederick_Blackman>
    <https://en.wikipedia.org/wiki/Albert_Seward>
    <https://en.wikipedia.org/wiki/John_Edward_Marr>
    <https://en.wikipedia.org/wiki/Patrick_Laidlaw>
    <https://en.wikipedia.org/wiki/Alfred_Harker>
    <https://en.wikipedia.org/wiki/Paul_Dirac>
    <https://en.wikipedia.org/wiki/William_Whiteman_Carlton_Topley>
    <https://en.wikipedia.org/wiki/Harold_Jeffreys>
    <https://en.wikipedia.org/wiki/Edward_Victor_Appleton>
    <https://en.wikipedia.org/wiki/Frederic_Bartlett>
    <https://en.wikipedia.org/wiki/Nevill_Mott>
    <https://en.wikipedia.org/wiki/John_Cockcroft>
    <https://en.wikipedia.org/wiki/W._V._D._Hodge>
    <https://en.wikipedia.org/wiki/Rudolf_Peierls>
    <https://en.wikipedia.org/wiki/Raymond_Lyttleton>
    <https://en.wikipedia.org/wiki/Frank_Yates>
    <https://en.wikipedia.org/wiki/Joseph_Hutchinson>
    <https://en.wikipedia.org/wiki/Charles_Oatley>
    <https://en.wikipedia.org/wiki/Frederick_Sanger>
    <https://en.wikipedia.org/wiki/Fred_Hoyle>
    <https://en.wikipedia.org/wiki/Abdus_Salam>
    <https://en.wikipedia.org/wiki/Roger_Penrose>
    <https://en.wikipedia.org/wiki/Eric_James_Denton>
    <https://en.wikipedia.org/wiki/Robert_Hinde>
    <https://en.wikipedia.org/wiki/Chris_Dobson>
  }
  ?wikipedia schema:about ?item.
  ?item p:P166 ?statement.
  ?statement ps:P166 wd:Q746756.
  OPTIONAL { ?statement pq:P585 ?time. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

for row in sample_filtered.iterrows():
    print('"{}" -> "{}" [ label = "{}" ];'.format(row[1][0], row[1][1], row[1][2]))