### Chancellor occupations
This notebook illustrates basic feature extraction techniques, as presented by:<br>
[1] P. Ristoski and H. Paulheim, “A comparison of propositionalization strategies for creating features from linked open data,” in Proceedings of the 1st International Conference on Linked Data for Knowledge Discovery - Volume 1232, Aachen, DEU, Sep. 2014, pp. 1–11.

Query results can be replicated [here](https://query.wikidata.org/#SELECT%20%3FchancellorLabel%20%3FinaugurationTime%20%3FoccupationLabel%20WHERE%20%7B%0A%20%20BIND%28wd%3AQ183%20AS%20%3Fgermany%29%0A%20%20%0A%20%20%3Fgermany%20p%3AP6%20%5Bps%3AP6%20%3Fchancellor%3B%20pq%3AP580%20%3FinaugurationTime%5D.%0A%20%20%3Fchancellor%20wdt%3AP106%20%3Foccupation.%0A%20%20%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D%20ORDER%20BY%20%3FinaugurationTime) (Wikidata).

In [223]:
import pandas as pd
import math

df = pd.read_csv("chancellor_occupations.tsv", sep="\t")

In [224]:
df.head(3)

Unnamed: 0,chancellorLabel,inaugurationTime,occupationLabel
0,Konrad Adenauer,1949-09-20T00:00:00Z,judge
1,Konrad Adenauer,1949-09-20T00:00:00Z,lawyer
2,Konrad Adenauer,1949-09-20T00:00:00Z,politician


In [225]:
chancellorIndex = df.groupby("inaugurationTime").first().chancellorLabel

In [226]:
df = df.pivot(index="inaugurationTime", columns="occupationLabel", values="occupationLabel").fillna(0)
df[df != 0] = 1
df = df.set_index(chancellorIndex)

In [227]:
# Binary feature extraction
df.transpose().to_csv("binary_extraction.csv")
df

occupationLabel,assessor,autobiographer,civil servant,consultant,economist,historian,journalist,judge,lawyer,lobbyist,military personnel,non-fiction writer,physicist,political scientist,politician,resistance fighter,university teacher,writer
chancellorLabel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Konrad Adenauer,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0
Ludwig Erhard,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0
Kurt Georg Kiesinger,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0
Willy Brandt,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0
Walter Scheel,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
Helmut Schmidt,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1
Helmut Kohl,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0
Gerhard Schröder,0,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,0,0
Angela Merkel,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0
Olaf Scholz,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0


In [228]:
for index, row in df.iterrows():
    df.loc[index] = row.div(row.sum())
df = df.astype(float)

# Relative Count feature extraction
df.round(3).transpose().to_csv("relative_count_extraction.csv")
df.round(3)

occupationLabel,assessor,autobiographer,civil servant,consultant,economist,historian,journalist,judge,lawyer,lobbyist,military personnel,non-fiction writer,physicist,political scientist,politician,resistance fighter,university teacher,writer
chancellorLabel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Konrad Adenauer,0.167,0.167,0.0,0.0,0.0,0.0,0.0,0.167,0.167,0.0,0.0,0.0,0.0,0.0,0.167,0.167,0.0,0.0
Ludwig Erhard,0.0,0.0,0.0,0.0,0.333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333,0.0,0.333,0.0
Kurt Georg Kiesinger,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333,0.333,0.0,0.0,0.0,0.0,0.0,0.333,0.0,0.0,0.0
Willy Brandt,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0
Walter Scheel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.5,0.0,0.0,0.0
Helmut Schmidt,0.0,0.0,0.2,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.2,0.0,0.0,0.2
Helmut Kohl,0.0,0.0,0.0,0.0,0.0,0.333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333,0.333,0.0,0.0,0.0
Gerhard Schröder,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.2,0.0,0.0,0.2,0.0,0.0,0.0
Angela Merkel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0,0.0
Olaf Scholz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0


TF-IDF, according to [1]:
$$\frac{1}{n} \cdot{} log(\frac{N}{|{r|C(r)}|})$$

$1/n$ is the relative count as displayed above.<br>
$N$ is the total number of resources (chancellors).<br>
$|{r|C(r)}|$ is the number of resources that share the relation (occupation) r.

In [229]:
N = len(df)

In [230]:
for occupation in df:
    multiplicity = df[occupation].gt(0).sum()
    df[occupation] = df[occupation].apply(lambda v: v * math.log(N/multiplicity))

In [231]:
df.round(3).transpose().to_csv("tfidf_extraction.csv")
df.round(3)

occupationLabel,assessor,autobiographer,civil servant,consultant,economist,historian,journalist,judge,lawyer,lobbyist,military personnel,non-fiction writer,physicist,political scientist,politician,resistance fighter,university teacher,writer
chancellorLabel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Konrad Adenauer,0.384,0.268,0.0,0.0,0.0,0.0,0.0,0.268,0.153,0.0,0.0,0.0,0.0,0.0,0.0,0.384,0.0,0.0
Ludwig Erhard,0.0,0.0,0.0,0.0,0.536,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.768,0.0
Kurt Georg Kiesinger,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.536,0.305,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Willy Brandt,0.0,0.402,0.0,0.0,0.0,0.0,0.576,0.0,0.0,0.0,0.0,0.301,0.0,0.0,0.0,0.0,0.0,0.0
Walter Scheel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.151,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Helmut Schmidt,0.0,0.0,0.461,0.0,0.322,0.0,0.0,0.0,0.0,0.0,0.0,0.241,0.0,0.0,0.0,0.0,0.0,0.461
Helmut Kohl,0.0,0.0,0.0,0.0,0.0,0.768,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.768,0.0,0.0,0.0,0.0
Gerhard Schröder,0.0,0.0,0.0,0.461,0.0,0.0,0.0,0.0,0.183,0.461,0.0,0.241,0.0,0.0,0.0,0.0,0.0,0.0
Angela Merkel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.151,0.0,0.0,0.0,0.0,0.0
Olaf Scholz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.458,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
