### Chancellor occupations
This notebook illustrates basic feature extraction techniques, as presented by:<br>
[1] P. Ristoski and H. Paulheim, “A comparison of propositionalization strategies for creating features from linked open data,” in Proceedings of the 1st International Conference on Linked Data for Knowledge Discovery - Volume 1232, Aachen, DEU, Sep. 2014, pp. 1–11.

Query results can be replicated [here](https://query.wikidata.org/#SELECT%20%3Fchancellor_name%20%3Foccupation_label%20WHERE%20%7B%0A%20%20BIND%28wd%3AQ183%20AS%20%3Fgermany%29%0A%20%20%3Fgermany%20%28p%3AP6%2Fps%3AP6%29%20%3Fchancellor.%0A%20%20%3Fchancellor%20rdfs%3Alabel%20%3Fchancellor_name.%0A%20%20%3Fchancellor%20%28p%3AP106%2Fps%3AP106%29%20%3Foccupation.%0A%20%20%3Foccupation%20rdfs%3Alabel%20%3Foccupation_label.%0A%20%20FILTER%20%28LANG%28%3Foccupation_label%29%20%3D%20%22en%22%20%26%26%20LANG%28%3Fchancellor_name%29%20%3D%20%22en%22%29.%0A%7D%20ORDER%20BY%20%3Fchancellor_name) (Wikidata).

In [10]:
import pandas as pd
import math

df = pd.read_csv("chancellor_occupations.tsv", sep="\t")

In [11]:
df.head(3)

Unnamed: 0,chancellor_name,occupation_label
0,Angela Merkel,politician
1,Angela Merkel,physicist
2,Gerhard Schröder,lawyer


In [12]:
df = df.pivot(index="chancellor_name", columns="occupation_label", values="occupation_label").fillna(0)
df[df != 0] = 1

In [13]:
# Binary feature extraction
df

occupation_label,assessor,autobiographer,civil servant,consultant,economist,historian,journalist,judge,lawyer,lobbyist,military personnel,non-fiction writer,physicist,political scientist,politician,resistance fighter,university teacher,writer
chancellor_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Angela Merkel,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0
Gerhard Schröder,0,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,0,0
Helmut Kohl,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0
Helmut Schmidt,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1
Konrad Adenauer,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0
Kurt Georg Kiesinger,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0
Ludwig Erhard,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0
Olaf Scholz,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0
Walter Scheel,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
Willy Brandt,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0


In [14]:
for index, row in df.iterrows():
    df.loc[index] = row.div(row.sum())

# Relative Couunt feature extraction
df.round(decimals=3)

occupation_label,assessor,autobiographer,civil servant,consultant,economist,historian,journalist,judge,lawyer,lobbyist,military personnel,non-fiction writer,physicist,political scientist,politician,resistance fighter,university teacher,writer
chancellor_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Angela Merkel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0,0.0
Gerhard Schröder,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.2,0.0,0.0,0.2,0.0,0.0,0.0
Helmut Kohl,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0
Helmut Schmidt,0.0,0.0,0.2,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.2,0.0,0.0,0.2
Konrad Adenauer,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0
Kurt Georg Kiesinger,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0
Ludwig Erhard,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.333333,0.0
Olaf Scholz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0
Walter Scheel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.5,0.0,0.0,0.0
Willy Brandt,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0


TF-IDF, according to [1]:
$$\frac{1}{n} \cdot{} log(\frac{N}{|{r|C(r)}|})$$

$1/n$ is the relative count as displayed above.<br>
$N$ is the total number of resources (chancellors).<br>
$|{r|C(r)}|$ is the number of resources that share the relation (occupation) r.

In [15]:
N = len(df)

In [16]:
for occupation in df:
    multiplicity = df[occupation].gt(0).sum()
    df[occupation] = df[occupation].apply(lambda v: v * math.log(N/multiplicity))

In [17]:
df.round(decimals=3)

occupation_label,assessor,autobiographer,civil servant,consultant,economist,historian,journalist,judge,lawyer,lobbyist,military personnel,non-fiction writer,physicist,political scientist,politician,resistance fighter,university teacher,writer
chancellor_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Angela Merkel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.151,0.0,0.0,0.0,0.0,0.0
Gerhard Schröder,0.0,0.0,0.0,0.461,0.0,0.0,0.0,0.0,0.183,0.461,0.0,0.241,0.0,0.0,0.0,0.0,0.0,0.0
Helmut Kohl,0.0,0.0,0.0,0.0,0.0,0.768,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.768,0.0,0.0,0.0,0.0
Helmut Schmidt,0.0,0.0,0.461,0.0,0.322,0.0,0.0,0.0,0.0,0.0,0.0,0.241,0.0,0.0,0.0,0.0,0.0,0.461
Konrad Adenauer,0.384,0.268,0.0,0.0,0.0,0.0,0.0,0.268,0.153,0.0,0.0,0.0,0.0,0.0,0.0,0.384,0.0,0.0
Kurt Georg Kiesinger,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.536,0.305,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ludwig Erhard,0.0,0.0,0.0,0.0,0.536,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.768,0.0
Olaf Scholz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.458,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Walter Scheel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.151,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Willy Brandt,0.0,0.402,0.0,0.0,0.0,0.0,0.576,0.0,0.0,0.0,0.0,0.301,0.0,0.0,0.0,0.0,0.0,0.0
