# Analysing DAAO Wikidata links


In [1]:
import pandas as pd

How many DAAO records are in the ACDE dataset?

In [2]:
df_acde = pd.read_json("ACDE_Merged_Normalized_202206031344.json")
df_daao = df_acde.loc[df_acde["data_source"] == "DAAO"]
daao_count = df_daao.shape[0]
daao_count

2188

How many DAAO people have an match in Wikidata?

In [3]:
df_daao = pd.read_json("daao_wd_all_links_20221006.json")
df_daao.rename(columns={"or_id": "or_path"}, inplace=True)
daao_urls = pd.read_csv("daao_urls.csv")
df_wd_links = pd.merge(
    df_daao, daao_urls, how="left", left_on="or_path", right_on="daao_path"
)
df_wd_links.rename(columns={"daao_id": "or_id"}, inplace=True)
df_wd_links = df_wd_links[['acde_source', 'or_id', 'or_path', 'related_source', 'related_id', 'related_url',
       'wd_url', 'wd_label']]

df_wd_links.head()

Unnamed: 0,acde_source,or_id,or_path,related_source,related_id,related_url,wd_url,wd_label
0,DAAO,,ulay-abramovic,RKDartists ID,123221,https://rkd.nl/explore/artists/123221,http://www.wikidata.org/entity/Q29939705,Abramović/Ulay
1,DAAO,,ulay-abramovic,GTAA ID,75229,https://data.beeldengeluid.nl/gtaa/75229,http://www.wikidata.org/entity/Q29939705,Abramović/Ulay
2,DAAO,,ulay-abramovic,NLA Trove people ID,1599739,https://trove.nla.gov.au/people/1599739,http://www.wikidata.org/entity/Q29939705,Abramović/Ulay
3,DAAO,,ulay-abramovic,DAAO ID,ulay-abramovic,https://www.daao.org.au/bio/ulay-abramovic,http://www.wikidata.org/entity/Q29939705,Abramović/Ulay
4,DAAO,,ulay-abramovic,Photographers' Identities Catalog ID,330593,https://pic.nypl.org/constituents/330593,http://www.wikidata.org/entity/Q29939705,Abramović/Ulay


We harvested details from all of the entities in Wikidata that had a DAAO id attached to them. But not all of these appear to be in the ACDE dataset. Some Wikidata entries might not be linked to the ACDE dataset because I was unable to find a DAAO url for the hash identifier in the ACDE dataset. I used these urls to get the path segments that Wikidata uses as identifiers for DAAO. However, I was able to get urls for 2090 of the 2188 DAAO records in the ACDE dataset from Trove, so that doesn't explain the full difference.

How many of the DAAO people in the ACDE dataset have Wikidata links?

In [4]:
daao_acde_wd_count = df_wd_links.loc[df_wd_links["or_id"].notnull()].groupby(by=["acde_source", "or_id"]).ngroups
daao_acde_wd_count

437

How many entities in Wikidata with DAAO links don't seem to be in the ACDE dataset? (Note the qualification above, this could be off by 98 records.)

In [5]:
daao_notacde_wd_count = df_wd_links.loc[df_wd_links["or_id"].isnull()].loc[df_wd_links["or_id"].isnull()].groupby(by=["acde_source", "or_path"]).ngroups
daao_notacde_wd_count

1426

In total, how many people with DAAO records have entries in Wikidata?

In [6]:
daao_wd_count = daao_notacde_wd_count + daao_acde_wd_count
daao_wd_count

1863

In [7]:
print(f"{daao_acde_wd_count/daao_count:.2%} of DAAO people in the ACDE dataset have a match in Wikidata")

19.97% of DAAO people in the ACDE dataset have a match in Wikidata


## Duplicates?

I noticed that there seemed to be some duplicate Wikidata links. Investigating further I noticed that some Wikidata entities were linked to multiple DAAO ids, so when I harvested the DAAO ids I ended up harvesting the Wikidata entity more than once. I don't want to remove them, as that would lose the links to the DAAO identifiers, but for some calculations we'll want to remove them.

In [8]:
df_wd_links.shape[0]

29332

In [9]:
df_wd_links.drop_duplicates(subset=["related_source", "related_id", "wd_url"], inplace=True)
df_wd_links.shape[0]

28112

## Related sources

Which other sources are linked to DAAO records via Trove?

In [10]:
# There are some duplicates where the Wikidata records contain multiple DAAO identifiers, causing them to be
# harvested more than once. Let's get rid of them for counting purposes.
df_wd_links.drop_duplicates(subset=["related_source", "wd_url"], inplace=True)

df_wd_sources = df_wd_links.value_counts(subset=["related_source"]).to_frame().reset_index()
df_wd_sources.rename(columns={0: "count"}, inplace=True)

How many different external identifiers are attached to the Wikidata entities with DAAO links?

In [11]:
df_wd_sources.shape[0]

740

740! Let's show the top twenty.

In [12]:
df_wd_sources[:20]

Unnamed: 0,related_source,count
0,DAAO ID,1823
1,NLA Trove people ID,1437
2,VIAF ID,1252
3,WorldCat Identities ID,1181
4,Freebase ID,842
5,ISNI,816
6,Library of Congress authority ID,770
7,National Gallery of Victoria artist ID,649
8,Artnet artist ID,637
9,FAST ID,614


In [13]:
people_ids = {
    "Libraries Australia ID": "P409",
    "NLA Trove people ID": "P1315",
    "People Australia ID": "P9159",
    "Obituaries Australia ID": "P9232",
    "Australian Dictionary of Biography ID": "P1907",
    "Labour Australia ID": "P9245",
    "Indigenous Australia ID": "P9246",
    "Women Australia ID": "P9244",
    "Encyclopedia of Australian Science ID": "P4228",
    "AusStage person ID": "P8292",
    "AustLit ID": "P8295",
    "DAAO ID": "P1707",
    "National Archives of Australia entity ID": "P10856",
    "Encyclopedia of Melbourne ID": "P9304",
    "Dictionary of Sydney ID": "P3794",
    "Australian Women's Register ID": "P4186",
    "Art Gallery of South Australia creator ID": "P6804",
    "Parliament of Australia MP ID": "P10020",
    "Re-Member ID": "P8633",
    "NSW Parliament member ID": "P10012",
    "Australian War Memorial ID": "P6713",
    "National Gallery of Victoria artist ID": "P2041",
    "Australian Prints + Printmaking artist ID": "P10086",
    "Australian Music Centre artist ID": "P9575",
    "Australian National Maritime Museum person ID": "P7769",
    "Australian Poetry Library poet ID": "P5465"
}

pids = [k for k,v in people_ids.items()]

In [14]:
wd_aus_sources = df_wd_links.loc[df_wd_links["related_source"].isin(pids)].value_counts(subset=["related_source"]).to_frame().reset_index()
wd_aus_sources.style.hide()

related_source,0
DAAO ID,1823
NLA Trove people ID,1437
National Gallery of Victoria artist ID,649
Libraries Australia ID,516
Art Gallery of South Australia creator ID,490
People Australia ID,465
Australian Dictionary of Biography ID,436
Dictionary of Sydney ID,130
Women Australia ID,130
Australian National Maritime Museum person ID,108


## Links per record

In [15]:
related_source_counts = df_wd_links.groupby(by=["acde_source", "wd_url", "wd_label"])["related_source"].count().to_frame().reset_index()

In [16]:
related_source_counts["related_source"].describe()

count    1823.000000
mean       15.420735
std        16.990607
min         1.000000
25%         6.000000
50%        11.000000
75%        19.000000
max       224.000000
Name: related_source, dtype: float64

In [17]:
more_than_one = related_source_counts.loc[related_source_counts["related_source"] > 1]
print(f"{more_than_one.shape[0]/daao_wd_count:.2%} of DAAO people with Wikidata records are linked to additional sources")

95.81% of DAAO people with Wikidata records are linked to additional sources


Records with the most links

In [18]:
related_source_counts.sort_values('related_source', ascending=False)[:20]

Unnamed: 0,acde_source,wd_url,wd_label,related_source
1334,DAAO,http://www.wikidata.org/entity/Q5582,Vincent van Gogh,224
297,DAAO,http://www.wikidata.org/entity/Q168724,Emma Thompson,188
209,DAAO,http://www.wikidata.org/entity/Q153739,Jean Arp,168
774,DAAO,http://www.wikidata.org/entity/Q235066,Laurie Anderson,161
335,DAAO,http://www.wikidata.org/entity/Q184366,Thomas Henry Huxley,141
768,DAAO,http://www.wikidata.org/entity/Q229455,Cindy Sherman,136
233,DAAO,http://www.wikidata.org/entity/Q159409,Louise Bourgeois,128
203,DAAO,http://www.wikidata.org/entity/Q14633979,Christo,128
959,DAAO,http://www.wikidata.org/entity/Q360466,George Cruikshank,118
254,DAAO,http://www.wikidata.org/entity/Q160422,Theo van Doesburg,112
