# Analysing AusStage Wikidata links


In [1]:
import pandas as pd

How many AusStage records are in the ACDE dataset?

In [2]:
df_acde = pd.read_json("ACDE_Merged_Normalized_202206031344.json")
df_ausstage = df_acde.loc[df_acde["data_source"] == "AusStage"]
ausstage_count = df_ausstage.shape[0]
ausstage_count

171081

How many AusStage people have an match in Wikidata?

In [3]:
df_wd_links = pd.read_json("ausstage_wd_all_links_20221006.json")
df_wd_links.head()

Unnamed: 0,acde_source,or_id,related_source,related_id,related_url,wd_url,wd_label
0,AusStage,548265,VIAF ID,38961419,https://viaf.org/viaf/38961419/,http://www.wikidata.org/entity/Q15485502,Lucinda Coxon
1,AusStage,548265,Internet Broadway Database person ID,488007,https://www.ibdb.com/broadway-cast-staff/488007,http://www.wikidata.org/entity/Q15485502,Lucinda Coxon
2,AusStage,548265,GND ID,1011169592,https://d-nb.info/gnd/1011169592,http://www.wikidata.org/entity/Q15485502,Lucinda Coxon
3,AusStage,548265,Library of Congress authority ID,no95043617,https://id.loc.gov/authorities/no95043617,http://www.wikidata.org/entity/Q15485502,Lucinda Coxon
4,AusStage,548265,NUKAT ID,n2016244632,https://wikidata-externalid-url.toolforge.org/...,http://www.wikidata.org/entity/Q15485502,Lucinda Coxon


How many of the AusStage people in the ACDE dataset have Wikidata links?

In [4]:
ausstage_wd_count = df_wd_links.loc[df_wd_links["or_id"].notnull()].groupby(by=["acde_source", "or_id"]).ngroups
ausstage_wd_count

1279

In [5]:
print(f"{ausstage_wd_count/ausstage_count:.2%} of AusStage people in the ACDE dataset have a match in Wikidata")

0.75% of AusStage people in the ACDE dataset have a match in Wikidata


## Duplicates?

I noticed that there seemed to be some duplicate Wikidata links. Investigating further I noticed that some Wikidata entities were linked to multiple AusStage ids, so when I harvested the AusStage ids I ended up harvesting the Wikidata entity more than once. I don't want to remove them, as that would lose the links to the AusStage identifiers, but for some calculations we'll want to remove them.

In [6]:
df_wd_links.shape[0]

40277

In [7]:
df_wd_links.drop_duplicates(subset=["related_source", "related_id", "wd_url"], inplace=True)
df_wd_links.shape[0]

40058

## Related sources

Which other sources are linked to DAAO records via Trove?

In [8]:
# There are some duplicates where the Wikidata records contain multiple DAAO identifiers, causing them to be
# harvested more than once. Let's get rid of them for counting purposes.
df_wd_links.drop_duplicates(subset=["related_source", "wd_url"], inplace=True)

df_wd_sources = df_wd_links.value_counts(subset=["related_source"]).to_frame().reset_index()
df_wd_sources.rename(columns={0: "count"}, inplace=True)

How many different external identifiers are attached to the Wikidata entities with DAAO links?

In [9]:
df_wd_sources.shape[0]

1117

1117! Let's show the top twenty.

In [10]:
df_wd_sources[:20]

Unnamed: 0,related_source,count
0,AusStage person ID,1270
1,VIAF ID,980
2,WorldCat Identities ID,928
3,Freebase ID,911
4,Library of Congress authority ID,852
5,ISNI,823
6,IMDb ID,803
7,TMDB person ID,595
8,GND ID,566
9,Bibliothèque nationale de France ID,561


In [11]:
people_ids = {
    "Libraries Australia ID": "P409",
    "NLA Trove people ID": "P1315",
    "People Australia ID": "P9159",
    "Obituaries Australia ID": "P9232",
    "Australian Dictionary of Biography ID": "P1907",
    "Labour Australia ID": "P9245",
    "Indigenous Australia ID": "P9246",
    "Women Australia ID": "P9244",
    "Encyclopedia of Australian Science ID": "P4228",
    "AusStage person ID": "P8292",
    "AustLit ID": "P8295",
    "DAAO ID": "P1707",
    "National Archives of Australia entity ID": "P10856",
    "Encyclopedia of Melbourne ID": "P9304",
    "Dictionary of Sydney ID": "P3794",
    "Australian Women's Register ID": "P4186",
    "Art Gallery of South Australia creator ID": "P6804",
    "Parliament of Australia MP ID": "P10020",
    "Re-Member ID": "P8633",
    "NSW Parliament member ID": "P10012",
    "Australian War Memorial ID": "P6713",
    "National Gallery of Victoria artist ID": "P2041",
    "Australian Prints + Printmaking artist ID": "P10086",
    "Australian Music Centre artist ID": "P9575",
    "Australian National Maritime Museum person ID": "P7769",
    "Australian Poetry Library poet ID": "P5465"
}

pids = [k for k,v in people_ids.items()]

In [12]:
wd_aus_sources = df_wd_links.loc[df_wd_links["related_source"].isin(pids)].value_counts(subset=["related_source"]).to_frame().reset_index()
wd_aus_sources.style.hide()

related_source,0
AusStage person ID,1270
NLA Trove people ID,540
Libraries Australia ID,317
Australian Music Centre artist ID,57
People Australia ID,51
AustLit ID,45
Australian Dictionary of Biography ID,45
Dictionary of Sydney ID,28
Australian Women's Register ID,27
Women Australia ID,22


## Links per record

In [13]:
related_source_counts = df_wd_links.groupby(by=["acde_source", "wd_url", "wd_label"])["related_source"].count().to_frame().reset_index()

In [14]:
related_source_counts["related_source"].describe()

count    1270.000000
mean       31.541732
std        40.587386
min         1.000000
25%         8.000000
50%        16.000000
75%        36.750000
max       348.000000
Name: related_source, dtype: float64

In [15]:
more_than_one = related_source_counts.loc[related_source_counts["related_source"] > 1]
print(f"{more_than_one.shape[0]/ausstage_wd_count:.2%} of AusStage people with Wikidata records are linked to additional sources")

98.91% of AusStage people with Wikidata records are linked to additional sources


Records with the most links

In [16]:
related_source_counts.sort_values('related_source', ascending=False)[:20]

Unnamed: 0,acde_source,wd_url,wd_label,related_source
1053,AusStage,http://www.wikidata.org/entity/Q692,William Shakespeare,348
1197,AusStage,http://www.wikidata.org/entity/Q9068,Voltaire,261
906,AusStage,http://www.wikidata.org/entity/Q5673,Hans Christian Andersen,251
277,AusStage,http://www.wikidata.org/entity/Q169566,H. P. Lovecraft,236
1082,AusStage,http://www.wikidata.org/entity/Q7200,Alexander Pushkin,218
1176,AusStage,http://www.wikidata.org/entity/Q81447,Aldous Huxley,218
1098,AusStage,http://www.wikidata.org/entity/Q7317,Giuseppe Verdi,213
166,AusStage,http://www.wikidata.org/entity/Q134456,Yukio Mishima,211
765,AusStage,http://www.wikidata.org/entity/Q504,Émile Zola,210
183,AusStage,http://www.wikidata.org/entity/Q1399,Niccolò Machiavelli,208
