# Analysing DAAO Trove links


In [1]:
import pandas as pd

How many DAAO records are in the ACDE dataset?

In [3]:
df_acde = pd.read_json("ACDE_Merged_Normalized_202206031344.json")
df_daao = df_acde.loc[df_acde["data_source"] == "DAAO"]
daao_count = df_daao.shape[0]
daao_count

2188

How many DAAO people have a match in Trove's People and Organisations zone?

In [4]:
df_trove_links = pd.read_json("daao_trove_links_20221004.json")
df_trove_links.head()

Unnamed: 0,acde_source,or_id,related_source,related_source_id,related_source_url,related_source_name
0,DAAO,4da1e25db538b73d93029eca,Trove,1486884,https://nla.gov.au/nla.party-1486884,Annette Ellis
1,DAAO,4da19cd2b538b73d93013e35,Trove,1483347,https://nla.gov.au/nla.party-1483347,Ruth Napaljarri
2,DAAO,4da1b2bfb538b73d9301e797,Trove,1486592,https://nla.gov.au/nla.party-1486592,Chris Edwards
3,DAAO,4da1eaf5b538b73d9302a6cc,Trove,1484395,https://nla.gov.au/nla.party-1484395,Marisa Purcell
4,DAAO,5158251dd707ff422f00102e,Trove,1528482,https://nla.gov.au/nla.party-1528482,Hugh Child


In [29]:
daao_trove_count = df_trove_links.groupby(by=["acde_source", "or_id"]).ngroups
daao_trove_count

2090

How many don't have a match?

In [30]:
df_trove_not_found = pd.read_json("daao_trove_not_found_20221004.json")
daao_no_trove_count = df_trove_not_found.shape[0]
daao_no_trove_count

98

Check that the number with and without Trove matches adds up to the total number of AusStage records.

In [7]:
daao_trove_count + daao_no_trove_count == daao_count

True

In [9]:
print(f"{daao_trove_count/daao_count:.2%} of DAAO people have a match in Trove")

95.52% of DAAO people have a match in Trove


## Related sources

Which other sources are linked to DAAO records via Trove?

In [10]:
sources = pd.read_json("agencies.json", orient="index")
df_trove_sources = pd.merge(df_trove_links, sources, how="left", left_on="related_source", right_index=True)
df_trove_sources.value_counts(subset=["related_source", 0])

related_source  0                                    
AuCNLKIN        Libraries Australia                      999
AU-AIAS         AIATSIS Aboriginal Biographical Index    476
AU-ANU:ADBO     Australian Dictionary of Biography        85
AU-SAUS         AusStage                                  60
AU-VU:AWR       The Australian Women's Register           33
AU-ANU:OA       Obituaries Australia                      16
AU-AuCNLKIN     Libraries Australia                       10
AU-ANL:AD       Australia Dancing                          5
AU-VASD         Australian Sound Design                    4
AU-VU           University of Melbourne                    4
AU-VU:EOAS      Encyclopedia of Australian Science         4
AU-QGU          Griffith University                        2
AU-ANL:MA       Music Australia                            1
AU-NAMC         Australian Music Centre                    1
AU-NUWS         University of Western Sydney               1
AU-YORCID       ORCID          

It's important to note that AIATSIS records (AU-AIAS) are different to most of the others. The records are not 'identities' that bring together information about an individual, they're separate resources relating to an individual. So one individual might have multiple AIATSIS links. 

Number of individual AusStage records with AIATSIS links.

In [11]:
df_trove_links.loc[df_trove_links["related_source"] == "AU-AIAS"].groupby(by=["acde_source", "or_id", "related_source"]).ngroups

122

## Links per record

In [12]:
related_source_counts = df_trove_links.groupby(by=["acde_source", "or_id", "related_source_name"])["related_source"].count().to_frame().reset_index()

In [13]:
related_source_counts["related_source"].describe()

count    2362.000000
mean        1.605419
std         1.871140
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        67.000000
Name: related_source, dtype: float64

In [14]:
more_than_one = related_source_counts.loc[related_source_counts["related_source"] > 1]
print(f"{more_than_one.shape[0]/daao_trove_count:.2%} of DAAO people with Trove records are linked to additional sources")

41.77% of DAAO people with Trove records are linked to additional sources


Records with the most links

In [15]:
related_source_counts.sort_values('related_source', ascending=False)[:20]

Unnamed: 0,acde_source,or_id,related_source_name,related_source
1972,DAAO,4e5205f7d707ff1dc7002610,Albert Namatjira,67
412,DAAO,4da197b9b538b73d9300bc5f,Kevin Gilbert,25
1478,DAAO,4da1b640b538b73d9302000d,Lola Greeno,17
1605,DAAO,4da1bd1eb538b73d93023d6c,Bill Onus,15
1464,DAAO,4da1b5d0b538b73d9301f8b0,Dulcie Greeno,15
1623,DAAO,4da1be66b538b73d93024462,Mervyn Bishop,14
1436,DAAO,4da1b4deb538b73d9301eb65,Lin Onus,13
1729,DAAO,4da1ce86b538b73d930275c7,Shane Pickett,12
1776,DAAO,4da1dd67b538b73d93028dd7,Trevor Nickolls,11
114,DAAO,4da19577b538b73d93003fb4,Euphemia Bostock,11


## Duplicates?


In [16]:
df_trove = df_trove_links.copy().loc[df_trove_links["related_source"] == "Trove"]

Show a sample of matches with duplicated Trove links.

In [17]:
df_trove.loc[df_trove.duplicated(subset=["related_source_url"], keep=False)].sort_values(by="related_source_url")

Unnamed: 0,acde_source,or_id,related_source,related_source_id,related_source_url,related_source_name
875,DAAO,5a60631ad707ff4dfc000006,Trove,1481482,https://nla.gov.au/nla.party-1481482,Beverly Graham
1081,DAAO,4da199f0b538b73d9300fcf0,Trove,1481482,https://nla.gov.au/nla.party-1481482,Beverly Graham
2592,DAAO,4da1a08bb538b73d93016c4b,Trove,1482563,https://nla.gov.au/nla.party-1482563,Henry Hewitt Rayner
2961,DAAO,4da1c913b538b73d930266d4,Trove,1482563,https://nla.gov.au/nla.party-1482563,Henry Hewitt Rayner
1981,DAAO,4da19acbb538b73d930112d9,Trove,1482566,https://nla.gov.au/nla.party-1482566,Norah Napanangka Long
3731,DAAO,4da19cd9b538b73d93013e6a,Trove,1482566,https://nla.gov.au/nla.party-1482566,Norah Napanangka Long
324,DAAO,4da19f20b538b73d93015d7f,Trove,1485218,https://nla.gov.au/nla.party-1485218,Kathleen Petyarre
2737,DAAO,513fe07fd707ff6bcd001e4b,Trove,1485218,https://nla.gov.au/nla.party-1485218,Kathleen Petyarre
251,DAAO,4ed4699dd707ff4ff0000000,Trove,461161,https://nla.gov.au/nla.party-461161,Sydney Ball
1649,DAAO,4e521665d707ff1dc7009c44,Trove,461161,https://nla.gov.au/nla.party-461161,Sydney Ball


How many unique Trove links are there?

In [18]:
unique_trove_links = df_trove.drop_duplicates(subset=["related_source_url"]).shape[0]
unique_trove_links

2084

Number of duplicates

In [20]:
daao_trove_count - unique_trove_links

6