# Analysing AusStage  Trove links

Questions:

- How many AusStage people have an entry in Trove?
- How many of these records are linked with other identifiers?
- Number of sources


In [2]:
import pandas as pd

How many AusStage records are in the ACDE dataset?

In [3]:
df_acde = pd.read_json("ACDE_Merged_Normalized_202206031344.json")
df_ausstage = df_acde.loc[df_acde["data_source"] == "AusStage"]
ausstage_count = df_ausstage.shape[0]
ausstage_count

171081

How many AusStage people have an match in Trove's People and Organisations zone?

In [4]:
df_trove_links = pd.read_json("ausstage_trove_links_20221005.json")
df_trove_links.head()

Unnamed: 0,acde_source,or_id,related_source,related_source_id,related_source_url,related_source_name
0,AusStage,581,Trove,1561804,https://nla.gov.au/nla.party-1561804,Geoff Gaskill
1,AusStage,690,Trove,1561828,https://nla.gov.au/nla.party-1561828,Dan Scollay
2,AusStage,691,Trove,595149,https://nla.gov.au/nla.party-595149,John Gatenby Bolton
3,AusStage,691,AuCNLKIN,35990423,http://nla.gov.au/anbd.aut-an35990423,J G Bolton
4,AusStage,691,AU-VU:EOAS,P001483,http://www.eoas.info/biogs/P001483b.htm,John Gatenby Bolton


In [5]:
ausstage_trove_count = df_trove_links.groupby(by=["acde_source", "or_id"]).ngroups
ausstage_trove_count

134012

How many don't have a match?

In [6]:
df_trove_not_found = pd.read_json("ausstage_trove_not_found_20221005.json")
ausstage_no_trove_count = df_trove_not_found.shape[0]
ausstage_no_trove_count

37069

Check that the number with and without Trove matches adds up to the total number of AusStage records.

In [7]:
ausstage_trove_count + ausstage_no_trove_count == ausstage_count

True

In [8]:
print(f"{ausstage_trove_count/ausstage_count:.2%} of AusStage people have a match in Trove")

78.33% of AusStage people have a match in Trove


## Related sources

Which other sources are linked to AusStage records via Trove?

In [27]:
sources = pd.read_json("agencies.json", orient="index")
df_trove_sources = pd.merge(df_trove_links, sources, how="left", left_on="related_source", right_index=True)
df_trove_sources.value_counts(subset=["related_source", 0])

related_source  0                                    
AuCNLKIN        Libraries Australia                      4464
AU-AIAS         AIATSIS Aboriginal Biographical Index    1171
AU-ANU:ADBO     Australian Dictionary of Biography        325
AU-QPRO         The Prosecution Project                   204
AU-NUN:DAAO     Design & Art Australia Online             160
AU-ANL:AD       Australia Dancing                         128
AU-VU:AWR       The Australian Women's Register           108
AU-NAMC         Australian Music Centre                    92
AU-ANU:OA       Obituaries Australia                       72
AU-ANL:MA       Music Australia                            50
AU-VU:EOAS      Encyclopedia of Australian Science         45
AU-AuCNLKIN     Libraries Australia                        41
AU-APAR         Australian Parliamentary Library           28
AU-YORCID       ORCID                                      25
AU-VU           University of Melbourne                    19
AU-NUWS         

It's important to note that AIATSIS records (AU-AIAS) are different to most of the others. The records are not 'identities' that bring together information about an individual, they're separate resources relating to an individual. So one individual might have multiple AIATSIS links. 

Number of individual AusStage records with AIATSIS links.

In [29]:
df_trove_links.loc[df_trove_links["related_source"] == "AU-AIAS"].groupby(by=["acde_source", "or_id", "related_source"]).ngroups

408

## Links per record

In [31]:
related_source_counts = df_trove_links.groupby(by=["acde_source", "or_id", "related_source_name"])["related_source"].count().to_frame().reset_index()

In [35]:
related_source_counts["related_source"].describe()

count    134794.000000
mean          1.046055
std           0.478156
min           1.000000
25%           1.000000
50%           1.000000
75%           1.000000
max          67.000000
Name: related_source, dtype: float64

In [42]:
more_than_one = related_source_counts.loc[related_source_counts["related_source"] > 1]
print(f"{more_than_one.shape[0]/ausstage_trove_count:.2%} of AusStage people with Trove records are linked to additional sources")

3.20% of AusStage people with Trove records are linked to additional sources


Records with the most links

In [32]:
related_source_counts.sort_values('related_source', ascending=False)[:20]

Unnamed: 0,acde_source,or_id,related_source_name,related_source
74540,AusStage,462553,Albert Namatjira,67
82,AusStage,144,Jack Davis,59
7492,AusStage,228382,George Thompson,54
61155,AusStage,443527,Harold Blair,46
965,AusStage,1731,Jimmy Little,40
40284,AusStage,412913,Michael Ryan,34
84746,AusStage,476430,John Donovan,28
5870,AusStage,225474,David Dalaithngu,26
2706,AusStage,4342,Ernie Dingo,23
66885,AusStage,452049,George Miller,21


## Duplicates?

I noticed that there seem to be some duplicate results where multiple AusStage records have been linked (correctly it seems) within Trove a single record. Some of these AusStage records contain different data but refer to the same person. Look for example at the records for Yvonne (Fifi) Banvard [here](https://www.ausstage.edu.au/pages/contributor/226757) and [here](https://www.ausstage.edu.au/pages/contributor/226677). These two records reflect different names for the same person and are cross-linked in AusStage. Trove brings them together under a single identity. This is all as you would expect, but because we're harvesting by each AusStage identifier, it means we're getting the same Trove record twice. I'm not sure it matters very much, but it does mean that the totals by source listed above don't reflect unique instances of the related source. Let's look at how many unique Trove identifiers there are...

In [65]:
df_trove = df_trove_links.copy().loc[df_trove_links["related_source"] == "Trove"]

Show a sample of matches with duplicated Trove links.

In [52]:
df_trove.loc[df_trove.duplicated(subset=["related_source_url"], keep=False)].sort_values(by="related_source_url")

Unnamed: 0,acde_source,or_id,related_source,related_source_id,related_source_url,related_source_name
99306,AusStage,488120,Trove,1153627,https://nla.gov.au/nla.party-1153627,Joy Smith
95880,AusStage,483378,Trove,1153627,https://nla.gov.au/nla.party-1153627,Joy Smith
34179,AusStage,253094,Trove,1175424,https://nla.gov.au/nla.party-1175424,Peter Mitchell
26700,AusStage,400260,Trove,1175424,https://nla.gov.au/nla.party-1175424,Peter Mitchell
11274,AusStage,229874,Trove,1186215,https://nla.gov.au/nla.party-1186215,Daniel Mitchell
...,...,...,...,...,...,...
72518,AusStage,449836,Trove,978476,https://nla.gov.au/nla.party-978476,Stephen Smith
94996,AusStage,513296,Trove,978476,https://nla.gov.au/nla.party-978476,Stephen Smith
4901,AusStage,8338,Trove,978476,https://nla.gov.au/nla.party-978476,Stephen Smith
71375,AusStage,450642,Trove,986452,https://nla.gov.au/nla.party-986452,William Henry Harris


How many unique Trove links are there?

In [59]:
unique_trove_links = df_trove.drop_duplicates(subset=["related_source_url"]).shape[0]
unique_trove_links

133822

Number of duplicates

In [66]:
ausstage_trove_count - unique_trove_links

190