# Unlocking the Airwaves: Entity Report

This notebook explores the propery coverage in Wikidata and SNAC for people and corporate bodies in the Unlocking the Airwaves project. It relies having previously run the `collect.py` program which downloads the Unlocking the Airwaves entities from Airtable and fetching their respective Wikidata and SNAC data from those APIs. The results of that are written to two files `map-person.csv` and `map-corporate-bodies.csv`.


## People

First we'll take a look at people, not all people have a `snac_id` or `wikidata_id`. The mapping we are using for people looks like the following. Note: the wikidata and snac queries use the [JMESPath](https://jmespath.org/) notation.

In [7]:
import json

print(json.dumps(json.load(open('map-person.json')), indent=2))

{
  "label": {
    "wikidata": "labels.en.value"
  },
  "date_of_birth": {
    "wikidata": "claims.P569",
    "snac": "dates[].fromDate"
  },
  "place_of_birth": {
    "wikidata": "claims.P19"
  },
  "date_of_death": {
    "wikidata": "claims.P570",
    "snac": "dates[].toDate"
  },
  "description": {
    "wikidata": "descriptions.en.value",
    "snac": "biogHists[].text"
  },
  "place_of_death": {
    "wikidata": "claims.P20"
  },
  "occupation": {
    "wikidata": "claims.P106",
    "snac": "occupations[].term.term"
  },
  "field_of_work": {
    "wikidata": "claims.P101"
  },
  "employer": {
    "wikidata": "claims.P108"
  },
  "member_of": {
    "wikidata": "claims.P463"
  },
  "snac_id": {
    "wikidata": "claims.P3430"
  },
  "loc_id": {
    "wikidata": "claims.P244"
  }
}


The `map-person.csv` file is the result of running `collect.py` which examines each Person and Corporate Body in Unlocking the Airwave's Airtable and looks them up at Wikidata or SNAC using their APIs. If the mapping query succeeds in finding a value then a `1` is placed in the table cell and otherwise a `0`.

In [8]:
import pandas
people = pandas.read_csv('map-person.csv')
people

Unnamed: 0,name,wikidata_id,snac_id,label_wikidata,label_snac,date_of_birth_wikidata,date_of_birth_snac,place_of_birth_wikidata,place_of_birth_snac,date_of_death_wikidata,...,field_of_work_wikidata,field_of_work_snac,employer_wikidata,employer_snac,member_of_wikidata,member_of_snac,snac_id_wikidata,snac_id_snac,loc_id_wikidata,loc_id_snac
0,"Edson, Peter",,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Shamberger, Harold J.",,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Easton, Florence",,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Woelful, Norman",,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Novrup, Johannes, 1904-1960.",,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6128,"Pepan, Edward F.",,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6129,"Kerwin, Jonathan W.",,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6130,"Smith, H.E.",,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6131,"Jarrell, Randall, 1914-1965",,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


How many people have a Wikidata ID or SNAC ID?

In [9]:
len(people[people['wikidata_id'].notna()]) / len(people)

0.04353497472688733

In [10]:
len([people['snac_id'].notna()]) / len(people)

0.00016305233980107615

So not many people have a SNAC or Wikidata ID yet? In order to decide what displays for people look like for people we can look at the metadata coverage for users that did have either a Wikidata ID.

In [24]:
people_matched = people[people['snac_id'].notna() | people['wikidata_id'].notna()]
people_matched

Unnamed: 0,name,wikidata_id,snac_id,label_wikidata,label_snac,date_of_birth_wikidata,date_of_birth_snac,place_of_birth_wikidata,place_of_birth_snac,date_of_death_wikidata,...,field_of_work_wikidata,field_of_work_snac,employer_wikidata,employer_snac,member_of_wikidata,member_of_snac,snac_id_wikidata,snac_id_snac,loc_id_wikidata,loc_id_snac
41,"Nixon, Richard M. (Richard Milhous), 1913-1994",Q9588,3018986.0,1,0,1,1,1,0,1,...,0,0,0,0,1,0,1,0,1,0
47,"Gillis, Don, 1912-1978",Q714525,84628616.0,1,0,1,1,1,0,1,...,0,0,1,0,0,0,1,0,1,0
49,"Badings, Henk",,21641534.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
58,"Akins, Charles",,84650902.0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
81,"Cranley, John",Q6227508,28501034.0,1,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6058,"Daniel, Dorothy",Q55719771,24519820.0,1,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,1,0
6072,"Mosely, Philip E. (Philip Edward), 1905-1972",Q27491453,3536252.0,1,0,1,1,0,0,1,...,0,0,1,0,0,0,0,0,1,0
6091,"Dreyfus, Lee S.",,2678036.0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6103,"Gould, Jack",Q6112762,73321129.0,1,0,1,1,1,0,1,...,0,0,0,0,0,0,1,0,1,0


In [38]:
people_avg = pandas.DataFrame({
    "average": people_matched.loc[:,'label_wikidata':].sum() / len(people_matched),
})
people_avg

Unnamed: 0,average
label_wikidata,0.834375
label_snac,0.0
date_of_birth_wikidata,0.778125
date_of_birth_snac,0.66875
place_of_birth_wikidata,0.634375
place_of_birth_snac,0.0
date_of_death_wikidata,0.6875
date_of_death_snac,0.66875
description_wikidata,0.778125
description_snac,0.428125


But notice the `label_wikidata` which we would expect would be at 100% is 83%? This might be because we are looking at entities with either a SNAC ID or a Wikidata ID. Maybe if just look at entities with a Wikidata ID it will look better?

In [40]:
people_wikidata = people[people['wikidata_id'].notna()]
people_wikidata_avg = pandas.DataFrame({
    "average": people_wikidata.loc[:,'label_wikidata':].sum() / len(people_wikidata),
})
people_wikidata_avg

Unnamed: 0,average
label_wikidata,1.0
label_snac,0.0
date_of_birth_wikidata,0.932584
date_of_birth_snac,0.666667
place_of_birth_wikidata,0.7603
place_of_birth_snac,0.0
date_of_death_wikidata,0.82397
date_of_death_snac,0.666667
description_wikidata,0.932584
description_snac,0.445693


That looks like a better picture of property coverage for Wikidata entities. We can do the same thing for SNAC entities. So this will show the percentage coverage of properties for entities that have a SNAC ID.

In [42]:
people_snac = people[people['snac_id'].notna()]
people_snac_avg = pandas.DataFrame({
    "average": people_snac.loc[:,'label_wikidata':].sum() / len(people_snac),
})
people_snac_avg

Unnamed: 0,average
label_wikidata,0.801498
label_snac,0.0
date_of_birth_wikidata,0.756554
date_of_birth_snac,0.801498
place_of_birth_wikidata,0.632959
place_of_birth_snac,0.0
date_of_death_wikidata,0.689139
date_of_death_snac,0.801498
description_wikidata,0.749064
description_snac,0.513109


## Corporate Bodies

TBD if this analysis is useful...