In [5]:
%pip install lxml

Collecting lxml
  Downloading lxml-5.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.0/8.0 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: lxml
Successfully installed lxml-5.1.0
Note: you may need to restart the kernel to use updated packages.


In [70]:
from collections import defaultdict
from functools import reduce
from itertools import chain
from typing import List, Tuple 

import bs4
from tqdm.notebook import tqdm

In [123]:
def refworks_get_children_tag_names_values(parent:bs4.element.Tag) -> List[Tuple[str, str]]:
    return list(filter(
        lambda x: x[0] is not None, 
        map(lambda t: (t.name, set((t.string.strip(),))), parent.children)
    ))

def update_dict_with_records(d:defaultdict, kv:Tuple[str, set]) -> defaultdict:
    key, value = kv
    d[key] = d[key] | value
    return d

Examining **export.xml**, the refworks file.

### XML record description

`<reference></reference>` is the item level tag

Each reference **may** have the following attributes:

- `<rt>` is reference type for example: "Book, Whole" and "Video/DVD". There is also "Generic" which is odd
- `<sr>` unknown, seems to only have a single value "1"
- `<id>` the refworks Id
- `<a1>` Seems to be first author
- `<t1>` Seems to be title
- `<yr>` year
- `<k1>` Maybe subject, examples: "Psychology in literature", "Religion"
- `<ab>` Abstract, note only 242 of the 2000 references have an abstract.
- `<no>` No idea, maybe some sort of topic example: '87032251 Edward E. Rosenbaum. 25 cm.; RN: AOLCancer/Laryngeal, throat, and vocal cord cancer; ID: 2154'
- `<ed>` Edition, could be string like/numeric, examples: '2013/07/23', 'New, rev.'
- `<pb>` Publisher: 'Trinity Mirror Sport Media', 'University of Rochester Press'
- `<pp>` No idea, some location maybe, examples: 'Shizuoka-ken Hamamatsu-shi', 'New York,', 'Liverpool'
- `<sn>` Serial number 
- `<ad>` Something academic? example: 'Division of General Medicine, Emory University School of Medicine, Atlanta, GA 30303, USA. atorke@emory.edu'
- `<an>` No idea, some sort of number, examples: '016084891', '39510772'
- `<db>` No idea, text description maybe, examples: 'The cancer monologue project', 'Shifāʼuddaulah kī sarguzasht
- `<cn>` No idea, examples: 'BODBL XWeek 32 (12) Bodleian Library XWeek 32 (12)', 'RSLBL 15085 e.668 Radcliffe Science Library 15085 e.668 (Box B000000204119)'
- `<lk>` Links, urls
- `<sp>` Something related to edition maybe, examples: '75', 'iv, 1 , 6', '471'
- `<jf>` Journal maybe, examples: 'Conscious Cogn', 'Bull Med Libr Assoc'
- `<jo>` Also Journal maybe, examples: 'Psychother.Psychosom.', 'Rocz.Akad.Med.Bialymst.'
- `<vo>` verion maybe, examples: '75', '74', '76'
- `<is>` issue maybe, examples: '13', '3', '9246'
- `<fd>` A date, examples: 'Sep', 'Aug', 'Jul 6'
- `<do>` DOI
- `<a2>` A tag associated with audio sources?, examples: 'Narrated by Anne,Bancroft', 'postscript read by Greg,Louganis'
- `<la>` Language, examples: 'In Turkish.', 'In English, and one article in Chinese.'

In [30]:
with open("../data/export.xml", "r") as f:
    parser = bs4.BeautifulSoup(f.read(), "xml")

In [46]:
references = parser.find_all("reference")
len(references)

2000

In [86]:
tags_and_values = reduce(
    update_dict_with_records, 
    chain.from_iterable(map(refworks_get_children_tag_names_values, tqdm(references, desc="parsing references"))), 
    defaultdict(set)
)

parsing references:   0%|          | 0/2000 [00:00<?, ?it/s]

In [90]:
for tag_name in tags_and_values:
    print(f"Tag: <{tag_name}>") 
    print(f"Unique values: {len(tags_and_values[tag_name])}")
    print(f"5 samples: {list(tags_and_values[tag_name])[:5]}")
    print()
                            

Tag: <rt>
Unique values: 6
5 samples: ['Dissertation/Thesis', 'Motion Picture', 'Generic', 'Video/DVD', 'Book, Whole']

Tag: <sr>
Unique values: 1
5 samples: ['1']

Tag: <id>
Unique values: 2000
5 samples: ['RefID:14281-tibble1972john', "RefID:14964-weiner2004brother's", 'RefID:13277-severo1985lisa', 'RefID:13221-seibert1968pebbles', "RefID:13695-smolan1992medicine's"]

Tag: <a1>
Unique values: 2357
5 samples: ['Song,Hag-un', 'Gauthier,Ursula', 'Wadler,Joyce', 'Todd,Alexandra Dundas', 'Wall,Barbara']

Tag: <t1>
Unique values: 1850
5 samples: ['The cancer monologue project', 'Shifāʼuddaulah kī sarguzasht', 'Peering through the darkness: the subjective experience of clinical depression', 'Not the last goodbye : reflections on life, death, healing and cancer', "Living with prostate cancer : a patient's survival guide"]

Tag: <yr>
Unique values: 117
5 samples: ['1972', '2001', '1993', '2003', '1976']

Tag: <op>
Unique values: 1287
5 samples: ['viii, 242 p.', '168 p.', 'v. <1 >', '123 p.,

Examining **HumExMedMasterLibrary-Converted.xml**, the Endnote file.

### XML record description

`<record></record>` is the item level tag

Each reference **may** have the following attributes:

- `<database>` reference to the endnote file
- `<source-app>` "EndNote"
- `<rec-number>` endnote id
- `<foreign-keys>` db reference key
- `<ref-type>` seems like integer category for media
- `<contributors>` possible authors
- `<titles>` title
- `<pages>` page number of the reference
- `<keywords>` ...keywords, examples: 'Brill, Alida, 1950-', 'Fertilization in vitro, Human.' 
- `<dates>` years as a string
- `<accession-num>` Another kind of id number  
- `<call-num>` Possibly library bibliographic id, examples: 'BODBL 24728 e.64/2\rBodleian Library 24728 e.64/2', 'SCABL HM291.SOC 1998\rSoc and Cultural Anth Library HM291.SOC 1998'
-  `<notes>` misc notes
-  `<urls>` links
-  `<label>` misc labels
- `<num-vols>` media description
- `<pub-location>` publisher location
- `<publisher>` publisher
- `<isbn>` isbn number
- `<number>` not clear what it is
- `<edition>` edition
- `<periodical>` periodical name
- `<volume>` record volum
- `<electronic-resource-num>` possible bibliographic id
- `<language>` record language
- `<orig-pub>` original publisher
- `<alt-periodical>` alternate periodical
- `<remote-database-name>` remote db protocol
- `<remote-database-provider>` remote db provider name
- `<abstract>` abstract
- `<work-type>` kind of media or publication
- `<research-notes>` notes on record
- `<auth-address>` author location
- `<custom2>` seems like an extra text field
- `<access-date>` access date
- `<section>` citation section

In [111]:
with open("../data/HumExMedMasterLibrary-Converted.xml", "r") as f:
    parser = bs4.BeautifulSoup(f.read(), "xml")

In [114]:
records = parser.find_all("record")
len(records)

15502

In [172]:
def endnote_process_child(child:bs4.element.Tag) -> Tuple[str, str]:
    name = child.name
    contents = child.contents[0] if len(child.contents) else None

    if isinstance(contents, bs4.element.NavigableString):
        value = set((str(contents),))
    elif isinstance(contents, bs4.element.Tag):
        value = set((next(contents.children).string,))
    elif contents is None:
        value = set()
    else:
        raise Exception(f"name:{name} had a value: {value}")

    return (name, value)

def endnote_get_children_tag_names_values(parent:bs4.element.Tag) -> List[Tuple[str, str]]:
    return list(map(
        endnote_process_child,
        parent.children,
    ))

In [173]:
tags_and_values = reduce(
    update_dict_with_records, 
    chain.from_iterable(map(endnote_get_children_tag_names_values, tqdm(records, desc="parsing records"))), 
    defaultdict(set)
)

parsing records:   0%|          | 0/15502 [00:00<?, ?it/s]

In [174]:
for tag_name in tags_and_values:
    print(f"Tag: <{tag_name}>") 
    print(f"Unique values: {len(tags_and_values[tag_name])}")
    print(f"5 samples: {list(tags_and_values[tag_name])[:5]}")
    print()
            

Tag: <database>
Unique values: 1
5 samples: ['HumExMedMasterLibrary-Converted.enl']

Tag: <source-app>
Unique values: 1
5 samples: ['EndNote']

Tag: <rec-number>
Unique values: 15502
5 samples: ['916', '10843', '12062', '2710', '8408']

Tag: <foreign-keys>
Unique values: 15502
5 samples: ['916', '10843', '12062', '2710', '8408']

Tag: <ref-type>
Unique values: 12
5 samples: ['13', '12', '3', '6', '5']

Tag: <contributors>
Unique values: 6768
5 samples: ['Markut, Lynda A.', 'Hathaway, Katharine', 'Clare,', 'Carpenter, Kim', 'Radner, Gilda']

Tag: <titles>
Unique values: 7901
5 samples: ['The cancer monologue project', 'Visions and revisions : coming of age in the age of Aids', 'Shifāʼuddaulah kī sarguzasht', '<LEDJacek.pdf>', "Atomes à l'heure du thé"]

Tag: <pages>
Unique values: 4714
5 samples: ['xxxiv, 257 p.', '585-97', '11-6', '370 p, 16 p of plates', 'xvi, 153 p.']

Tag: <keywords>
Unique values: 6212
5 samples: ['Brill, Alida, 1950-', 'Fertilization in vitro, Human.', 'Panizz