## Technical Analysis for *[Investigate getting custom data into Intake][I1296]*

[I1296]: https://github.com/sul-dlss/dlme/issues/1296

## Intake Plug-Ins
[Intake][INTAKE]'s plugins provide a method for handling custom data sources like these examples. Following *[Making Drivers](https://intake.readthedocs.io/en/latest/making-plugins.html)*, we could extend the `base.DataSource` class with the `_get_partion` method responsible for creating and returning the data source's Pandas DataFrame.

[INTAKE]: https://github.com/intake/intake

In [1]:
import json
import pathlib

import intake
import pandas as pd
import lxml.etree as etree
import xml_to_df

## LOC IIIF JSON

In [2]:
loc = pathlib.Path('loc')
loc_dataframes = {}
for i,collection in enumerate(loc.iterdir()):
    data_dir = collection/"data"
    records = []
    for row in data_dir.glob("*.json"):
        record = json.loads(row.read_text())
        records.append(record)
    collection_df = pd.json_normalize(records)
    print(f"""
    LOC {i} {collection.name}
Shape: {collection_df.shape}
 Size: {len(collection_df)}""")
    loc_dataframes[collection.name] = collection_df


    LOC 0 st-catherines-monastery
Shape: (1687, 40)
 Size: 1687

    LOC 1 abdul-hamid-ii-books
Shape: (321, 52)
 Size: 321

    LOC 2 greek-and-armenian-patriarchates
Shape: (1002, 26)
 Size: 1002

    LOC 3 el-taher
Shape: (105, 46)
 Size: 105

    LOC 4 persian
Shape: (300, 57)
 Size: 300

    LOC 5 abdul-hamid-ii-photos
Shape: (1817, 79)
 Size: 1817


Using this [comment](https://github.com/sul-dlss/dlme-harvest/issues/87#issuecomment-879159834) as a reference, tests an existing collection's DataFrame.

Now, we will create a new dataframe from the `loc/abdul-hamid-ii-books/data` directory. 

In [3]:
abdul_hammid_book_path = loc/"abdul-hamid-ii-books/data"
books_json = [json.loads(row.read_text()) for row in abdul_hammid_book_path.glob("*.json")]
abdul_hammid_books_df = pd.json_normalize(books_json)

Comfirm `abdul_hammid_books_df` is equal to the corresponding dataframe in `loc_dataframes` list.

In [4]:
loc_dataframes['abdul-hamid-ii-books'].equals(abdul_hammid_books_df)

True

Changing value in the subject list for a Series (row 67).

In [5]:
print(f"Initial Subjects for row 67: {abdul_hammid_books_df.iloc[67].subject}")
abdul_hammid_books_df.iloc[67].subject.append('rifle')
print(f"Current Subjects for row 67: {abdul_hammid_books_df.iloc[67].subject}")

Initial Subjects for row 67: ['turkey', 'firearms', '19th century']
Current Subjects for row 67: ['turkey', 'firearms', '19th century', 'rifle']


Now test if the `abdul_hammid_books_df` is still equal to the second dataframe in `loc_dataframe`. 

In [7]:
loc_dataframes['abdul-hamid-ii-books'].equals(abdul_hammid_books_df)

False

Now compare the two for differences:

In [8]:
loc_dataframes['abdul-hamid-ii-books'].compare(abdul_hammid_books_df)

Unnamed: 0_level_0,subject,subject
Unnamed: 0_level_1,self,other
67,"[turkey, firearms, 19th century]","[turkey, firearms, 19th century, rifle]"


### De-duplication

**NOTE** One of the requirements from the SPIKE is to de-duplicate the rows in the DataFrame, to accomplish this with Pandas we will need to convert the `list` objects in a column to Python `set` or `tuple`.

## BNF XML DataFrames
Using the [xml_to_df](https://github.com/PraveenKumar-21/xml_to_df) module to flatten XML and builds the columns names from the element and attributes in the xml.  

In [9]:
bnf = pathlib.Path("bnf")
bnf_dfs = {}
for collection in bnf.iterdir():
    if collection.is_dir():
        data_dir = collection/"data"
        file_dfs = []
        for xml_doc in data_dir.glob("*.xml"):
            file_dfs.append(xml_to_df.convert_xml_to_df(xml_doc.as_posix()))
        collection_df = pd.concat(file_dfs, ignore_index=True)
        bnf_dfs[collection.name] = collection_df
        print(f"""BNF {collection.name}
Shape: {collection_df.shape}
 Size: {len(collection_df)}""")
    

BNF ifao
Shape: (1668, 18)
 Size: 1668
BNF ideo
Shape: (360, 18)
 Size: 360


In [10]:
bnf_dfs['ideo'].sample(10)

Unnamed: 0,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}contributor,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}creator,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}description,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}format,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}identifier,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}language,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}relation,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}rights,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}source,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}title,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}type,{http://www.loc.gov/zing/srw/}extraRecordData_link,{http://www.loc.gov/zing/srw/}extraRecordData_nqamoyen,{http://www.loc.gov/zing/srw/}extraRecordData_thumbnail,{http://www.loc.gov/zing/srw/}extraRecordData_typedoc,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}subject,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}date,{http://www.loc.gov/zing/srw/}recordData_{http://www.openarchives.org/OAI/2.0/oai_dc/}dc_{http://purl.org/dc/elements/1.1/}publisher
50,,"Ibn al-Batanūnī, ʿAlī ibn ʿUmar (-après 1495)....",[[Al-Sirr al-ṣafī fī manāqib sayyidī Muḥammad ...,"[55, 120 pages ; 17 × 24 cm, Nombre total de v...",https://gallica.bnf.fr/ark:/12148/bpt6k9106138q,"[ara, arabe]",[Notice du catalogue : http://catalogue.bnf.fr...,"[domaine public, public domain]","Institut dominicain d'études orientales, 9-759-75",[ ال جزء الأول [والجزء الثاني] من كتاب السر ال...,"[text, monographie imprimée, monographie impri...",,,,,"Ḥanafī, Muḥammad ibn al-Ḥasan al- (-1443)",1888.0,[ʿAlá nafaqaẗ Salīm Sayyid Aḥmad Ibrāhīm al-Qa...
198,,,,,,,,,,,,,,,,,,
160,,,,,,,,,,,,,,,,,,
168,,,,,,,,,,,,,,,,,,
42,,,,,,,,,,,,,,,,,,
241,,,,,,,,,,,,,,,,,,
67,,,,,,,,,,,,,,,,,,
169,,,,,,,,,,,,,,,,,,
123,,,,,,,,,,,,,,,,,,
347,,,,,,,,,,,,https://gallica.bnf.fr/ark:/12148/bpt6k9105982k,0.0,https://gallica.bnf.fr/ark:/12148/bpt6k9105982...,monographies,,,


## Yale Babylonian CSV File

In [11]:
yale_babylonian = pd.read_csv("yale/babylonian/data/yale-babylonian.csv")

In [12]:
yale_babylonian.sample(10)

Unnamed: 0,id,occurrence_id,last_modified,callnumber,title,scientificname,collector,collecting_date,latitude,longitude,...,type,format,era,geographic_culture,identified_by,identification_references,previous_identifications,associated_references,iiif_id,image_id
6307,4478861,urn:uuid:0d05c109-2a82-47b1-8c83-ad001317c420,2021-04-14T21:48:31.000Z,YPM BC 009239,Tablet. Draft of seal inscription. Old Babylon...,taxon undetermined,,,,,...,BC number 9239; Babylonian Collection; date: 0...,tablet,clay,Old Babylonian,,,,,/ypm/nat/1825803,https://images.collections.yale.edu/iiif/2/ypm...
2325,4507912,urn:uuid:c36ce805-faa9-4b96-8e6d-fa693c125637,2021-05-28T21:46:16.000Z,YPM BC 024080,Tablet. Loan of barley. Early Old Babylonian. ...,taxon undetermined,,,,,...,BC number 24080; Babylonian Collection; date: ...,tablet,clay,Early Old Babylonian,,,,,/ypm/nat/1840644,https://images.collections.yale.edu/iiif/2/ypm...
6379,4480167,urn:uuid:1df9657c-9eb3-4d8a-9fa0-93bf26479390,2021-04-14T21:48:31.000Z,YPM BC 008910,Tablet. Account of land parcels and quantities...,taxon undetermined,,,,,...,BC number 8910; Babylonian Collection; genre: ...,tablet,clay,Late Early Dynastic,,,,,/ypm/nat/1825474,https://images.collections.yale.edu/iiif/2/ypm...
8546,4486713,urn:uuid:11601195-ac1f-48ab-823c-44ca6c5eb03f,2021-05-06T21:45:56.000Z,YPM BC 003097,Tablet. Disbursement of livestock by Abba$agga...,taxon undetermined,,,,,...,BC number 3097; Babylonian Collection; date: A...,tablet,clay,Ur III,,,,,/ypm/nat/1819661,https://images.collections.yale.edu/iiif/2/ypm...
1958,4475152,urn:uuid:947dd326-c2c8-4424-8287-6ee9227b459f,2021-05-28T21:46:16.000Z,YPM BC 026484,Tablet. Soil worked by guru$. Ur III. Clay.; Y...,taxon undetermined,,,,,...,BC number 26484; Babylonian Collection; date: ...,tablet,clay,Ur III,,,,,/ypm/nat/1843048,https://images.collections.yale.edu/iiif/2/ypm...
8640,4487314,urn:uuid:9a3fa9a9-3f91-49da-809d-a83d621c3eee,2021-04-22T21:45:41.000Z,YPM BC 002941,Tablet. Delivery of silver. Ur III. Clay.; YPM...,taxon undetermined,,,,,...,BC number 2941; Babylonian Collection; date: ?...,tablet,clay,Ur III,,,,,/ypm/nat/1819505,https://images.collections.yale.edu/iiif/2/ypm...
3108,4490888,urn:uuid:d14127e3-59a0-4f19-83b9-979935dd395b,2021-06-02T21:47:41.000Z,YPM BC 020216,Tablet. Sale of slaves. Old Babylonian. Clay. ...,taxon undetermined,,,,,...,BC number 20216; Babylonian Collection; date: ...,tablet,clay,Old Babylonian,,,,,/ypm/nat/1836780,https://images.collections.yale.edu/iiif/2/ypm...
4592,4501377,urn:uuid:83d5faa3-9d7a-4132-b4ad-c8597ac96e6f,2021-02-15T16:43:48.000Z,YPM BC 015867,Tablet. Loan of silver. Ur III. Clay.; YPM BC ...,taxon undetermined,,,,,...,BC number 15867; Babylonian Collection; date: ...,tablet,clay,Ur III,,,,,/ypm/nat/1832431,https://images.collections.yale.edu/iiif/2/ypm...
1696,4511665,urn:uuid:a05fa26b-3e0b-4ee2-80ad-7370682f150a,2021-05-21T21:50:20.000Z,YPM BC 027128,Tablet. Transfer of dead sheep. Ur III. Clay.;...,taxon undetermined,,,,,...,BC number 27128; Babylonian Collection; date: ...,tablet,clay,Ur III,,,,,/ypm/nat/1843692,https://images.collections.yale.edu/iiif/2/ypm...
7275,4476551,urn:uuid:ae29aa39-a8b7-4b2b-844a-61c6b7a3274e,2021-05-02T21:45:50.000Z,YPM BC 006033,Tablet. Record concerning guru$ employed for f...,taxon undetermined,,,,,...,BC number 6033; Babylonian Collection; date: A...,tablet,clay,Ur III,,,,,/ypm/nat/1822597,https://images.collections.yale.edu/iiif/2/ypm...


De-duplicate values in `yale_babylonian` DataFrame.

In [13]:
print(f"Initial size of yale_babylonian {len(yale_babylonian)}")
yale_babylonian.drop_duplicates()
print(f"After dropping duplicates, size of yale_babylonian {len(yale_babylonian)}")

Initial size of yale_babylonian 8820
After dropping duplicates, size of yale_babylonian 8820


## Penn Genizah XML
As noted in the [ticket](https://github.com/sul-dlss/dlme/issues/1296), Penn's Genizah collection record has a node with a language value stored as an attribute values of that node. An additional complexity with this source is that the TEI-XML is embedded within HTML. To extract the XML we can use the BeautifulSoup module to extract the XML that can be run through the `xml_to_df` module to create the DataFrame.

In [14]:
from bs4 import BeautifulSoup 

with open("/Users/jpnelson/02021/sul-dlss/dlme-metadata/penn/openn/genizah-0002/data/0002-104.xml") as fo:
    penn_genizah_104_html = BeautifulSoup(fo.read(),
                                     'html.parser')


In [19]:
penn_genizah_104_tei_xml = penn_genizah_104_html.find("tei")
filedesc = penn_genizah_104_tei_xml.find("filedesc")

In [20]:
filedesc

<filedesc>
<titlestmt>
<title>Description of Center for Advanced Judaic Studies Library, Halper 79: Mishnah, Ḳodashim, Midot 1:1-3</title>
</titlestmt>
<publicationstmt>
<publisher>Center for Advanced Judaic Studies Library</publisher>
<availability>
<licence target="http://creativecommons.org/publicdomain/mark/1.0/">These images and the content of Center for Advanced Judaic Studies Library, Halper 79: Mishnah, Ḳodashim, Midot 1:1-3 are free of known copyright restrictions and in the public domain. See the Creative Commons Public Domain Mark page for usage details, http://creativecommons.org/publicdomain/mark/1.0/.</licence>
<licence target="https://creativecommons.org/licenses/by/4.0/legalcode">Metadata is ©2017 University of Pennsylvania Books &amp; Manuscripts and licensed under a Creative Commons Attribution License version 4.0 (CC-BY-4.0 https://creativecommons.org/licenses/by/4.0/legalcode. For a description of the terms of use see the Creative Commons Deed https://creativecomm

In [23]:
penn_genizah_104_df = pd.read_xml(str(penn_genizah_104_tei_xml),
                                  xpath="filedesc")

ValueError: xpath does not return any nodes. Be sure row level nodes are in xpath. If document uses namespaces denoted with xmlns, be sure to define namespaces and use them in xpath.

In [17]:
penn_genizah_104_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   filedesc     0 non-null      float64
 1   profiledesc  0 non-null      float64
 2   surface      0 non-null      float64
dtypes: float64(3)
memory usage: 176.0 bytes


## University of Michigan MARC XML
MARC XML sources, for example from the University of Michigan, pose it's own set of challenges beyond just reading the XML into a Panda's DataFrame.

### Using the `xml_to_df` module
Loading the MARC XML into a DataFrame results in a shape that while would provide some of the higher level functionality for DataFrames like `compare` and `equals`, it creates columns that do not reflect the underlying structure we want using elements and attributes like `subfield` nodes and `code`, `tag`, and `indicators`.

In [88]:
mich = [xml_to_df.convert_xml_to_df(row) for row in pathlib.Path("michigan/data/").glob("*.xml")]

In [89]:
mich_df = pd.concat(mich, ignore_index=True)

In [90]:
mich_df.sample(10)

Unnamed: 0,record_leader,record_controlfield_tag,record_controlfield,record_datafield_tag,record_datafield_ind1,record_datafield_ind2,record_datafield_subfield_code,record_datafield_subfield
465,04136ntm a2200565 a 4500,"[001, 003, 005, 006, 007, 008]","[006816116, MiAaHDL, 20190719000000.0, m ...","[035, 035, 040, 043, 245, 246, 246, 246, 246, ...","[ , , , , 0, 3, 3, 3, 3, , , , , , , ...","[ , , , , 0, , , , , , , , , , , ...","[a, a, a, c, e, a, 6, a, f, a, a, a, i, a, c, ...","[(MiU)006816116, sdr-miu006816116, MiU, MiU, a..."
42,08439ntm a2200553 a 4500,"[001, 003, 005, 006, 007, 008]","[006834153, MiAaHDL, 20190719000000.0, m ...","[035, 035, 040, 043, 100, 245, 246, 260, 300, ...","[ , , , , 1, 1, 3, , , , , , , , , ...","[ , , , , , 0, , , , , , , , , , ...","[a, a, a, c, e, a, 6, a, d, 6, a, f, i, a, c, ...","[(MiU)006834153, sdr-miu006834153, MiU, MiU, a..."
605,05413ntm a2200505 a 4500,"[001, 003, 005, 006, 007, 008]","[006825748, MiAaHDL, 20190719000000.0, m ...","[035, 035, 040, 043, 245, 246, 260, 300, 500, ...","[ , , , , 0, 3, , , , , , , , , , ...","[ , , , , 0, 8, , , , , , , , , , ...","[a, a, a, c, e, a, 6, a, f, a, c, a, b, c, a, ...","[(MiU)006825748, sdr-miu006825748, MiU, MiU, a..."
974,06393ntm a2200673 a 4500,"[001, 003, 005, 006, 007, 008]","[006816085, MiAaHDL, 20200516000000.0, m ...","[035, 035, 040, 043, 100, 245, 246, 246, 260, ...","[ , , , , 0, 1, 3, 3, , , , , , , , ...","[ , , , , , 0, , , , , , , , , , ...","[a, a, a, c, e, a, 6, a, d, 6, a, f, i, a, i, ...","[(MiU)006816085, sdr-miu006816085, MiU, MiU, a..."
776,01942ntm a2200313 a 4500,"[001, 003, 005, 006, 007, 008]","[002601793, MiAaHDL, 20190719000000.0, m ...","[035, 035, 035, 035, 040, 245, 300, 500, 500, ...","[ , , , , , 0, , , , 8, , , , , , ...","[ , , , , , 0, , , , , , , , , , ...","[a, a, a, a, a, c, e, d, a, k, f, a, f, b, c, ...","[(MiU)002601793, sdr-miu002601793, (OCoLC)7060..."
466,06066ntm a2200541 a 4500,"[001, 003, 005, 006, 007, 008]","[006812468, MiAaHDL, 20190719000000.0, m ...","[035, 035, 040, 043, 100, 245, 246, 246, 260, ...","[ , , , , 1, 1, 3, 3, , , , , , , , ...","[ , , , , , 0, , , , , , , , , , ...","[a, a, a, c, e, a, 6, a, d, 6, a, f, i, a, i, ...","[(MiU)006812468, sdr-miu006812468, MiU, MiU, a..."
464,06597ntm a2200565 a 4500,"[001, 003, 005, 006, 007, 008]","[006804824, MiAaHDL, 20190719000000.0, m ...","[035, 035, 040, 043, 100, 245, 246, 260, 300, ...","[ , , , , 1, 1, 3, , , , , , , , , ...","[ , , , , , 0, , , , , , , , , , ...","[a, a, a, c, e, a, 6, a, d, 6, a, f, i, a, c, ...","[(MiU)006804824, sdr-miu006804824, MiU, MiU, a..."
238,08164ntm a2200589 a 4500,"[001, 003, 005, 006, 007, 008]","[005977493, MiAaHDL, 20190719000000.0, m ...","[035, 035, 040, 043, 100, 240, 245, 246, 246, ...","[ , , , , 1, 1, 1, 3, 3, , , , , , , ...","[ , , , , , 0, 0, , , , , , , , , ...","[a, a, a, c, e, a, 6, a, d, 6, a, 6, a, f, i, ...","[(MiU)005977493, sdr-miu005977493, MiU, MiU, a..."
1038,05903ntm a2200553 a 4500,"[001, 003, 005, 006, 007, 008]","[006815062, MiAaHDL, 20190719000000.0, m ...","[035, 035, 040, 043, 100, 240, 245, 260, 300, ...","[ , , , , 1, 1, 1, , , , , , , , , ...","[ , , , , , 0, 0, , , , , , , , , ...","[a, a, a, c, e, a, 6, a, d, 6, a, 6, a, f, c, ...","[(MiU)006815062, sdr-miu006815062, MiU, MiU, a..."
65,08076ntm a2200817 a 4500,"[001, 003, 005, 006, 007, 008]","[006817861, MiAaHDL, 20190719000000.0, m ...","[035, 035, 040, 043, 100, 245, 246, 260, 300, ...","[ , , , , 1, 1, 3, , , , , , , , , ...","[ , , , , , 0, 8, , , , , , , , , ...","[a, a, a, c, e, a, 6, a, d, 6, a, f, a, a, c, ...","[(MiU)006817861, sdr-miu006817861, MiU, MiU, a..."


### Using `pymarc` for MARC XML
Another approach would be use parse the MARC XML into the widely used `pymarc` module and then see if we can create a DataFrame that better matches our expectations.

In [11]:
import pymarc
mich1 = pymarc.marcxml.parse_xml_to_array('michigan/data/michigan-1.xml')
data = json.loads(mich1[0].as_json())

In [33]:
mich1_df = pd.json_normalize(mich1[0].as_dict()['fields'])

In [38]:
mich1_df.head()

Unnamed: 0,001,003,005,006,007,008,035.subfields,035.ind1,035.ind2,040.subfields,...,CAT.ind2,FMT.subfields,FMT.ind1,FMT.ind2,HOL.subfields,HOL.ind1,HOL.ind2,974.subfields,974.ind1,974.ind2
0,10740758.0,,,,,,,,,,...,,,,,,,,,,
1,,MiAaHDL,,,,,,,,,...,,,,,,,,,,
2,,,20190719000000.0,,,,,,,,...,,,,,,,,,,
3,,,,m d,,,,,,,...,,,,,,,,,,
4,,,,,cr bn ---auaua,,,,,,...,,,,,,,,,,


In [26]:
mich1_fields = mich1[0].get_fields()

In [28]:
mich1[0].as_dict()

{'leader': '08645ntm a2200817 a 4500',
 'fields': [{'001': '010740758'},
  {'003': 'MiAaHDL'},
  {'005': '20190719000000.0'},
  {'006': 'm        d        '},
  {'007': 'cr bn ---auaua'},
  {'008': '110909q17811799tu            00| ||ota d'},
  {'035': {'subfields': [{'a': '(MiU)010740758'}], 'ind1': ' ', 'ind2': ' '}},
  {'035': {'subfields': [{'a': 'sdr-miu010740758'}],
    'ind1': ' ',
    'ind2': ' '}},
  {'040': {'subfields': [{'a': 'MiU'}, {'c': 'MiU'}, {'e': 'amremm'}],
    'ind1': ' ',
    'ind2': ' '}},
  {'041': {'subfields': [{'a': 'ota'}], 'ind1': '0', 'ind2': ' '}},
  {'043': {'subfields': [{'a': 'n-us-mi'}, {'a': 'e-gx---'}, {'a': 'a-tu---'}],
    'ind1': ' ',
    'ind2': ' '}},
  {'100': {'subfields': [{'6': '880-01'},
     {'a': 'Ahmet Resmî Efendi,'},
     {'c': 'Giridî,'},
     {'d': '1700-1783.'}],
    'ind1': '0',
    'ind2': ' '}},
  {'245': {'subfields': [{'6': '880-03'},
     {'a': 'Ahmet Resmî Efendinin dört kıta-yi kitabını havi mecmuadır,'},
     {'f': '[late 

Yet a third option would be to use add an xpath expression to the standard Pandas `read_xml` method and extract the leader, control, and datafields and see how close the resulting DataFrame matches our expectations.

In [39]:
mich1000 = pd.read_xml('michigan/data/michigan-1000.xml',
                      xpath='record/*[self::controlfield or self::datafield]')

In [40]:
mich1000

Unnamed: 0,tag,controlfield,ind1,ind2,subfield
0,001,002601793,,,
1,003,MiAaHDL,,,
2,005,20190719000000.0,,,
3,006,m d,,,
4,007,cr bn ---auaua,,,
5,008,930310q16001650xx 00|| ara d,,,
6,035,,,,(MiU)002601793
7,035,,,,sdr-miu002601793
8,035,,,,(OCoLC)706055949
9,035,,,,(RLIN)MIUH93-A33
