# Parse the Price Checklist

PDFs are evil. They're all pretty when you print them on the screen or paper but trying to parse them is an unmitigated nightmare. This is my attempt at parsing this one PDF: Price's *Chewing Lice with Host Associations*.

The first trick to parsing PDFs is to get the output into a form that gives you a fighting chance to parse the output. I work from Linux so that limits what I have but, fortunately, there are some excellent tools available here: Poppler utils (poppler-utils). I'll be using this to convert the PDF into HTML.

The next trick is to use a library to parse the HTML. Parsing HTML with regular expression is sometimes doable but it is really the wrong tool for this job. I use Beautiful Soup 4 for this, but there are other excellent choices. lxml is a backing library for Beautiful Soup.

Side note: I prefer to use virtual environments, but that's just a personal preference.

In [1]:
#!pip install beautifulsoup4 lxml

## Convert the PDF to HTML

I'm using Poppler's pdftohtml utility for the conversion. The switches are:
- `-c` to create a "complex HTML" That will keep white spaces in the document. White space has meaning and we may want to use it for parsing.
- `-q` keeps the utility from printing a lot of messages.

[1] It will produce a bunch of "png" files which are background images for the pages. It also produce an "outline" which is the list of pages and links to them.

In [2]:
#!pdftohtml -c -q data/Price_louse.pdf data/Price_louse.pdf

## Examine the output

Note: The output HTML (`Price_louse.pdf-<page_no>-.html`) looks better in Chromium than in Firefox.

What the `-c` option for the conversion did was wrap all text in `<p>` tags with the class as a font reference. Importantly, the output contains absolute positioning in the style attributes. We can use this to prepare the text for parsing. So, open up the debug window in the browser (control-shift-I in Firefox) or a text editor to look at the output.

There will be some HTML entities in the output like `&#160;` = `&nbsp;` which is a non-breaking space. We will clean this up later.

## And now the hunt begins

- In this document page one is formatted differently, so we'll treat it differently. Every other page seems to be formatted alike.
- Page classes start at zero and the first class is for the page footer/header so they should be pretty easy to strip.
- We can use the style attribute's "top" to find lines and we can use the "left" and a cutoff value to find columns.

## Start by reading the files.

In [3]:
import re
from collections import Counter
from pathlib import Path

import tqdm

In [4]:
FLAGS = re.VERBOSE

In [5]:
DATA_DIR = Path('.') / 'data'
PDF = DATA_DIR / 'Price_louse.pdf'

In [6]:
from bs4 import BeautifulSoup

## Create a dictionary of pages

Each item will itself be a dictionary.

In [7]:
PAGES = {}

## Build the initial page data

In [8]:
for page in DATA_DIR.glob('Price_louse.pdf-*.html'):
    match = re.search(r'Price_louse.pdf-(\d+).html', str(page))

    if not match:
        continue

    page_no = int(match.group(1))

    with open(page) as in_file:
        doc = in_file.read()

    soup = BeautifulSoup(doc, features='lxml')

    img = soup.img

    PAGES[page_no] = {
        'page': soup,
        'width': int(img['width']),
        'height': int(img['height']),
    }

len(PAGES)

188

## Remove unneeded tags from the pages

We're going to wind up with a set of text fragments for each page. Each fragment will be wrapped in a `<p>` tag.

In [9]:
for page_no, page in PAGES.items():
    paras = page['page'].find_all('p')
    removes = ['ft00', 'ft01', 'ft02'] if page_no == 1 else ['ft00']
    paras = [p for p in paras if p['class'][0] not in removes]
    page['paras'] = paras

## Sort the text fragments by column, top, left

We're using the midpoint of the width of the page to determine what column the text belongs to.

This should put all of the text in each page in order, provided we did the column separation correctly.

While we're at it we remove the style and class attributes.

In [10]:
for page_no, page in PAGES.items():
    page['ordered'] = []

    midpoint = page['width'] // 2

    for p in page['paras']:
        top = int(re.search(r'top:(\d+)px', p['style']).group(1))
        left = int(re.search(r'left:(\d+)px', p['style']).group(1))

        col = 0 if left < midpoint else 1

        del p['class']
        del p['style']

        page['ordered'].append((page_no, col, top, left, p))

    page['ordered'] = sorted(page['ordered'], key=lambda p: tuple(p[:4]))

## Stitch all of the text together into a single document

We're just joining the pages here.

In [11]:
SINGLE = []

for page_no in sorted(PAGES.keys()):
    page = PAGES[page_no]
    SINGLE.extend([p for p in page['ordered']])

## Join lines

This should put all of the lines back into a form similar to how they appear in the document.

**Note: LINES will contain strings and not Beautiful Soup objects.**

In [12]:
LINES = []

prev = (0, 0, 0)
line = []

for p in SINGLE:
    curr = (p[0], p[1], p[2])
    if curr != prev and line:
        LINES.append(''.join(line))
        prev = curr
        line = []
    line.append(str(p[4]))

# Handle any last remaining line fragments
if line:
    LINES.append(''.join(line))

## Convert HTML entities into characters

For reasons that make sense in a web setting, but not here, Beautiful Soup leaves some HTML entities unconverted. We don't want these.

In [13]:
LINES = [ln.replace('&amp;', '&') for ln in LINES]
LINES = [ln.replace('&lt;', '<') for ln in LINES]
LINES = [ln.replace('&gt;', '>') for ln in LINES]
LINES = [ln.replace('\xa0', ' ') for ln in LINES]

## Remove the paragraph tags

In [14]:
LINES = [ln.replace('<p>', '') for ln in LINES]
LINES = [ln.replace('</p>', '') for ln in LINES]

## Join lines that overflow in the document itself (Step 1)

This will make parsing easier.

So we want to turn something like:
```
*Macropus antilopinus (Gould) [Diprotodontia:
Macropod.]
```

into this:
```
*Macropus antilopinus (Gould) [Diprotodontia: Macropod.]
```

It looks we can use some heuristics for conservative line joining:
- If the line starts with a `<b>` or a `<i>` tag then it's a line start.
- If the line starts with an `*` then it's a line start.
- If the line starts with a space then it's a line start.
- Everything else is an "overflow" line and should be joined to the one above.

In [15]:
JOINED = []
line = []

starters = re.compile(r'^(<b>|<i>|\*|\s)')

for ln in LINES:
    if starters.match(ln) and line:
        JOINED.append(' '.join(line))
        line = []
    line.append(ln)

# Handle any last remaining line fragments
if line:
    JOINED.append(''.join(line))

Some lines are broken in the middle of a `<i></i>` or `<b><i></i></b>`.

In [16]:
TEXT = '\n'.join(JOINED)
TEXT = TEXT.replace('<i>nomen nudum</i>', '|nomen nudum|')
TEXT = TEXT.replace('<i>unknown</i>', '|unknown|')
TEXT = TEXT.replace('</i></b>\n<b><i>', ' ')
TEXT = re.sub(r'</i>\n<i>(?!<)', ' ', TEXT)
TEXT = TEXT.replace('</b> <b>', ' ')
TEXT = TEXT.replace('|unknown|', '<i>unknown</i>')
TEXT = TEXT.replace('|nomen nudum|', '<i>nomen nudum</i>')

There are nested brackets in some references which will make regular expression parsing of the reference difficult. So I'll replace the interior square brackets `[]` with curly braces `{}`.

For example:
`[REF: Eichler [& Vasjukova], 1980:343]`
will become
`[REF: Eichler {& Vasjukova}, 1980:343]`.

In [17]:
TEXT = re.sub(r'\[& ( [^\]]+ ) \]', r'{&\1}', TEXT, flags=FLAGS)

In [18]:
JOINED = TEXT.splitlines()

In [19]:
# with open('Price_louse_text.txt', 'w') as out_file:
#     for p in JOINED:
#         out_file.write(f'{p}\n')

## Join lines that overflow in the document itself (Step 2)

The joining above did not link up all of the lines we needed. Step 1 was more about how lines started and this step is more about how lines end. There are situations where a line starts with a `<b>` or `<i>` and it is still a continuation line.

1. Look at the parity of parentheses. If a they're left open then join the line below. Ex:

`
<i><b>chloropodis</b></i> (Schrank, 1803:189) (subgenus
<i>Eulaemobothrion</i>) [in <i>Pediculus</i>]
`

to:

`
<i><b>chloropodis</b></i> (Schrank, 1803:189) (subgenus <i>Eulaemobothrion</i>) [in <i>Pediculus</i>]
`

2. We also have this equals sign `=` and `TYPE:` situation going on here. Ex:

`
<i>buteonis</i> (Fabricius [J.C.], 1777:309) [in <i>Pediculus</i>] =
<i><b>maximum</b></i> [REF: Hopkins & Clay, 1952:183]
`

to:

`
<i>buteonis</i> (Fabricius [J.C.], 1777:309) [in <i>Pediculus</i>] = <i><b>maximum</b></i> [REF: Hopkins & Clay, 1952:183]
`

In [20]:
def dangling_paren(line):
    opens = sum(1 for c in line if c in '([')
    closes = sum(1 for c in line if c in ')]')
    return opens > closes

In [21]:
LINES = []

LINE_ITER = iter(JOINED)

for ln in LINE_ITER:

    while dangling_paren(ln):
        ln += ' ' + next(LINE_ITER)

    while re.search(r'(=|TYPE:|of)$', ln):
        ln += ' ' + next(LINE_ITER)

    LINES.append(ln)

## Fix specific lines

There is broken data in the PDF

In [22]:
LINE_ITER = []
for ln in LINES:

    if ln.startswith('<i>C. yolandae</i>'):
        ln = ln.replace('<i>C. yolandae</i>', '  <i>C. yolandae</i>')
        print('<i>C. yolandae</i>')

    LINE_ITER.append(ln)

LINES = LINE_ITER

<i>C. yolandae</i>


## Clean up text for human readability

Align type hosts with the rest of the hosts.

In [23]:
LINES = [f' {ln}' if ln[0] == '*' else ln for ln in LINES]

## Examine results so far

In [24]:
with open(DATA_DIR / 'Price_louse_text.txt', 'w') as out_file:
    for p in LINES:
        out_file.write(f'{p}\n')

## Finally, we can start parsing

Quoting from the paper:

- Checklist is ordered alphabetically by chewing louse family, genus, species, and subspecies.
- Valid generic and specific names appear in bold face.
- Taxonomic names are followed by the author, year of publication, and the beginning page number of the description.
- Author names are not abbreviated except for Linnaeus (=L.), Blagoveshtchensky (=Blagov.), Burmeister (=Burm.), and Timmermann (=Timmer.).
- Generic name listings include the type species.
- Where the genus name has changed from the original description, “in” followed by the original genus name appears in brackets.
- Junior synonyms appear in plain face type followed by an “=” and, in bold face, the name of the taxon that we regard as valid. This is followed by the citation for the synonomy. Where no reference is given, we believe the synonomy to be new.
- Associated hosts appear in alphabetical order indented below each louse taxon, with the name of the type host preceded by an asterisk.
- Host names are followed by the author, order and family and, for other than the type host, by a citation documenting the host-louse association.
- We have not attempted to cite the first publication documenting the association, instead favoring sources that we regard as most reliable.
- Where no source is given, the association is attributable to the authors of the checklist.
- Bird orders are abbreviated by the omission of “iformes,” and host family names are abbreviated by the omission of “idae.”
- Where “? Host” appears, we believe that the name of type host indicated by the author of the louse taxon cannot be assigned to valid host species.
- Where “? ID” appears, we believe that, while the name is valid, the host probably was misidentified.

### Regular expressions for parsing lines

#### Louse family

In [25]:
louse_family = re.compile(r"""<b> (?P<family> [A-Z]+ ) </b> $""", flags=FLAGS)

#### Louse genus

In [26]:
louse_genus = re.compile(r"""
    <i><b>       (?P<genus>      [A-Z][a-z]+)  </b></i> \s*
                 (?P<genus_ref>  .+ (?! TYPE:) )   \s*
    TYPE: \s <i> (?P<genus_type> [A-Za-z\s]+ ) </i>
    \s*           (?P<genus_type_ref> .* )?
    $ """, flags=FLAGS)

#### Louse genus synonym

In [27]:
louse_genus_syn = re.compile(r"""
    <i>            (?P<genus_syn>      [A-Z][a-z.]+ )          </i> \s*
                   (?P<genus_type_ref> [^=]+ )                      \s*
    = \s* <i><b>   (?P<genus>          [^<]+ )             </b></i> \s*
    (?: \s* \[     (?P<genus_new_syn>  New \s Syn) \.            \] \s* )?
    (?: \[REF: \s* (?P<genus_syn_ref>  [^\]]+ )                  \] \s* )?
    TYPE: \s* <i>  (?P<genus_type>     [A-Za-z\s]+ )           </i> \s*
    \s*            (?P<skip> .* )?
    $ """, flags=FLAGS)

#### Louse subgenus

In [28]:
louse_subgenus = re.compile(r"""
    <i>            (?P<subgenus>      [A-Z][a-z.]+ ) </i>     \s*
                   (?P<subgenus_ref>  [^=]+ )                 \s*
    = \s* subgenus \s of \s*
    <i><b>         (?P<genus>         [A-Z][a-z.]+ ) </b></i> \s*
    (?: \[REF: \s* (?P<genus_ref>     [^\]]+ )       \]       \s* )?
    (?: \s* \[     (?P<genus_new_syn> New \s Syn) \. \]       \s* )?
    TYPE: \s* <i>  (?P<subgenus_type> [A-Za-z\s]+ )  </i>     \s*
    \s*            (?P<subgenus_type_ref> .+ )?
    $ """, flags=FLAGS)

#### Louse species

In [29]:
louse_species = re.compile(r"""
    <i><b>           (?P<species>     [a-z.]+ )    </b></i>    \s*
                     (?P<species_ref> [^\(\[]+ )               \s*
    (?: \( subgenus \s <i> (?P<subgenus> [A-Z][a-z.]+) </i> \) \s* )?
    (?: \[ in \s <i> (?P<genus_ori>   [^<]+ )          </i> \] \s* )?
    $ """, flags=FLAGS)

In [30]:
louse_species2 = re.compile(r"""
    <i><b>           (?P<species>      [a-z.]+ )      </b></i> \s*
        \(           (?P<species_ref>  [^\)]+ )             \) \s*
    (?: \( subgenus \s <i> (?P<subgenus> [A-Z][a-z.]+) </i> \) \s* )?
    (?: \[ in \s <i> (?P<genus_ori>    [A-Z][a-z.]+ )  </i> \] \s* )?
    $ """, flags=FLAGS)

In [31]:
louse_species3 = re.compile(r"""
    <i><b>           (?P<species>     [a-z.]+ )    </b></i>    \s*
    (?P<species_ref> [^\[]+ \[ [^\]]+ \] [^\(\[]+ )            \s*
    (?: \( subgenus \s <i> (?P<subgenus> [A-Z][a-z.]+) </i> \) \s* )?
    (?: \[ in \s <i> (?P<genus_ori>   [^<]+ )          </i> \] \s* )?
    $ """, flags=FLAGS)

In [32]:
louse_species4 = re.compile(r"""
    <i><b> (?P<species>     [a-z.]+ )        </b></i>  \s*
    (?: nomen \s novum \s* )?
    (?: \[ in \s <i> (?P<genus_ori>   [^<]+  ) </i> \] \s* )?
    (?: \[ (?P<species_ref> [^\]]+ ) \] | \( (?P<species_ref2> [^\)]+ ) \) ) \s*
    $ """, flags=FLAGS)

#### Louse species synonym

In [33]:
louse_species_syn = re.compile(r"""
    <i> (?P<species_syn> (([A-Z][.] | [A-Z][a-z]+)  \s )? [a-z]+ )  </i> \s*
                     (?P<species_syn_ref> [^\[=]+ ) \s*
    (?: \[ in \s <i> (?P<genus_ori>       [^\[<]+ ) </i> \]  \s* )?
    (?: = \s* <i><b> (?P<species>         [^<]+   ) </b></i> \s* )
    (?: \[           (?P<new_syn> New \s Syn\. )         \]  \s* )?
    (?: \[(?: REF: )? \s* (?P<species_ref>  .+ )         \]      )?
    $""", flags=FLAGS)

In [34]:
louse_species_syn2 = re.compile(r"""
    <i> (?P<species_syn> (([A-Z][.] | [A-Z][a-z]+)  \s )? [a-z]+ )  </i> \s*
    \(? (?P<species_ref> [^\[]+ \[ [^\]]+ \] [^\(\[]+ )         \)?  \s*
    (?: \[ in \s <i> (?P<genus_ori>       [^<]+   ) </i> \]  \s* )?
    (?: = \s* <i><b> (?P<species>         [^<]+ )   </b></i> \s* )
    (?: \[(?: REF: )? \s* (?P<species_syn_ref>  [^\]]+ )     \]      )?
    $""", flags=FLAGS)

#### Louse subspecies

In [35]:
louse_subspecies = re.compile(r"""
    <i><b>           (?P<subspecies> [a-z.]+ \s [a-z.]+ ) </b></i> \s*
                     (?P<subspecies_ref> [^\(\[]+ )            \s*
    (?: \( subgenus \s <i> (?P<subgenus> [A-Z][a-z.]+) </i> \) \s* )?
    (?: \[ in \s <i> (?P<genus_ori>   [^<]+ )          </i> \] \s* )?
    $ """, flags=FLAGS)

In [36]:
louse_subspecies2 = re.compile(r"""
    <i><b>           (?P<subspecies> [a-z.]+ \s [a-z.]+ ) </b></i> \s*
        \(           (?P<subspecies_ref>  [^\)]+ )          \) \s*
    (?: \( subgenus \s <i> (?P<subgenus> [A-Z][a-z.]+) </i> \) \s* )?
    (?: \[ in \s <i> (?P<genus_ori>    [A-Z][a-z.]+ )  </i> \] \s* )?
    $ """, flags=FLAGS)

#### Louse subspecies synonym

In [37]:
louse_subspecies_syn = re.compile(r"""
    <i> (?P<subspecies_syn> [a-z.]+ \s [a-z.]+ )    </i> \s*
                     (?P<subspecies_syn_ref> [^\(\[=]+ ) \s*
    (?: \[ in \s <i> (?P<genus_ori>       [^\[<]+ ) </i> \]  \s* )?
    (?: = \s* <i><b> (?P<species>         [^<]+   ) </b></i> \s* )?
    (?: \[           (?P<new_syn> New \s Syn\. )         \]  \s* )?
    (?: \[(?: REF: )? \s* (?P<species_ref>  .+ )         \]      )?
    $""", flags=FLAGS)

In [38]:
louse_subspecies_syn2 = re.compile(r"""
    <i> (?P<subspecies_syn> [a-z.]+ \s [a-z.]+ )  </i> \s*
    \(  (?P<subspecies_syn_ref> [^)]+          )  \)   \s*
    (?: \[ in \s <i> (?P<genus_ori>       [^<]+   ) </i> \]        \s* )?
    (?: = \s* <i><b> (?P<species>         [^<]+ )   </b></i>       \s* )?
    (?: \[(?: REF: )? \s* (?P<species_ref>  [^\]]+ )     \]            )?
    $""", flags=FLAGS)

#### Host species

In [39]:
# lines = [ln for ln in LINES if re.match(r'\s', ln)]
# lines

In [40]:
host_species = re.compile(r"""
    \s+     (?P<type_host>     \*)?  
        <i> (?P<host_species> [^<]+ )                                 </i>  \s*
    (?:     (?P<invalid_host> \? \s (host | ID) )                           \s* )?
    (?: \(? (?P<host_auth>    [^\[]+    )                               \)? \s* )?
    (?: \[  (?P<host_order>   [\w.]+ ) :? \s+ (?P<host_family> \w+ \.? ) \] \s* )?
    (?: \[REF: \s* (?P<host_ref> [^\]]+ )                             \]    \s* )?
    $ """, flags=FLAGS)

In [41]:
# for ln in lines:
#     if not (host_species.match(ln)):
#         print(ln)

### Parse the lines

In [42]:
hosts = []
generic = []
specific = []

family = {}
genus = {}
species = {}

for ln in LINES:

    if match := louse_family.match(ln):
        family = {
            'family': match.group('family'),
        }

    elif match := louse_genus.match(ln):
        genus = {
            'generic': match.group('genus'),
            'genus': match.group('genus'),
            'genus_ref': match.group('genus_ref'),
            'genus_type': match.group('genus_type'),
            'genus_type_ref': match.group('genus_type_ref'),
        }
        generic.append(genus)

    elif match := louse_genus_syn.match(ln):
        genus = {
            'generic': match.group('genus'),
            'genus': match.group('genus'),
            'genus_ref': '',  # match.group('genus_ref'),
            'genus_type': match.group('genus_type'),
            'genus_type_ref': match.group('genus_type_ref'),
            'genus_syn': match.group('genus_syn'),
            'genus_syn_ref': match.group('genus_syn_ref'),
            'genus_new_syn': match.group('genus_new_syn'),
        }
        generic.append(genus)

    elif match := louse_subgenus.match(ln):
        genus = {
            'generic': match.group('subgenus'),
            'subgenus': match.group('subgenus'),
            'subgenus_ref': match.group('subgenus_ref'),
            'genus': match.group('genus'),
            'genus_ref': match.group('genus_ref'),
            'subgenus_type': match.group('subgenus_type'),
            'subgenus_type_ref': match.group('subgenus_type_ref'),
            'genus_new_syn': match.group('genus_new_syn'),
        }
        generic.append(genus)

    elif match := louse_species.match(ln):
        species = {
            'specific': match.group('species'),
            'species': match.group('species'),
            'species_ref': match.group('species_ref'),
            'genus_ori': match.group('genus_ori'),
            'subgenus': match.group('subgenus'),
        }
        specific.append(species)

    elif match := louse_species2.match(ln):
        species = {
            'specific': match.group('species'),
            'species': match.group('species'),
            'species_ref': match.group('species_ref'),
            'genus_ori': match.group('genus_ori'),
            'subgenus': match.group('subgenus'),
        }
        specific.append(species)

    elif match := louse_species3.match(ln):
        species = {
            'specific': match.group('species'),
            'species': match.group('species'),
            'species_ref': match.group('species_ref'),
            'genus_ori': match.group('genus_ori'),
            'subgenus': match.group('subgenus'),
        }
        specific.append(species)

    elif match := louse_species4.match(ln):
        species = {
            'specific': match.group('species'),
            'species': match.group('species'),
            'species_ref': match.group('species_ref'),
            'genus_ori': match.group('genus_ori'),
        }
        specific.append(species)

    elif match := louse_species_syn.match(ln):
        species = {
            'specific': match.group('species_syn'),
            'species': match.group('species'),
            'species_ref': '',  # match.group('species_ref'),
            'species_syn': match.group('species_syn'),
            'species_syn_ref': match.group('species_ref'),
            'genus_ori': match.group('genus_ori'),
        }
        specific.append(species)

    elif match := louse_species_syn2.match(ln):
        species = {
            'specific': match.group('species_syn'),
            'species': match.group('species'),
            'species_ref': '',  # match.group('species_ref'),
            'species_syn': match.group('species_syn'),
            'species_syn_ref': match.group('species_ref'),
            'genus_ori': match.group('genus_ori'),
        }
        specific.append(species)

    elif match := louse_subspecies.match(ln):
        species = {
            'specific': match.group('subspecies'),
            'subspecies': match.group('subspecies'),
            'subspecies_ref': match.group('subspecies_ref'),
            'genus_ori': match.group('genus_ori'),
            'subgenus': match.group('subgenus'),
        }
        specific.append(species)
        
    elif match := louse_subspecies2.match(ln):
        species = {
            'specific': match.group('subspecies'),
            'subspecies': match.group('subspecies'),
            'subspecies_ref': match.group('subspecies_ref'),
            'genus_ori': match.group('genus_ori'),
            'subgenus': match.group('subgenus'),
        }
        specific.append(species)

    elif match := louse_subspecies_syn.match(ln):
        species = {
            'specific': match.group('subspecies_syn'),
            'subspecies_syn': match.group('subspecies_syn'),
            'subspecies_syn_ref': match.group('subspecies_syn_ref'),
            'genus_ori': match.group('genus_ori'),
            'species': match.group('species'),
            'species_ref': '',  # match.group('species_ref'),
        }
        specific.append(species)

    elif match := louse_subspecies_syn2.match(ln):
        species = {
            'specific': match.group('subspecies_syn'),
            'subspecies_syn': match.group('subspecies_syn'),
            'subspecies_syn_ref': match.group('subspecies_syn_ref'),
            'genus_ori': match.group('genus_ori'),
            'species': match.group('species'),
            'species_ref': '',  # match.group('species_ref'),
        }
        specific.append(species)

    elif match := host_species.match(ln):
        host = {
            'type_host': match.group('type_host'),
            'host_species': match.group('host_species'),
            'host_auth': match.group('host_auth'),
            'host_order': match.group('host_order'),
            'host_family': match.group('host_family'),
            'invalid_host': match.group('invalid_host'),
            'host_ref': match.group('host_ref'),
        }
        row = host | family | genus | species
        hosts.append(row)

    # Print lines that do not match a pattern
    else:
        print(ln)

## Output results

In [43]:
import pandas as pd

In [44]:
dfs = {
    'hosts': hosts,
    'generic': generic,
    'specific': specific,
}

In [45]:
OUTPUT = Path('.') / 'output'

for name, lst in dfs.items():
    df = pd.DataFrame(lst)
    path = OUTPUT / f'{name}.csv'
    df.to_csv(path, index=False)

# Hmm...

### In retrospect the regular expressions were not a good idea at all.