# Entry Formats

The following formatting can appear for an entry within Webster's Unabridged Dictionary:

### Base

```
BARBOTINE
Bar"bo*tine, n. Etym: [F.]

Defn: A paste of clay used in decorating coarse pottery in relief.
```

Key and information, plus the definition in a separate paragraph, starting with the string `Defn: `.

### Multiple Definitions
```
HOLLOWNESS
Hol"low*ness, n.

1. State of being hollow. Bacon.

2. Insincerity; unsoundness; treachery. South.
```

In the case of more than one definitions, each gets its own numbered paragraph.

### Definition With Field Label
```
INADHERENT
In`ad*her"ent, a.

1. Not adhering.

2. (Bot.)

Defn: Free; not connected with the other organs.
```

If a field label is present, the corresponding definition is shifted two lines below.

### Global Field Label With ABC Numbering

```
INFARCT
In*farct", n. [See Infarce.] (Med.)
 (a) An obstruction or embolus.
 (b) The morbid condition of a limited area resulting from such
obstruction; as, a hemorrhagic infarct.
```

If a field label is relevant for the whole entry, an ABC list is used for multiple definitions.

Note the absence of the empty line between the info and definition section.

### Supplementary Definitions

```
MARCHING
March"ing, a. & n.

Defn: ,fr. March, v. Marching money (Mil.), the additional pay of
officer or soldier when his regiment is marching.
 -- In marching order (Mil.), equipped for a march.
 -- Marching regiment. (Mil.) (a) A regiment in active service. (b)
In England, a regiment liable to be ordered into other quarters, at
home or abroad; a regiment of the line.
```

Some additional definitions may be provided under a parent definition, signaled with the ` --` characters.

### Synonyms

```
MISUSE
Mis*use", v. t. Etym: [F. mésuser. See Mis-, prefix from French, and
Use.]

1. To treat or use improperly; to use to a bad purpose; to misapply;
as, to misuse one's talents. South.
The sweet poison of misused wine. Milton.

2. To abuse; to treat ill.
O, she misused me past the endurance of a block. Shak.

Syn.
 -- To maltreat; abuse; misemploy; misapply.
```

Occasionally, a list of synonyms may appear at the end of a record.

### Notes

```
SNOWCAP
Snow"cap`, n. (Zoöl.)

Defn: A very small humming bird (Microchæra albocoronata) native of
New Grenada.

Note: The feathers of the top of the head are white and snining, the
body blue black with a purple and bronzy luster. The name is applied
also to Microchæra parvirostris of Central America, which is similar
in color.
```

A definition can have additional notes in the paragraph below, prefixed by a `Note: ` string.

## Parsing

### Constraints

Due 

In [13]:
import requests
import re
import pandas as pd

In [14]:
url = "http://localhost:8080/temp/pg29765.txt"
r = requests.get(url, stream=True)

r.raise_for_status()

In [15]:
# regex patterns for strings that signal the start of a section
patterns = {
    "content_marker": r"\*{3}.*\*{3}",
    "key": r"^[A-Z][^a-z]*$",
    "def": r"^(Defn: |\d*\. | *\([a-z]\) )",
    "note": r"^Note: ",
    "synonyms": r"^Syn\. ",
    "extra": r"^ --"
}

patterns = {k: re.compile(v) for k, v in patterns.items()}

In [16]:
content_flag = False
ignore_flag = False

current_section = ""

entries = []

for l in r.iter_lines(decode_unicode=True):
    # signal content area if cursor is between the two marker lines
    if patterns["content_marker"].match(l):
        content_flag = not content_flag
        continue

    # ignore meta content
    if not content_flag:
        continue

    # ignore empty lines
    if l == "":
        continue

    if patterns["key"].match(l):
        ignore_flag = False

        current_entry = {"key": l, "info": "", "defs": []}
        current_def_idx = -1
        
        entries.append(current_entry)
        
        current_section = "key"

        continue

    if patterns["def"].match(l):
        ignore_flag = False

        current_entry["defs"].append(l)
        current_def_idx += 1

        current_section = "defs"

        continue

    # ignore synonyms, notes and additional def sections
    if (
        patterns["note"].match(l) or
        patterns["synonyms"].match(l) or
        patterns["extra"].match(l)
    ):
        ignore_flag = True

    # NOTE: checks that reset the ignore flag must come before this
    if ignore_flag:
        continue
    
    # if previous line was a key, next will always be info
    # NOTE: there are a few multiline keys, they will be parsed incorrectly
    if current_section == "key":
        current_section = "info"
    
    # append line to current section's parsed text
    if current_section == "info":
        current_entry["info"] += f" {l}"
    elif current_section == "defs":
        current_entry["defs"][current_def_idx] += f" {l}"

In [17]:
df = pd.DataFrame(entries)

df.head()

Unnamed: 0,key,info,defs
0,A,"A (named a in the English, and most commonly ...",[Defn: The first letter of the English and of ...
1,A,A (# emph. #).,[1. Etym: [Shortened form of an. AS. an one. S...
2,A,"A, prep. Etym: [Abbreviated form of an (AS. o...","[1. In; on; at; by. [Obs.] ""A God's name."" ""To..."
3,A,"A. Etym: [From AS. of off, from. See Of.]","[Defn: Of. [Obs.] ""The name of John a Gaunt."" ..."
4,A,,[]
