# WLSP Hachidaishu Integration

The goal of this notebook is to update the WLSP entries in the Hachidaishu dataset from the original to the revised version where possible while also ensuring Hachidaishu-unique entries are safely preserved.
This is accomplished by:

1.  comparing the original WLSP to the revised version,
2.  mapping the Hachidaishu POS tagset (IPAdic) to the UniDic POS tagset,
3.  adding UD POS information following current UniDic->UD conventions,
4.  mapping tokens to their closest UniDic equivalents where UniDic WLSP entries are available,
5.  creating a new dictionary (database) of tokens currently not mappable to UniDic (or new lemma ids per external dataset UniDic support). (WIP)


## Data

We load the data from several sources:

- Hachidaishu database (this repo: `hachidai.db`)
- WLSP1 (original WLSP): private data not published due to copyright
- WLSP2 (Expanded and revised WLSP--『分類語彙表増補改訂版データベース』（ver.1.0.1）): https://github.com/masayu-a/wlsp
- UniDic to WLSP2 mappings: https://github.com/masayu-a/WLSP2UniDic
- (Not implemented: UniDic to Historical WLSP mappings: https://github.com/masayu-a/WLSP2UniDic_historical)
- UniDic CWJ 3.1 lex file: LZMA compressed lex_3_1.csv from https://ccd.ninjal.ac.jp/unidic_archive/cwj/3.1.0/unidic-cwj-3.1.0.zip

Summary table (English version at end of notebook)

|     | 名称   | 勅/院宣  | 成立   | 撰者                                                 |   首 |
|----:|:-------|:---------|:-------|:-----------------------------------------------------|-----:|
|   1 | 古今   | 醍醐天皇 | 905頃  | 紀友則・紀貫之・凡河内躬恒・壬生忠岑                 | 1100 |
|   2 | 後撰   | 村上天皇 | 951頃  | 清原元輔・紀時文・大中臣能宣・源順・坂上望城         | 1425 |
|   3 | 拾遺   | 花山院   | 1007頃 | 花山院                                               | 1351 |
|   4 | 後拾遺 | 白河天皇 | 1086   | 藤原通俊                                             | 1218 |
|   5 | 金葉   | 白河院   | 1125頃 | 源俊頼                                               |  665 |
|   6 | 詞花   | 崇徳院   | 1151頃 | 藤原顕輔                                             |  415 |
|   7 | 千載   | 後白河院 | 1188   | 藤原俊成                                             | 1288 |
|   8 | 新古今 | 後鳥羽院 | 1205   | 源通具・藤原有家・藤原定家・藤原家隆・藤原雅経・寂蓮 | 1978 |


In [None]:
from pathlib import Path
import csv
from collections import Counter, defaultdict

old_wlsp_path = Path("~/Dropbox/bunrui/").expanduser() # Not public.
new_wlsp_path = Path("~/Dropbox/bunrui/wlsp").expanduser() # From GitHub repo.
unidic_wlsp_path = Path("~/Dropbox/bunrui/wlsp2unidic").expanduser() # From GitHub repo.
hachidaishu_path = Path("~/Dropbox/bunrui/hachidai.db").expanduser()
unidic_lex_path = Path(
    "~/Dropbox/bunrui/lex_3_1.csv.xz"
).expanduser()  # Compressed from lex_3_1.csv in https://ccd.ninjal.ac.jp/unidic_archive/cwj/3.1.0/unidic-cwj-3.1.0.zip


### Hachidaishu


In [None]:
from hachidaishu import HachidaishuDB, Anthology

hachidaishu = HachidaishuDB(filename="hachidai.db")

metadata = {
    1: {"no.": 1, "name": "Kokinshu", "name_ja": "古今", "order_ja": "醍醐天皇", "order": "Daigo tenno", "date": "ca. 905", "editor": "紀友則・紀貫之・凡河内躬恒・壬生忠岑", "poems": 1100},
    2: {"no.": 2, "name": "Gosenshu", "name_ja": "後撰", "order_ja": "村上天皇", "order": "Murakami tenno", "date": "ca. 951", "editor": "清原元輔・紀時文・大中臣能宣・源順・坂上望城", "poems": 1425},
    3: {"no.": 3, "name": "Shuishu", "name_ja": "拾遺", "order_ja": "花山院", "order": "Kazan'in", "date": "ca. 1007", "editor": "花山院", "poems": 1351},
    4: {"no.": 4, "name": "Goshuishu", "name_ja": "後拾遺", "order_ja": "白河天皇", "order": "Shirakawa tenno", "date": "1086", "editor": "藤原通俊", "poems": 1218},
    5: {"no.": 5, "name": "Kin’yoshu", "name_ja": "金葉", "order_ja": "白河院", "order": "Shirakawain", "date": "ca. 1125", "editor": "源俊頼", "poems": 665},
    6: {"no.": 6, "name": "Shikashu", "name_ja": "詞花", "order_ja": "崇徳院", "order": "Sutokuin", "date": "ca. 1151", "editor": "藤原顕輔", "poems": 415},
    7: {"no.": 7, "name": "Senzaishu", "name_ja": "千載", "order_ja": "後白河院", "order": "Goshirakawain", "date": "1188", "editor": "藤原俊成", "poems": 1288},
    8: {"no.": 8, "name": "Shinkokinshu", "name_ja": "新古今", "order_ja": "後鳥羽院", "order": "Gotobain", "date": "1205", "editor": "源通具・藤原有家・藤原定家・藤原家隆・藤原雅経・寂蓮", "poems": 1978},
}

def decode_bg_id(s):
    """Returns a dictionary containing a 分類番号 key mapped to a WLSP article record string.
    CH-JP-0000 are unique to Hachidaishu."""
    xs = s.split("-")
    return {"分類番号": f"{xs[1][1]}.{xs[2]}"}


bgid_hachidaishu = set(decode_bg_id(t.bg_id)["分類番号"] for t in hachidaishu.tokens())
len(bgid_hachidaishu)


### WLSP1 (Original WLSP; WLSPH in TEI/output)


In [None]:
with open(old_wlsp_path / "sakuin.dat") as f:
    # 段落内番号 => 小段落番号
    # 分類番号は合うものの，"段落番号", "小段落番号"が基本的に合わないのでマッピングする必要あり
    reader = csv.DictReader(
        f, fieldnames=["reading", "orth", "分類番号", "段落番号", "小段落番号", "info", "note"]
    )
    wlsp1 = [r for r in reader]
    for r in wlsp1:
        if len(r["分類番号"]) == 5:
            r["分類番号"] += "0"
        r["段落番号"] = f'{int(r["段落番号"]):02}'

# Index: orth -> WLSP1 id
wlsp1_index = defaultdict(list)

for r in wlsp1:
    wlsp1_index[r["orth"]].append(r)

# Index: lemma (Hachidaishu only) -> WLSP1 id
wlsp1h_index = defaultdict(set)
for token in hachidaishu.tokens():
    wlsp1h_index[token.lemma].add(decode_bg_id(token.bg_id)["分類番号"])

wlsp1h_index

len(wlsp1_index), len(wlsp1h_index)


In [None]:
wlsp1[0], wlsp1h_index['や']

### WLSP2 (Expanded and revised WLSP; WLSP in TEI/output)


In [None]:
with open(new_wlsp_path / "bunruidb.txt") as f:
    reader = csv.DictReader(
        f,
        fieldnames=[
            "id",
            "見出し番号",
            "record_type",
            "類",
            "部門",
            "中項目",
            "分類項目",
            "分類番号",
            "段落番号",
            "小段落番号",
            "語番号",
            "orth_info",
            "orth",
            "reading",
            "reverse_reading",
        ],
    )
    wlsp2 = [r for r in reader]

# Index: orth -> WLSP2 id
wlsp2_index = defaultdict(list)

for r in wlsp2:
    wlsp2_index[r["orth"]].append(r)

# Index: lemma (Hachidaishu only) -> WLSP2 id
wlsp2h_index = defaultdict(set)
for token in hachidaishu.tokens():
    wlsp2h_index[token.lemma].add(decode_bg_id(token.bg_id)["分類番号"])

len(wlsp2_index), len(wlsp2h_index)


In [None]:
from collections import namedtuple
wlsp2describe = defaultdict(set)
WLSPRecord = namedtuple("WLSPRecord", ['類', '部門', '中項目', '分類項目'])
for token, records in wlsp2_index.items():
    for record in records:
        wlsp2describe[record['分類番号']].add(WLSPRecord(record['類'], record['部門'], record['中項目'], record['分類項目']))

for id, rs in wlsp2describe.items():
    assert len(rs) == 1
    wlsp2describe[id] = list(rs)[0]

wlsp2describe = dict(wlsp2describe)

### UniDic to WLSP2 mappings


In [None]:
with open(unidic_wlsp_path / "BunruiNo_LemmaID.txt") as f:
    wlsp_number2label = {}
    wlsp2unidic_lemma_id = {}
    lemma_id2wlsp2 = {}
    lines = f.readlines()[1:]
    for line in lines:
        b, lemma_id = line.rstrip().split("\t")
        number, label, sub_number = b.split(",")
        # Check if number to label mappings are unique: (✅ they are)
        assert number not in wlsp_number2label or label == wlsp_number2label[number]
        wlsp_number2label[number] = label
        wlsp2unidic_lemma_id[number] = lemma_id
        lemma_id2wlsp2[lemma_id] = number

len(wlsp_number2label), len(wlsp2unidic_lemma_id), len(lemma_id2wlsp2)


In [None]:
wlsp_number2label["1.1010"], wlsp2unidic_lemma_id["1.1010"], lemma_id2wlsp2["65788"]

### UniDic DB


In [None]:
import lzma
import csv
from collections import namedtuple  # we want to be able to hash dict entries

with lzma.open(unidic_lex_path, "rt") as f:
    # l1..4 are not used. l1 is pre-NFKC'd orth?
    fields = [
        "l1",
        "l2",
        "l3",
        "l4",
        "pos1",
        "pos2",
        "pos3",
        "pos4",
        "cType",
        "cForm",
        "lForm",
        "lemma",
        "orth",
        "pron",
        "orthBase",
        "pronBase",
        "goshu",
        "iType",
        "iForm",
        "fType",
        "fForm",
        "iConType",
        "fConType",
        "type",
        "kana",
        "kanaBase",
        "form",
        "formBase",
        "aType",
        "aConType",
        "aModType",
        "lid",
        "lemma_id",
    ]
    UniDicRecord = namedtuple("UniDicRecord", fields)
    reader = csv.DictReader(f, fieldnames=fields)
    unidic_db = [UniDicRecord(**r) for r in reader]

# Some useful mapping dicts # FIXME these are not 1:1 (i.e. a lemma string might have more than one entry/id), they should be defaultdict(set)
lemma_id2s = {e.lemma_id: e.lemma for e in unidic_db}
lemma_s2id = {e.lemma: e.lemma_id for e in unidic_db}

# This is an expensive index, but allows quick indexing into all entries from a given orth or lemma token.
unidic_token_index = defaultdict(set)
for e in unidic_db:
    unidic_token_index[e.lemma].add(e)
    unidic_token_index[e.orth].add(e)

len(unidic_db), len(lemma_id2s), len(lemma_s2id)


In [None]:
unidic_token_index["言ふ"]

## 1. Update WLSP mappings

In order to map from the old to the new WLSP, where there were small changes to the last digit(s) of the article number, we use a simple distance function.
The distance function allows us to match entries of the same token that has slightly different IDs in the two editions.
In the future, this should be converted to a WordNet-like synset() functionality.


In [None]:
def bgid_distance(a, b):
    """分類語彙表ID aとbを比較し，どの程度一致しているかを0（完全不一致）から5（完全一致）で返す。
    体の類（1.X）と用の類（2.X）は最初から異なるため0を返すが，同じ類内では以下のようになる：
    >>> a = '1.1610'; b = '1.1600'; bgid_distance(a, b)
    3
    """
    for i, (x, y) in enumerate(zip(a, b)):
        if x != y:
            if i > 0:
                return i - 1  # .の分を引く
            else:
                return i
    else:
        return len(a) - 1  # .の分を引く


bgid_distance("1.1610", "1.1600"), bgid_distance("1.1610", "1.1610"), bgid_distance(
    "1.1610", "1.1611"
)


In [None]:
from itertools import permutations, product

token_match_types = {
    "no": 0,  # WLSP2及びUniDicに該当しない
    "full": 0, # WLSP2と完全一致
    "unidic_only": 0, # UniDicのみ該当（WLSP2ではヒットしない）
    "mixed": 0, # WLSP1と2で多少の項目が異なるが，一致している項目がある
    "partial": 0, # WLSP1と2で項目が一致しないが，違い項目が少なくとも一つある
    "convergent": 0, # WLSP1と2で項目が全く一致しない（X.YYYYYのXレベルでの違い）
}
wlsp1_to_2 = defaultdict(set)
token_wlsp1_to_2 = defaultdict(lambda: defaultdict(set))

for token, token_bgids in wlsp1h_index.items():
    if token not in wlsp2_index:
        if token in unidic_token_index:
            token_match_types["unidic_only"] += 1
            for t in unidic_token_index[token]:
                for id in token_bgids:
                    if t.lemma_id in lemma_id2wlsp2:
                        wlsp1_to_2[id].add(lemma_id2wlsp2[t.lemma_id])
                        token_wlsp1_to_2[token][id].add(lemma_id2wlsp2[t.lemma_id])
        else:
            token_match_types["no"] += 1
    else:
        bgid1 = token_bgids
        bgid2 = {e["分類番号"] for e in wlsp2_index[token]}

        if bgid1.issubset(bgid2):
            token_match_types["full"] += 1
            for id1 in token_bgids:
                wlsp1_to_2[id1].add(id1)
                token_wlsp1_to_2[token][id1].add(id1)
                # If we wanted to add all possible mappings instead of keeping the same mapping:
                # for e in wlsp2_index[token]:
                #     wlsp1_to_2[id1].add(e["分類番号"])
        elif len(bgid1.intersection(bgid2)) > 0:
            token_match_types["mixed"] += 1
            common_ids = bgid1.intersection(bgid2)
            for id in common_ids:
                wlsp1_to_2[id].add(id)
                token_wlsp1_to_2[token][id].add(id)
        else:
            matches = {(a, b): bgid_distance(a, b) for a, b in product(bgid1, bgid2)}
            top_matches = sorted(matches.items(), key=lambda x: x[1], reverse=True)
            if not any(v > 0 for k, v in top_matches):
                token_match_types["convergent"] += 1
            else:
                token_match_types["partial"] += 1
                # We take the top 1 match for now
                top_mapping = top_matches[0][0]
                bgid1, bgid2 = top_mapping
                wlsp1_to_2[bgid1].add(bgid2)
                token_wlsp1_to_2[token][bgid1].add(bgid2)

token_match_types, len(wlsp1_to_2), len(wlsp1_to_2)/len(bgid_hachidaishu), wlsp1_to_2['1.1630']


In [None]:
len(token_wlsp1_to_2), token_wlsp1_to_2['来']

# Scratch

Below is work in progress/temporary workspace.

In [None]:
# wlsp2の場合
{
    "id": "058059",
    "見出し番号": "55858",
    "record_type": "A",
    "類": "体",
    "部門": "自然",
    "中項目": "自然",
    "分類項目": "色",
    "分類番号": "1.5020",
    "段落番号": "12",
    "小段落番号": "01",
    "語番号": "01",
    "orth_info": "青",
    "orth": "青",
    "reading": "あお",
    "reverse_reading": "おあ",
}


In [None]:
[r for r in wlsp1 if r["分類番号"] == "1.1770"][0:3]


In [None]:
list((t.bg_id, decode_bg_id(t.bg_id), t) for t in hachidaishu.tokens())[:20]


In [None]:
bgid_hachidaishu = set(decode_bg_id(t.bg_id)["分類番号"] for t in hachidaishu.tokens())
len(bgid_hachidaishu)


In [None]:
from collections import defaultdict

wlsph_index = defaultdict(set)
for token in hachidaishu.tokens():
    wlsph_index[token.lemma].add(decode_bg_id(token.bg_id)["分類番号"])

wlsph_index['や']


In [None]:
len(set(r["orth"] for r in wlsp1)), len(wlsp1)


In [None]:
[r for r in wlsp1 if r["orth"] == "中"][:3]


In [None]:
Counter(r["orth"] for r in wlsp1).most_common(20)


In [None]:
wlsp1_tokens = set(r["orth"] for r in wlsp1)
wlsp2_tokens = set(r["orth"] for r in wlsp2)
wlsph_tokens = set(t for t in wlsph_index)


In [None]:
len(wlsp1_tokens), len(wlsp2_tokens), len(wlsp1_tokens.difference(wlsp2_tokens)) / len(
    wlsp1_tokens
), len(wlsp2_tokens.difference(wlsp1_tokens)) / len(wlsp2_tokens)


In [None]:
len(wlsph_tokens), len(wlsph_tokens.difference(wlsp2_tokens)), list(wlsph_tokens.difference(wlsp2_tokens))[:20]


In [None]:
for r in hachidaishu[:12]:
    print(r)

In [None]:
from itertools import groupby
ts = list(hachidaishu.tokens())
token_count = len(ts)
character_count = sum(len(c.surface) for c in ts)
anthology_count = len(list(groupby(hachidaishu, lambda r: r.anthology)))
poem_count = len(list(groupby(hachidaishu, lambda r: r.poem)))

## TEI conversion

In [None]:
from lxml import etree
from lxml import objectify as o

xml_ns = "{http://www.w3.org/XML/1998/namespace}"

from datetime import date    
today = date.today().isoformat()

E = o.ElementMaker(
    annotate=False,
    namespace="http://www.tei-c.org/ns/1.0",
    nsmap={None: "http://www.tei-c.org/ns/1.0"},
)

header = E.teiHeader(
    E.fileDesc(
        E.titleStmt(
            E.title("Hachidaishu dataset"),
            E.author(E.persName(E.forename("Hilofumi"), E.surname("Yamamoto"))),
            E.author(E.persName(E.forename("Bor"), E.surname("Hodošček"))),
        ),
        E.editionStmt(
            E.edition(f"{1}st edition", n=str(1)),
            E.respStmt(
                E.resp("Encoded by"),
                E.persName(E.forename("Bor"), E.surname("Hodošček"))
            )
        ),
        E.extent(
            E.measure(f"{character_count:,} characters", unit="characters", quantity=f"{character_count}"),
            E.measure(f"{token_count:,} morphemes", unit="morphemes", quantity=f"{token_count}"),
            E.measure(f"{poem_count:,} poems", unit="poems", quantity=f"{poem_count}"),
            E.measure(f"{anthology_count:,} anthologies", unit="anthologies", quantity=f"{anthology_count}"),
        ),
        E.publicationStmt(
            E.publisher("Bor Hodošček and Hilofumi Yamamoto"),
            # E.distributor(),
            # E.idno(),
            E.availability(
                E.licence(
                    E.ab("CC BY-SA 4.0",
                    E.ref(" Licence", target="https://creativecommons.org/licenses/by-sa/4.0/")) 
                )
            ),
            E.date(when=today),
        ),
        E.sourceDesc(
            E.listBibl(
                E.head("Works consulted in creating the original Hachidaishu database."),
                E.bibl("「新編国歌大観CD-ROM版」（1996）『新編国歌大観』編集委員会監修"),
                E.bibl("中村他（1999）「国文学研究資料館編集二十一代集データベース」"), # https://www.iwanami.co.jp/book/b266286.html
                E.bibl("新日本古典文学大系本二十一代集"),
                E.bibl("久保田（1979）『新潮日本古典集成の新古今集』"),
                E.bibl("ヴァージニア大学日本語テキストイニシアティブ監修"),
            )
        )
    ),
    E.encodingDesc(
        E.projectDesc(
            E.p("""This is a conversion of the space-delimited database format of the Hachidaishu dataset into TEI format. The original Chasen IPAdic POS tags were automatically converted into UniDic POS tags, then into Universal Dependencies POS tags. Word List by Semantic Principle (WLSP) entries were (partially) updated from the floppy disk edition to the newest 1.1 version."""),
        ),
        E.classDecl(
            E.taxonomy({f"{xml_ns}id": "NDC"},
                E.bibl(
                    E.title("Nippon Decimal Classification"),
                    E.edition("9"), 
                    E.ptr(target="https://ndc.datasearch.jp/")
                ),
            ),
        ),
    ),
    E.profileDesc(
        E.langUsage(
            E.language("Japanese (ca. 905-1205)", ident="ja")
        ),
        E.textClass(
            E.classCode("911", scheme="#NDC"),
            E.classCode("Q30038136", scheme="http://www.wikidata.org/entity/")
        ),
        # FIXME
        # E.textDesc(
        #     E.purpose(type="", degree=""),
        #     E.channel(mode="w"),
        #     n="Waka poetry",
        # )
    ),
    E.revisionDesc(
        E.listChange( # TODO Get from git
            E.change("upload to repo", when=today, who="Bor Hodošček"),
        ),
        status="published"
    )
)

def format_token(token):
    """Formats token following CONLL-U UD conventions for feature description.
    In the future, this should be rather rendered into XML tags."""
    wlsp1 = decode_bg_id(token.bg_id)['分類番号']
    wlsp2s = token_wlsp1_to_2[token.lemma]
    wlsp2 = None
    if wlsp1 in wlsp2s:
        if wlsp1 in wlsp2s[wlsp1]:
            wlsp2 = wlsp1
        else:
            wlsp2 = list(wlsp2s)[0]
    elif len(wlsp2s) > 0:
        wlsp2 = list(wlsp2s)[0]
    else:
        wlsp2 = "UNK"

    try:
        wd = wlsp2describe[wlsp2]
        description = f"{wd.類}-{wd.部門}-{wd.中項目}-{wd.分類項目}"
    except:
        description = "UNK"
        
    if wlsp2 == "UNK":
        wlsp2 = ""
    else:
        wlsp2 = f"|WLSP={wlsp2}"

    # Makes WLSP description optional when unknown/unmapped.
    if description == "UNK":
        description = ""
    else:
        description = f"|WLSPDescription={description}"

    return f"UPosTag={token.ud_pos}|IPAPosTag={token.ipa_pos}|UniDicPosTag={token.unidic_pos}|LemmaReading={token.lemma_reading}|Kanji={token.kanji}|KanjiReading={token.kanji_reading}|WLSPH={wlsp1}{wlsp2}{description}"

def generate_body(db):
    body = E.body()
    for anthology, poems in groupby(db, key=lambda r: r.anthology):
        anthology_div = o.SubElement(body, "div", type="anthology", n=anthology.name)
        for poem, xs in groupby(poems, key=lambda r: r.poem):
            # TODO app/rdg[@wit]
            lg = o.SubElement(anthology_div, "lg", type="waka", n=str(poem))
            l = o.SubElement(lg, "l")
            for segment in xs:
                decompositions = segment.decompositions()
                if len(decompositions) == 1: # no variants
                    t = decompositions[0][0]
                    w = o.SubElement(l, "w", pos=t.ipa_en_pos, lemma=t.lemma, msd=format_token(t))
                    w._setText(t.surface)
                else:
                    app = o.SubElement(l, "app")
                    for decomposition in decompositions:
                        rdg = o.SubElement(app, "rdg")
                        for token in decomposition.tokens:
                            w = o.SubElement(rdg, "w", pos=token.ipa_en_pos, lemma=token.lemma, msd=format_token(token))
                            w._setText(token.surface)
    return body

text = E.text(generate_body(hachidaishu))

root = E.TEI(header, text, {f"{xml_ns}lang": "ja"})

# print(etree.tostring(root, pretty_print=True, encoding="unicode"))

with open("hachidaishu.xml", "w") as f:
    f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    tei_string = etree.tostring(root, pretty_print=True, encoding="unicode")
    f.write(tei_string)

In [None]:
# https://adrien.barbaresi.eu/blog/validating-tei-xml-python.html
!wget -c https://tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng
with open("tei_all.rng") as f:
    schema = f.read()
    schema = schema.replace('<?xml version="1.0" encoding="utf-8"?>', '<?xml version="1.0"?>', 1)

In [None]:
from io import StringIO

relaxng_doc = etree.parse(StringIO(schema))
tei_relaxng = etree.RelaxNG(relaxng_doc)
mytree = etree.parse("hachidaishu.xml")


In [None]:
try:
    result = tei_relaxng.assert_(mytree)
    print("Valid.")
except AssertionError as err:
    print("TEI validation error:", err)


## Sanity checks

In [None]:
variation_points = len(root.xpath("//app"))
variation_segments = len(root.xpath("//rdg"))
all_tokens = len(root.xpath("//w"))
xml_tokens = len(root.xpath("//l/w | //l/app"))
tokens = len(list(hachidaishu.tokens()))
poems = len(root.xpath("//l"))
lg_poems = len(root.xpath("//lg"))
num_anthologies = len(root.xpath("//div[@type='anthology']"))

assert poems == lg_poems
assert xml_tokens == tokens

num_anthologies, poems, variation_points, variation_segments, all_tokens, tokens

## JSON

In [None]:
def msd_to_json(s):
    json = {}
    records = s.split("|")
    for record in records:
        k, v = record.split("=")
        json[k] = v
    return json

flat_records = []
for anthology in root.xpath("//div[@type='anthology']"):
    for poem in anthology.xpath("lg"):
        for l in poem.iter("l"):
            for token in l.iterchildren():
                if token.tag == "app":
                    token_record = list(list(token.iterchildren())[0].iterchildren())[0]
                else:
                    token_record = token
                flat_records.append({"Anthology": anthology.attrib["n"],
                                     "Poem": poem.attrib["n"],
                                     "Surface": token_record.text,
                                     "Lemma": token_record.attrib["lemma"],
                                     "POS": token_record.attrib["pos"]} | msd_to_json(token_record.attrib["msd"]))

import json
with open("hachidaishu.jsonl", "w") as f:
    for r in flat_records:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

In [None]:
# Table from beginning
import polars as pl

pl.Config.set_fmt_str_lengths(40)
pl.DataFrame([v for k, v in metadata.items()]).with_columns([
    pl.col("no."), pl.col("name").cast(pl.Categorical), pl.col("order").cast(pl.Categorical), pl.col("date"), pl.col("poems")
]).select(["no.", "name", "order", "date", "poems"])