# Hachidaishu Vocabulary Dataset Examples

This notebook showcases how one might use the Hachidaishu vocabulary dataset from Python. It is available from the Github repository below or through [Google Colab](https://colab.research.google.com/drive/1rS2lbD2rLGw3XxuOroKeEOeWF3MMMTvP?usp=sharing).

The Hachidaishu vocabulary dataset is available from Zenodo and Github:

-   https://zenodo.org/record/4744170
-   https://github.com/yamagen/hachidaishu

We will download the dataset from Zenodo or Github and save it as `hachidai.db`. You only need to do this once, unless running under Google Colab or other ephemeral environments.

In [None]:
#!wget -c https://zenodo.org/record/4744170/files/hachidai.db?download=1 -O hachidai.db
!wget -c https://github.com/yamagen/hachidaishu/raw/main/hachidai.db

## Inspecting the dataset

As we can see in the cell below, the dataset contains space-delimited lines containing the following information:

1.  "01:000001:0007" consists of 3 fields: 1) anthology, 2) number of poem, and 3) serial ID of the token. The anthology ID indicates respectively: 01..Kokinshu, 02..Gosenshu, 03..Shuishu, 04..Goshuishu, 05..Kin'yoshu, 06..Shikashu, 07..Senzaishu, and 08..Shinkokinshu.
2.  Indicates the type of token: A type is a single token; B type is a compound token; C type is a breakdown of B type. A00 indicates a single token; A01 indicates a single token and has another meaning; B00 indicates a compound token; B01 indicates a compound token which has another meaning; C00 indicates the first element of the B00/B01.. breakdown; C01 indicates the second element of the B00/B01.. breakdown.
3.  "BG-02-1527-01-0102": classification ID based on semantic categories according to Bunruigoihyo (Yamazaki et al. 2014).
4.  Indicates the Chasen POS number.
5.  Indicates surface form: a form appears in literary works.
6.  Indicates lemma in kanji writing.
7.  Indicates lemma in kana writing.
8.  Indicates conjugated form in kanji writing form.
9.  Indicates conjugated form in kana writing form.


In [None]:
!head hachidai.db

It is entirely possible to use standard UNIX command line tools to perform various analyses on the dataset. In the example below, we can extract and count all birds occuring in the dataset in one line. We use the classification ID based on semantic categories (Bunruigohyo) of `BG-01-5620` to filter on the bird semantic category, print the lemma column and tabulate all occurrences.

In [None]:
!grep BG-01-5620 hachidai.db | grep 'A00'| awk '{print $6}'| sort | uniq -c | sort -nr | nl 

## Mapping ChaSen id to Japanese POS name

The cells below contain a mapping from ChaSen's (IPADic's) POS tagset ids to their Japanese names.

(This cell is hidden by default, but can be viewed if wanting to see the mappings. Alternatively, you can evaluate the dictionary `chasenid2pos` to see all Japanese mappings in a new cell, or the English version `chasenid2pos_en` to see the mappings in English ([source](https://www.sketchengine.eu/japanese-tagset/)).)

In [None]:
chasenid2pos = {
 '00': '？',
 '01': '名詞',
 '02': '名詞-一般',
 '03': '名詞-固有名詞',
 '04': '名詞-固有名詞-一般',
 '05': '名詞-固有名詞-人名',
 '06': '名詞-固有名詞-人名-一般',
 '07': '名詞-固有名詞-人名-姓',
 '08': '名詞-固有名詞-人名-名',
 '09': '名詞-固有名詞-組織',
 '10': '名詞-固有名詞-地域',
 '11': '名詞-固有名詞-地域-一般',
 '12': '名詞-固有名詞-地域-国',
 '13': '名詞-代名詞',
 '14': '名詞-代名詞-一般',
 '15': '名詞-代名詞-縮約',
 '16': '名詞-副詞可能',
 '17': '名詞-サ変接続',
 '18': '名詞-形容動詞語幹',
 '19': '名詞-数',
 '20': '名詞-非自立',
 '21': '名詞-非自立-一般',
 '22': '名詞-非自立-副詞可能',
 '23': '名詞-非自立-助動詞語幹',
 '24': '名詞-非自立-形容動詞語幹',
 '25': '名詞-特殊-',
 '26': '名詞-特殊-助動詞語幹',
 '27': '名詞-接尾',
 '28': '名詞-接尾-一般',
 '29': '名詞-接尾-人名',
 '30': '名詞-接尾-地域',
 '31': '名詞-接尾-サ変接続',
 '32': '名詞-接尾-助動詞語幹',
 '33': '名詞-接尾-形容動詞語幹',
 '34': '名詞-接尾-副詞可能',
 '35': '名詞-接尾-助数詞',
 '36': '名詞-接尾-特殊',
 '37': '名詞-接続詞的',
 '38': '名詞-動詞非自立的',
 '39': '名詞-引用文字列',
 '40': '名詞-ナイ形容詞語幹',
 '41': '接頭詞',
 '42': '接頭詞-名詞接続',
 '43': '接頭詞-動詞接続',
 '44': '接頭詞-形容詞接続',
 '45': '接頭詞-数接続',
 '46': '動詞',
 '47': '動詞-自立',
 '48': '動詞-非自立',
 '49': '動詞-接尾',
 '50': '形容詞',
 '51': '形容詞-自立',
 '52': '形容詞-非自立',
 '53': '形容詞-接尾',
 '54': '副詞',
 '55': '副詞-一般',
 '56': '副詞-助詞類接続',
 '57': '連体詞',
 '58': '接続詞',
 '59': '助詞',
 '60': '助詞-格助詞',
 '61': '助詞-格助詞-一般',
 '62': '助詞-格助詞-引用',
 '63': '助詞-格助詞-連語',
 '64': '助詞-接続助詞',
 '65': '助詞-係助詞',
 '66': '助詞-副助詞',
 '67': '助詞-間投助詞',
 '68': '助詞-並立助詞',
 '69': '助詞-終助詞',
 '70': '助詞-副助詞／並立助詞／終助詞',
 '71': '助詞-連体化',
 '72': '助詞-副詞化',
 '73': '助詞-特殊',
 '74': '助動詞',
 '75': '感動詞',
 '76': '記号',
 '77': '記号-一般',
 '78': '記号-句点',
 '79': '記号-読点',
 '80': '記号-空白',
 '81': '記号-アルファベット',
 '82': '記号-括弧開',
 '83': '記号-括弧閉',
 '84': 'その他',
 '85': 'その他-間投',
 '86': 'フィラー',
 '87': '非言語音',
 '88': '語断片',
 '89': '未知語'
 }

chasenid2pos_en = {
 '00': '?',
 '01': 'N',
 '02': 'N.g',
 '03': 'N.Prop',
 '04': 'N.Prop.g',
 '05': 'N.Prop.n',
 '06': 'N.Prop.n.g',
 '07': 'N.Prop.n.s',
 '08': 'N.Prop.n.f',
 '09': 'N.Prop.o',
 '10': 'N.Prop.p',
 '11': 'N.Prop.p.g',
 '12': 'N.Prop.p.c',
 '13': 'N.Pron',
 '14': 'N.Pron.g',
 '15': 'N.Pron.sh',
 '16': 'N.Adv',
 '17': 'N.Vs',
 '18': 'N.Ana',
 '19': 'N.Num',
 '20': 'N.bnd',
 '21': 'N.bnd.g',
 '22': 'N.bnd.Adv',
 '23': 'N.bnd.Aux',
 '24': 'N.bnd.Ana',
 '25': 'N.spec',
 '26': 'N.spec.Aux',
 '27': 'N.Suff',
 '28': 'N.Suff.g',
 '29': 'N.Suff.n',
 '30': 'N.Suff.p',
 '31': 'N.Suff.Vs',
 '32': 'N.Suff.Aux',
 '33': 'N.Suff.Ana',
 '34': 'N.Suff.Adv',
 '35': 'N.Suff.msr',
 '36': 'N.Suff.spec',
 '37': 'N.Conj',
 '38': 'N.V.bnd',
 '39': 'N.Phr',
 '40': 'N.nai',
 '41': 'Pref',
 '42': 'Pref.N',
 '43': 'Pref.V',
 '44': 'Pref.Ai',
 '45': 'Pref.Num',
 '46': 'V',
 '47': 'V.free',
 '48': 'V.bnd',
 '49': 'V.Suff',
 '50': 'Ai',
 '51': 'Ai.free',
 '52': 'Ai.bnd',
 '53': 'Ai.Suff',
 '54': 'Adv',
 '55': 'Adv.g',
 '56': 'Adv.P',
 '57': 'Adn',
 '58': 'Conj',
 '59': 'P',
 '60': 'P.c',
 '61': 'P.c.g',
 '62': 'P.c.r',
 '63': 'P.c.Phr',
 '64': 'P.Conj',
 '65': 'P.bind',
 '66': 'P.Adv',
 '67': 'P.ind',
 '68': 'P.coord',
 '69': 'P.fin',
 '70': 'P.advcoordfin',
 '71': 'P.prenom',
 '72': 'P.advzer',
 '73': 'P.spec',
 '74': 'Aux',
 '75': 'Interj',
 '76': 'Sym',
 '77': 'Sym.g',
 '78': 'Sym.p',
 '79': 'Sym.c',
 '80': 'Sym.w',
 '81': 'Sym.a',
 '82': 'Sym.bo',
 '83': 'Sym.bc',
 '84': 'Other',
 '85': 'Other.indir',
 '86': 'Fill',
 '87': 'Nss',
 '88': 'Frgm',
 '89': 'Unknown'
}

## Python data loader classes

Below is an example of how to read the dataset into Python. Several conveniant methods to access the dataset for common analyses are provided:

-   `Token` is a Python dataclass containing all information on a token.
-   `HachidaishuRecord` is a Python dataclass that provides a wrapper over each line of the `hachidai.db`.
-   `HachidaishuDB` is a Python dataclass that wraps over the list of `HachidaishuRecord`s and provides an iterator interface and conveniance methods:
    -   `query()`: Iterates over the dataset, optionally filtering for a specific anthology, poem or serial.
    -   `tokens()`: Wraps `query()` to return a sequence of tokens, taking into account the preffered tokenization.
    -   `text()`: Wraps `query()` to return poems as plain text, one per line, optionally separated by spaces or other delimiters.

Note that while there are no methods for filtering on a specific POS tag or similar, this can be easily performed using standard Python functions (an example is given further below).

In [None]:
from dataclasses import dataclass, field, fields, astuple, asdict
from typing import List
from enum import Enum
from itertools import groupby
from collections import Counter


Anthology = Enum('Anthology', 'Kokinshu Gosenshu Shuishu Goshuishu Kin’yoshu Shikashu Senzaishu Shinkokinshu')


@dataclass
class Token:
    token_type: str
    bg_id: str
    chasen_id: str = field(repr=False)
    pos: str = field(init=False)
    surface: str
    lemma: str
    lemma_reading: str
    kanji: str
    kanji_reading: str

    def __post_init__(self):
        self.pos = chasenid2pos[self.chasen_id]

    def __repr__(self):
        return f'{self.surface}/{self.pos}/{self.lemma}'

    def __iter__(self):
        return iter(astuple(self))


@dataclass
class HachidaishuRecord:
    anthology: Anthology
    poem: int
    serial: int
    tokens: List[Token]

    def __iter__(self):
        return iter(astuple(self)[:3] + astuple(self.token()))

    def keys(self):
        return [f.name for f in fields(self)][:3] + [f.name for f in fields(self.token())]

    def token(self):
        '''Return the token, ignoring any decomposition or alternative variants.'''
        return self.tokens[0]

    def decomposition(self):
        '''Return a list containing the alternative decomposition of token in record.'''
        if len(self.tokens) > 1:
            return self.tokens[1:]
        else:
            return self.tokens


@dataclass
class HachidaishuDB:
    db: List[HachidaishuRecord] = field(repr=False)

    def __init__(self, filename='hachidai.db'):
        self.db = self._retokenize(self._read_db(filename))

    def __getitem__(self, index):
        return self.db[index]

    def __iter__(self):
        for record in self.db:
            yield record

    def columns(self):
        return self.db[0].keys()

    def _retokenize(self, db):
        tokens = []
        serial = None
        for entry in db:
            if serial == entry.serial:
                tokens[-1].tokens += entry.tokens
            else:
                tokens.append(entry)
            
            serial = entry.serial
        return tokens

    def _read_db(self, filename='hachidai.db'):
        with open(filename) as f:
            for row in f.readlines():
                fields = row.rstrip().split(' ')
                id, token_type, bg_id, chasen_id, surface, lemma, lemma_reading, kanji, kanji_reading = fields
                anthology, poem, serial = id.split(':')
                token = Token(token_type, bg_id, chasen_id, surface,
                              lemma, lemma_reading, kanji, kanji_reading)
                yield HachidaishuRecord(
                    Anthology(int(anthology)),
                    int(poem),
                    int(serial),
                    [token]
                    )

    def query(self, anthology=None, poem=None, serial=None):
        for record in self:
            if anthology and record.anthology != Anthology(anthology):
                continue
            if poem and record.poem != poem:
                continue
            if serial and record.serial != serial:
                continue
            yield record

    
    def tokens(self, mode='default', anthology=None, poem=None, serial=None):
        for record in self.query(anthology, poem, serial):
            if mode == 'default':
                yield record.token()
            elif mode == 'decomposition':
                for token in record.decomposition():
                    yield token
    
    def text(self, delimiter=' ', anthology=None, poem=None, serial=None):
        by_poem = groupby(self.query(anthology=anthology, poem=poem, serial=serial),
                          key=lambda r: (r.anthology, r.poem))
        return '\n'.join(delimiter.join(record.token().surface for record in poem)
                         for poem_info, poem in by_poem)


db = HachidaishuDB()

### Simple usage

Let's print the first poem from the Kokinshu, using `/` as a delimiter between tokens.

In [None]:
db.text(delimiter='/', anthology=1, poem=1)

We can see more information (such as alternative tokenizations) by doing a query with the same anthology and poem parameters:

In [None]:
list(db.query(anthology=1, poem=1))

Or, similary, just get all the tokens:
(Note you can use the autocomplete feature of your editior to access all antology names by pressing TAB after `Anthology.`, as Anthology is an enum containing the name to id mapping.)

In [None]:
list(db.tokens(anthology=Anthology.Kokinshu, poem=1))

As seen above, the default representation of each token hides everything but the surface form, POS, and lemma form. By writing a custom print function, you can print other pertinent fields.

In [None]:
def token2string(t):
    return f'{t.surface},{t.pos},{t.lemma},{t.kanji},{t.bg_id}'

list(token2string(token) for token in db.tokens(anthology=Anthology.Kokinshu, poem=1))

### Basic statistics

Below we can calculate the frequency of POS tags, 20 most common lemmas in the Shinkokinshu, as well as generate an interactive plot, covering the Hachidaishu anthologies.

In [None]:
Counter(token.pos for token in db.tokens()).most_common()

In [None]:
Counter(token.lemma for token in db.tokens(anthology=Anthology.Shinkokinshu)).most_common(20)

### Example visualization of POS distribution over anthologies

The example uses Pandas and Plotly to plot the distribution of POS tags by anthology.

In [None]:
!pip install pandas plotly

In [None]:
import pandas as pd
import plotly.express as px


dfp = pd.DataFrame.from_records(db.query(), columns=db.columns())
dfp.anthology = dfp.anthology.apply(lambda a: f'{a.value}: {a.name}')
dfp['pos_1'] = dfp.pos.apply(lambda s: s.split('-')[0])
dfp = dfp.groupby(['anthology', 'pos_1']).agg({'pos_1': ['count']})
dfp = dfp.reset_index()
dfp.columns = ['anthology', 'pos_1', 'count']
dfp

In [None]:
fig = px.bar(dfp, x="anthology", y="count", color="pos_1", barmode="stack")
fig.show()

## Using Pandas

A more straightforward analysis can also be performed using Pandas by treating it as a regular tabular dataset. In this casre, care must be taken when aggregating over tokens by selecting for the chosen token type (variant/decomposition status).

In [None]:
import pandas as pd

In [None]:
df = pd.read_table('hachidai.db', usecols=range(9), sep=' ',
                   names=['id', 'token_type', 'bg_id', 'chasen_id', 'surface', 'lemma', 'lemma_reading', 'kanji', 'kanji_reading'])

In [None]:
birds = df[df.token_type.str.match('A00') & df.bg_id.str.contains('BG-01-5620')]
birds

In [None]:
birds.lemma.value_counts()
# These counts should be the same as in our example at the beginning: