# Reading Corpus data as ARPAC Registers

How you read in your corpus as a register depends on the data that the corpus provides. We prepared our data all in the same format: as `csv` tables with phonemes/syllables as index, and additional info as columns, but how you prepare this is up to you.

## Constructing Register Objects

Let's take an example from our german corpus. For example, we have a dataset consisting of trigrams with their overall counts of occurrence in the corpus. This is what the first 5 rows of the dataset look like:

In [9]:
import pandas as pd
from arpac.io import IPA_TRIGRAMS_DEFAULT_PATH

pd.read_csv(IPA_TRIGRAMS_DEFAULT_PATH, index_col=0).head()

Unnamed: 0,frequency
ʊ_n_t,9109
n_ɪ_ç,8478
d_aː_s,7432
d_a_n,6523
ɛ_n_t,5868


Now we want to read it into ARPAC's internal representation, which behaves pretty much like a python dictionary. Here, we read in the table line by line and build a register that consists of `Syllable` objects.

In [17]:
import csv
from arpac.io import IPA_TRIGRAMS_DEFAULT_PATH
from arpac.types.base_types import Register
from arpac.types.elements import Phoneme, Syllable


fdata = list(csv.reader(open(IPA_TRIGRAMS_DEFAULT_PATH, "r", encoding='utf-8')))[1:]

trigrams = Register()
for syllable, count in fdata:
    syllable_id = "".join(syllable.split("_"))
    trigrams[syllable_id] = Syllable(
        id=syllable_id, 
        phonemes=[Phoneme(id=phoneme, info={}) for phoneme in syllable.split("_")],
        info={"count": count}
    )

print(trigrams)

ʊnt|nɪç|daːs|dan|ɛnt|lɪç|lzoː|ɪçt|aːbɐ|alz|... (21266 elements total)


## Example Data Preparation: German Corpus

Here, we show for all our german corpora, how we turn them into csv tables of the expected format.

In [44]:
import pandas as pd
import csv

df = pd.read_csv("german/orig/german_IPA_seg.csv").drop("Unnamed: 0", axis=1).set_index("Seg").rename_axis(index=None).dropna()

df.to_csv("german/unigrams.csv")
df.head()

Unnamed: 0,SegNrInWord
<P>,1
<HÄSITATION>,1
v,1
aː,2
s,3


In [33]:
import pandas as pd
import csv

df = pd.read_csv("german/orig/ipa_bigrams_german.csv").drop("Unnamed: 0", axis=1).set_index("bigram").rename_axis(index=None)

df.to_csv("german/bigrams.csv")
df.head()

Unnamed: 0,frequency
ɪ_ç,32175
ə_n,27681
n_t,25515
t_s,19529
j_aː,19326


In [37]:
import pandas as pd
import csv

df = pd.read_csv("german/orig/ipa_trigrams_german.csv").set_index("ipa_trigram").rename_axis(index=None)

df.to_csv("german/trigrams.csv")
df.head()

Unnamed: 0,frequency
ʊ_n_t,9109
n_ɪ_ç,8478
d_aː_s,7432
d_a_n,6523
ɛ_n_t,5868


In [4]:
from arpac.phonecodes import phonecodes
import csv
import pandas as pd


syllables_corpus_path = "german/orig/syll.txt"

with open(syllables_corpus_path, "r", encoding='utf-8') as csv_file:
    fdata = list(csv.reader(csv_file, delimiter='\t'))

syllables_dict = {}

for syll_stats in fdata[1:]:
    #unfortunately, the xsampa codes don't let us know
    syll_ipa = phonecodes.xsampa2ipa(syll_stats[1], language="deu")
    info = {"freq": int(syll_stats[2]), "prob": float(syll_stats[3])}
    syllables_dict[syll_ipa] = info  # will overwrite if already present

df = pd.DataFrame.from_dict(syllables_dict, orient="index")


df.to_csv("german/syllables.csv")
df.head()

Unnamed: 0,freq,prob
jaː,19303,0.0191
ɪç,18267,0.018075
das,16420,0.016247
n,16191,0.01602
dan,11181,0.011063


# Example Data Preparation: English Corpus

In [35]:
from arpac.phonecodes import phonecodes
import pandas as pd

df = pd.read_csv("english/orig/EFS.CD", delimiter="\\", names=["Syllable", "freq"], usecols=[0, 3])
df["Syllable"] = df["Syllable"].apply(lambda x: phonecodes.disc2ipa(x, L="eng"))

df = df.set_index("Syllable").rename_axis(index=None)

df["prob"] = df["freq"]/df.sum()["freq"]
df = df.drop_duplicates()


df.to_csv("english/syllables.csv")
df.head()

Unnamed: 0,freq,prob
ɑː,1070,0.0001554662
ɑːɲ,25,3.632388e-06
ɑːɲt,5,7.264777e-07
ɑːɜ˞,4688,0.0006811455
ɑːʃ,0,0.0
