**Toki Pona Modeling**

First, I'll pull in my data so that we can run same basic statistics on it.

We'll make a clone of the poki lapo repository into the file poki

```git clone https://github.com/kulupu-lapo/poki.git```

Poki lapo is currently the largest and most complete corpus of Toki Pona data

There is a structure inside the repository, but since we only care about the plaintext, we can simply toss away the headers and use the plaintext.






In [1]:
#We'll import the packages we'll use for this
import nltk
import re
import os
import random
import warnings
import strip_markdown #to remove markdown formatting

def print_divider():
    print("\n" + "="*40 + "\n")

In [2]:
#We'll open all files in poki directory and read them, stripping the header\
poki_plaintext_path = './poki/plaintext/'

all_files_paths = []
#all the text files in poki directory that we will read end in ".md"
for root, dirs, files in os.walk(poki_plaintext_path):
    for file in files:
        if file.endswith('.md'):
            print(os.path.join(root, file))
            all_files_paths.append(os.path.join(root, file))

print(f"Total files found: {len(all_files_paths)}")

./poki/plaintext/2002\10\jan-uluka-li-pilin-ike.md
./poki/plaintext/2002\10\kili-lili.md
./poki/plaintext/2002\10\sijelo-pi-nasin-pona.md
./poki/plaintext/2002\10\tenpo-suno-ni-iwa.md
./poki/plaintext/2002\10\toki-pona.md
./poki/plaintext/2002\10\tomo-palisa-pi-papeli.md
./poki/plaintext/2003\04\pilin-ike.md
./poki/plaintext/2003\04\wan-taso.md
./poki/plaintext/2003\05\ma-tomo-pape.md
./poki/plaintext/2003\05\mama-pi-mi-mute.md
./poki/plaintext/2004\06\lipu-sewi.md
./poki/plaintext/2004\06\o-toki-pi-jan-sewi.md
./poki/plaintext/2004\06\son-23.md
./poki/plaintext/2005\07\ma-tomo-pape.md
./poki/plaintext/2005\09\toki-utala-Supaka.md
./poki/plaintext/2005\12\wan-taso.md
./poki/plaintext/2006\02\lipu-pi-jan-pona-sewi-matejo.md
./poki/plaintext/2007\06\blue-sky.md
./poki/plaintext/2007\06\kalama-nasa-wawa.md
./poki/plaintext/2007\06\melissa.md
./poki/plaintext/2007\06\midnight-rider.md
./poki/plaintext/2007\06\revival.md
./poki/plaintext/2007\06\sina-olin-o-toki.md
./poki/plaintext/2008\unk

Now that we have the list of all the files, we will first strip out the headers, then find the [1-4] grams

In [4]:
def read_and_clean_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
        # Remove header (header is starting with '---' and ending with '---')
        if content.startswith('---'):
            header_end = content.find('---', 3)
            if header_end != -1:
                content = content[header_end + 3:].lstrip()
        else:
            #this is just to notify us if a file has no header, but we can still process it
            warnings.warn(f"No header found in file: {file_path}")
        # Remove markdown formatting
        clean_content = strip_markdown.strip_markdown(content)
        return clean_content
    
#Now, lets see if this worked by reading one file
sample_file_path = random.choice(all_files_paths)
print(f"Reading file: {sample_file_path}")
print("File content before cleaning:")
print(open(sample_file_path, 'r', encoding='utf-8').read())
print_divider()
print("File content after cleaning:")
print(read_and_clean_file(sample_file_path))

Reading file: ./poki/plaintext/2025\02\mirislou.md
File content before cleaning:
---
title: Mirislou
description: null
authors:
  - jan Ke Tami
proofreaders: null
date: 2025-02-01
date-precision: day
tags:
  - music
original:
  title: Mirislou
  authors:
    - folk origin
license: null
sources:
  - https://discord.com/channels/301377942062366741/1335050179228864592/1335050553805246545
archives: null
preprocessing: null
accessibility-notes: null
notes: null
---

jan pona mi pi ma Masi o  \
o sona e pilin wawa mi

olin mi o, jan pi pona suli o  \
uta sina li sama suwi moku

jan pi wile mi pi ma Masi  \
pilin ike li kama anpa e mi  \
sina kama e mi tawa ma sina

jan pi wile mi pi ma Masi  \
pilin ike li kama anpa e mi  \
sina kama e mi tawa ma sina

jan suwi mi pi ma Masi o  \
uta mi en uta sina li wan  \
la ni li ante e ijo mi ale  \
uta sina taso li ken ante ni

jan pi wile mi pi ma Masi  \
pilin ike li kama anpa e mi  \
sina kama e mi tawa ma sina

jan pi wile mi pi ma Masi  \
pilin ik