<a href="https://colab.research.google.com/github/simulate111/Introduction-to-Human-Language-Technology/blob/main/sentence_splitting_and_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence splitting and tokenization example

This short notebook illustrates one comparatively fast way to do sentence splitting and tokenization in Python. It's not particularly accurate at either, but should do the job in cases where the details don't matter too much.

We'll use the [sentence-splitter](https://pypi.org/project/sentence-splitter/) and [regex](https://pypi.org/project/regex/) packages.

In [1]:
!pip install --quiet sentence-splitter regex

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h

Grab some example data

In [2]:
!wget -nc http://dl.turkunlp.org/TKO_7095_2023/fiwiki-20221120-sample.txt

--2024-04-13 19:40:04--  http://dl.turkunlp.org/TKO_7095_2023/fiwiki-20221120-sample.txt
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 500364237 (477M) [text/plain]
Saving to: ‘fiwiki-20221120-sample.txt’


2024-04-13 19:40:33 (17.2 MB/s) - ‘fiwiki-20221120-sample.txt’ saved [500364237/500364237]



Read in example data in paragraph-per-line format

In [3]:
paragraphs = open('fiwiki-20221120-sample.txt').readlines()

Instantiate sentence splitter. Note that you need to provide the language, and not all languages are supported.

In [4]:
from sentence_splitter import SentenceSplitter

splitter = SentenceSplitter(language='fi')

Run sentence splitting and log runtime. Just take some paragraphs from the start to keep things reasonably fast.

In [None]:
%%time

sentences = [s for p in paragraphs[:100000] for s in splitter.split(p)]

Split into tokens using a regular expression. Here the regular expression defines as a token any sequence of alphanumeric characters or any (other) single non-space character.

In [None]:
import regex

TOKENIZE_RE = regex.compile(r'([[:alnum:]]+|\S)')

Tokenize and log runtime

In [None]:
%%time

tokenized = [TOKENIZE_RE.findall(s) for s in sentences]

Check a few examples

In [None]:
for t in tokenized[:10]:
    print(t)