In [1]:
# Copyright (c) 2023 Sophie Katz
#
# This file is part of Language Model.
#
# Language Model is free software: you can redistribute it and/or modify it under
# the terms of the GNU General Public License as published by the Free Software
# Foundation, either version 3 of the License, or (at your option) any later version.
#
# Language Model is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
# PARTICULAR PURPOSE. See the GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along with Language
# Model. If not, see <https://www.gnu.org/licenses/>.


# Using the Wiki-2 dataset from `torchtext` as a dataset for language modeling

## Table of contents

- [Imports](#imports)
- [Loading the dataset](#loading-the-dataset)
- [Tokenizing the examples](#tokenizing-the-examples)
- [Splitting the tokens into sentences](#splitting-the-tokens-into-sentences)
- [Extracting the vocabulary](#extracting-the-vocabulary)
- [Putting it all together](#putting-it-all-together)

## Resources used

Name | URL
---- | ---
WikiText-2 documentation | https://pytorch.org/text/stable/datasets.html#wikitext-2

## Imports

We primarily need `torchtext` for the corpus and some utilities. We'll use `torchdata` later on when we're packaging our dataset for others to use.

In [2]:
import functools
import itertools
from typing import Iterable, Iterator, Set, Tuple

import torchdata.datapipes as dp
from torchtext.data.functional import simple_space_split
from torchtext.datasets import WikiText2
from torchtext.vocab import Vocab, build_vocab_from_iterator


## Loading the dataset

We're going to use the Wiki-2 corpus which is just a bunch of text scraped from Wikipedia. `torchtext` makes loading the dataset extremely easy:

In [3]:
train, val, test = WikiText2()


It gives us `train`, `val`, and `test` which are Pytorch datapipes. If you're not familiar with datapipes, don't worry! For our purposes, you can just assume that they are iterators without missing out on anything.

Let's look at a few examples from `train`, treating it as an iterator:

In [4]:
# Get the first 10 examples from the training set, with indices
for index, example in enumerate(itertools.islice(train, 10)):
    # Print it out
    print(f"[{index}] {example!r}")


[0] ' \n'
[1] ' = Valkyria Chronicles III = \n'
[2] ' \n'
[3] ' Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . \n'
[4] " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game m

We get a series of space-separated tokens stored in strings. Let's see how we can tokenize examples 3 and 4 above (they look decent as a starting point).

## Tokenizing the examples

We need a starting point to test our tokenization. Let's create a combination of examples 3 and 4. Since they both start with `' '` characters, we can just concatenate them.

In [5]:
example = functools.reduce(lambda a, b: a + b, itertools.islice(train, 3, 5), "")

example




' Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . \n The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and 

It's a decent chunk of text! `torchtext` provides a built-in tokenizer called `simple_space_split`. Since the text is already separated by spaces, it should be a pretty good fit.

Let's see what happens when we use it:

In [6]:
# simple_space_split expects a list of strings and returns a list of token lists for
# each string.
tokens = next(simple_space_split([example]))

# We only want to look at unique tokens and also sorted tokens
unique_and_sorted = sorted(set(tokens))

# Print out some of them
unique_and_sorted

['"',
 "'n",
 "'s",
 '(',
 ')',
 ',',
 '.',
 '2010',
 '2011',
 '3',
 ':',
 '<unk>',
 '@-@',
 'A',
 'Battlefield',
 'Character',
 'Chronicles',
 'Europan',
 'Gallia',
 'Hitoshi',
 'Honjou',
 'II',
 'III',
 'Imperial',
 'January',
 'Japan',
 'Japanese',
 'May',
 'Media.Vision',
 'Nameless',
 'Ozawa',
 'PlayStation',
 'Portable',
 'Raven',
 'Released',
 'Sakimoto',
 'Second',
 'Sega',
 'Senjō',
 'Takeshi',
 'The',
 'Valkyria',
 'War',
 'While',
 'a',
 'adjustments',
 'against',
 'along',
 'also',
 'and',
 'are',
 'as',
 'began',
 'black',
 'both',
 'by',
 'carrying',
 'commonly',
 'composer',
 'designer',
 'developed',
 'development',
 'director',
 'done',
 'during',
 'entries',
 'features',
 'first',
 'follows',
 'for',
 'from',
 'fusion',
 'game',
 'gameplay',
 'handled',
 'in',
 'is',
 'it',
 'its',
 'large',
 'lit',
 'making',
 'military',
 'more',
 'multiple',
 'nation',
 'newcomers',
 'no',
 'of',
 'on',
 'opening',
 'operations',
 'outside',
 'over',
 'parallel',
 'penal',
 'perfor

We get one token that we don't want, `'@-@'`, but otherwise it looks good!

## Filtering out unwanted tokens

This is a pretty straightforward task that we can do with a simple function.

In [7]:
def filter_tokens(tokens, unwanted):
    unwanted = set(unwanted)

    for token in tokens:
        if not token in unwanted:
            yield token


filtered = list(filter_tokens(unique_and_sorted, {"@-@"}))

# We only need to see enough to show that the unwanted token is missing
filtered[:7]


['"', "'n", "'s", '(', ')', ',', '.']

# Splitting the tokens into sentences

Here's where we'll write a bit of custom logic to split the tokens into sentences. This is basically a split function but with tokens instead of characters.

In [8]:
def split_sentences(tokens):
    sentence = []

    for token in tokens:
        if token == ".":
            if len(sentence) > 0:
                yield sentence
                sentence = []
        else:
            sentence.append(token)

    if len(sentence) > 0:
        yield sentence


for index, sentence in enumerate(
    split_sentences(filter_tokens(next(simple_space_split([example])), {"@-@"}))
):
    print(f"[{index}] {sentence!r}")


[0] ['Senjō', 'no', 'Valkyria', '3', ':', '<unk>', 'Chronicles', '(', 'Japanese', ':', '戦場のヴァルキュリア3', ',', 'lit']
[1] ['Valkyria', 'of', 'the', 'Battlefield', '3', ')', ',', 'commonly', 'referred', 'to', 'as', 'Valkyria', 'Chronicles', 'III', 'outside', 'Japan', ',', 'is', 'a', 'tactical', 'role', 'playing', 'video', 'game', 'developed', 'by', 'Sega', 'and', 'Media.Vision', 'for', 'the', 'PlayStation', 'Portable']
[2] ['Released', 'in', 'January', '2011', 'in', 'Japan', ',', 'it', 'is', 'the', 'third', 'game', 'in', 'the', 'Valkyria', 'series']
[3] ['<unk>', 'the', 'same', 'fusion', 'of', 'tactical', 'and', 'real', 'time', 'gameplay', 'as', 'its', 'predecessors', ',', 'the', 'story', 'runs', 'parallel', 'to', 'the', 'first', 'game', 'and', 'follows', 'the', '"', 'Nameless', '"', ',', 'a', 'penal', 'military', 'unit', 'serving', 'the', 'nation', 'of', 'Gallia', 'during', 'the', 'Second', 'Europan', 'War', 'who', 'perform', 'secret', 'black', 'operations', 'and', 'are', 'pitted', 'agains

This is definitely functional!

**NOTE:** If you investigate further, there's a case where it sees "Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3" and assumes that the period after "lit" is the end of a sentence. There's not an easy way to fix this problem and it doesn't occur very often, so we're just going to leave it.

## Extracting the vocabulary

Transformers don't operate on words. They operate on vectors of word indices in a vocabulary. We'll extract the vocabulary from just the first 20 examples to demonstrate how this can be done with `torchtext`. It provides a helper method called `build_vocab_from_iterator` that does exactly what we need:

In [9]:
vocab = build_vocab_from_iterator(
    simple_space_split(itertools.islice(train, 20)), specials=["<unk>"]
)

for index in range(len(vocab)):
    print(f"[{index}] {vocab.get_itos()[index]!r}")


[0] '<unk>'
[1] 'the'
[2] ','
[3] '.'
[4] 'to'
[5] 'and'
[6] 'of'
[7] 'a'
[8] 'in'
[9] '"'
[10] 'are'
[11] '='
[12] 'Valkyria'
[13] 'by'
[14] 'game'
[15] '@-@'
[16] 'as'
[17] 'character'
[18] 'their'
[19] 'with'
[20] 'can'
[21] 'is'
[22] 'Chronicles'
[23] "'s"
[24] 'that'
[25] 'Gallian'
[26] 'The'
[27] 'missions'
[28] ':'
[29] 'Nameless'
[30] 'an'
[31] 'characters'
[32] 'into'
[33] 'unit'
[34] 'squad'
[35] 'who'
[36] 'Army'
[37] 'also'
[38] 'be'
[39] 'it'
[40] 'military'
[41] 'on'
[42] 'player'
[43] 'series'
[44] 'through'
[45] 'was'
[46] 'Dahau'
[47] 'Darcsen'
[48] 'Gallia'
[49] 'III'
[50] 'Kurt'
[51] 'Raven'
[52] 'against'
[53] 'battlefield'
[54] 'different'
[55] 'each'
[56] 'from'
[57] 'his'
[58] 'move'
[59] 'not'
[60] 'order'
[61] 'other'
[62] 'story'
[63] 'such'
[64] 'them'
[65] 'war'
[66] 'weapon'
[67] "'"
[68] '422nd'
[69] 'As'
[70] 'Calamity'
[71] 'Each'
[72] 'Empire'
[73] 'II'
[74] 'Imperial'
[75] 'Japan'
[76] 'Potentials'
[77] 'at'
[78] 'both'
[79] 'enemy'
[80] 'for'
[81] 'ha

It doesn't look too bad!

## Putting it all together

We now have the ability to load text, tokenize it, split it into sentences, and build a vocabulary from it. Let's put this all together so that other code can easily take advantage of what we've done here.

We'll use `torchdata`'s data pipeline framework to standardize this. It's a bit of a learning curve, but it's worth it in the long run.

In [10]:
class SimpleSpaceSplit(dp.iter.IterDataPipe):
    def __init__(self, datapipe: dp.iter.IterDataPipe) -> None:
        self.datapipe = datapipe

    def __iter__(self) -> Iterator[Iterable[str]]:
        return simple_space_split(self.datapipe)


class FilterTokens(dp.iter.IterDataPipe):
    def __init__(self, datapipe: dp.iter.IterDataPipe, unwanted: Set[str]) -> None:
        self.datapipe = datapipe
        self.unwanted = unwanted

    def __iter__(self) -> Iterator[Iterable[str]]:
        for example in self.datapipe:
            yield filter(lambda token: not token in self.unwanted, example)


class BuildVocabularyFromTokens(dp.iter.IterDataPipe):
    def __init__(self, datapipe: dp.iter.IterDataPipe, specials: list[str]) -> None:
        self.datapipe = datapipe
        self.specials = specials
        self._vocabulary = None

    @property
    def vocabulary(self) -> Vocab:
        self._build_vocabulary()

        return self._vocabulary

    def __iter__(self) -> Iterator[Iterable[str]]:
        self._build_vocabulary()

        for example in self.datapipe:
            yield self.vocabulary.forward(
                example if type(example) is list else list(example)
            )

    def _build_vocabulary(self) -> None:
        if self._vocabulary is None:
            self._vocabulary = build_vocab_from_iterator(
                self.datapipe, specials=self.specials
            )


class SplitSentencesByIndex(dp.iter.IterDataPipe):
    def __init__(self, datapipe: dp.iter.IterDataPipe, period_index: int) -> None:
        self.datapipe = datapipe
        self.period_index = period_index

    def __iter__(self) -> Iterator[Iterable[str]]:
        for example in self.datapipe:
            sentences = []
            sentence = []

            for index in example:
                if index == self.period_index:
                    if len(sentence) > 0:
                        sentences.append(sentence)
                        sentence = []
                else:
                    sentence.append(index)

            if len(sentence) > 0:
                sentences.append(sentence)

            yield sentences


def get_wiki2_transformer_datapipe(
    strings: Iterable[str],
) -> Tuple[Vocab, Iterable[list[list[int]]]]:
    datapipe_tokens_unfiltered = SimpleSpaceSplit(strings)
    datapipe_tokens_filtered = FilterTokens(datapipe_tokens_unfiltered, {"@-@"})
    datapipe_indices = BuildVocabularyFromTokens(
        datapipe_tokens_filtered, specials=["<unk>"]
    )
    datapipe_sentences = SplitSentencesByIndex(
        datapipe_indices, datapipe_indices.vocabulary["."]
    )

    return datapipe_indices.vocabulary, datapipe_sentences


count = 0

vocab, datapipe = get_wiki2_transformer_datapipe(train)

for example in datapipe:
    if count <= 5:
        print(f"Example {count}:")

        for sentence_index, sentence_tokens in enumerate(example):
            print(f"Sentence {sentence_index}:", end="")

            for token_index in sentence_tokens:
                print(f" {vocab.get_itos()[token_index]}", end="")

            print()

        print()

    count += 1

print(f"Count: {count}")

Example 0:

Example 1:
Sentence 0: = Valkyria Chronicles III =

Example 2:

Example 3:
Sentence 0: Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit
Sentence 1: Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role playing video game developed by Sega and Media.Vision for the PlayStation Portable
Sentence 2: Released in January 2011 in Japan , it is the third game in the Valkyria series
Sentence 3: <unk> the same fusion of tactical and real time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven "

Example 4:
Sentence 0: The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II
Sentence 1: While it retained the standard features of 