<a href="https://colab.research.google.com/github/simecek/mlprague2024/blob/main/01_Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1

The goal of this exercise is to explore how the ratio of the number of characters to the number of tokens varies by language and tokenizer. This ratio influences the length of text that your model can process and its processing speed. We will use Wikipedia as the source of our texts.

**Exercise:**
- Instead of using "[Python (programming language)](https://en.wikipedia.org/wiki/Python_(programming_language))", select a different Wikipedia page.
- Instead of Czech, use your language.
- Experiment with different tokenizers. What is the best character-to-token ratio you can achieve?




In [2]:
!pip install -qq wikipedia wikipedia-api

In [3]:
import wikipedia
import wikipediaapi
from transformers import AutoTokenizer, GPT2TokenizerFast

In [4]:
def download_wikipedia_text(title, language="en"):
    wikipedia.set_lang(language)

    try:
        # Get the Wikipedia page by title
        page = wikipedia.page(title)

        # Extract the text content of the page
        text = page.content

        return text

    except wikipedia.exceptions.PageError as e:
        print(f"Page not found: {e}")
        return None

    except wikipedia.exceptions.DisambiguationError as e:
        print(f"Disambiguation page encountered: {e}")
        return None

title = "Python (programming language)"  ## CHANGE THIS!
text = download_wikipedia_text(title)

In [5]:
print(len(text))
text

42814




In [6]:
def get_page_title_in_other_languages(english_title, languages):
    # Create a WikipediaAPI object
    wiki = wikipediaapi.Wikipedia(user_agent='MyWikiApp/1.0.0')

    # Get the page object for the given title
    page = wiki.page(title)

    # Create a dictionary to store the page titles in different languages
    page_titles = {}

    # Iterate over the specified languages
    for lang in languages:
        # Get the language page object
        lang_page = page.langlinks.get(lang)

        if lang_page:
            page_titles[lang] = lang_page.title
        else:
            page_titles[lang] = None

    return page_titles

languages = ['cs', 'sk']  ## CHANGE THIS!
titles = get_page_title_in_other_languages(title, languages)
texts = {lang: download_wikipedia_text(langtitle, lang) for lang, langtitle in titles.items()}
texts['en'] = text

In [7]:
{lang: len(langtext) for lang, langtext in texts.items()}

{'cs': 54785, 'sk': 12777, 'en': 42814}

In [8]:
model = "gpt2"  ## CHANGE THIS!
# you can try tokenizers from other models like mistralai/Mistral-7B-v0.1, bigscience/bloom,
# MaLA-LM/mala-500-10b-v2, BUT-FIT/csmpt7b, GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
tokenizer = AutoTokenizer.from_pretrained(model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [9]:
len(texts['en']) / len(tokenizer.encode(texts['en']))

Token indices sequence length is longer than the specified maximum sequence length for this model (9461 > 1024). Running this sequence through the model will result in indexing errors


4.525314448789769

In [10]:
len(texts['cs']) / len(tokenizer.encode(texts['cs']))

1.997629899726527