# Exercise - Tokenizers

With `transformers` you can easily get a tokenizer of any model as follows

```
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.encode(texts)
```

The goal of this exercise is to explore how the ratio of the number of characters to the number of tokens varies by language and tokenizer. This ratio influences the length of text that your model can process and its processing speed. We will use Wikipedia as the source of our texts.

**Exercise:**
- Instead of using "[Python (programming language)](https://en.wikipedia.org/wiki/Python_(programming_language))", select a different Wikipedia page.
- Experiment with different tokenizers. What is the best character-to-token ratio you can achieve?




In [1]:
import wikipedia
import wikipediaapi
from transformers import AutoTokenizer, GPT2TokenizerFast

In [None]:
def download_wikipedia_text(title, language="en"):
    wikipedia.set_lang(language)

    try:
        # Get the Wikipedia page by title
        page = wikipedia.page(title)

        # Extract the text content of the page
        text = page.content

        return text

    except wikipedia.exceptions.PageError as e:
        print(f"Page not found: {e}")
        return None

    except wikipedia.exceptions.DisambiguationError as e:
        print(f"Disambiguation page encountered: {e}")
        return None

title = "Python (programming language)"  ## CHANGE THIS!
text = download_wikipedia_text(title)

In [None]:
print(len(text))
text

In [None]:
def get_page_title_in_other_languages(english_title, languages):
    # Create a WikipediaAPI object
    wiki = wikipediaapi.Wikipedia(user_agent='MyWikiApp/1.0.0')

    # Get the page object for the given title
    page = wiki.page(title)

    # Create a dictionary to store the page titles in different languages
    page_titles = {}

    # Iterate over the specified languages
    for lang in languages:
        # Get the language page object
        lang_page = page.langlinks.get(lang)

        if lang_page:
            page_titles[lang] = lang_page.title
        else:
            page_titles[lang] = None

    return page_titles

languages = ['cs', 'sk']  ## CHANGE THIS!
titles = get_page_title_in_other_languages(title, languages)
texts = {lang: download_wikipedia_text(langtitle, lang) for lang, langtitle in titles.items()}
texts['en'] = text

In [None]:
{lang: len(langtext) for lang, langtext in texts.items()}

In [None]:
model = "gpt2"  ## CHANGE THIS!
# you can try tokenizers from other models like mistralai/Mistral-7B-v0.1, bigscience/bloom,
# MaLA-LM/mala-500-10b-v2, BUT-FIT/csmpt7b, GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
tokenizer = AutoTokenizer.from_pretrained(model)

In [None]:
len(texts['en']) / len(tokenizer.encode(texts['en']))

In [None]:
len(texts['cs']) / len(tokenizer.encode(texts['cs']))