Getting started, tokenization

This is my first attempt at creating a simple LLM which will know about some wikipedia articles. The dataset will be derived from wikipedia website using datasets library.

First we will install the dependencies required for this project

In [8]:
!pip install wikipedia-api transformers torch



We will then import the libraries
1. Wikipedia API provides the dataset from the Wikipedia website, it allows data to be fetched from their database in a selected language.
2. Autotokenizer: A hugging face module that loads tokenizers for NLP models like GPT-2, BERT etc.

FYI: A tokenizer is a function which converts input data into a set of sequences which the LLM can use to predict closely related words. For example, The words "HI" and "how" are placed "close" to each other so it is easier for the LLM to predict the next word.

In [9]:
import wikipediaapi
from transformers import AutoTokenizer


We will then initialize the Wikipedia API in English. This will make sure that all the data being pulled from the Wiki database is in English.

It is important to note that the function Wikipedia() takes in 2 arguments. The first one is user agent, it is basically a form of ID that is needed to create connections with the Wikipedia DB, and the second is ofcourse the language. We have selected Google_Collab_API as our user agent

In [10]:
wiki = wikipediaapi.Wikipedia("Google_Collab_API","en")

We will define a function which will get the Wikipedia content. If it finds the page, it will return the text from the page, else it will return none.

In [11]:
def get_wikipedia_content(title):
    """
    Fetches the text content of a Wikipedia page.

    This function takes a Wikipedia page title, checks if the page exists,
    and returns its text content. If the page does not exist, it prints
    an error message and returns None.

    Args:
        title (str): The title of the Wikipedia page to retrieve.

    Returns:
        str or None: The text content of the Wikipedia page if it exists, otherwise None.

    Example:
        content = get_wikipedia_content("Python (programming language)")
        if content:
            print(content[:500])  # Prints first 500 characters of the content
    """
    page = wiki.page(title)  #fethes the wikipedia page with given title
    if not page.exists():    #page.exists() returns boolean if the page exists or not
        print(f"PAGE '{title}' DOES NOT EXIST!")
        return None
    return page.text         #returns the whole text from the page found


In the snippet below, we have a title variable which will be used to call the get_wikipedia_content function and fetch the page related to that title.

The title variable can be changed for other titles.

In [12]:
title = "Artificial Intelligence"
content = get_wikipedia_content(title)

if content:
  print(f" Successfully fetched Wikipedia content for '{title}'")

 Successfully fetched Wikipedia content for 'Artificial Intelligence'


We will now call a pre-trained tokenizer, which basically activates the GPT-2's tokenizer system. We need the tokenizer to convert human language into sequence of numbers that the LLM uses to understand and store information appropriately.

In [13]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

We wil now tokenize the text received from the wikipedia, which is stored in variable called content. It will convert the text into PyTorch tensors, will truncate it if the dataset is longer than 1024 tokens, which is the limit of GPT-2 itself. All text will be split into smaller chunks, smaller than 1024.

All the tokens created from that page will be stored in the tokens datastructure.

In [14]:
tokens = tokenizer(content, return_tensors="pt", truncation=True, max_length=1024)

Next we will display the tokenized output. This will show what the tokens look like. Pretty cool!

In [15]:
print(f"\n Tokenized input IDs: \n'{tokens.input_ids}' ")


 Tokenized input IDs: 
'tensor([[8001, 9542, 4430,  ...,  257, 2176, 3061]])' 


To test stuff, we will convert the tokens back to the text of wikipedia page.

In [16]:
decoded_text = tokenizer.decode(tokens.input_ids[0])           #converts tokens back to human language.
print(f"\n🔹 Decoded text (sample):\n{decoded_text[:500]}...") #prints only the first 500 texts



🔹 Decoded text (sample):
Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called AIs.
High-profile applications of AI include advanced web search engines (e.g., Google Search); rec...
