<a href="https://colab.research.google.com/github/saadmughlii/LLM_practice/blob/main/my_first_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization of data fetched from Wikipedia repository

This is my first attempt at creating a simple LLM which will know about some wikipedia articles. The dataset will be derived from wikipedia website using datasets library.

First we will install the dependencies required for this project

In [1]:
!pip install torch transformers datasets accelerate wikipedia-api


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting wikipedia-api
  Downloading wikipedia_api-0.8.1.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12=

We will then import the libraries
1. Wikipedia API provides the dataset from the Wikipedia website, it allows data to be fetched from their database in a selected language.
2. Autotokenizer: A hugging face module that loads tokenizers for NLP models like GPT-2, BERT etc.

FYI: A tokenizer is a function which converts input data into a set of sequences which the LLM can use to predict closely related words. For example, The words "HI" and "how" are placed "close" to each other so it is easier for the LLM to predict the next word.

In [2]:
import wikipediaapi
from transformers import AutoTokenizer


KeyboardInterrupt: 

We will then initialize the Wikipedia API in English. This will make sure that all the data being pulled from the Wiki database is in English.

It is important to note that the function Wikipedia() takes in 2 arguments. The first one is user agent, it is basically a form of ID that is needed to create connections with the Wikipedia DB, and the second is ofcourse the language. We have selected Google_Collab_API as our user agent

In [None]:
wiki = wikipediaapi.Wikipedia("Google_Collab_API","en")

We will define a function which will get the Wikipedia content. If it finds the page, it will return the text from the page, else it will return none.

In [None]:
def get_wikipedia_content(title):
    """
    Fetches the text content of a Wikipedia page.

    This function takes a Wikipedia page title, checks if the page exists,
    and returns its text content. If the page does not exist, it prints
    an error message and returns None.

    Args:
        title (str): The title of the Wikipedia page to retrieve.

    Returns:
        str or None: The text content of the Wikipedia page if it exists, otherwise None.

    Example:
        content = get_wikipedia_content("Python (programming language)")
        if content:
            print(content[:500])  # Prints first 500 characters of the content
    """
    page = wiki.page(title)  #fethes the wikipedia page with given title
    if not page.exists():    #page.exists() returns boolean if the page exists or not
        print(f"PAGE '{title}' DOES NOT EXIST!")
        return None
    return page.text         #returns the whole text from the page found


In the snippet below, we have a title variable which will be used to call the get_wikipedia_content function and fetch the page related to that title.

The title variable can be changed for other titles.

In [None]:
title = "Artificial Intelligence"
content = get_wikipedia_content(title)

if content:
  print(f" Successfully fetched Wikipedia content for '{title}'")

We will now call a pre-trained tokenizer, which basically activates the GPT-2's tokenizer system. We need the tokenizer to convert human language into sequence of numbers that the LLM uses to understand and store information appropriately.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

**NOTE: The code from here for this section has been commented to save processing power, in the next section we will be working with tokenization as a function. Please de-comment for your own testing purposes.**


We wil now tokenize the text received from the wikipedia, which is stored in variable called content. It will convert the text into PyTorch tensors, will truncate it if the dataset is longer than 1024 tokens, which is the limit of GPT-2 itself. All text will be split into smaller chunks, smaller than 1024.

All the tokens created from that page will be stored in the tokens datastructure.

In [None]:
#if content:
    #tokens = tokenizer(content, return_tensors="pt", truncation=True, max_length=1024)
    #print("Tokenization complete!")

Next we will display the tokenized output. This will show what the tokens look like. Pretty cool!

In [None]:
#print(f"\n Tokenized input IDs: \n'{tokens.input_ids}' ")

To test stuff, we will convert the tokens back to the text of wikipedia page.

In [None]:
#decoded_text = tokenizer.decode(tokens.input_ids[0])           #converts tokens back to human language.
#print(f"\n🔹 Decoded text (sample):\n{decoded_text[:500]}...") #prints only the first 500 texts


# Training the model

Now we will work on training our model. For that, we will use PyTorch and dataset libraries. The next snippet will tokenize the dataset. It will first convert the Wikipedia text into dataset format, then will proceed to map dataset itself.



In [None]:
import torch
from datasets import Dataset


def tokenize_function(examples):
     """
    Tokenizes the input examples into token IDs using the specified tokenizer.

    This function takes a dictionary of examples where each example contains a
    "text" field, and tokenizes the text using the provided tokenizer. The
    text is truncated to a maximum length of 1024 tokens if necessary.

    Args:
        examples (dict): A dictionary containing text data. The "text" field in
                         the dictionary is expected to contain the input text
                         to be tokenized.

    Returns:
        dict: A dictionary containing the tokenized text with token IDs.
              The output is compatible with the input structure of the tokenizer
              (e.g., token IDs, attention masks).

    Example:
        input_examples = {"text": "This is a sample text."}
        tokenized_output = tokenize_function(input_examples)

    Note:
        The tokenizer should be defined elsewhere in the code, and this function
        assumes that the tokenizer has been properly initialized with the required
        settings (e.g., model type, padding, etc.).
    """


  return tokenizer(examples["text"], truncation=True, max_length=1024)

dataset = Dataset.from_dict({"text" : [content]})   #converts wikipedia text into dataset format
tokenized_dataset = dataset.map(tokenize_function, batched=True)  #tokenizes all dataset entries


I understand it may look a little overwhelming. Below is dumbed-down description of whats happening

Grab text from wikipedia -> Convert the text into "dataset" format -> Tokenize the "dataset" formatted data.

We will next load the GPT-2 model, which will be trained for our Wikipedia text.

In [None]:
from transformers import AutoModelForCausalLM

model_used = AutoModelForCausalLM.from_pretrained("gpt2")

print("WE HAVE GPT-2 NOW YAY")

Finally, we will now define the training arguments which will basically devise how will the training be done, where should the training data be saved and stuff.

In [None]:
from transformers import TrainingArguments

# Define training settings
training_args = TrainingArguments(
    output_dir="./gpt2_wiki",       #where the training model is saved
    per_device_train_batch_size=2,  #2 samples per training step, nothing too heavy
    num_train_epochs=3,             #the LLM will be trained 3 times over the whole provided tokenized dataset
    logging_dir="./logs",
    save_strategy="epoch"           #after every epoch, the model will be saved with trained data.
)

print("Training arguments set!")


Now that we have our training arguments set up, we will then train the model.

In [None]:
from transformers import Trainer

# Initialize Trainer
trainer = Trainer(
    model=model_used,   #this will train gpt-2
    args=training_args, #the arguments being used
    train_dataset=tokenized_dataset #with what data do we train the model
)

# Start training
trainer.train()

print("Training complete!")


We finally have trained the model, based on our wikipedia article about artificial intelligence systems.

# Testing for "Artificial Intelligence" article

Now that we have trained our GPT-2 model with the "Artificial Intelligence" article, saved in the variable in the first section, we can test it out by giving it an input, and GPT-2 will guess the next 100 words


In [None]:
input_text = "The future of artificial intelligence is"    #half prompt given to our trained GPT-2 model

input_ids = tokenizer(input_text, return_tensors="pt").input_ids #converting the input text into a tokens, that GPT 2 will understand

output = model.generate(input_ids, max_length = 100) #the max length of the output being generated by our GPT 2

print("Generated text \n", tokenizer.decode(output[0])) #the output being given by GPT 2