# Step 1: Installation and Initial Setup

In [1]:
!pip install transformers datasets



The transformers library, provided by Hugging Face, contains pre-trained models and tools for building and fine-tuning various Natural Language Processing (NLP) models. The datasets library is used to load popular datasets conveniently, which makes it easy to prepare data for training and fine-tuning models. Run this installation command at the beginning to set up these libraries.



# Step 2: Loading and Sampling the Dataset
Load a dataset suitable for fine-tuning:



In [None]:
!pip install --upgrade fsspec

from datasets import load_dataset

# load IMDb dataset and take a small sample
dataset = load_dataset("imdb", split="train[:1%]")
print(dataset[0])

Here, we load the IMDb movie reviews dataset, often used in NLP tasks for sentiment analysis. By specifying train[:1%], we only load 1% of the training set, which is beneficial for quick experimentation and avoids using excessive computational resources. The print(dataset[0]) command checks that the data is loaded correctly.



# Step 3: Data Preprocessing


In [None]:
def preprocess(batch):
    batch['text'] = [text.replace('\n', ' ') for text in batch['text']]
    return batch

# apply preprocessing to the dataset
dataset = dataset.map(preprocess, batched=True)

In this function, we replaced newline characters in each review with spaces. This step is crucial because some models may not handle newline characters well, especially if trained for single-line inputs. dataset.map(preprocess, batched=True) applies this preprocessing function to the entire dataset, batch by batch, which improves efficiency.



# Step 4: Initializing the Model and Tokenizer
Load a pre-trained model and tokenizer for fine-tuning:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

Here, we loaded distilgpt2, a lightweight version of GPT-2, which is suitable for causal language modelling tasks. AutoTokenizer and AutoModelForCausalLM automatically download and set up the tokenizer and model architecture for the specified model. Setting the pad_token to eos_token ensures consistent padding in sequences, which is necessary for batch processing.



# Step 5: Tokenizing the Data
Convert text into tokens the model can understand:

In [None]:
def tokenize_function(examples):
    tokenized = tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)
    tokenized['labels'] = tokenized['input_ids'].copy()  # set labels to be the same as input_ids
    return tokenized

tokenized_data = dataset.map(tokenize_function, batched=True)

This function tokenizes each text input by converting it into integer IDs that the model can process. Using padding= “max_length” and truncation=True; ensures each tokenized sequence has a fixed length of 128, which avoids model memory overflow. Setting labels as a copy of input_ids prepares the dataset for language modelling by ensuring the model learns to predict the next word in a sequence.



# Step 6: Configuring Training Parameters
The next step in the fine-tuning process is to set up hyperparameters for model training:

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    logging_dir='./logs',
    logging_steps=10,
    save_total_limit=1
)

The TrainingArguments class is used to define the hyperparameters and settings for training. Key parameters include:

* output_dir: Directory to save model checkpoints.
* evaluation_strategy= “epoch”: Evaluate the model at the end of each epoch.
* per_device_train_batch_size and per_device_eval_batch_size: Number of samples processed per device in each batch during training and evaluation, respectively.
* num_train_epochs=1: Train the model for a single epoch.
* logging_steps: How often to log training information.
* save_total_limit=1: Limits the saved checkpoints to avoid storage overload.

# Step 7: Splitting the Dataset
Now, divide the dataset into training and evaluation sets:

In [None]:
train_data = tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data))))
eval_data = tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data)), len(tokenized_data)))

Here, we randomly shuffle the dataset and then split it into 80% training data and 20% evaluation data. This ensures that the model has enough data to learn from and also allows for a validation set to assess the model’s performance.



# Step 8: Setting Up the Trainer & Fine-Tuning the Model
Now, the next step in the process of fine-tuning LLMs is to initialize and configure the training process for fine-tuning:

In [None]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data
)

The Trainer class in transformers simplifies the training process by automating tasks like gradient updates and model evaluation. It uses training_args for hyperparameters and takes the train_data and eval_data datasets to structure the training and validation process.


Now, this is the fine-tuning step. Start training the model on the custom dataset:

In [None]:
trainer.train()

This command initiates the fine-tuning process. The train() function performs multiple forward and backward passes through the data, which updates the model’s weights to minimize prediction errors based on the IMDb dataset. Fine-tuning will allow the pre-trained distilgpt2 model to adjust to the specific language and style of movie reviews.

# Step 9: Save & Test the Fine-tuned Model
Save the model and tokenizer for future use:

In [None]:
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

Once training is completed, saving the model ensures that the fine-tuned parameters can be reused without re-running the entire process. The save_pretrained function saves both the model weights and the tokenizer configuration to a directory.

Now, let’s generate text based on a prompt to evaluate the model:

In [None]:
prompt = "The script"
inputs = tokenizer(prompt, return_tensors="pt")

output = model.generate(inputs['input_ids'], max_length=15)
print(tokenizer.decode(output[0], skip_special_tokens=True))

In this final section, we provide a sample prompt (“The script”) to test the model’s generative capabilities. The generate() function creates a new text sequence by sampling from the model’s learned distribution. By decoding and printing the output, you can observe how well the fine-tuned model generates text that aligns with the IMDb dataset.