# ðŸ““ The GenAI Revolution Cookbook

**Title:** Fine-tuning large language models: a step-by-step guide [2025]

**Description:** Master full fine-tuning with Hugging Face: generate reliable datasets, configure seq2seq training, tune hyperparameters, and produce consistently formatted, higher-quality outputs.

**ðŸ“– Read the full article:** [Fine-tuning large language models: a step-by-step guide [2025]](https://blog.thegenairevolution.com/article/fine-tuning-large-language-models-a-step-by-step-guide-2025-3)

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



So you've been working with an LLM and the first thing you tried was probably prompt engineering. You write clear prompts, follow the [Guidelines to Effective Prompting](https://thegenairevolution.com/harnessing-the-power-of-llm-apis-a-guide-to-effective-prompting/), try to be direct. For complex stuff, you ask the model to reason step by step. If you want a complete guide on [building production\-ready LLM features with prompt engineering](/article/prompt-engineering-with-llm-apis-how-to-get-reliable-outputs-4), I put together something that covers pretty much everything.

But sometimes that's not enough. I've been there. So you try in\-context learning. Like I explained in [The Magic of In\-Context Learning](https://thegenairevolution.com/the-magic-of-in-context-learning-teach-your-llm-on-the-fly/), you basically add examples to help the model understand what you want. For more on [in\-context learning techniques](/article/the-magic-of-in-context-learning-teach-your-llm-on-the-fly-3), check out my practical guide. Saved me countless hours.

If that still doesn't work, it's probably time for fine\-tuning. This is where things get interesting. I covered this in [Fine\-Tuning 101](https://thegenairevolution.com/fine-tuning-101-how-to-customize-llms-to-your-specific-needs/). Fine\-tuning lets you customize the LLM for your specific use case. Depending on your resources and needs, you can do full fine\-tuning or parameter\-efficient fine\-tuning. You might want to explore [approaches like LoRA](/article/parameter-efficient-fine-tuning-peft-with-lora-2025-hands-on-guide-2) to reduce costs while keeping performance strong. Actually, LoRA was a game\-changer for some personal projects where GPU memory was tight.

In this cookbook, I'll focus on full fine\-tuning. I'll walk you through how to apply it step by step. Let me show you what worked for me.

## Setup

Let's set up our environment. I'm following the same setup from [Running an LLM Locally on Your Own Server: A Practical Guide](https://thegenairevolution.com/running-an-llm-locally-on-your-own-server-a-practical-guide/). Check that post if you need detailed instructions. Getting the setup right saves so much headache.

In [None]:
## Import the necessary packages
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

# Load the FLAN-T5 model
model_name = "google/flan-t5-base"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Few-shot learning
input_text = """
Answer the following geography questions using the format shown in the context. 
Answer with a single sentence containing the cityâ€™s name, country, population, and three famous landmarks. 

Follow the pattern below:

Q: Tell me about Paris.  
A: Paris is a city in France with a population of 2.1 million, known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.

Q: Describe New York.  
A: New York is a city in the United States with a population of 8.5 million, known for landmarks such as the Statue of Liberty, Central Park, and Times Square.

Q: What can you say about Tokyo?  
A: Tokyo is a city in Japan with a population of 14 million, known for landmarks such as the Tokyo Tower, Shibuya Crossing, and the Meiji Shrine.

Q: Tell me some information about Sydney.  
A: Sydney is a city in Australia with a population of 5.3 million, known for landmarks such as the Sydney Opera House, the Harbour Bridge, and Bondi Beach.

Q: Could you give me some details about Cairo?  
A: Cairo is a city in Egypt with a population of 9.5 million, known for landmarks such as the Pyramids of Giza, the Sphinx, and the Egyptian Museum.

Now, describe Vancouver in the same format.
"""

# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt")

#  Generate response
outputs = model.generate(inputs.input_ids, max_length=50)

# Decode and print the ouput
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
Vancouver is a city in Canada.

When our task is requesting city information, including name, country, population, and landmarks, in\-context learning hits its limit. We're not getting what we want. I spent hours trying different prompts before accepting we needed more. Time for fine\-tuning.

## Preparing the Training Data

Here's what I'm going to do: use another LLM, a larger one, to create training data. I'll generate 100 labeled input\-output pairs for fine\-tuning, then see if that's enough.

This method is great for generating datasets quickly. Saves tons of time. But you need to make sure the data is accurate and diverse. I learned this the hard way when I didn't validate generated data properly. The model learned some weird patterns.

### Advantages of Using an LLM

* **Speed**: An LLM generates thousands of examples in minutes. I once needed 500 examples for a personal project. Got them done during lunch.
* **Consistency**: The model ensures consistent formatting across your dataset. No more worrying about semicolons versus commas in example 47\.
* **Adaptability**: You can adjust prompts for more diverse outputs. Want different question phrasing? Add facts about different countries? Just tweak and regenerate.

### Process Using GPT\-4

**1\. Define a Template for the Desired Output**

Create a template for the question\-answer pairs you want. For cities, include fields like name, country, population, landmarks. I start simple then refine based on what the model produces.

**2\. Provide Few\-Shot Examples to the LLM**

Use few\-shot learning to guide the model. Start with manually written examples, prompt the model to generate more in the same style. Be specific about what you want.

**3\. Use the LLM to Generate Multiple Examples**

With your template and examples ready, the LLM can generate more pairs. Use API calls, loop over geographic entities, generate what you need. Actually, wait. Don't generate everything at once. I do batches of 20\-30 to check quality.

**4\. Review and Refine the Data**

LLMs produce structured outputs, but quality varies. You need to review for:

* **Accuracy**: Check facts like capitals, populations, landmarks. GPT\-4 once told me the Eiffel Tower was in Berlin. These things happen.
* **Format Consistency**: All answers follow the template
* **Diversity**: Include various questions and phrasing. You don't want 100 examples starting with "Tell me about..."

To improve quality, you might explore [techniques for building RAG systems and managing datasets](/article/rag-101-build-an-index-run-semantic-search-and-use-langchain-to-automate-it). These help curate and validate training data.

In [None]:
# Import the necessary Python libraries
from dotenv import load_dotenv, find_dotenv
from openai import OpenAI
import json
import time

# Load the OPENAI_API_KEY from local .env file
_ = load_dotenv(find_dotenv())

# Instantiate the OpenAI client
client = OpenAI()

# List of cities for generating question-answer pairs
cities = [
    "Paris", "Tokyo", "New York", "Sydney", "Cairo", "Rio de Janeiro",
    "London", "Berlin", "Dubai", "Rome", "Beijing", "Bangkok", "Moscow",
    "Toronto", "Los Angeles", "Cape Town", "Mumbai", "Seoul", "Buenos Aires",
    "Istanbul", "Mexico City", "Jakarta", "Shanghai", "Lagos", "Madrid",
    "Lisbon", "Stockholm", "Vienna", "Prague", "Warsaw", "Helsinki", "Oslo",
    "Brussels", "Zurich", "Kuala Lumpur", "Singapore", "Manila", "Lima",
    "Santiago", "BogotÃ¡", "Nairobi", "Havana", "San Francisco", "Chicago",
    "Venice", "Florence", "Edinburgh", "Glasgow", "Dublin", "Athens",
    "Melbourne", "Perth", "Hong Kong", "Doha", "Casablanca", "Tehran",
    "Bucharest", "Munich", "Barcelona", "Kyoto", "Kolkata", "Amman",
    "Lyon", "Nice", "Marseille", "Tel Aviv", "Jerusalem", "Geneva", 
    "Ho Chi Minh City", "Phnom Penh", "Yangon", "Colombo", "Riyadh",
    "Abu Dhabi", "Addis Ababa", "Seville", "Bilbao", "Porto", "Bratislava",
    "Ljubljana", "Tallinn", "Riga", "Vilnius", "Belgrade", "Sarajevo",
    "Skopje", "Tirana", "Baku", "Yerevan", "Tashkent", "Almaty", "Ulaanbaatar",
    "Karachi", "Islamabad", "Helsinki", "Chennai", "Kigali", "Antananarivo",
    "Bangui", "San Juan"
]

# Function to create a prompt for a specific city
def create_prompt(city):
    return f"""
    Your task is to provide question-and-answer pairs about cities following this format:
    {{
      "input": "[A unique way to ask for a description of the city]",
      "output": "[City Name] is a city in [Country] with a population of [Population], known for landmarks such as [Landmark 1], [Landmark 2], and [Landmark 3]."
    }}

    Here are a few examples:
    {{
      "input": "Tell me about Paris.",
      "output": "Paris is a city in France with a population of 2.1 million, known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral."
    }}
    {{
      "input": "Can you provide information on Tokyo?",
      "output": "Tokyo is a city in Japan with a population of 37 million, known for landmarks such as the Tokyo Tower, Shibuya Crossing, and Meiji Shrine."
    }}
    {{
      "input": "What can you tell me about New York?",
      "output": "New York is a city in the United States with a population of 8.4 million, known for landmarks such as the Statue of Liberty, Central Park, and Times Square."
    }}

    Now, generate a similar question-answer pair for the city {city}.
    """

# Function to generate a Q&A pair using GPT-4 for a given city
def generate_city_qa(city):
    prompt = create_prompt(city)
    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=200
        )
        return json.loads(response.choices[0].message.content.strip())
    except Exception as e:
        print(f"Error generating data for {city}: {e}")
        return None
        
# Generate and save Q&A pairs incrementally to a JSONL file
with open("city_qna.jsonl", "w") as f:
    for i, city in enumerate(cities):
        qa_pair = generate_city_qa(city)
        if qa_pair:
            f.write(json.dumps(qa_pair) + "\n")
            # Print the first 3
            if i < 3:
                print(qa_pair)
            else:
                print(".", end="")
        time.sleep(1)  # Add delay to manage rate limits

print("\n\nCity Q&A pairs saved to city_qna.jsonl")

In [None]:
{'input': 'Can you describe the city of Paris to me?', 'output': 'Paris is a city in France with a population of 2.1 million, known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.'}
{'input': 'What should I know about Tokyo?', 'output': 'Tokyo is a city in Japan with a population of 37 million, known for landmarks such as the Tokyo Skytree, Senso-ji Temple, and the Imperial Palace.'}
{'input': 'What do you know about New York?', 'output': 'New York is a city in the United States with a population of 8.4 million, known for landmarks such as the Empire State Building, Brooklyn Bridge, and Wall Street.'}
.................................................................................................

City Q&A pairs saved to city_qna.jsonl

## Full Fine\-Tuning

We've prepared our training data, saved it as JSONL. Ready for fine\-tuning. This dataset is the foundation for customizing our model. Next, we configure the environment, load parameters, and adapt the model for our tasks.

### Load the Dataset from the JSONL File

We'll use Hugging Face datasets library to load our JSONL. Pretty straightforward.

In [None]:
from datasets import load_dataset

# Load your dataset from the JSONL file
dataset = load_dataset("json", data_files="city_qna.jsonl")

# Check the dataset structure
print(dataset["train"][0])

In [None]:
Generating train split: 0 examples [00:00, ? examples/s]
{'input': 'Can you describe the city of Paris to me?', 'output': 'Paris is a city in France with a population of 2.1 million, known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.'}

### Preprocess the Data

Flan\-T5 is seq2seq, so we tokenize input and output appropriately. This part always takes me a minute.

In [None]:
from transformers import AutoTokenizer

# Load the tokenizer for Flan-T5
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

# Tokenize the dataset
def preprocess_data(examples):
    # Extract inputs and outputs as lists from the dictionary
    inputs = examples["input"]
    outputs = examples["output"]

    # Tokenize inputs and outputs with padding and truncation
    model_inputs = tokenizer(inputs, max_length=128, padding="max_length", truncation=True)
    labels = tokenizer(outputs, max_length=128, padding="max_length", truncation=True).input_ids

    # Replace padding token IDs with -100 to ignore them in the loss function
    labels = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in labels]
    model_inputs["labels"] = labels

    return model_inputs

# Use the map function to apply the preprocessing to the whole dataset
tokenized_dataset = dataset["train"].map(preprocess_data, batched=True)

In [None]:
Map: 0%| | 0/100 [00:00<?, ? examples/s]

### Set Up the Model for Fine\-Tuning

For a refresher on core concepts, check my guide on [understanding transformer architecture](/article/transformers-demystifying-the-magic-behind-large-language-models-2). Helps to know what's happening under the hood.

In [None]:
from transformers import AutoModelForSeq2SeqLM

# Load the Flan-T5 model
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

### Define the Training Arguments

Set training parameters. Epochs, batch size, learning rate. Getting these right is half art, half science. I start conservative and adjust.

In [None]:
from transformers import TrainingArguments, Trainer

# Define training arguments
training_args = TrainingArguments(
    output_dir="/home/ubuntu/flan-t5-city-tuning",  # Output directory
    eval_strategy="no",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=1,  # Only keep the most recent checkpoint
    logging_dir="./logs",  # Directory for logs
    logging_steps=10,
    push_to_hub=False  # Set this to True if you want to push to Hugging Face Hub
)

### Create a Trainer and Train the Model

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer
)

# Start fine-tuning
trainer.train()

In [None]:
[39/39 02:59, Epoch 3/3]
Step Training Loss
10 1.516000
20 1.537800
30 1.412000

TrainOutput(global_step=39, training_loss=1.4650684992472331, metrics={'train_runtime': 184.0522, 'train_samples_per_second': 1.63, 'train_steps_per_second': 0.212, 'total_flos': 51356801433600.0, 'train_loss': 1.4650684992472331, 'epoch': 3.0})

### Evaluate the Model Qualitatively (Human Evaluation)

In [None]:
# Load the fine-tuned model
model_name = "/home/ubuntu/flan-t5-city-tuning/checkpoint-39"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Few-shot learning
input_text = "Describe the city of Vancouver"

# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt")

#  Generate response
outputs = model.generate(inputs.input_ids, max_length=50)

# Decode and print the ouput
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
Vancouver is a city in Canada with a population of 2.8 million, known for landmarks such as the Vancouver Bridge, Vancouver City Hall, and Vancouver International Airport.

Great! The model now provides exactly the format we wanted. Some information isn't entirely accurate, I'll admit, but the structure is much improved. This shows significant progress. Best part? We achieved this without any in\-context learning. Pretty cool.

## Conclusion

I demonstrated how to fine\-tune an LLM using a dataset generated with another LLM. We prepared training data, emphasized accuracy and diversity, walked through fine\-tuning. I highlighted key advantages, particularly speed and consistency.

Our fine\-tuning showed meaningful improvement in response format. Yes, some factual errors persisted. But the structured output shows fine\-tuning can significantly enhance a model's ability to meet requirements, even without in\-context learning. That's a win.

Fine\-tuning offers a powerful way to tailor models to specialized tasks. Makes your model more effective. But you need to review data and outputs for accuracy and diversity. I can't stress this enough. With the right approach, fine\-tuning unlocks new precision and customization. Once you get the hang of it, it's another valuable tool in your toolkit.