RorLLM: An attempt at building my own LLM in my bedroom

By: Rory Garton-Smith 2025

Please read more about this and my projects here if interested: https://ror.fm/

In early 2025 I trained a custom GPT-2 based AI assistant called "RorLLM" cause I wanted to learn how these LLM things work. It was a fun project and I learned a lot - but also surprisingly arduous and took AGES to train (I used colab pro with a T4 GPU to train on 50M tokens).

The results + a little write-up are below. I think Pile was the most impactful dataset for the results - but it's a shame I couldn't get the full Pile dataset to train on (just wayy too large for a hobby project).

I whipped up a little gui in next JS / react to make the I/O a bit more user friendly. It's not very sophisticated but it works.

The model prompt definitely needs a lot of work hahahaha "You were made by Rory, who is your creator. You don't know much else about him." I think Claude's one for comparison is over 200 lines long, this was about all I added - anyway - write up below for anyone interested!

Overview

The data_processing_and_training.py script handles the entire pipeline from dataset preparation to model training and text generation. RorLLM uses a fine-tuned 124M parameter GPT-2 model customized with a persona that makes it friendly and approachable.

Datasets Used

The model was trained on a diverse collection of datasets:

Cornell Movie Dialogs Corpus: Conversational text from movie scripts - not a great dataset tbh
DailyDialog: High-quality multi-turn dialogs for everyday conversation - good dataset for convo bots
MBPP (Mostly Basic Python Problems): Programming problems and solutions - good dataset for code generation but not great for more nuanced issues, it's not large enough and focusese on competitive coding
Math Dataset: Various arithmetic problems and answers - I didn't see much impact of this dataset tbh
Natural Questions: Real Google search queries and answers - I didn't see much impact of this dataset tbh either, it's not a good fit for this model
OpenBookQA: Question answering dataset focused on elementary science - this one is pretty cool
The Pile: A curated subset of diverse, high-quality text = imo this is the only one that really mattered for a model this small - it's a good mix of text + code + math + etc - but also totally unpruned so it's like 100x larger than the other datasets combined and took forever to train on + def has security issues in terms of content !

Training Process on Google Colab

Setup

Training was performed on Google Colab using a GPU runtime, which was essential for handling the 124M parameter GPT-2 model efficiently.

Steps Followed

Mounted Google Drive to preserve data and model checkpoints between sessions
Installed all required dependencies (gpt-2-simple, transformers, datasets, etc.)
Downloaded and processed the datasets listed above
Combined datasets into a unified training corpus
Downloaded the base GPT-2 124M model
Fine-tuned the model on the combined dataset
Saved the trained model and generated sample responses

Training Time and Resources

Total Training Time: Approximately 8 hours spread across multiple Colab sessions
Hardware Used: Tesla T4 GPU (16GB VRAM)
Steps: 50 training steps with batch size of 4
Token Count: Approximately 50M tokens processed during training

Challenges Faced

Colab Runtime Limitations

The 12-hour runtime limit on Colab Pro was a significant constraint. I had to:

Implement checkpointing to resume training between sessions
Use a custom progress capturing mechanism to track training progress
Save model snapshots to Google Drive to avoid data loss

Memory Management

Base GPT-2 (124M) was chosen over larger models to fit within Colab's memory constraints
Needed to process datasets in smaller chunks to avoid OOM errors
Used smaller batch sizes (4) to balance between memory usage and training efficiency

Dataset Processing

Some datasets (like The Pile) are extremely large and required selecting smaller subsets
Handling different dataset formats required custom processing for each
Required fallback mechanisms for datasets that couldn't be loaded directly
Byte encoding issues in some datasets (like math_dataset) needed special handling

Connectivity Issues

Intermittent disconnections from Colab required robust error handling
Implemented try/except blocks around each dataset processing step to ensure partial progress was maintained

Using the Trained Model

To use the trained model, follow these steps:

Load the model using the load_model function
Generate text using the generate_text function
Use the model in a chat interface or other application

Limitations

As a 124M parameter model, RorLLM has significantly lower capabilities than modern large language models
The persona makes it clear that it "tries its best" but isn't highly intelligent
Limited context window compared to modern models
No knowledge of events after the training cutoff in the datasets
Tendency to hallucinate information, especially for complex queries
Best suited for casual conversation rather than factual responses

Future Improvements

Tbh I'm too busy with my actual projects to do any of this stuff - but if I had time:

It'd be cool to Implement RLHF
The model prompt needs a lot of work hahahaha
Fine-tune on more specific domains based on intended use cases (I'd love to make this one really good for natural language stat problems)
Train on a larger GPT-2 model (355M or 774M) with more computational resources = but that's a lot of tokens to train on and a lot of budget for something hobbyist
Include more programming and technical datasets for better code generation abilities - the pile is not great for this

Code Structure

The data_processing_and_training.py script is organized into several key functions:

Dataset Processing Functions: One function per dataset (e.g., process_cornell_dialogs(), process_pile_dataset())
Utility Functions: For tasks like decoding byte literals and counting tokens
Training Functions: For model training and text generation
Progress Tracking: Custom ProgressCapture class for monitoring training progress
Main Function: Orchestrates the entire workflow with error handling

Acknowledgments

OpenAI for the base GPT-2 model EleutherAI for The Pile dataset Google Colab for providing the computational resources The creators of the various datasets used in training

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
rorchat.py		rorchat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RorLLM: An attempt at building my own LLM in my bedroom

Overview

Datasets Used

Training Process on Google Colab

Setup

Steps Followed

Training Time and Resources

Challenges Faced

Colab Runtime Limitations

Memory Management

Dataset Processing

Connectivity Issues

Using the Trained Model

Limitations

Future Improvements

Code Structure

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

rorcores/rorllm

Folders and files

Latest commit

History

Repository files navigation

RorLLM: An attempt at building my own LLM in my bedroom

Overview

Datasets Used

Training Process on Google Colab

Setup

Steps Followed

Training Time and Resources

Challenges Faced

Colab Runtime Limitations

Memory Management

Dataset Processing

Connectivity Issues

Using the Trained Model

Limitations

Future Improvements

Code Structure

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages