A PyTorch implementation of a transformer-based language model trained on Wikipedia text. This project is inspired by the nanoGPT architecture but simplified for educational purposes.
This project implements a transformer-based language model that can:
- Train on Wikipedia text data
- Generate text based on user prompts
- Save and load model checkpoints
The model architecture is based on the transformer architecture with:
- 6 transformer layers
- 6 attention heads
- 384 embedding dimensions
- 1536 feedforward dimensions
- Iteration-based Training: Uses iterations instead of epochs for more flexible training
- Gradient Accumulation: Implements gradient accumulation for effective larger batch sizes
- Learning Rate Scheduling: Includes warmup and cosine decay for better convergence
- Mixed Precision Training: Uses automatic mixed precision for faster training
- Checkpoint Saving: Saves the best model based on loss
- Text Generation: Generates text from user prompts with configurable parameters
- Python 3.8+
- PyTorch 2.0+
- Hugging Face datasets
- tiktoken (OpenAI's tokenizer)
Here's an example of text generated by the model with the prompt "The best thing is":
The best thing is the attribute of the god, according to Homer from the greatest of the Iliad. In the earliest Greek, he is the most important attribute of the life, especially in the Roman form of the Greek language of the poem.
Aristotle was credited with Phoenicians across the late 18th century. It was considered the basis of Apollo's Egypt and Homer's son Asperipi observed to flee to his mother, and he also gave the Niobids until his death in Troy
This example demonstrates the model's ability to generate coherent text that follows the style and content patterns found in Wikipedia articles, particularly those about historical and mythological topics.
The model uses a standard transformer architecture with:
- Multi-head self-attention
- Position-wise feedforward networks
- Layer normalization
- Residual connections
- Dropout for regularization
The model has been trained for 16,000 iterations with a training loss of 1.0533. This indicates good progress in learning the patterns in the Wikipedia text. The model contains approximately 150 million parameters and was trained in about 1 hour on an NVIDIA RTX 4090 GPU.
- Inspired by the nanoGPT project
- Uses the Hugging Face datasets library
- Uses OpenAI's tiktoken for tokenization