Fine-tuned CodeT5 Transformer for Python code generation from natural language docstrings
This project fine-tunes Salesforce's CodeT5 (codet5-base) on the CodeSearchNet Python dataset to perform docstring-to-code generation: given a natural language description of a Python function, the model generates the corresponding Python source code.
Input β "Calculate the factorial of n using recursion."
Output β def factorial(n):
if n == 0:
return 1
return n * factorial(n - 1)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CodeT5 (220M params) β
β β
β ββββββββββββββββββββ ββββββββββββββββββββββββ β
β β Encoder β β Decoder β β
β β (RoBERTa-style) ββββββΆβ (T5-style, causal) β β
β β β β β β
β β Docstring tokensβ β Python code tokens β β
β ββββββββββββββββββββ ββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β² β
Natural language Generated Python
"Flatten a list..." def flatten(...)
| Step | Detail |
|---|---|
| Tokenization | RoBERTa tokenizer with "Generate Python: {docstring}" prefix |
| Model | Salesforce/codet5-base β encoder-decoder Transformer (220M params) |
| Training | Seq2Seq cross-entropy loss, teacher forcing |
| Decoding | Beam search (width=5) + nucleus sampling (top-p=0.95) |
| Evaluation | BLEU-4, ROUGE-1/L |
| Split | Full Size | Used |
|---|---|---|
| Train | ~412K samples | 50K samples |
| Validation | ~23K samples | 2K samples |
| Test | ~22K samples | 22K samples |
Each sample contains a Python function with its docstring. We use:
- Input:
func_documentation_string(natural language) - Target:
func_code_string(Python source code)
| Metric | Score |
|---|---|
| BLEU-4 | 54.66 |
| ROUGE-1 | 0.611 |
| ROUGE-L | 0.594 |
| Train Loss | 1.071 |
| Epochs | 3 |
| Training Time | ~3h 51m (Kaggle P100) |
This model was trained on 50K samples (out of 412K) with only 3 epochs due to computational constraints. As a result:
- The model works well for simple, common functions (file reading, basic data structures)
- The model struggles with complex algorithms (recursion, sorting logic, prime checking)
- Generated code may have structural correctness but logical errors
| Improvement | Expected Gain |
|---|---|
| Train on full 412K samples | +10-15% BLEU |
| Increase to 10+ epochs | +5-10% BLEU |
Use codet5-large (770M params) |
+15-20% BLEU |
Use codet5p-2b (2B params) |
+25-30% BLEU |
π‘ Note: This project is intended as a portfolio demonstration of the fine-tuning pipeline, not as a production-ready code generation tool. For production use, consider training on the full dataset with a larger model.
The fine-tuned model is available on HuggingFace Hub:
from transformers import T5ForConditionalGeneration, AutoTokenizer
model = T5ForConditionalGeneration.from_pretrained("roybeey/codet5-python-codegen")
tokenizer = AutoTokenizer.from_pretrained("roybeey/codet5-python-codegen")git clone https://github.com/roybeey0/codet5-python-codegen.git
cd codet5-python-codegen
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtpython train.py
# Model saved to ./outputs/codet5-python-codegenπ‘ Tip: To do a quick sanity check, uncomment the subset lines in
train.pyto train on 5K samples first.
python inference.pypython app.py
# Open: http://localhost:7860codet5-python-codegen/
β
βββ train.py # Fine-tuning pipeline (load β preprocess β train β save)
βββ inference.py # Code generation from docstrings
βββ eval_metrics.py # BLEU, ROUGE evaluation
βββ app.py # Gradio web demo UI
β
βββ outputs/ # (git-ignored) Trained model checkpoints
β βββ codet5-python-codegen/
β βββ config.json
β βββ model.safetensors
β βββ tokenizer_config.json
β
βββ requirements.txt
βββ README.md
| Parameter | Value |
|---|---|
| Base model | Salesforce/codet5-base |
| Max input tokens | 256 |
| Max output tokens | 256 |
| Batch size | 8 |
| Learning rate | 5e-5 |
| Warmup steps | 500 |
| Epochs | 3 |
| Optimizer | AdamW |
| Precision | FP16 (if GPU available) |
| Beam search width | 5 |
| Spec | |
|---|---|
| GPU | Kaggle P100 (16GB) |
| RAM | 29GB |
| Training time | ~3h 51m |
| Dataset | 50K / 412K samples |
Use a larger model:
MODEL_NAME = "Salesforce/codet5-large"Train on full dataset:
# Comment out these lines in train.py:
# raw_dataset["train"] = raw_dataset["train"].select(range(50000))
# raw_dataset["validation"] = raw_dataset["validation"].select(range(2000))- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation β Wang et al., 2021
- CodeSearchNet Challenge β Husain et al., 2019
- HuggingFace Transformers
- Salesforce/codet5-base on HuggingFace
MIT License β feel free to use, modify, and distribute.
roybeey
GitHub Β· HuggingFace
β If you found this project useful, please star the repository!