Skip to content

roybeey0/codet5-python-codegen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 CodeT5 Python Code Generator

Fine-tuned CodeT5 Transformer for Python code generation from natural language docstrings

Python PyTorch HuggingFace Gradio License: MIT


πŸ“Œ Overview

This project fine-tunes Salesforce's CodeT5 (codet5-base) on the CodeSearchNet Python dataset to perform docstring-to-code generation: given a natural language description of a Python function, the model generates the corresponding Python source code.

Input  β†’ "Calculate the factorial of n using recursion."
Output β†’ def factorial(n):
             if n == 0:
                 return 1
             return n * factorial(n - 1)

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CodeT5 (220M params)              β”‚
β”‚                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚    Encoder       β”‚     β”‚      Decoder         β”‚  β”‚
β”‚  β”‚  (RoBERTa-style) │────▢│  (T5-style, causal)  β”‚  β”‚
β”‚  β”‚                  β”‚     β”‚                      β”‚  β”‚
β”‚  β”‚  Docstring tokensβ”‚     β”‚  Python code tokens  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–²                            β”‚
  Natural language              Generated Python
  "Flatten a list..."            def flatten(...)

How It Works

Step Detail
Tokenization RoBERTa tokenizer with "Generate Python: {docstring}" prefix
Model Salesforce/codet5-base β€” encoder-decoder Transformer (220M params)
Training Seq2Seq cross-entropy loss, teacher forcing
Decoding Beam search (width=5) + nucleus sampling (top-p=0.95)
Evaluation BLEU-4, ROUGE-1/L

πŸ“Š Dataset

CodeSearchNet Python

Split Full Size Used
Train ~412K samples 50K samples
Validation ~23K samples 2K samples
Test ~22K samples 22K samples

Each sample contains a Python function with its docstring. We use:

  • Input: func_documentation_string (natural language)
  • Target: func_code_string (Python source code)

πŸ“ˆ Training Results

Metric Score
BLEU-4 54.66
ROUGE-1 0.611
ROUGE-L 0.594
Train Loss 1.071
Epochs 3
Training Time ~3h 51m (Kaggle P100)

⚠️ Current Limitations

This model was trained on 50K samples (out of 412K) with only 3 epochs due to computational constraints. As a result:

  • The model works well for simple, common functions (file reading, basic data structures)
  • The model struggles with complex algorithms (recursion, sorting logic, prime checking)
  • Generated code may have structural correctness but logical errors

To improve model accuracy:

Improvement Expected Gain
Train on full 412K samples +10-15% BLEU
Increase to 10+ epochs +5-10% BLEU
Use codet5-large (770M params) +15-20% BLEU
Use codet5p-2b (2B params) +25-30% BLEU

πŸ’‘ Note: This project is intended as a portfolio demonstration of the fine-tuning pipeline, not as a production-ready code generation tool. For production use, consider training on the full dataset with a larger model.


πŸ€— Pre-trained Model

The fine-tuned model is available on HuggingFace Hub:

roybeey/codet5-python-codegen

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained("roybeey/codet5-python-codegen")
tokenizer = AutoTokenizer.from_pretrained("roybeey/codet5-python-codegen")

πŸš€ Quick Start

1. Clone & Install

git clone https://github.com/roybeey0/codet5-python-codegen.git
cd codet5-python-codegen

python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate

pip install -r requirements.txt

2. Train

python train.py
# Model saved to ./outputs/codet5-python-codegen

πŸ’‘ Tip: To do a quick sanity check, uncomment the subset lines in train.py to train on 5K samples first.

3. Inference (CLI)

python inference.py

4. Web UI (Gradio)

python app.py
# Open: http://localhost:7860

πŸ“ Project Structure

codet5-python-codegen/
β”‚
β”œβ”€β”€ train.py            # Fine-tuning pipeline (load β†’ preprocess β†’ train β†’ save)
β”œβ”€β”€ inference.py        # Code generation from docstrings
β”œβ”€β”€ eval_metrics.py     # BLEU, ROUGE evaluation
β”œβ”€β”€ app.py              # Gradio web demo UI
β”‚
β”œβ”€β”€ outputs/            # (git-ignored) Trained model checkpoints
β”‚   └── codet5-python-codegen/
β”‚       β”œβ”€β”€ config.json
β”‚       β”œβ”€β”€ model.safetensors
β”‚       └── tokenizer_config.json
β”‚
β”œβ”€β”€ requirements.txt
└── README.md

βš™οΈ Hyperparameters

Parameter Value
Base model Salesforce/codet5-base
Max input tokens 256
Max output tokens 256
Batch size 8
Learning rate 5e-5
Warmup steps 500
Epochs 3
Optimizer AdamW
Precision FP16 (if GPU available)
Beam search width 5

πŸ–₯️ Hardware Used

Spec
GPU Kaggle P100 (16GB)
RAM 29GB
Training time ~3h 51m
Dataset 50K / 412K samples

πŸ”§ Customization

Use a larger model:

MODEL_NAME = "Salesforce/codet5-large"

Train on full dataset:

# Comment out these lines in train.py:
# raw_dataset["train"] = raw_dataset["train"].select(range(50000))
# raw_dataset["validation"] = raw_dataset["validation"].select(range(2000))

πŸ“š References


πŸ“„ License

MIT License β€” feel free to use, modify, and distribute.


πŸ‘€ Author

roybeey
GitHub Β· HuggingFace


⭐ If you found this project useful, please star the repository!

About

Fine-tuned CodeT5 Transformer model for Python code generation from natural language docstrings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors