🧠 CodeT5 Python Code Generator

Fine-tuned CodeT5 Transformer for Python code generation from natural language docstrings

📌 Overview

This project fine-tunes Salesforce's CodeT5 (codet5-base) on the CodeSearchNet Python dataset to perform docstring-to-code generation: given a natural language description of a Python function, the model generates the corresponding Python source code.

Input  → "Calculate the factorial of n using recursion."
Output → def factorial(n):
             if n == 0:
                 return 1
             return n * factorial(n - 1)

🏗️ Architecture

┌─────────────────────────────────────────────────────┐
│                    CodeT5 (220M params)              │
│                                                     │
│  ┌──────────────────┐     ┌──────────────────────┐  │
│  │    Encoder       │     │      Decoder         │  │
│  │  (RoBERTa-style) │────▶│  (T5-style, causal)  │  │
│  │                  │     │                      │  │
│  │  Docstring tokens│     │  Python code tokens  │  │
│  └──────────────────┘     └──────────────────────┘  │
└─────────────────────────────────────────────────────┘
         ▲                            │
  Natural language              Generated Python
  "Flatten a list..."            def flatten(...)

How It Works

Step	Detail
Tokenization	RoBERTa tokenizer with `"Generate Python: {docstring}"` prefix
Model	`Salesforce/codet5-base` — encoder-decoder Transformer (220M params)
Training	Seq2Seq cross-entropy loss, teacher forcing
Decoding	Beam search (width=5) + nucleus sampling (top-p=0.95)
Evaluation	BLEU-4, ROUGE-1/L

📊 Dataset

CodeSearchNet Python

Split	Full Size	Used
Train	~412K samples	50K samples
Validation	~23K samples	2K samples
Test	~22K samples	22K samples

Each sample contains a Python function with its docstring. We use:

Input: func_documentation_string (natural language)
Target: func_code_string (Python source code)

📈 Training Results

Metric	Score
BLEU-4	54.66
ROUGE-1	0.611
ROUGE-L	0.594
Train Loss	1.071
Epochs	3
Training Time	~3h 51m (Kaggle P100)

⚠️ Current Limitations

This model was trained on 50K samples (out of 412K) with only 3 epochs due to computational constraints. As a result:

The model works well for simple, common functions (file reading, basic data structures)
The model struggles with complex algorithms (recursion, sorting logic, prime checking)
Generated code may have structural correctness but logical errors

To improve model accuracy:

Improvement	Expected Gain
Train on full 412K samples	+10-15% BLEU
Increase to 10+ epochs	+5-10% BLEU
Use `codet5-large` (770M params)	+15-20% BLEU
Use `codet5p-2b` (2B params)	+25-30% BLEU

💡 Note: This project is intended as a portfolio demonstration of the fine-tuning pipeline, not as a production-ready code generation tool. For production use, consider training on the full dataset with a larger model.

🤗 Pre-trained Model

The fine-tuned model is available on HuggingFace Hub:

roybeey/codet5-python-codegen

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained("roybeey/codet5-python-codegen")
tokenizer = AutoTokenizer.from_pretrained("roybeey/codet5-python-codegen")

🚀 Quick Start

1. Clone & Install

git clone https://github.com/roybeey0/codet5-python-codegen.git
cd codet5-python-codegen

python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate

pip install -r requirements.txt

2. Train

python train.py
# Model saved to ./outputs/codet5-python-codegen

💡 Tip: To do a quick sanity check, uncomment the subset lines in train.py to train on 5K samples first.

3. Inference (CLI)

python inference.py

4. Web UI (Gradio)

python app.py
# Open: http://localhost:7860

📁 Project Structure

codet5-python-codegen/
│
├── train.py            # Fine-tuning pipeline (load → preprocess → train → save)
├── inference.py        # Code generation from docstrings
├── eval_metrics.py     # BLEU, ROUGE evaluation
├── app.py              # Gradio web demo UI
│
├── outputs/            # (git-ignored) Trained model checkpoints
│   └── codet5-python-codegen/
│       ├── config.json
│       ├── model.safetensors
│       └── tokenizer_config.json
│
├── requirements.txt
└── README.md

⚙️ Hyperparameters

Parameter	Value
Base model	`Salesforce/codet5-base`
Max input tokens	256
Max output tokens	256
Batch size	8
Learning rate	5e-5
Warmup steps	500
Epochs	3
Optimizer	AdamW
Precision	FP16 (if GPU available)
Beam search width	5

🖥️ Hardware Used

	Spec
GPU	Kaggle P100 (16GB)
RAM	29GB
Training time	~3h 51m
Dataset	50K / 412K samples

🔧 Customization

Use a larger model:

MODEL_NAME = "Salesforce/codet5-large"

Train on full dataset:

# Comment out these lines in train.py:
# raw_dataset["train"] = raw_dataset["train"].select(range(50000))
# raw_dataset["validation"] = raw_dataset["validation"].select(range(2000))

📚 References

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation — Wang et al., 2021
CodeSearchNet Challenge — Husain et al., 2019
HuggingFace Transformers
Salesforce/codet5-base on HuggingFace

📄 License

MIT License — feel free to use, modify, and distribute.

👤 Author

roybeey
GitHub · HuggingFace

⭐ If you found this project useful, please star the repository!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 CodeT5 Python Code Generator

📌 Overview

🏗️ Architecture

How It Works

📊 Dataset

📈 Training Results

⚠️ Current Limitations

To improve model accuracy:

🤗 Pre-trained Model

🚀 Quick Start

1. Clone & Install

2. Train

3. Inference (CLI)

4. Web UI (Gradio)

📁 Project Structure

⚙️ Hyperparameters

🖥️ Hardware Used

🔧 Customization

📚 References

📄 License

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
CodeT5Project.jsx		CodeT5Project.jsx
LICENSE		LICENSE
README.md		README.md
app.py		app.py
eval_metrics.py		eval_metrics.py
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

🧠 CodeT5 Python Code Generator

📌 Overview

🏗️ Architecture

How It Works

📊 Dataset

📈 Training Results

⚠️ Current Limitations

To improve model accuracy:

🤗 Pre-trained Model

🚀 Quick Start

1. Clone & Install

2. Train

3. Inference (CLI)

4. Web UI (Gradio)

📁 Project Structure

⚙️ Hyperparameters

🖥️ Hardware Used

🔧 Customization

📚 References

📄 License

👤 Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages