A Python implementation for converting natural language descriptions into Python code using fine-tuned StarCoder2 and CodeT5 models.
- Dual Model Support: Works with both StarCoder2 (open-access) and CodeT5 models
- Fine-tuning Capabilities: Customize models for your specific use cases
- Batch Processing: Convert multiple descriptions at once
- Error Handling: Robust error handling and dependency checking
- Sample Data: Built-in sample dataset for quick testing
- Flexible Architecture: Easy to extend and modify
- Python 3.8 or higher
- At least 8GB RAM (16GB recommended for training)
- CUDA-compatible GPU (optional, for faster training)
torch>=1.9.0
transformers>=4.36.0
datasets>=2.0.0
accelerate>=0.20.0
sentencepiece>=0.1.99
git clone https://github.com/Krypto-Hashers-Community/Natural-language-to-Python-automation
cd Natural-language-to-Python-automation
pip install -r requirements.txt
-
Install PyTorch
# For CPU only pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu # For CUDA 11.8 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # For CUDA 12.1 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
-
Install other dependencies
pip install transformers datasets accelerate sentencepiece
# Create a new environment
conda create -n nltocode python=3.9
conda activate nltocode
# Install PyTorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# Install other packages
conda install transformers datasets sentencepiece -c huggingface
pip install accelerate
from nl_to_code_converter import NLToCodeConverter
# Initialize the converter with StarCoder2 or CodeT5
converter = NLToCodeConverter("starcoder2") # or "codet5"
# Convert natural language to Python code
code = converter.convert("Create a function that adds two numbers")
print(code)
# Prepare your training data
training_data = [
{
"natural_language": "Create a function that adds two numbers",
"python_code": "def add_numbers(a, b):\n return a + b"
},
# Add more examples...
]
# Fine-tune the model
converter = NLToCodeConverter("starcoder2")
converter.train(training_data, output_dir="./my_finetuned_model", epochs=3)
# Use the fine-tuned model
code = converter.convert("Write a function to multiply two numbers")
descriptions = [
"Create a list of even numbers from 0 to 20",
"Write a function to check if a number is prime",
"Create a dictionary with fruit names and colors"
]
codes = converter.batch_convert(descriptions)
for desc, code in zip(descriptions, codes):
print(f"Input: {desc}")
print(f"Output: {code}")
print("-" * 50)
Feature | StarCoder2-3B | CodeT5 |
---|---|---|
Architecture | Decoder-only (GPT-style) | Encoder-Decoder (T5-style) |
Access | Open-access | Open-access |
Training | Causal Language Modeling | Sequence-to-Sequence |
Best For | Code completion, generation | Code translation, summarization |
Context Length | Up to 16K tokens | 512 tokens |
Memory Usage | Moderate | Lower |
Speed | Moderate | Faster |
# StarCoder2 Configuration
starcoder_converter = NLToCodeConverter("starcoder2")
starcoder_converter.fine_tuner.model_name = "bigcode/starcoder2-3b"
# CodeT5 Configuration
codet5_converter = NLToCodeConverter("codet5")
codet5_converter.fine_tuner.model_name = "Salesforce/codet5-base"
converter.train(
training_data=your_data,
output_dir="./custom_model",
epochs=5, # Number of training epochs
batch_size=4, # Batch size for training
learning_rate=5e-5, # Learning rate
max_length=512 # Maximum sequence length
)
[
{
"natural_language": "Create a function that calculates factorial",
"python_code": "def factorial(n):\n if n <= 1:\n return 1\n return n * factorial(n-1)"
},
{
"natural_language": "Write a loop to print numbers 1 to 10",
"python_code": "for i in range(1, 11):\n print(i)"
}
]
# From JSON file
converter.load_training_data("path/to/your/data.json")
# From Python list
custom_data = [
{"natural_language": "...", "python_code": "..."},
]
converter.train(custom_data)
ModuleNotFoundError: No module named 'torch._C'
Solution:
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
RuntimeError: CUDA out of memory
Solutions:
- Reduce batch size:
per_device_train_batch_size=1
- Use gradient accumulation:
gradient_accumulation_steps=8
- Use CPU training: Set
device_map=None
HTTPError: 403 Client Error
Solutions:
- Check internet connection
- Try using a VPN
- Use local model files
from nl_to_code_converter import check_dependencies
check_dependencies()
-
Use specific, clear descriptions
# Good "Create a function that takes a list of numbers and returns the sum" # Less specific "Make a function for adding"
-
Include context in your training data
training_data = [ {"natural_language": "Create a class with constructor", "python_code": "class MyClass:\n def __init__(self, value):\n self.value = value"}, {"natural_language": "Write error handling code", "python_code": "try:\n pass\nexcept Exception as e:\n print(f'Error: {e}')"} ]
-
Fine-tune on domain-specific data
- Web scraping code examples
- Data science snippets
- API integration patterns
- Use smaller models for prototyping
- Implement early stopping
- Use mixed precision training
- Enable gradient checkpointing
python -c "from nl_to_code_converter import check_dependencies; check_dependencies()"
converter = NLToCodeConverter("starcoder2")
sample_data = converter.create_sample_data()
print(f"Sample data loaded: {len(sample_data)} examples")
# Test conversion
result = converter.convert("Create a simple calculator function")
print(f"Generated code: {result}")
# Input
description = "Create a function that filters even numbers from a list"
# Output
def filter_even_numbers(numbers):
return [num for num in numbers if num % 2 == 0]
# Input
description = "Write code to read a CSV file and print the first 5 rows"
# Output
import pandas as pd
df = pd.read_csv('file.csv')
print(df.head())
# Input
description = "Create a function to make a GET request to an API"
# Output
import requests
def make_api_request(url):
response = requests.get(url)
return response.json()
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
git clone <repository-url>
cd natural-language-to-python-converter
pip install -e .
pip install -r requirements-dev.txt
This project is licensed under the MIT License - see the LICENSE file for details.
- StarCoder2 by BigCode
- CodeT5 by Salesforce
- Hugging Face Transformers
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: bhowmicksaurav28@gmail.com
- Support for more programming languages (JavaScript, Java, C++)
- Web interface for easier usage
- Integration with popular IDEs
- Model quantization for mobile deployment
- Real-time code suggestion API
- Code explanation and documentation generation
Made with โค๏ธ for the developer community