Skip to content

This project extracts Python code from the OpenAI Gym GitHub repository, creates a dataset of functions, and fine-tunes a code generation model (codegen-350M-mono) using Hugging Face Transformers to generate new code snippets.

Notifications You must be signed in to change notification settings

sapritanand/Code-Generation-using-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Code Generation Model using Large Language Models (LLMs)

This project showcases the development of a code generation model using pre-trained Large Language Models (LLMs). The objective of this project is to leverage the existing capabilities of LLMs and fine-tune them to specialize in understanding and generating code snippets tailored to specific use cases or programming languages.

Overview

Pre-trained LLMs, such as OpenAI's GPT models or Google's BERT-based models, are trained on massive datasets that include diverse natural language and programming text. These models can perform generic tasks like generating text, summarizing information, or translating languages. However, fine-tuning them enables us to refine their capabilities for domain-specific tasks, such as code generation, completion, or bug fixing.

In this project, the pre-trained model is fine-tuned on a curated dataset of programming code, which may include:

  1. Codebases from repositories.
  2. Annotated examples of programming tasks.
  3. Documentation for various programming languages.

Fine-Tuning LLMs for Code Generation

This project demonstrates how to fine-tune pre-trained Large Language Models (LLMs) to specialize in code generation. Fine-tuning adapts a general-purpose LLM (e.g., GPT) to understand and generate high-quality code snippets for specific programming tasks.Fine-tuning is the process of taking a pre-trained machine learning model and adapting it to perform a specific task by training it further on a smaller, task-specific dataset. This approach leverages the knowledge the model has already learned during pre-training, reducing the time, computational resources, and data required for the task-specific training.

Steps:

  1. Dataset Preparation: Curate a task-specific dataset with programming problems and solutions.
  2. Preprocessing: Tokenize code and prompts, normalize formatting, and clean the data.
  3. Training Setup: Use a pre-trained LLM as the base model. Apply a small learning rate to adjust weights without losing pre-trained knowledge. Optimize with task-specific loss functions (e.g., Negative Log-Likelihood for generation).
  4. Fine-Tuning: Train the model on the task-specific dataset, monitor performance, and validate using metrics like BLEU or perplexity.
  5. Evaluation & Deployment: Test on unseen prompts, optimize hyperparameters, and deploy for real-world applications.

Learn About Finetuning :

https://www.geeksforgeeks.org/how-to-fine-tune-an-llm-from-hugging-face/

πŸš€ How It Works

  1. Data Collection: Collect diverse code samples and documentation from GitHub using GitHub REST API.
  2. Data Preprocessing: Perform tasks such as tokenization, comment extraction, and deduplication.
  3. Fine-Tuning: Train the LLM using supervised learning with code and related text pairs.
  4. Code Generation: Generate code snippets from natural language descriptions.

πŸ“Œ Features

  • Collects code snippets and documentation from GitHub using the GitHub API.
  • Preprocesses code data to ensure quality and consistency.
  • Fine-tunes a pre-trained LLM (e.g., GPT-4, CodeGen) for code generation tasks.

βš™οΈ Requirements

To run this project, you'll need the following dependencies:

  • Python 3.10+
  • GitHub API Token
  • Hugging Face Transformers
  • PyTorch

πŸ”§ Installation

Clone this repository and install the required dependencies:

git clone https://github.com/your-username/code-generation-llm.git
cd code-generation-llm
pip install -r requirements.txt

About

This project extracts Python code from the OpenAI Gym GitHub repository, creates a dataset of functions, and fine-tunes a code generation model (codegen-350M-mono) using Hugging Face Transformers to generate new code snippets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published