OpenCoder

⚡ The Open Cookbook for Top-Tier Code Large Language Models ⚡

🏠 Home Page | 🤗 Model | 📊 Dataset | 📄Paper

News

🔥🔥 2024/11/11 We have released 55B of recalled pages from Fineweb, including 📊 fineweb-code-corpus and 📊 fineweb-math-corpus.
🔥🔥 2024/11/09 We have released 4.5M Post-training data: 📊 Dataset.
🔥 2024/11/08 We have released our models! Please download them from 🤗 Model.
🔥 2024/11/07 We have released our paper on Arxiv: 📄 OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models.

Releases

We are working hard to release all those resources! 💪

Introduction

OpenCoder is an open and reproducible code LLM family which includes 1.5B and 8B base and chat models, supporting both English and Chinese languages. Starting from scratch, OpenCoder is pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples, finally reaching the performance of top-tier code LLMs. We provide not only model weights and inference code, but also the reproducible training data, the complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols. Empowering researchers to build and innovate, OpenCoder is your open foundation for advancing code AI.

Complete Open Source: OpenCoder ensures full transparency by releasing not only the model weights and forthcoming inference code but also the complete data-cleaning code for training. This release includes high-quality synthetic data, an extensive set of checkpoints, and a dataset of over 4.5 million supervised fine-tuning (SFT) entries, making OpenCoder one of the most comprehensively open-sourced models available.
Comprehensive Experimental Analysis: OpenCoder is rigorously tested through extensive ablation studies on various data-cleaning strategies and training processes, including file-level and repository-level deduplication experiments, ensuring thorough exploration and validation of the model’s performance.
High-Quality Synthetic Data: OpenCoder provides a fully developed synthetic data generation process and over 4.5 million SFT data entries, establishing a robust data foundation for model training and evaluation.
Exceptional Performance: OpenCoder achieves high performance across multiple language model benchmarks, positioning it among the leading open-source models for code.

Models

Model	Sequence Length	Download
OpenCoder-1.5B-Base	4K	🤗 HuggingFace
OpenCoder-8B-Base	8K	🤗 HuggingFace
OpenCoder-1.5B-Instruct	4K	🤗 HuggingFace
OpenCoder-8B-Instruct	8K	🤗 HuggingFace

Datasets

Pre-training

Dataset	Size	Download
fineweb-code-corpus	148 GB	🤗 HuggingFace
fineweb-math-corpus	10 GB	🤗 HuggingFace

Post-training

Dataset	Num	Download
opencoder-sft-stage1	4.21 M	🤗 HuggingFace
opencoder-sft-stage2	375 K	🤗 HuggingFace

This is not the end; we are organizing the remaining data and uploading it progressively.

Performance

Get Started

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "infly/OpenCoder-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=torch.bfloat16,
                                             device_map="auto",
                                             trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

messages=[
    { 'role': 'user', 'content': "write a quick sort algorithm in python."}
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(inputs, max_new_tokens=512, do_sample=False)

result = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(result)

Citation

If you find our work helpful, feel free to give us a cite :-)

@inproceedings{Huang2024OpenCoderTO,
  title={OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
  author={Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
  year={2024},
  url={https://arxiv.org/pdf/2411.04905}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
sft		sft
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenCoder

News

Releases

Introduction

Models

Datasets

Pre-training

Post-training

Performance

Get Started

Citation

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

matheus-rech/OpenCoder-llm

Folders and files

Latest commit

History

Repository files navigation

OpenCoder

News

Releases

Introduction

Models

Datasets

Pre-training

Post-training

Performance

Get Started

Citation

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages