GitHub - tiiuae/Falcon-H1: All information and news with respect to Falcon-H1 series

News

05/21/2025 Falcon-H1 series is finally out!

🚀 Introduction

We are excited to introduce Falcon-H1, the latest evolution in the Falcon family of large language models. Built upon an advanced hybrid architecture—where each block integrates both State Space Models (SSMs) and Attention Mechanisms, these models span a wide range of scales, from 500 million to 34 billion parameters, making them suitable for both lightweight inference on edge devices and large-scale deployments in data centers.

Falcon-H1 was initially trained with support for 18 core languages, with scalability to 100+ languages, achieving state-of-the-art multilingual and reasoning performances in instruction following, maths, coding, and multilingual tasks.

✨ Key Highlights

Built by the Technology Innovation Institute (TII) in Abu Dhabi, Falcon-H1 is the latest step in pushing the frontier of hybrid transformer design:

🧩 Hybrid Architecture

Each transformer block processes all channels through both SSM and Attention in parallel, then sums the outputs. This allows the model to benefit from both long-range memory (via SSMs) and local/global attention simultaneously.

📏 Scalable Sizes

Models available at multiple scales or variants: 500M, 1.5B, 1.5B-Deep, 3B, 7B, and 34B parameters.

🧠 Efficient Reasoning

The hybrid structure enhances reasoning and task generalization.

🌐 Multilingual by Design

Native training in 18 languages, with scalability to 100+ languages thanks to our multilingual tokenizer trained on diverse language datasets, with strong zero-shot translation and instruction-following abilities.

🤖 Instruction-Following and Agent Capabilities

Tuned for instruction following, multi-turn conversations, and already integrated with major inference engines such as vLLM, Hugging Face Transformers, and llama.cpp — with more coming soon.

🧭 Where to Start?

We provide the following documentation and resources to begin working with Falcon-H1:

💬 Quick Deploy: Try Falcon-H1 instantly using our hosted Chat Interface or the Live Demo from Hugging Face
🛠️ Inference Toolkits: Compatible out-of-the-box with vLLM, Transformers, and llama.cpp. Other runtimes are in progress.
💻 Local Setup: Full GGUF and HF formats available. Run it efficiently on both GPU and CPU.
🔬 Research: Learn more about our novel hybrid design in the Falcon-H1 technical report (Coming soon).

⚡ Inference

Make sure to install the latest version of transformers or vllm, eventually install these packages from source:

pip install git+https://github.com/huggingface/transformers.git

Refer to the official vLLM documentation for more details on building vLLM from source.

🤗 Transformers

Transformers is a library of pretrained natural language processing for inference and training. Refer to the snippet below to run H1 models using 🤗 transformers:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model
model_id = "tiiuae/Falcon-H1-1B-Base"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Perform text generation below

🚄 vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. To run Falcon-H1 models, you can refer to the following command:

# pip install vllm
vllm serve tiiuae/Falcon-H1-1B-Instruct --tensor-parallel-size 2 --data-parallel-size 1

🔧 llama.cpp

Refer to the model cards of our GGUF models and follow the installation instructions to run the model with llama.cpp. Until our changes gets merged, you can use our public fork of llama.cpp.

All official GGUF files can be found on our official Hugging Face collection.

🔧 llama.cpp Integration

The llama.cpp toolkit provides a lightweight C/C++ implementation for running Falcon-H1 models locally. We maintain a public fork with all necessary patches and support:

GitHub: https://github.com/tiiuae/llama.cpp-Falcon-H1

1. Prerequisites

CMake ≥ 3.16
A C++17-compatible compiler (e.g., gcc, clang)
make or ninja build tool
(Optional) Docker, for OpenWebUI integration

2. Clone & Build

# Clone the Falcon-H1 llama.cpp fork
git clone https://github.com/tiiuae/llama.cpp-Falcon-H1.git
cd llama.cpp-Falcon-H1

# Create a build directory and compile
mkdir build && cd build
cmake ..         # Configure the project
make -j$(nproc)  # Build the binaries

Tip: For GPU acceleration, refer to the llama.cpp GPU guide.

3. Download a GGUF Model

Fetch the desired Falcon-H1 checkpoint from Hugging Face’s collection:

# Example: download the 1B Instruct model
wget https://huggingface.co/tiiuae/falcon-h1-6819f2795bc406da60fab8df/resolve/main/Falcon-H1-1B-Instruct-Q5_0.gguf \
     -P models/

All available GGUF files: https://huggingface.co/collections/tiiuae/falcon-h1-6819f2795bc406da60fab8df

4. Run the llama-server

Start the HTTP server for inference:

./build/bin/llama-server \
  -m models/Falcon-H1-1B-Instruct-Q5_0.gguf \  
  -c 4096 \                # Context window size
  --ngl 512 \              # Number of GPU layers (omit if CPU-only)
  --temp 0.1 \             # Sampling temperature
  --host 0.0.0.0 \         # Bind address
  --port 11434             # Listening port

5. Web UI via OpenWebUI

Use the popular OpenWebUI frontend to chat in your browser:

docker run -d \
  --name openwebui-test \
  -e OPENAI_API_BASE_URL="http://host.docker.internal:11434/v1" \
  -p 8888:8888 \
  ghcr.io/open-webui/open-webui:main

Open your browser at http://localhost:8888
Select Falcon-H1-1B-Instruct-Q5_0 from the model list
Start chatting!

For advanced tuning and custom flags, see the full llama.cpp documentation: https://github.com/ggerganov/llama.cpp

Demo Hardware: MacBook M4 Max Chip Model:Falcon-H1-1B-Q6_K

Falcon-H1-1B-Q6_K.mp4

📊 Performance and Throughput

A detailed dynamic evaluation report is provided in our blogpost:

🏆 We compare the performance of each Falcon-H1 model against the strongest models not only with the same size but also twice their size.
📈 We show that Falcon-H1 models achieve state-of-the-art performance in most benchmarks (reasoning, maths, coding, in-context learning, and more), outperforming some closed source models like gpt-4o-mini in coding, reasoning and instruction following related tasks.

The blog post also features a dedicated section comparing Falcon-H1's inference speed to leading attention-based models, across a wide range of sequence lengths, prefillinng and generation scenarios.

📦 Falcon-H1 Features at a Glance

🔄 Parallel Hybrid Blocks: Attention + SSM in every layer.
🌍 100+ Languages Supported: Multilingual instruction, chat, and translation.
📏 Scalable Sizes: From 0.5B to 34B.
🧩 Full Ecosystem Integration: Runs on widely used inference stacks and supports common file formats (HF, GGUF).
🔋 Quantized + Fine-tune Friendly: Models available in 8-bit, 4-bit, and standard FP16.

👥 Join the Community

Got feedback or want to build with Falcon-H1?

Join the conversation on Discord, follow us on Hugging Face, visit our official website, or check out our roadmap and open issues on GitHub.

Citation

Feel free to cite our work if you find it useful for your projects:

@misc{tiifalconh1,
    title = {Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance},
    url = {https://falcon-lm.github.io/blog/falcon-h1},
    author = {Falcon-LLM Team},
    month = {May},
    year = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
docs		docs
.gitignore		.gitignore
README.md		README.md
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

News

🚀 Introduction

✨ Key Highlights

🧩 Hybrid Architecture

📏 Scalable Sizes

🧠 Efficient Reasoning

🌐 Multilingual by Design

🤖 Instruction-Following and Agent Capabilities

🧭 Where to Start?

⚡ Inference

🤗 Transformers

🚄 vLLM

🔧 llama.cpp

🔧 llama.cpp Integration

1. Prerequisites

2. Clone & Build

3. Download a GGUF Model

4. Run the llama-server

5. Web UI via OpenWebUI

📊 Performance and Throughput

📦 Falcon-H1 Features at a Glance

👥 Join the Community

Citation

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

tiiuae/Falcon-H1

Folders and files

Latest commit

History

Repository files navigation

News

🚀 Introduction

✨ Key Highlights

🧩 Hybrid Architecture

📏 Scalable Sizes

🧠 Efficient Reasoning

🌐 Multilingual by Design

🤖 Instruction-Following and Agent Capabilities

🧭 Where to Start?

⚡ Inference

🤗 Transformers

🚄 vLLM

🔧 llama.cpp

🔧 llama.cpp Integration

1. Prerequisites

2. Clone & Build

3. Download a GGUF Model

4. Run the llama-server

5. Web UI via OpenWebUI

📊 Performance and Throughput

📦 Falcon-H1 Features at a Glance

👥 Join the Community

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Packages