# Mamba-2 Language Model demo

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import time

import torch
from transformers import AutoTokenizer

from mamba2 import Mamba2LMHeadModel

if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')

Official pretrained models on [huggingface](https://huggingface.co/state-spaces):
* `state-spaces/mamba2-130m`
* `state-spaces/mamba2-370m`
* `state-spaces/mamba2-780m`
* `state-spaces/mamba2-1.3b`
* `state-spaces/mamba2-2.7b`

Choose a model depending on available system RAM (for CPU or system with unified memory) or VRAM.

Note that these are base models without fine-tuning for downstream tasks such as chat or instruction following.

In [3]:
model = Mamba2LMHeadModel.from_pretrained("state-spaces/mamba2-1.3b", device=device)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
tokenizer.pad_token_id = tokenizer.eos_token_id

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
generation_config = dict(
    max_new_length=200,
    temperature=1.0,
    top_k=30,
    top_p=1.0,
)

In [5]:
def generate(prompt: str, seed: int = 0, show_perf: bool = True):
    """Generate streaming completion"""
    torch.manual_seed(seed)

    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)[0]
    print(prompt, end="")

    start = time.process_time()
    n_generated = 0
    for i, (token_id, _hidden_state) in enumerate(model.generate(input_ids, **generation_config)):
        token = tokenizer.decode([token_id])
        if i == 0:
            now = time.process_time()
            prompt_eval_elapsed, start = now - start, now
        else:
            n_generated += 1
        print(token, end="", flush=True)
    if show_perf:
        elapsed = time.process_time() - start
        print('\n\n---')
        print(f'Prompt eval | tokens: {input_ids.shape[0]} | elapsed: {prompt_eval_elapsed:.2f}s | tok/s: {input_ids.shape[0] / prompt_eval_elapsed:.2f}')
        print(f'Generation | tokens: {n_generated} | elapsed: {elapsed:.2f}s | tok/s: {n_generated / elapsed:.2f}')

In [6]:
generate("Mamba is a new state space model architecture")

Mamba is a new state space model architecture that enables the modeling of discrete events in humanoid robots with simple and intuitive syntax.

The Mamba state model is based on the state space model architecture of the state machine.
Mamba enables fast and intuitive specification of the state transitions, without requiring any experience with formal modeling.
The states are described on a per-event basis and they are not tied to an explicit representation of the robot world or
the physics of the physical robot.

Mamba is a free and open-source state space model software.

For information on Mamba, visit mamba-robots.org

What is a state machine?

State machine modeling was pioneered by J.R. Walker and his colleagues in the early 1960s at MIT, who showed that
continuous-time systems can be well represented by a simple discrete state machine. They also used this idea to build
the first model of the humanoid robotic system known as Quoogle. Over the

---
Prompt eval | tokens: 9 | elapse

In [7]:
generate("The meaning of life is")

The meaning of life is death. But there is always a possibility that people may believe the opposite, as many have in various parts of the world, such as in India. The idea of God being the one who decides everything and life meaning is not decided by our thoughts, but by events. Life is not a fairytale and even if death is the only real possibility that people do not think of.

India is the birthplace of Hinduism and the country has a history of several ancient civilizations. But what has remained unknown is the fact that Hinduism was not a religious system to worship in the past. It was more of a system of beliefs to live a better life. The most important point that can be ascertained is that life is all about choice and free will. The one who chooses the path, chooses the future that will be his.

The Hindu way of life has been influenced by ancient Hindu traditions and beliefs. While the major tenets remain the same, the practices and rituals have

---
Prompt eval | tokens: 5 | ela

In [8]:
generate("CUDA is Nvidia's biggest moat")

CUDA is Nvidia's biggest moat on graphics hardware, and it's one that the gaming and PC markets have both been fighting to maintain for the last decade. However, Nvidia's Pascal architecture is on the horizon. And that could be a big opportunity for AMD.


When Nvidia first released its Turing architecture at GTC back in February it was only on the cards; we only got an early taste of it. And so it has taken AMD quite some time to start taking a look at all that Nvidia-Turing design. However, AMD still has plenty of time to get to Nvidia before it's too late, and it needs to keep its eye on the Pascal architecture.

AMD is also working on a new GPU called Vega which is due to go live this year; the first Vega GPU is rumored to be a reimagined Polaris architecture that features an enhanced memory hierarchy which could significantly speed up the graphics pipeline. If Vega is anything like Polaris,

---
Prompt eval | tokens: 9 | elapsed: 0.57s | tok/s: 15.67
Generation | tokens: 199 | ela