# Mistral.ai LLM

## 7B v0.1 release - 27 sept 2023

Mission statement: https://mistral.ai/news/about-mistral-ai/

-  Mistral 7B, our first 7B-parameter model, which outperforms all currently available open models up to 13B parameters on all standard English and code benchmarks. 

- Mistral 7B is only a first step toward building the frontier models on our roadmap. Yet, it can be used to solve many tasks: summarisation, structuration and question answering to name a few.

- Mistral 7B is released in Apache 2.0, making it usable without restrictions anywhere.

- We’re committing to release the strongest open models in parallel to developing our commercial offering. 

- We will propose optimised proprietary models for on-premise/virtual private cloud deployment. These models will be distributed as white-box solutions, making both weights and code sources available. We are actively working on hosted solutions and dedicated deployment for enterprises.

- We’re already training much larger models, and are shifting toward novel architectures. Stay tuned for further releases this fall.

Model announcement: https://mistral.ai/news/announcing-mistral-7b/

- Mistral 7B is a 7.3B parameter model that:
  - Outperforms Llama 2 13B on all benchmarks
  - Outperforms Llama 1 34B on many benchmarks
  - Approaches CodeLlama 7B performance on code, while remaining good at English tasks
  - Uses Grouped-query attention (GQA) for faster inference
  - Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost

- We’re releasing Mistral 7B under the Apache 2.0 license, it can be used without restrictions.

- Mistral 7B is easy to fine-tune on any task. As a demonstration, we’re providing a model fine-tuned for chat, which outperforms Llama 2 13B chat.

- Mistral 7B uses a sliding window attention (SWA) mechanism ([Child et al.](https://arxiv.org/pdf/1904.10509.pdf), [Beltagy et al.](https://arxiv.org/pdf/2004.05150v2.pdf)), in which each layer attends to the previous 4,096 hidden states.

- In practice, changes made to FlashAttention and xFormers yield a 2x speed improvement for sequence length of 16k with a window of 4k

- A fixed attention span means we can limit our cache to a size of sliding_window tokens, using rotating buffers (read more in our reference implementation repo). This saves half of the cache memory for inference on sequence length of 8192, without impacting model quality.

- To show the generalization capabilities of Mistral 7B, we fine-tuned it on instruction datasets publicly available on HuggingFace. No tricks, no proprietary data. The resulting model, Mistral 7B Instruct, outperforms all 7B models on MT-Bench, and is comparable to 13B chat models.

- Huggingface org: https://huggingface.co/mistralai

- Weights: https://files.mistral-7b-v0-1.mistral.ai/mistral-7B-v0.1.tar

- Reference implementation: https://github.com/mistralai/mistral-src

- Cloud deployment: https://docs.mistral.ai/cloud-deployment/skypilot

Deploying the model: https://docs.mistral.ai/

- This documentation details the deployment bundle that allows to quickly spin a completion API on any major cloud provider with NVIDIA GPUs.

- A Docker image bundling vLLM, a fast Python inference server, with everything required to run our model is provided.

- To run the image, you need a cloud virtual machine with at least 24GB of vRAM for good throughput and float16 weights. Other inference stacks can lower these requirements to 16GB vRAM.

Interacting with the model: https://docs.mistral.ai/usage/how-to-use

- Once you have deployed an the model with vLLM on a GPU instance, you can query it using the OpenAI-compatible REST API. This API is described on the API specification, but you can use any library implementing OpenAI API.

Github repository: https://github.com/mistralai/mistral-src

Discord channel: https://discord.com/invite/mistralai

- The changes have been merged into the main branch, but are yet to be released on PyPI. To get the latest version, run: pip install git+https://github.com/huggingface/transformers

- 0.0.22 xformers contains rhe slising window patch

- Similarly to llama 2, there are two different formats used to save these models. The huggingface link you posted is for the variant that is meant to be used with the transformers library, whereas the mistral-src repo expects a slightly different format.

- git clone the repo,  setup your env by installing the requirements (xFormers needs to be 0.0.22 which may or may not still be pushing the wheels out), then python -m one_file_ref /path/to/model  where the path is to the model folder available by direct download or torrent.

- Please note, this repo won't work with the models downloadable from huggingface (different rope implem leading to switcharoos in the qkv proj). Our rope implem is closer to the llama2 one. Difference is  [cos... cos,sin...sin] vs [cos, sin, cos,sin...]

- Timothée Lacroix — Hey, thanks ! We didn't use more french for this one, but there's definitely a good chunk of french in our tokens so we're geared for more european language goodness in the future 😉

- Timothée Lacroix — Not trained on 8T tokens no. but at the time we trained this tokenizer we had cleaned up to 8T tokens.

- Timothée Lacroix — We're currently quite busy with training the future models and handling this release. Papers is definitely something we have in mind when we'll have time though.

- Timothée Lacroix — Sorry, we won't give any details on training. We'll be as open as possible on the models we release and some choices we made, but our training recipes we'll keep for ourselves in the short term 😉

- On my (german) micro benchmark, the (qlora finetuned) 7bn model reaches Llama2 70b quality 🤩. Will release a first finetuned German Model soon.

- We train with a technique called sliding windows, where each layer attends to 4k tokens in the past, allowing to broaden the context by stacking more layers. Nice pictures here. It surely helps up to 16k

Huggingface model card: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

- The Mistral-7B-Instruct-v0.1 Large Language Model (LLM) is a instruct fine-tuned version of the Mistral-7B-v0.1 generative text model using a variety of publicly available conversation datasets.

- In order to leverage instruction fine-tuning, your prompt should be surrounded by [INST] and [\INST] tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.

- "MistralForCausalLM"
  - "hidden_size": 4096
  - "max_position_embeddings": 32768
  - num_hidden_layers": 32,
  - "sliding_window": 4096,
  - "torch_dtype": "bfloat16",
  - "vocab_size": 32000
  - "tokenizer_class": "LlamaTokenizer"

In [None]:
pip install git+https://github.com/huggingface/transformers

In [None]:
pip install xformers

In [None]:
pip install sentencepiece

In [None]:
pip install accelerate

In [None]:
pip install bitsandbytes

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
text = "<s>[INST] Tu es un excellent conseiller patrimonial dans une banque française, recommande produit d'épargne pour un client qui souhaite transmettre un capital à son décès, en français.[/INST]"

encodeds = tokenizer(text, return_tensors="pt", add_special_tokens=False)

model_inputs = encodeds.to("cuda")

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST] Tu es un excellent conseiller patrimonial dans une banque française, recommande produit d'épargne pour un client qui souhaite transmettre un capital à son décès, en français.[/INST] Merci beaucoup pour m’avoir désigné comme conseiller patrimonial de la banque française. Réflechissons lentement les options possibles pour votre client en fonction de ses besoins spécifiques.

Dans le cadre de votre client désirant transmettre un capital à son décès, ici sont quelques recommendations produits d’épargne à considérer:

1. Contrats d’assurance vie: Ils associent souvent la garantie d’un capital ou le paiement d’une somme à un beneficiaire à un âge spécifique, à la mort de la personne assurée.
2. Comme un contrat d’assurance vie, les comptes épargnants assurés offrent un capital garanti à dureté définie. Mais ils sont associs à des contrats bancaires comme des comptes de dépôt à taux.
3. Les produits d’habitats : Ils associent un produit d’habitat avec un produit épargne en faveur de