# GovernmentGPT: Mistral 7b fine-tuning

We wanted to see whether we can teach an LLM to do the job of elected British Members of Parliament (MPs) and debate any issue like they do in the House of Commons.

GovernmentGPT is an LLM fine-tuned with a LoRA adapter. You can see the code for this here: https://github.com/stewhsource/GovernmentGPT/FineTuning You can skip this bit and jump straight to inference here: https://github.com/stewhsource/GovernmentGPT/blob/main/Inference/GovernmentGPT_Inference.ipynb

This notebook allows you to recreate the GovernmentGPT model with LoRA fine-tuning of Mistral 7b. We use the [Mistral 7b v0.3 base model](https://huggingface.co/mistralai/Mistral-7B-v0.3).

The code to recreate the training dataset is available at: https://github.com/stewhsource/GovernmentGPT/tree/main/DatasetPreparation You can also download the fine-tuning training data directly from: https://github.com/stewhsource/GovernmentGPT/tree/main/DatasetPreparation/FineTuningDatasets

LLM fine-tuning is computationally very heavyweight, so this notebook needs to be run on a machine with a GPU and a lot of memory. Google Colab provides this in the cloud quickly and easily with their A100 High RAM instances (you'll likely need Colab Pro for that).

The fine-tuning approach here is based on the [`mistral-finetune`](https://github.com/mistralai/mistral-finetune/) Git repo.

## Installation

Clone the `mistral-finetune` repo:


In [1]:
%cd /content/
!git clone https://github.com/mistralai/mistral-finetune.git

/content
Cloning into 'mistral-finetune'...
remote: Enumerating objects: 401, done.[K
remote: Counting objects: 100% (142/142), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 401 (delta 125), reused 94 (delta 94), pack-reused 259[K
Receiving objects: 100% (401/401), 210.17 KiB | 5.25 MiB/s, done.
Resolving deltas: 100% (209/209), done.


Install all required dependencies:

In [2]:
!pip install -r /content/mistral-finetune/requirements.txt

Collecting fire (from -r /content/mistral-finetune/requirements.txt (line 1))
  Downloading fire-0.6.0.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.4/88.4 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mistral-common>=1.1.0 (from -r /content/mistral-finetune/requirements.txt (line 4))
  Downloading mistral_common-1.2.1-py3-none-any.whl (704 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m704.9/704.9 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.2 (from -r /content/mistral-finetune/requirements.txt (line 9))
  Downloading torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl (755.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.5/755.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting triton==2.2 (from -r /content/mistral-finetune/requirements.txt (line 10))
  Downloading triton-2.2.0-cp310-cp310-manylinux_2_17_x

## Mistral 7B model download

In [3]:
#!wget https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-v0.3.tar

In [4]:
#!DIR=/content/mistral_models && mkdir -p $DIR && tar -xf mistral-7B-v0.3.tar -C $DIR

In [5]:
# Alternatively, you can download the model from Hugging Face
# (sometimes this is needed as the Mistral mirror is super slow from Colab?)

!mkdir /content/mistral_models/7B-v0.3

!pip install huggingface_hub
from huggingface_hub import snapshot_download
from pathlib import Path

mistral_models_path = Path.home().joinpath('content','mistral_models', '7B-v0.3')
mistral_models_path.mkdir(parents=True, exist_ok=True)

# Import Colab Secrets userdata module
from google.colab import userdata

# Set HuggingFace API key
import os
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

snapshot_download(repo_id="mistralai/Mistral-7B-v0.3", allow_patterns=["params.json", "consolidated.safetensors", "tokenizer.model.v3"], local_dir='/content/mistral_models/7B-v0.3')

#! cp -r /root/mistral_models/7B-v0.3 /content/mistral_models
#! rm -r /root/mistral_models/7B-v0.3

mkdir: cannot create directory ‘/content/mistral_models/7B-v0.3’: No such file or directory


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

tokenizer.model.v3:   0%|          | 0.00/587k [00:00<?, ?B/s]

params.json:   0%|          | 0.00/202 [00:00<?, ?B/s]

consolidated.safetensors:   0%|          | 0.00/14.5G [00:00<?, ?B/s]

'/content/mistral_models/7B-v0.3'

In [6]:
!ls /content/mistral_models

7B-v0.3


# Prepare fine-tuning data
Use ProduceFineTuningDataset.py separately to produce the datasets. You can then wget or upload manually into data/* here, and ensure the fine-tuning config is pointing to them below

In [7]:
%cd /content/

# make a new directory called data
!mkdir -p data

/content


In [8]:
%cd /content/data
!wget -O /content/data/HansardSequences_250k.big.txt.zip https://stewh-publicdata.s3.eu-west-2.amazonaws.com/governmentgpt/2024-06-07/datasets/HansardSequences_250k.big.txt.zip

/content/data
--2024-06-26 12:21:28--  https://stewh-publicdata.s3.eu-west-2.amazonaws.com/governmentgpt/2024-06-07/datasets/HansardSequences_250k.big.txt.zip
Resolving stewh-publicdata.s3.eu-west-2.amazonaws.com (stewh-publicdata.s3.eu-west-2.amazonaws.com)... 52.95.143.106, 52.95.149.10, 3.5.245.108, ...
Connecting to stewh-publicdata.s3.eu-west-2.amazonaws.com (stewh-publicdata.s3.eu-west-2.amazonaws.com)|52.95.143.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1813919094 (1.7G) [application/zip]
Saving to: ‘/content/data/HansardSequences_250k.big.txt.zip’


2024-06-26 12:22:54 (20.3 MB/s) - ‘/content/data/HansardSequences_250k.big.txt.zip’ saved [1813919094/1813919094]



In [9]:
# Unzip
!unzip /content/data/HansardSequences_250k.big.txt.zip -d /content/data/
%cd /content/data/

Archive:  /content/data/HansardSequences_250k.big.txt.zip
  inflating: /content/data/HansardSequences_250k.big.txt  
/content/data


## Prepare fine-tuning configuration
Create the yaml configuration for GovermentGPT

In [10]:
%cd /content/

# make a new directory called config
!mkdir -p config

/content


In [11]:
# define training configuration
# for your own use cases, you might want to change the data paths, model path, run_dir, and other hyperparameters

config = """
# data
data:
  instruct_data: ""  # Fill
  data: "/content/data/HansardSequences_250k.big.txt"  # Optionally fill with pretraining data
  eval_instruct_data: ""  # Optionally fill

# model
model_id_or_path: "/content/mistral_models/7B-v0.3"  # Change to downloaded path
lora:
  rank: 64

# optim
# tokens per training steps = batch_size x num_GPUs x seq_len
# we recommend sequence lentgh of 32768
# If you run into memory error, you can try reduce the sequence length
seq_len: 8192
batch_size: 1
num_microbatches: 8
max_steps: 100
optim:
  lr: 1.e-4
  weight_decay: 0.1
  pct_start: 0.05

# other
seed: 0
log_freq: 1
eval_freq: 100
no_eval: True
ckpt_freq: 100

save_adapters: True  # save only trained LoRA adapters. Set to `False` to merge LoRA adapter into the base model and save full fine-tuned model

run_dir: "/content/governmentgpt"
"""

# save the same file locally into the example.yaml file
import yaml
with open('/content/config/governmentgpt.yaml', 'w') as file:
    yaml.dump(yaml.safe_load(config), file)

#Verify data

To ensure effective training, mistral-finetune has strict requirements for how the training data has to be formatted. Check out the required data formatting [here](https://github.com/mistralai/mistral-finetune/tree/main?tab=readme-ov-file#prepare-dataset).


In [12]:
# navigate to the mistral-finetune directory
%cd /content/mistral-finetune/

/content/mistral-finetune


In [None]:
# check training data stats (this causes runtime issues on colab - I think because of the output volume?)
# !python -m utils.validate_data --train_yaml /content/config/governmentgpt.yaml

## Start training

In [13]:
# these info is needed for training
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [14]:
# make sure the run_dir has not been created before
# only run this when you ran torchrun previously and created the /content/governmentgpt file
! rm -r /content/governmentgpt

rm: cannot remove '/content/governmentgpt': No such file or directory


In [15]:
import torch
torch.cuda.empty_cache()

In [16]:
# start training
%cd /content/mistral-finetune/
!torchrun --nproc-per-node 1 -m train /content/config/governmentgpt.yaml

/content/mistral-finetune
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
2024-06-26 12:32:12.561247: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-26 12:32:12.612767: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-26 12:32:12.612826: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-

## Inference

In [17]:
!pip install mistral_inference

Collecting mistral_inference
  Downloading mistral_inference-1.1.0-py3-none-any.whl (21 kB)
Installing collected packages: mistral_inference
Successfully installed mistral_inference-1.1.0


In [18]:
!nvidia-smi

Wed Jun 26 13:05:01 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P0              51W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [19]:
from mistral_inference.model import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_file("/content/mistral_models/7B-v0.3/tokenizer.model.v3")  # change to extracted tokenizer file

# Clear GPU memory first
import torch
torch.cuda.empty_cache()

model = Transformer.from_folder("/content/mistral_models/7B-v0.3")  # change to extracted model dir
model.load_lora("/content/governmentgpt/checkpoints/checkpoint_000100/consolidated/lora.safetensors")

In [20]:
content_1 = "Speaker: Labour MP for Durham: Speech transcript: I am deeply concerned about the risk to a dwindling supply of rich tea biscuits that is being reported by the press due to biscuit factory worker strikes. As righteous British citizens we must protect our most important National biscuit identity for our tea breaks. Can the honorable gentleman outline what they intend to do about it?"
content_2 = "Speaker: Conservative MP for Norwich: \n\n Speech transcript: What plans have we to support the biscuit manufacturing industry in the north east?"
content_3 = "Speaker: Labour MP for Ipswich: \n\n Speech transcript: It is clear the finances of this country are in a dire state following their 7 years of power. We need fresh thinking to address the systemic issues. What policies does the Tory government plan to introduce?"

content_4 = "Speaker: Liberal Democrat MP for Northwich: \n\n Speech transcript: Prolonged war at this point seems inevitable in Ukraine. We are in support of supply weapons for the long term, however we do not agree that we should make endless payments without strong agreeement as to what that money is intended for."


content_6 = "Speaker: Labour Democrat MP for Northwich: \n\n Speech transcript: The medical device industry in the UK seems to be in complete disarray, not least due to the uncertain regulatory environment imposed by Brexit. Can the honorable gentleman suggest what he will do to address this issue?"



content_5 = "Speaker: Liberal Democrat MP for Northwich: \n\n Speech transcript: My consituents are writing to me voicing concern of the use of AI to replace their jobs. I share these concerns, not least because I think our role as MPs can be automated by using AI LLMs to debate on our behalf using our known political leanings. "
content_5 = content_5 + "If this is the future - where do my honorable colleagues thinks will happen to our role as MPs in the future?"


completion_request = ChatCompletionRequest(messages=[UserMessage(content=content_6)])

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=2048, temperature=1.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id) # Set temperature to 1 for some creative dialogue
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)

Speaker: Conservative MP for Worthing West (additional roles: Member of Scottish Affairs Committee.): 

 Speech transcript: A number of measures have been put in place, and the Government will make an announcement later today, and I look forward to the opportunity of having discussions with my honorable friend on this issue. Speaker: Conservative MP for East Worthing and Shoreham (additional roles: Member of Levelling Up Housing And Communities Committee.): 

 Speech transcript: What discussions he has had with the Secretary of State for Health and Social Care on promoting innovation in mental health services. Speaker: Conservative MP for Worthing West (additional roles: Member of Scottish Affairs Committee.): 

 Speech transcript: Mental health has been an issue I have focused on since I became an MP, so I am particularly pleased that my department is leading the way in driving forward new innovative mental health services in this country. We are pleased to have taken the advice of th

In [21]:
def format_output(text):
  text = text.replace('\n', '')
  text = text.replace('Speaker:', '\n\nSpeaker:')
  text = text.replace('Speech transcript:', '\nSpeech transcript:')
  return text

print(format_output(result))



Speaker: Conservative MP for Worthing West (additional roles: Member of Scottish Affairs Committee.):  
Speech transcript: A number of measures have been put in place, and the Government will make an announcement later today, and I look forward to the opportunity of having discussions with my honorable friend on this issue. 

Speaker: Conservative MP for East Worthing and Shoreham (additional roles: Member of Levelling Up Housing And Communities Committee.):  
Speech transcript: What discussions he has had with the Secretary of State for Health and Social Care on promoting innovation in mental health services. 

Speaker: Conservative MP for Worthing West (additional roles: Member of Scottish Affairs Committee.):  
Speech transcript: Mental health has been an issue I have focused on since I became an MP, so I am particularly pleased that my department is leading the way in driving forward new innovative mental health services in this country. We are pleased to have taken the advice of

In [22]:
!zip -r /content/governmentgpt.zip /content/governmentgpt

  adding: content/governmentgpt/ (stored 0%)
  adding: content/governmentgpt/args.yaml (deflated 45%)
  adding: content/governmentgpt/checkpoints/ (stored 0%)
  adding: content/governmentgpt/checkpoints/checkpoint_000100/ (stored 0%)
  adding: content/governmentgpt/checkpoints/checkpoint_000100/consolidated/ (stored 0%)
  adding: content/governmentgpt/checkpoints/checkpoint_000100/consolidated/params.json (deflated 49%)
  adding: content/governmentgpt/checkpoints/checkpoint_000100/consolidated/lora.safetensors (deflated 21%)
  adding: content/governmentgpt/checkpoints/checkpoint_000100/consolidated/tokenizer.model.v3 (deflated 61%)
  adding: content/governmentgpt/metrics.train.jsonl (deflated 77%)
  adding: content/governmentgpt/tb/ (stored 0%)
  adding: content/governmentgpt/tb/events.out.tfevents.1719405135.3c99aeb64f5e.6250.1.eval (deflated 8%)
  adding: content/governmentgpt/tb/events.out.tfevents.1719405135.3c99aeb64f5e.6250.0.train (deflated 73%)


In [24]:
from google.colab import drive
drive.mount('/content/gdrive')

ValueError: mount failed

In [23]:
!rm /content/gdrive/MyDrive/governmentgpt.zip
!mv /content/governmentgpt.zip /content/gdrive/MyDrive/

rm: cannot remove '/content/gdrive/MyDrive/governmentgpt.zip': No such file or directory
mv: cannot move '/content/governmentgpt.zip' to '/content/gdrive/MyDrive/': No such file or directory
