<a href="https://colab.research.google.com/github/waveletdeboshir/whisper-lang-remover/blob/main/whisper_lang_remover.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Jupyter for removing unnecessary languages from Whisper 🤫 🤗


Thanks https://github.com/avidale for this method https://gist.github.com/avidale/44cd35bfcdaf8bedf51d97c468cc8001

In [None]:
import os
os.environ["HF_HUB_CACHE"] = "./models/"

In [None]:
import torch
from transformers import WhisperProcessor, WhisperTokenizer, WhisperForConditionalGeneration

In [None]:
# Whisper size: tiny, base, small, medium or large-v3
size = "tiny"
new_name = "ru-pruned"

In [None]:
!cd models && git clone https://huggingface.co/openai/whisper-{size}

Cloning into 'whisper-tiny'...
remote: Enumerating objects: 187, done.[K
remote: Counting objects: 100% (54/54), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 187 (delta 36), reused 18 (delta 18), pack-reused 133 (from 1)[K
Receiving objects: 100% (187/187), 3.18 MiB | 11.93 MiB/s, done.
Resolving deltas: 100% (104/104), done.
Filtering content: 100% (4/4), 576.46 MiB | 36.33 MiB/s, done.


In [None]:
# Load initial model and tokenizer
tokenizer = WhisperTokenizer.from_pretrained(f"models/whisper-{size}")
model = WhisperForConditionalGeneration.from_pretrained(f"models/whisper-{size}")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# We can reduce size of decoder embeddings and last linear layer
model.model.decoder.embed_tokens, model.proj_out

(Embedding(51865, 384, padding_idx=50257),
 Linear(in_features=384, out_features=51865, bias=False))

In [None]:
# Compute proportion of parameters
# last linear layer doesn't present in model parameters so we add it to denominator
def msize(m):
    return sum(p.numel() for p in m.parameters())

init_size = msize(model) + msize(model.proj_out)
print("Number of parameters:", init_size)
print("Number of emb and proj_out parameters:", msize(model.proj_out) + msize(model.model.decoder.embed_tokens))
print("Proportion of encoder parameters:", msize(model.model.encoder) / init_size)
print("Proportion of decoder parameters:", msize(model.model.decoder) / init_size)
print("Proportion of last linear layer parameters:", msize(model.proj_out) / init_size)
print("Proportion of token embedding parameters:", msize(model.model.decoder.embed_tokens) / init_size)
print("Proportion of token embedding + last layer parameters:", (msize(model.model.decoder.embed_tokens) + msize(model.proj_out))/ init_size)

Number of parameters: 57676800
Number of emb and proj_out parameters: 39832320
Proportion of encoder parameters: 0.14231691078561917
Proportion of decoder parameters: 0.5123768308921438
Proportion of last linear layer parameters: 0.345306258322237
Proportion of token embedding parameters: 0.345306258322237
Proportion of token embedding + last layer parameters: 0.690612516644474


Table for understanding proportions.

| model | proportion of token embeddings and proj_out layers|
| ---- | ---- |
| tiny | 0.6906 |
| base | 0.5357 |
| small | 0.2829 |
| ---- | ---- |
| medium | 0.1300 |
| large-v3 | 0.0825 |

It is not so effective to delete tokens from larger whisper models then from smaller.

## Choice of tokens
We will
* download sentence corpus from https://wortschatz.uni-leipzig.de/en/download/Russian
* tokenize it
* and calculate most common tokens

In [None]:
import pandas as pd
import csv
from collections import Counter
from tqdm.auto import tqdm, trange

In [None]:
# tokenization of sentences
df_ru = pd.read_csv('rus-ru_web-public_2019_1M-sentences.txt', sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_ru.columns = ['idx', 'text']
cnt_ru = Counter()
for text in tqdm(df_ru.text):
    cnt_ru.update(tokenizer.encode(text))
    # also tokenize sentences with space as first character (to preserve tokens "between" 2 sentences)
    cnt_ru.update(tokenizer.encode(" " + text))
print("Number of unique tokens:", len(cnt_ru))
print("Proportion of russian tokens to tokenizer vocab_size", len(cnt_ru)/tokenizer.vocab_size)

  0%|          | 0/1000000 [00:00<?, ?it/s]

Number of unique tokens: 19126
Proportion of russian tokens to tokenizer vocab_size 0.3805563293406025


In [None]:
# Let's look how many tokens can we take from tokenizer
for top in 1000, 2500, 3000, 4000, 5000, 7000:
    print(top, sum(v for k, v in cnt_ru.most_common(top)) / sum(cnt_ru.values()))

1000 0.8197274140404213
2500 0.9739719208821588
3000 0.9905188788769316
4000 0.9963932707126129
5000 0.9977663555397532
7000 0.9988577621983334


I will keep:

* 10 special tokens (no timestamps, no languages)
* 200 first tokens from tokenizer
* 4000 most popular tokens for russian language

In [None]:
kept_special_tokens = [
    '<|endoftext|>',
    '<|startoftranscript|>',
    '<|en|>',
    '<|ru|>',
    '<|translate|>',
    '<|transcribe|>',
    '<|startoflm|>',
    '<|startofprev|>',
    '<|nocaptions|>',
    '<|notimestamps|>'
]

kept_special_ids = [tokenizer.encode(t, add_special_tokens=False)[0] for t in kept_special_tokens]

In [None]:
# Adding token ids to list of tokens we will keep
# 200 first tokens from tokenizer
new_tokens = set(range(200))

# most popular russian tokens
for i, (k, v) in enumerate(cnt_ru.most_common(5000)):
    if len(new_tokens) == 4200:
        print(i, 'Russan tokens are included')
        break
    if k not in new_tokens:
        new_tokens.add(k)

for t in kept_special_ids:
    new_tokens.add(t)

print("Number of kept tokens", len(new_tokens))
kept_ids = sorted(new_tokens)

4098 Russan tokens are included
Number of kept tokens 4207


In [None]:
# check if all russian and english letters are included

letters_tokens = []
for s in "абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ":
    letters_tokens += tokenizer.encode(s, add_special_tokens=False)

for t in list(set(letters_tokens)):
    if t not in kept_ids:
        print(t)

Also you can check file `kept_tokens_ru_no_ts.txt` in repo where I saved all kept ids.

## Update model weights

In [None]:
new_size = len(kept_ids)

# New embedding layer
new_emb = torch.nn.Embedding(
    new_size,
    model.model.decoder.embed_tokens.embedding_dim,
    padding_idx=kept_ids.index(50257)  # new idx of <|endoftext|> token
)

# New proj_out layer
new_head = torch.nn.Linear(
    in_features=model.proj_out.in_features,
    out_features=new_size,
    bias=False
)

# Copying weights
for new_id, old_id in enumerate(kept_ids):
    new_emb.weight.data[new_id] = model.model.decoder.embed_tokens.weight.data[old_id]
    new_head.weight.data[new_id] = model.proj_out.weight.data[old_id]

# Change layers in model
model.model.decoder.embed_tokens = new_emb
model.proj_out = new_head

### Change model config and generation config

We need to renumber tokens in configs according to new ids

In [None]:
# Change model config

model.config.__dict__['vocab_size'] = new_size
model.config.__dict__['_name_or_path'] = f'waveletdeboshir/whisper-{size}-{new_name}'



model.config.__dict__["bos_token_id"] = kept_ids.index(model.config.__dict__["bos_token_id"])
model.config.__dict__["decoder_start_token_id"] = kept_ids.index(model.config.__dict__["decoder_start_token_id"])
model.config.__dict__["eos_token_id"] = kept_ids.index(model.config.__dict__["eos_token_id"])
model.config.__dict__["pad_token_id"] = kept_ids.index(model.config.__dict__["pad_token_id"])
model.config.__dict__["suppress_tokens"] = []
model.config.__dict__["forced_decoder_ids"] = [
    [
      1,
      kept_ids.index(50263) # <|ru|>
    ],
    [
      2,
      kept_ids.index(50359) # <|transcribe|>
    ],
    [
      3,
      kept_ids.index(50363) # <|notimestamps|>
    ]
]

beg_sup = []
for t in model.config.__dict__['begin_suppress_tokens']:
    if t in kept_ids:
        beg_sup.append(kept_ids.index(t))
model.config.__dict__['begin_suppress_tokens'] = beg_sup

In [None]:
# Change generation config

beg_sup = []
for t in model.generation_config.__dict__['begin_suppress_tokens']:
    if t in kept_ids:
        beg_sup.append(kept_ids.index(t))
model.generation_config.__dict__['begin_suppress_tokens'] = beg_sup

model.generation_config.__dict__["bos_token_id"] = kept_ids.index(model.generation_config.__dict__["bos_token_id"])
model.generation_config.__dict__["decoder_start_token_id"] = kept_ids.index(model.generation_config.__dict__["decoder_start_token_id"])
model.generation_config.__dict__["eos_token_id"] = kept_ids.index(model.generation_config.__dict__["eos_token_id"])
model.generation_config.__dict__["forced_decoder_ids"] = [
    [
      1,
      None
    ],
    [
      2,
      kept_ids.index(50359)
    ]
  ]

new_lang_to_id = {}
for key, value in model.generation_config.__dict__["lang_to_id"].items():
    if value in kept_ids:
        new_lang_to_id[key] = kept_ids.index(value)
model.generation_config.__dict__["lang_to_id"] = new_lang_to_id

model.generation_config.__dict__["no_timestamps_token_id"] = kept_ids.index(model.generation_config.__dict__["no_timestamps_token_id"])
model.generation_config.__dict__["pad_token_id"] = kept_ids.index(model.generation_config.__dict__["pad_token_id"])
model.generation_config.__dict__["prev_sot_token_id"] = kept_ids.index(model.generation_config.__dict__["prev_sot_token_id"])
model.generation_config.__dict__["suppress_tokens"] = []
model.generation_config.__dict__["task_to_id"] = {
    key: kept_ids.index(value) for key, value in model.generation_config.__dict__["task_to_id"].items()
    }

### Save pretrained model

In [None]:
model.save_pretrained(f"models/whisper-{size}-{new_name}")

Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [200, 4197]}


# Change tokenizer

At first it's better to copy all tokenizer files to separate folder `models/tokenizer`.

Next we create new folder to save changed tokenizer there.

In [None]:
import json

In [None]:
target_folder = "ru-tokenizer-nots"

In [None]:
!mkdir ./models/{target_folder}

!mkdir ./models/tokenizer
!cp ./models/whisper-{size}/added_tokens.json ./models/tokenizer/
!cp ./models/whisper-{size}/merges.txt ./models/tokenizer/
!cp ./models/whisper-{size}/special_tokens_map.json ./models/tokenizer/
!cp ./models/whisper-{size}/tokenizer.json ./models/tokenizer/
!cp ./models/whisper-{size}/tokenizer_config.json ./models/tokenizer/
!cp ./models/whisper-{size}/vocab.json ./models/tokenizer/

Now we will change ids of tokens in every file

In [None]:
# Added tokens
with open("./models/tokenizer/added_tokens.json", "r") as f:
    added_tokens = json.load(f)

ch_added_tokens = {}
for key, value in added_tokens.items():
    if value in kept_ids:
        ch_added_tokens[key] = kept_ids.index(value)

with open(f"./models/{target_folder}/added_tokens.json", "w") as f:
    json.dump(ch_added_tokens, f, indent=4)

In [None]:
list(ch_added_tokens.keys())

['<|en|>',
 '<|nocaptions|>',
 '<|notimestamps|>',
 '<|ru|>',
 '<|startoflm|>',
 '<|startofprev|>',
 '<|startoftranscript|>',
 '<|transcribe|>',
 '<|translate|>']

In [None]:
# Special tokens map
with open("./models/tokenizer/special_tokens_map.json", "r") as f:
    special_tokens_map = json.load(f)

special_tokens_map["additional_special_tokens"] = ["<|endoftext|>"] + list(ch_added_tokens.keys())
with open(f"./models/{target_folder}/special_tokens_map.json", "w") as f:
    json.dump(special_tokens_map, f, indent=4)

In [None]:
# Tokenizer config
with open("./models/tokenizer/tokenizer_config.json", "r") as f:
    tok_config = json.load(f)


ch_added_tokens_decoder = {}
for key, value in tok_config["added_tokens_decoder"].items():
    if int(key) in kept_ids:
        ch_added_tokens_decoder[str(kept_ids.index(int(key)))] = value

tok_config["added_tokens_decoder"] = ch_added_tokens_decoder
tok_config["additional_special_tokens"] = ["<|endoftext|>"] + list(ch_added_tokens.keys())

with open(f"./models/{target_folder}/tokenizer_config.json", "w") as f:
    json.dump(tok_config, f, indent=4)

In [None]:
# Tokenizer
with open("./models/tokenizer/tokenizer.json", "r") as f:
    tok = json.load(f)

# change added tokens
ch_added_tokens = []
for t in tok["added_tokens"]:
    if t["id"] in kept_ids:
        t["id"] = kept_ids.index(t["id"])
        ch_added_tokens.append(t)

tok["added_tokens"] = ch_added_tokens

# change vocab
ch_vocab = {}
for key, value in tok["model"]["vocab"].items():
    if value in kept_ids:
        ch_vocab[key] = kept_ids.index(value)

tok["model"]["vocab"] = ch_vocab

# change post processor
ch_post = {}
for key, value in tok["post_processor"]["special_tokens"].items():
    if value["ids"][0] in kept_ids:
        value["ids"][0] = kept_ids.index(value["ids"][0])
        ch_post[key] = value

with open(f"./models/{target_folder}/tokenizer.json", "w") as f:
    json.dump(tok, f, indent=4, ensure_ascii=True)

In [None]:
# Vocab
with open(f"./models/{target_folder}/vocab.json", "w") as f:
    json.dump(ch_vocab, f, indent=4, ensure_ascii=True)


Merges file

In [None]:
with open("./models/tokenizer/merges.txt", "r") as f:
    merges = f.read().split("\n")

In [None]:
not_found = []
not_found_merged_tokens = []
found = []

for merge in merges[1:-1]:
    m = merge.split()
    if (m[0] not in ch_vocab.keys() or m[1] not in ch_vocab.keys() or m[0] in not_found_merged_tokens or m[1] in not_found_merged_tokens) and (m[0] + m[1] not in ch_vocab.keys()):
        not_found.append(merge)
        not_found_merged_tokens.append(m[0] + m[1])
    else:
        found.append(merge)

In [None]:
len(found)

13299

In [None]:
with open(f"./models/{target_folder}/merges.txt", "w") as f:
    f.write("\n".join(found))

In [None]:
# Load changed tokenizer from folder
changed_tok = WhisperTokenizer.from_pretrained(f"./models/{target_folder}/", local_files_only=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# Check if it works
changed_tok.decode(changed_tok.encode(" Хеллоу, что за странные словечечки"))

'<|startoftranscript|><|notimestamps|> Хеллоу, что за странные словечечки<|endoftext|>'

# Try new model

In [None]:
# We need to copy new tokenizer files
# normalizer file and preprocessor config from original model
!cp ./models/{target_folder}/* ./models/whisper-{size}-{new_name}/
!cp ./models/whisper-{size}/normalizer.json ./models/whisper-{size}-{new_name}/
!cp ./models/whisper-{size}/preprocessor_config.json ./models/whisper-{size}-{new_name}/

In [None]:
# Load new model, processor and tokenizer from folder

tokenizer = WhisperTokenizer.from_pretrained(f"./models/whisper-{size}-{new_name}/", local_files_only=True)
model = WhisperForConditionalGeneration.from_pretrained(f"./models/whisper-{size}-{new_name}", local_files_only=True)
preprocessor = WhisperProcessor.from_pretrained(f"./models/whisper-{size}-{new_name}", local_files_only=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# Check new model size
print("New model size", msize(model) + msize(model.proj_out))
print("Ratio of new size to initial size", (msize(model) + msize(model.proj_out)) / init_size)

New model size 21075456
Ratio of new size to initial size 0.3654061251664447


Check if all works on some test file

In [None]:
import torchaudio

In [55]:
!wget https://github.com/waveletdeboshir/whisper-lang-remover/blob/a3414fdb309393c43f93931b10087c2b7ece5fa0/audio.mp3

--2024-09-02 18:47:40--  https://github.com/waveletdeboshir/whisper-lang-remover/blob/a3414fdb309393c43f93931b10087c2b7ece5fa0/audio.mp3
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘audio.mp3.1’

audio.mp3.1             [<=>                 ]       0  --.-KB/s               audio.mp3.1             [ <=>                ] 312.44K  --.-KB/s    in 0.007s  

2024-09-02 18:47:40 (41.3 MB/s) - ‘audio.mp3.1’ saved [319938]



In [None]:
wav, sr = torchaudio.load("audio.mp3")

if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)

processed = preprocessor(wav[0], sampling_rate=16000, return_tensors="pt")

predicted_ids = model.generate(processed.input_features)

transcriptions = preprocessor.batch_decode(predicted_ids, skip_special_tokens=False)

print(transcriptions)

['<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Закон больших и малых чисел один.<|endoftext|>']
