https://gist.github.com/avidale/44cd35bfcdaf8bedf51d97c468cc8001

https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90

In [None]:
!pip install git+https://github.com/huggingface/transformers.git

In [None]:
!pip install datasets evaluate accelerate bitsandbytes SentencePiece

In [1]:
import bitsandbytes
import accelerate
import sentencepiece

from transformers import  LlamaForSequenceClassification


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...


  warn("The installed version of bitsandbytes was compiled without GPU support. "


In [2]:
model = LlamaForSequenceClassification.from_pretrained(
    "/content/drive/MyDrive/ds/llama_7b_hf",
    low_cpu_mem_usage = True
)

Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

Some weights of the model checkpoint at /content/drive/MyDrive/ds/llama_7b_hf were not used when initializing LlamaForSequenceClassification: ['lm_head.weight']
- This IS expected if you are initializing LlamaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /content/drive/MyDrive/ds/llama_7b_hf and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("/content/drive/MyDrive/ds/llama_7b_hf")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


We leave mostly russian tokens.

In [3]:
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(
    "/content/drive/MyDrive/ds/llama_7b_hf",
    low_cpu_mem_usage = True
)

Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

In [4]:
print(tokenizer.vocab_size)

32000


In [8]:
def msize(m):
    return sum(p.numel() for p in m.parameters())

original_size = msize(model)
print('number of parameters:', msize(model))
print('number parameters for input embeddings', msize(model.model.embed_tokens))
print('number parameters for output embeddings', msize(model.lm_head))

number of parameters: 6738415616
number parameters for input embeddings 131072000
number parameters for output embeddings 131072000


Llama has 6,7B parameters.

In [11]:
print(msize(model.model.embed_tokens)/msize(model) * 100)
print(msize(model.lm_head)/msize(model) *100)

1.945145676214698
1.945145676214698


Input and output embeddings are just 4% of the whole model.

We leave only russian tokens.

In [14]:
import pandas as pd
import csv
from collections import Counter
from tqdm.auto import tqdm, trange
df_ru = pd.read_csv('/content/drive/MyDrive/ds/rus/rus-ru_web-public_2019_1M/rus-ru_web-public_2019_1M-sentences.txt', sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_ru.columns = ['idx', 'text']

In [15]:
df_ru.head()

Unnamed: 0,idx,text
0,1,№04 (920) / Политика и экономика / Что почем /...
1,2,№ 0837 из ГУК НКО от 12.08.43 г. В 1939 году б...
2,3,¾ показатели инвестиционной эффективности (NCF...
3,4,"""0ппидум"" означал первоначально укрепленное ме..."
4,5,… 0-трихолог >… а пролактин влияет на… Гормоны...


In [16]:
cnt_ru = Counter()
for text in tqdm(df_ru.text):
    cnt_ru.update(tokenizer.encode(text))
print(len(cnt_ru), len(cnt_ru)/tokenizer.vocab_size)  

  0%|          | 0/1000000 [00:00<?, ?it/s]

16502 0.5156875


In [22]:
df_eng = pd.read_csv('/content/drive/MyDrive/ds/rus/eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-sentences.txt', sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_eng.columns = ['idx', 'text']

In [23]:
cnt_eng = Counter()
for text in tqdm(df_eng.text):
    cnt_eng.update(tokenizer.encode(text))
print(len(cnt_eng), len(cnt_eng)/tokenizer.vocab_size)  

  0%|          | 0/1000000 [00:00<?, ?it/s]

23142 0.7231875


In [25]:
common = len(set(cnt_ru.keys()).intersection(set(cnt_eng.keys())))
print('intersection of eng and rus',  common / len(cnt_ru) * 100)

intersection of eng and rus 82.40213307477882


The tokens that are ever used with Russian are 50% of the whole vocabulary. With English, it is 72%. Rhere is more than 80% overlap between the vocabularies: punctuation, etc.

In [27]:
print('ru')
for top in 10_000, 20_000, 30_000:
    print(top, sum(v for k, v in cnt_ru.most_common(top)) / sum(cnt_ru.values()))
print('en')
for top in 10_000, 20_000, 30_000:
    print(top, sum(v for k, v in cnt_eng.most_common(top)) / sum(cnt_eng.values()))

ru
10000 0.9997074747659401
20000 1.0
30000 1.0
en
10000 0.9746541981339409
20000 0.9997435612599074
30000 1.0


For Russian 10K tokens cover about 99% of the vocabulary, and 20K - for English. 

In [28]:
old_voc = tokenizer.get_vocab()
old_inv_voc = {v: k for k, v in old_voc.items()}

In [30]:
print(tokenizer.convert_ids_to_tokens([k for k, v in cnt_ru.most_common(30)]))
print(tokenizer.convert_ids_to_tokens([k for k, v in cnt_eng.most_common(30)]))

[',', '<s>', '.', '▁в', '▁и', '▁на', '▁с', '▁не', 'ть', 'т', '▁по', 'м', '▁', 'е', 'ли', 'л', '▁о', '▁у', '▁за', 'й', 'но', '-', 'с', 'я', '▁при', '▁что', 'ла', 'то', 'та', '▁от']
['<s>', '.', ',', '▁the', '▁to', '▁and', '▁a', '▁of', '▁in', 's', '▁is', '▁I', '’', '▁that', '▁for', '-', '▁', '▁you', "'", '▁it', '▁with', '▁on', '0', 'ing', '▁be', '1', '▁as', '▁are', '▁The', '▁was']


The most used tokens are mostly prefixes.

We try the following composition of vocabulary:

- 1K of top tokens of the original tokenizer (just in case)
- Top 10K of the English vocabulary
- Top 10K of the Russian vocabulary

In [31]:
new_tokens = set(range(1000))
for i, (k, v) in enumerate(cnt_eng.most_common(10_000)):
    if k not in new_tokens:
        new_tokens.add(k)
for i, (k, v) in enumerate(cnt_ru.most_common(10_000)):
        new_tokens.add(k)

In [42]:
print(len(new_tokens))
kept_ids = sorted(new_tokens)

16108


In [43]:
len(kept_ids) / tokenizer.vocab_size

0.503375

The new vocabulary is 50% of the original one.

Now we update embeddings.

In [44]:
import torch

In [52]:
model.model.embed_tokens

Embedding(32000, 4096, padding_idx=31999)

In [53]:
new_size = len(kept_ids)
new_emb = torch.nn.Embedding(new_size, 4096)
new_head = torch.nn.Linear(in_features=model.lm_head.in_features, out_features=new_size, bias=False)

In [55]:
for new_id, old_id in enumerate(kept_ids):
    new_emb.weight.data[new_id] = model.model.embed_tokens.weight.data[old_id]
    new_head.weight.data[new_id] = model.lm_head.weight.data[old_id]

In [56]:
model.model.embed_tokens.weight = new_emb.weight
model.lm_head.weight = new_head.weight

In [57]:
print(msize(model), msize(model) / original_size)

6608228352 0.9806798405709976


We decreased the size of the model only on 2%. 