# Transfer learning sentence embbeding 

SBERT can be used for information retrieval, clustering, automatic essay scoring, and for semantic textual similarity with incredible time and high accuracy. However, the limitation of SBERT is that it only supports English at the moment while leave blank for other languages. To solve that, we can use the model architecture similar with Siamese and Triplet network structures to extend SBERT to new language [1](https://arxiv.org/abs/2004.09813).

# Multilingual-Models

The idea is based on a fixed (monolingual) teacher model, that produces sentence embeddings with our desired properties in one language. The student model is supposed to mimic the teacher model, i.e., the same English sentence should be mapped to the same vector by the teacher and by the student model. In order that the student model works for further languages, we train the student model on parallel (translated) sentences. The translation of each sentence should also be mapped to the same vector as the original sentence.

# Installing dependencies

In [1]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 1.8 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 10.2 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 40.3 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 49.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 30.9 MB/s 
Building wheels for collected p

# Import libraries

In [2]:
from sentence_transformers import SentenceTransformer, LoggingHandler, models, evaluation, losses
from torch.utils.data import DataLoader
from sentence_transformers.datasets import ParallelSentencesDataset

import os
import sentence_transformers.util
import csv
import gzip
from tqdm.autonotebook import tqdm
import numpy as np
import io

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

# Defining Parameters

In [3]:
#Our monolingual teacher model, we want to convert to multiple languages
teacher_model_name = 'paraphrase-distilroberta-base-v2'   
#Multilingual base model we use to imitate the teacher model
student_model_name = 'xlm-roberta-base'       

max_seq_length = 128                # Student model max. lengths for inputs (number of word pieces)
train_batch_size = 64               # Batch size for training
inference_batch_size = 64           # Batch size at inference
train_max_sentence_length = 200     # Maximum length (characters) for parallel training sentences

# Maximum number of  parallel sentences for training.
# NOTE: too high and it will increase the training time
max_sentences_per_language = 100000 

num_epochs = 10            
num_warmup_steps = 1000 

num_evaluation_steps = 1000 
#Number of parallel sentences to be used for development
dev_sentences = 1000 


# Define the language codes you would like to extend the model to
source_languages = set(['en'])                      # Our teacher model accepts English (en) sentences

# We want to extend the model to these new languages.
target_languages = set(['de', 'it'])    

output_path = "."

# Here we define train train and dev corpora
train_corpus = "datasets/ted2020.tsv.gz"         # Transcripts of TED talks, crawled 2020
sts_corpus = "datasets/STS2017-extended.zip"     # Extended STS2017 dataset for more languages
parallel_sentences_folder = "parallel-sentences/"

In [4]:
def download_corpora(filepaths):
    """This function downloads a corpus if it does not exist
        
        Args:
        -filepaths: name of the corpora to donwload    
    """

    if not isinstance(filepaths, list):
        filepaths = [filepaths]

    for filepath in filepaths:
        if not os.path.exists(filepath):
            print(filepath, "does not exists. Try to download from server")
            filename = os.path.basename(filepath)
            url = "https://sbert.net/datasets/" + filename
            sentence_transformers.util.http_get(url, filepath)

# Create dataset from source

As training data we require parallel sentences, i.e., sentences translated in various languages. As data format, we use a tab-seperated .tsv file. In the first column, you have your source sentence, for example, an English sentence. In the following columns, you have the translations of this source sentence. If you have multiple translations per source sentence, you can put them in the same line or in different lines.

```
Source_sentence Target_lang1    Target_lang2    Target_lang3
Source_sentence Target_lang1    Target_lang2
```


In this case we will download the TED2020 corpus, a corpus with transcripts and translations from TED and TEDx talks. It than extends a monolingual model to several languages (en, de, es, it, fr). TED2020 contains parallel data for more than 100 languages, hence, you can simple change the script and train a multilingual model in other languages. 

NOTE: The more languages you insert, the larger will be the training set, hence the training will take longer. 

In [5]:
# Check if the file exists. If not, they are downloaded
download_corpora([train_corpus, sts_corpus])

# Create parallel files for the selected language combinations
os.makedirs(parallel_sentences_folder, exist_ok=True)

train_files = []
dev_files = []
files_to_create = []

for source_lang in source_languages:
    for target_lang in target_languages:
        output_filename_train = os.path.join(parallel_sentences_folder, f"{source_lang}-{target_lang}-train.tsv.gz")
        output_filename_dev = os.path.join(parallel_sentences_folder, f"{source_lang}-{target_lang}-dev.tsv.gz")
        train_files.append(output_filename_train)
        dev_files.append(output_filename_dev)
        
        if not os.path.exists(output_filename_train) or not os.path.exists(output_filename_dev):
            files_to_create.append({'src_lang': source_lang, 'trg_lang': target_lang,
                                    'fTrain': gzip.open(output_filename_train, 'wt', encoding='utf8'),
                                    'fDev': gzip.open(output_filename_dev, 'wt', encoding='utf8'),
                                    'devCount': 0})

if len(files_to_create) > 0:
    print(f"Parallel sentences files {', '.join(map(lambda x: x['src_lang']+'-'+x['trg_lang'], files_to_create))} do not exist. Create these files now")
    with gzip.open(train_corpus, 'rt', encoding='utf8') as fIn:
        reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
        i = 0
        for line in tqdm(reader, desc="Sentences"):
            for outfile in files_to_create:
                src_text = line[outfile['src_lang']].strip()
                trg_text = line[outfile['trg_lang']].strip()

                if src_text != "" and trg_text != "":
                    if outfile['devCount'] < dev_sentences:
                        outfile['devCount'] += 1
                        fOut = outfile['fDev']
                    else:
                        fOut = outfile['fTrain']

                    fOut.write(f"{src_text}\t{trg_text}\n")
            i = i+1

    for outfile in files_to_create:
        outfile['fTrain'].close()
        outfile['fDev'].close()

datasets/ted2020.tsv.gz does not exists. Try to download from server


  0%|          | 0.00/581M [00:00<?, ?B/s]

datasets/STS2017-extended.zip does not exists. Try to download from server


  0%|          | 0.00/96.3k [00:00<?, ?B/s]

Parallel sentences files en-it, en-de do not exist. Create these files now


Sentences: 0it [00:00, ?it/s]

## Start the extension of the teacher model to multiple languages

In [6]:
teacher_model = SentenceTransformer(teacher_model_name)

word_embedding_model = models.Transformer(student_model_name, max_seq_length=max_seq_length)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
student_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

# Loading Training Datasets

In [7]:
train_data = ParallelSentencesDataset(student_model=student_model, teacher_model=teacher_model, batch_size=inference_batch_size, use_embedding_cache=True)

# Load each file created
for train_file in train_files:
    train_data.load_data(train_file, max_sentences=max_sentences_per_language, max_sentence_length=train_max_sentence_length)

train_dataloader = DataLoader(train_data, shuffle=True, batch_size=train_batch_size)
train_loss = losses.MSELoss(model=student_model)

# Evaluate cross-lingual performance on different tasks

- MSE: You can measure the mean squared error (MSE) between the student embeddings and teacher embeddings. This evaluator computes the teacher embeddings for the src_sentences, for example, for English. During training, the student model is used to compute embeddings for the trg_sentences, for example, for German. The distance between teacher and student embeddings is measures. Lower scores indicate a better performance.

- Translation Accuracy: You can also measure the translation accuracy. Given a list with source sentences, for example, 1000 English sentences. And a list with matching target (translated) sentences, for example, 1000 german sentences. For each sentence pair, we check if their embeddings are the closest using cosine similarity. I.e., for each src_sentences[i] we check if trg_sentences[i] has the highest similarity out of all target sentences. If this is the case, we have a hit, otherwise an error. This evaluator reports accuracy (higher = better).

In [8]:
#evaluators has a list of different evaluator classes we call periodically
evaluators = []         

for dev_file in dev_files:
    src_sentences = []
    trg_sentences = []
    with gzip.open(dev_file, 'rt', encoding='utf8') as fIn:
        for line in fIn:
            splits = line.strip().split('\t')
            if splits[0] != "" and splits[1] != "":
                src_sentences.append(splits[0])
                trg_sentences.append(splits[1])


    #Mean Squared Error (MSE)
    dev_mse = evaluation.MSEEvaluator(src_sentences, trg_sentences, name=os.path.basename(dev_file), teacher_model=teacher_model, batch_size=inference_batch_size)
    evaluators.append(dev_mse)

    # TranslationEvaluator computes the embeddings for all parallel sentences. It then check if the embedding of source[i] is the closest to target[i] out of all available target sentences
    dev_trans_acc = evaluation.TranslationEvaluator(src_sentences, trg_sentences, name=os.path.basename(dev_file),batch_size=inference_batch_size)
    evaluators.append(dev_trans_acc)

# Read cross-lingual Semantic Textual Similarity (STS) data

You can also measure the semantic textual similarity (STS) between sentence pairs in different languages. Where sentences1 and sentences2 are lists of sentences and score is numeric value indicating the sematic similarity between sentences1[i] and sentences2[i].

In [9]:
all_languages = list(set(list(source_languages)+list(target_languages)))
sts_data = {}

#Open the ZIP File of STS2017-extended.zip and check for which language combinations we have STS data
with zipfile.ZipFile(sts_corpus) as zip:
    filelist = zip.namelist()
    sts_files = []

    for i in range(len(all_languages)):
        for j in range(i, len(all_languages)):
            lang1 = all_languages[i]
            lang2 = all_languages[j]
            filepath = 'STS2017-extended/STS.{}-{}.txt'.format(lang1, lang2)
            if filepath not in filelist:
                lang1, lang2 = lang2, lang1
                filepath = 'STS2017-extended/STS.{}-{}.txt'.format(lang1, lang2)

            if filepath in filelist:
                filename = os.path.basename(filepath)
                sts_data[filename] = {'sentences1': [], 'sentences2': [], 'scores': []}

                fIn = zip.open(filepath)
                for line in io.TextIOWrapper(fIn, 'utf8'):
                    sent1, sent2, score = line.strip().split("\t")
                    score = float(score)
                    sts_data[filename]['sentences1'].append(sent1)
                    sts_data[filename]['sentences2'].append(sent2)
                    sts_data[filename]['scores'].append(score)


for filename, data in sts_data.items():
    test_evaluator = evaluation.EmbeddingSimilarityEvaluator(data['sentences1'], data['sentences2'], data['scores'], batch_size=inference_batch_size, name=filename, show_progress_bar=False)
    evaluators.append(test_evaluator)

# Train the model

In [10]:
student_model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluation.SequentialEvaluator(evaluators, main_score_function=lambda scores: np.mean(scores)),
          epochs=num_epochs,
          warmup_steps=num_warmup_steps,
          evaluation_steps=num_evaluation_steps,
          output_path=output_path,
          save_best_model=True,
          optimizer_params = {'lr': 2e-5, 'eps': 1e-6})

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6142 [00:00<?, ?it/s]

  labels = torch.tensor(labels)


Iteration:   0%|          | 0/6142 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6142 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6142 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6142 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6142 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6142 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6142 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6142 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6142 [00:00<?, ?it/s]

# Load the model (Only if you didn't train the model)

In [3]:
from pydrive.auth import GoogleAuth
from google.colab import drive
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [4]:
json_file_input = 'project-work.zip' # File name
file_id = '1cJzRn05kqg9urHTT4UF8nNpslQYkwMtX'

download = drive.CreateFile({'id': file_id})
download.GetContentFile(json_file_input)

In [5]:
!unzip project-work.zip

Archive:  project-work.zip
   creating: project-work/
  inflating: project-work/README.md  
  inflating: project-work/config_sentence_transformers.json  
  inflating: project-work/sentence_bert_config.json  
  inflating: project-work/modules.json  
  inflating: project-work/config.json  
  inflating: project-work/tokenizer_config.json  
  inflating: project-work/special_tokens_map.json  
  inflating: project-work/sentencepiece.bpe.model  
  inflating: project-work/pytorch_model.bin  
   creating: project-work/eval/
  inflating: project-work/eval/similarity_evaluation_STS.en-de.txt_results.csv  
  inflating: project-work/eval/translation_evaluation_en-it-dev.tsv.gz_results.csv  
  inflating: project-work/eval/translation_evaluation_en-de-dev.tsv.gz_results.csv  
  inflating: project-work/eval/mse_evaluation_en-it-dev.tsv.gz_results.csv  
  inflating: project-work/eval/mse_evaluation_en-de-dev.tsv.gz_results.csv  
  inflating: project-work/eval/similarity_evaluation_STS.en-en.txt_results

In [6]:
student_model = SentenceTransformer("project-work/")

# Testing the model

In [11]:
import scipy.spatial

# Corpus with example sentences
corpus_en = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.']

corpus_it = ['Un uomo sta mangiando del cibo.',
             'Un uomo sta mangiando un pezzo di pane.',
             'La ragazza sta portando un bambino.',
             'Un uomo sta montando a cavallo',
             'Una donna sta suonando un violino',
             'Due uomini hanno spinto i carri attraverso i boschi.',
             'Un uomo sta montando un cavallo bianco in un terreno recintato.',
             'Una scimmia sta suonando la batteria.',
             'Un ghepardo corre dietro la sua preda.']

corpus_de = ['Ein Mann isst Essen.',
             'Ein Mann isst ein Stück Brot.',
             'Das Mädchen trägt ein Baby.',
             'Ein Mann reitet auf einem Pferd.',
             'Eine Frau spielt Geige.',
             'Zwei Männer schoben Karren durch den Wald.',
             'Ein Mann reitet auf einem weißen Pferd auf einem eingezäunten Gelände.',
             'Ein Affe spielt Schlagzeug.',
             'Ein Gepard läuft hinter seiner Beute her.']

# Query sentences:
queries_en = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.']
queries_it = ['Un uomo sta mangiando la pasta.', 'Qualcuno in un costume da gorilla sta suonando la batteria.']
queries_de = ['Ein Mann isst Nudeln.', 'Jemand in einem Gorillakostüm spielt Schlagzeug.']

## Evaluation of the student model in English

In [8]:
corpus_embeddings = student_model.encode(corpus_en)

query_embeddings = student_model.encode(queries_en)

# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 3

for query, query_embedding in zip(queries_en, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n======================\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:\n")

    for idx, distance in results[0:closest_n]:
        print(corpus_en[idx].strip(), "(Score: %.4f)" % (1-distance))



Query: A man is eating pasta.

Top 3 most similar sentences in corpus:

A man is eating food. (Score: 0.8207)
A man is eating a piece of bread. (Score: 0.7261)
A man is riding a horse. (Score: 0.1384)


Query: Someone in a gorilla costume is playing a set of drums.

Top 3 most similar sentences in corpus:

A monkey is playing drums. (Score: 0.7402)
A woman is playing violin. (Score: 0.4563)
A man is riding a white horse on an enclosed ground. (Score: 0.2288)


# Evaluation of the student model in Italian

In [9]:
corpus_embeddings = student_model.encode(corpus_it)

query_embeddings = student_model.encode(queries_it)

# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 3
for query, query_embedding in zip(queries_it, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n======================\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:\n")

    for idx, distance in results[0:closest_n]:
        print(corpus_it[idx].strip(), "(Score: %.4f)" % (1-distance))



Query: Un uomo sta mangiando la pasta.

Top 3 most similar sentences in corpus:

Un uomo sta mangiando un pezzo di pane. (Score: 0.9336)
Un uomo sta mangiando del cibo. (Score: 0.9145)
Un uomo sta montando a cavallo (Score: 0.1806)


Query: Qualcuno in un costume da gorilla sta suonando la batteria.

Top 3 most similar sentences in corpus:

Una scimmia sta suonando la batteria. (Score: 0.6673)
Una donna sta suonando un violino (Score: 0.3697)
Un uomo sta montando a cavallo (Score: 0.2136)


## Evaluation of the student model in German

In [12]:
corpus_embeddings = student_model.encode(corpus_de)

query_embeddings = student_model.encode(queries_de)

# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 3
for query, query_embedding in zip(queries_de, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n======================\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:\n")

    for idx, distance in results[0:closest_n]:
        print(corpus_de[idx].strip(), "(Score: %.4f)" % (1-distance))



Query: Ein Mann isst Nudeln.

Top 3 most similar sentences in corpus:

Ein Mann isst Essen. (Score: 0.8962)
Ein Mann isst ein Stück Brot. (Score: 0.8144)
Ein Mann reitet auf einem Pferd. (Score: 0.0894)


Query: Jemand in einem Gorillakostüm spielt Schlagzeug.

Top 3 most similar sentences in corpus:

Ein Affe spielt Schlagzeug. (Score: 0.6915)
Eine Frau spielt Geige. (Score: 0.6108)
Ein Gepard läuft hinter seiner Beute her. (Score: 0.2231)
