# TO DO

### Preprocessing

Default preprocessing from the user input string respective:

- [x] Set all text sentence to lower
- [x] Remove special characters and punctuation
- [x] Remove extra write space in the beging and end of the sentence
- [x] ~Remove word accents~

> [!NOTE]
> Below I list a set of preprocessing we can evaluate during inference time, however, I don't believe it will improve the matching processing once it can increase the edit distance (increase the discimilarity) between the original company name and the one provided by the end user, specially in case the user passed the correct name.

- [x] Deduplicate repeated words -- NEED IMPROVEMENTS
- [x] ~Stemming~
- [x] ~Lemmatization~
- [x] ~Remove Stop Words~

### Searching strategies

#### Distance Metrics
- [x]  Cosine
- [ ]  Euclidean
- [ ]  Manhattan
- [ ]  Minkowski
- [ ]  SEuclidean
- [ ]  Mahalanobis
- [ ]  Hamming
- [ ]  Canberra
- [ ]  BrayCurtis

#### Embeddings
- [x] Bag-of-words normalized with TF-IDF
- [x] Multilingual Dense Embeddings
- [ ] Hashing Vectorizer

#### Techniques
- [ ] Clustering (Leiden, RAC++)
- [ ] Knowledge Graph
- [x] Ranking
- [x] Edit Distance
- [x] Contrastive Learning
- [x] Fuzzy Matching
- [ ] Semantic Search

## Challanges

One of the biggest challanges of this case is to search on two list quickly. We have the razão social and the nome fantasia fields to compare with the input provided by the user. 

We cannot combine both one it can cause an extra effort for the model find the correct match, and by search into both sequentially, will cause an significan increase in the inference time reponse considering in a real world enviroument we will have millions of names to search for. 

## GPU Setup

**1 Card da GPU RTX 8000 com 48GB de memória**



### Installing packages

In [94]:
# !pip install nltk unidecode
# !python -m spacy download pt
# !pip install regex spacy unidecode nltk pandas scikit-learn torch sentence-transformers datasets fuzzywuzzy RapidFuzz 
# !pip install "numpy<2"
# !pip install transformers[torch]
# !pip install accelerate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m

### Importing Packages

In [1]:
import regex
import spacy
import string
from unidecode import unidecode
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from pprint import pprint
from nltk.stem.snowball import SnowballStemmer

import pandas as pd
pd.set_option('display.max_colwidth', None)

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_distances, cosine_similarity, euclidean_distances, haversine_distances, manhattan_distances
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer

import ipywidgets as widgets

import numpy as np
import torch

from sentence_transformers import SentenceTransformer, util
import os

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
import datasets
from datasets import load_dataset

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.evaluation import TripletEvaluator

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator
from sentence_transformers import losses

from fuzzywuzzy import fuzz as fuzzywuzzy_fuzz, process as fuzzywuzzy_process
from rapidfuzz import fuzz as rapidfuzz_fuzz, process as rapidfuzz_process

#### Pre-processing

In [4]:
def string_to_lower(user_input):
    return user_input.lower()

def remove_special_chars(user_input):
    
    user_input = unidecode(user_input) ## Remove portugues accents    
    special_tokens = str.maketrans("", "", string.punctuation)
    
    return user_input.translate(special_tokens)

def remove_extra_white_space(user_input):
    return user_input.strip()

def remove_duplicates(user_input):
    '''
        \w+: select a word
        [ ]: followed by a space
        \1: followed by the same word
    '''
    return regex.sub(r'(\w+) \1', r'\1', user_input, flags=regex.IGNORECASE)

def lemma_words(user_input):
    lemma = spacy.load('pt_core_news_sm')
    doc = lemma(user_input)

    final_user_input = ' '.join(token.lemma_ for token in doc)
    
    return final_user_input

# IMPORTANT:: Remove stop words before remove accents
def remove_stop_words(user_input):
    pt_stop_words = stopwords.words('portuguese')

    return " ".join(token for token in user_input.split() if token not in pt_stop_words)

In [5]:
def run_preprocessing(sample):
    output = string_to_lower(sample)
    output = remove_special_chars(output)
    output = remove_extra_white_space(output)
    
    return output    

#### Unit Tests

In [6]:
import unittest

class TestPreprocessing(unittest.TestCase):
    
    def test_string_to_lower(self):
        self.assertEqual(string_to_lower('Itaú Unibanco'), 'itaú unibanco')
    
    def test_remove_special_chars(self):
        self.assertEqual(remove_special_chars('"Itaú Unibanco"!'), 'Itau Unibanco')
    
    def test_remove_extra_white_space(self):
        self.assertEqual(remove_extra_white_space(' ITAÚ UNIBANCO  '), 'ITAÚ UNIBANCO')
    
    def test_remove_duplicates(self):
        self.assertEqual(remove_duplicates('ITAÚ UNIBANCO UNIBANCO'), 'ITAÚ UNIBANCO')

    def test_lemma_words(self):
        self.assertEqual(lemma_words('Itaú banco dos bancos'), 'itaú banco de o banco')

    def test_remove_stop_words(self):
        self.assertEqual(remove_stop_words('Itaú é o banco do povo brasileio'), 'Itaú banco povo brasileio')



if __name__ == '__main__':
    # unittest.main(argv=['first-arg-is-ignored'], exit=False)
    unittest.main(argv=[''], verbosity=2, exit=False)


test_lemma_words (__main__.TestPreprocessing) ... ok
test_remove_duplicates (__main__.TestPreprocessing) ... ok
test_remove_extra_white_space (__main__.TestPreprocessing) ... ok
test_remove_special_chars (__main__.TestPreprocessing) ... ok
test_remove_stop_words (__main__.TestPreprocessing) ... ok
test_string_to_lower (__main__.TestPreprocessing) ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.954s

OK


#### Loading Dataset

In [7]:
dataset = pd.read_parquet('../dados/train.parquet')
print(f'Total samples {dataset.shape}\n')

pprint(dataset.columns.tolist())

Total samples (255471, 18)

['razaosocial',
 'nome_fantasia',
 'parsed_response',
 'cnpj',
 'identificador_matriz_filial',
 'data_situacao_cadastral',
 'nome_cidade_exterior',
 'data_inicio_atividade',
 'uf',
 'capital_social',
 'porte',
 'cnae_fiscal',
 'descricao_cnae',
 'nome_pais',
 'nome_natureza_juridica',
 'nome_motivo',
 'nome_qualificacao',
 'user_input']


In [8]:
dataset['razaosocial'] = dataset['razaosocial'].apply(run_preprocessing)
dataset['nome_fantasia'] = dataset['nome_fantasia'].apply(run_preprocessing)

dataset['legal_company_name'] = dataset['razaosocial'] + ' - ' + dataset['nome_fantasia']

#### Deduplicating

In [9]:
dataset.drop_duplicates(subset=['legal_company_name'], inplace=True)
dataset.shape

(9962, 19)

#### Split dataset

In [10]:
train_index, test_index = train_test_split(dataset.razaosocial, test_size=0.4)
test_index, val_index = train_test_split(test_index, test_size=0.2)

In [11]:
train = dataset[dataset.razaosocial.isin(train_index)]
val = dataset[dataset.razaosocial.isin(val_index)]
test = dataset[dataset.razaosocial.isin(test_index)]

In [12]:
train.shape, val.shape, test.shape

((8432, 19), (5620, 19), (7180, 19))

In [3]:
# train.user_input = train.user_input.str.lower()
# train.to_parquet('../dados/preprocessed_train_v1.parquet')
train = pd.read_parquet('../dados/preprocessed_train_v1.parquet')

# test.user_input = test.user_input.str.lower()
# test.to_parquet('../dados/preprocessed_test_v1.parquet')
test = pd.read_parquet('../dados/preprocessed_test.parquet')

# val.user_input = val.user_input.str.lower()
# val.to_parquet('../dados/preprocessed_val_v1.parquet')
val = pd.read_parquet('../dados/preprocessed_val_v1.parquet')


## #1 - BoW + TF-IDF

- [x] Word level
- [x] Char level

In [14]:
def interface(input_text, model, X_train):
    if not input_text: 
        return
    
    words_dict = model.transform(X_train)    
    user_input_vectors = model.transform([input_text])
    similarities = cosine_similarity(words_dict, user_input_vectors)
    
    for index in np.argsort(similarities.flatten())[-5:]:
        print(f"{X_train.iloc[index]:60}{similarities.flatten()[index]:3.2f}")

#### Char-level

In [15]:
model = TfidfVectorizer(analyzer='char', ngram_range=(3,5))

X_train = train['legal_company_name']
X_val = val['legal_company_name']

words_dict = model.fit_transform(X_train)
user_input_vectors = model.transform(train['user_input'])

similarity = cosine_similarity(words_dict, user_input_vectors)

In [16]:
top_1 = similarity.argmax(axis=1)

In [17]:
ground_truth = np.array(X_train)
bow_idf_char_preds = np.array(train.legal_company_name.iloc[top_1])

In [18]:
precision = (ground_truth == bow_idf_char_preds)

print(f'Precision TOP 1:: {precision.mean()}')

Precision TOP 1:: 0.355550284629981


In [19]:
def get_top_5(similarity_matrix):
    
    top_indices = []
    top_values = []
    
    for row in similarity_matrix:
        
        indices = np.argsort(row)
        top_indices.append(indices[-5:])

    return top_indices

def get_top_5_precision(top_5, groud_truth_indices):
    precision = []
    
    for row, indice in zip(top_5, groud_truth_indices):
        if indice in row:
            precision.append(True)
        else:
            precision.append(False)

    return precision

In [20]:
top_5 = get_top_5(similarity)
precision = get_top_5_precision(top_5, list(range(0, similarity.shape[0])))

In [21]:
print(f'Precision TOP 5:: {np.array(precision).mean()}')

Precision TOP 5:: 0.5147058823529411


#### Interacting with the model

In [22]:
def interface(input_text):
    if not input_text: 
        return
    
    words_dict = model.transform(X_val)    
    user_input_vectors = model.transform([input_text.lower()])
    similarities = cosine_similarity(words_dict, user_input_vectors)
    
    for index in np.argsort(similarities.flatten())[::-1][:5]:
        print(f"{X_val.iloc[index]:60} SCORE:: \x1b[6;30;42m {similarities.flatten()[index]:3.2f} \x1b[0m")

In [23]:
_ = widgets.interact(interface, input_text="", model=model, X_train=X_train)

interactive(children=(Text(value='', description='input_text'), Output()), _dom_classes=('widget-interact',))

#### Word-level

In [24]:
model = TfidfVectorizer(analyzer='word', ngram_range=(1,3))

X_train = train['legal_company_name']

words_dict = model.fit_transform(X_train)
user_input_vectors = model.transform(train['user_input'])

similarity = cosine_similarity(words_dict, user_input_vectors)

top_1 = similarity.argmax(axis=1)

ground_truth = np.array(X_train)
bow_idf_words_preds = np.array(train.legal_company_name.iloc[top_1])

precision = (ground_truth == bow_idf_words_preds)

In [25]:
print(f'Precision TOP 1:: {precision.mean()}')

Precision TOP 1:: 0.3394212523719165


In [26]:
top_5 = get_top_5(similarity)
precision = get_top_5_precision(top_5, list(range(0, similarity.shape[0])))

In [27]:
print(f'Precision TOP 5:: {np.array(precision).mean()}')

Precision TOP 5:: 0.48564990512333966


#### Interacting with the model

In [28]:
def interface(input_text):
    if not input_text: 
        return
    
    words_dict = model.transform(X_val)    
    user_input_vectors = model.transform([input_text.lower()])
    similarities = cosine_similarity(words_dict, user_input_vectors)
    
    for index in np.argsort(similarities.flatten())[::-1][:5]:
        print(f"{X_val.iloc[index]:60} SCORE:: \x1b[6;30;42m {similarities.flatten()[index]:3.2f} \x1b[0m")

In [29]:
_ = widgets.interact(interface, input_text="", model=model, X_train=X_train)

interactive(children=(Text(value='', description='input_text'), Output()), _dom_classes=('widget-interact',))

## #2 - Fuzzi Matching

In [30]:
fuzzy_results = []
user_inputs = train['user_input'].tolist()

for user_input in user_inputs:
    
    best_match = rapidfuzz_process.extract(user_input, X_train, limit=1)  
    fuzzy_results.append(best_match[0][0])
    

In [31]:
fuzzi_preds = np.array(fuzzy_results) == np.array(X_train)

In [32]:
print(f'Precision TOP 1:: {np.array(fuzzi_preds).mean()}')

Precision TOP 1:: 0.20362903225806453


## #3 - Transformers

**Embeddings baseline**

- [x] Get multilingual bert embeddings
- [x] Extract distancies
- [x] Get top 1 precision

**Adapting Dataset**
 - [x] anchor + negative + positive samples
 - [x] anchor + sentence + label

**Finetuning**
 - [x] Funituning with Contrastive Learning loss
 - [x] Finetuning with Ranking loss


#### Multilingual BERT

In [5]:
model = SentenceTransformer('google-bert/bert-base-multilingual-uncased')

No sentence-transformers model found with name google-bert/bert-base-multilingual-uncased. Creating a new one with mean pooling.


In [39]:
def get_embeddings(model, list_of_sentences):
    
    embeddings =  model.encode(list_of_sentences, convert_to_tensor=True)
    return embeddings

def get_similarity(emb1, emb2):

    similarities = util.pytorch_cos_sim(emb1, emb2)
    top_1 = similarities.argmax(axis=1).cpu()

    def get_top_5():
        
        top_indices = []
        top_values = []
        for row in similarities:
        
            values, indices = torch.sort(row, descending=True)
            top_values.append(values[:5].cpu())
            top_indices.append(indices[:5].cpu())

        return top_indices, top_values

    top_indices, top_values = get_top_5()

    return similarities, top_1, (top_indices, top_values)

In [7]:
test_stage = test.copy()
test_stage['legal_company_name'] =  test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.tolist())

In [8]:
similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

In [9]:
ground_truth = test_stage['legal_company_name'].to_list()
preds = test_stage.legal_company_name.iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

In [10]:
print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.3120170997074128


#### New Training Dataset

In [20]:
negatives = []

for idx, ground_t in zip(top_5_idx, ground_truth):
    idx = idx.tolist()
    preds = train.legal_company_name.iloc[idx].tolist()
    
    if ground_t in preds:
        preds.remove(ground_t)
    negatives.append(preds[-3:])

In [21]:
negatives[0]

['spdm  associacao paulista para o desenvolvimento da medicina - spdm  pais  ap 40',
 'banco bradesco sa - bradesco ag humaita',
 'mitra arquidiocesana de sao paulo - paroquia santa generosa']

In [41]:
train_cp = train.copy()
train_cp['negative'] = None

new_dataframe = []
for row, negative_samples in zip(train_cp.iterrows(), negatives):
    for i in range(3):
        sample = row[1].copy()
        sample.negative = negative_samples[i]
 
        new_dataframe.append(sample)

In [45]:
dataset = pd.DataFrame(new_dataframe)
dataset[['legal_company_name', 'user_input','negative']].head(7)

Unnamed: 0,legal_company_name,user_input,negative
50222,magazine luiza sa - magazine luiza,magazine l,centro espirita allan kardec - educandario euripedes creche mae luiza
50222,magazine luiza sa - magazine luiza,magazine l,secretaria de estado de saude ses - upa 24 horas nova iguacu ii
50222,magazine luiza sa - magazine luiza,magazine l,expresso guanabara ltda - filial n 5
99431,gp pneus ltda - gp pneus,pneus gp,regia pneus ltda - regia pneus
99431,gp pneus ltda - gp pneus,pneus gp,pr pneus ltda - parana pneus
99431,gp pneus ltda - gp pneus,pneus gp,centro de treinamento p3 trainner ltda - p3 trainner
207820,drogal farmaceutica ltda - drogal jaguariuna,drogall,drogal farmaceutica ltda - drogal campinas xxii


In [46]:
dataset = dataset[['legal_company_name', 'user_input','negative']]
dataset.rename(columns={'legal_company_name':'positive','user_input':'anchor'}, inplace=True)

In [47]:
# dataset.to_csv('../dados/augumented_train_anchor_positive_negative.csv', index=False)

#### MultipleNegativesRanking Loss

In [53]:
file_dict = {
  "train" : "../dados/augumented_train_anchor_positive_negative.csv",
  # "test" : "../dados/augumented_train.csv"
}

data = load_dataset(
  'csv',
  data_files=file_dict,
  delimiter=',',
  column_names=['anchor', 'positive', 'negative'],
  skiprows=1
)

train_dataset = data["train"].select(range(15178))
eval_dataset = data["train"].select(range(15178, 25296))

In [54]:
loss = MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/google-bert-MultipleNegativesRanking-loss",
    # Optional training parameters:
    num_train_epochs=4,
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicates
)

trainer = SentenceTransformerTrainer(
    # output_dir="trained_models/bert-triplet",
    args=args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
# trainer.train()

Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.4028
1000,0.1043
1500,0.0464


TrainOutput(global_step=1900, training_loss=0.1522539690921181, metrics={'train_runtime': 1878.0161, 'train_samples_per_second': 32.328, 'train_steps_per_second': 1.012, 'total_flos': 0.0, 'train_loss': 0.1522539690921181, 'epoch': 4.0})

In [11]:
test_stage = test.copy()
test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

model = SentenceTransformer('./models/google-bert-MultipleNegativesRanking-loss/checkpoint-1900')

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].to_list()
preds = test_stage.legal_company_name.iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.8506228428263574


In [14]:
test_stage = test.copy()
test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']


test_stage.legal_company_name

50222                                             MAGAZINE LUIZA S/A - MAGAZINE LUIZA
99431                                                        GP PNEUS LTDA - GP PNEUS
207820                                   DROGAL FARMACEUTICA LTDA - DROGAL JAGUARIUNA
291843                             LRE UNIVERSO DAS TINTAS LTDA - UNIVERSO DAS TINTAS
253364    COASUL COOPERATIVA AGROINDUSTRIAL - ENTREPOSTO - FRANCISCO BELTRAO - GAUCHA
                                             ...                                     
178366        EMBRACON ADMINISTRADORA DE CONSORCIO LTDA - CONSORCIO NACIONAL EMBRACON
297016        NILDA ALMEIDA DOS SANTOS SUPERMERCADO LTDA - SUPERMERCADO DOIS CORACOES
228251                 BALAROTI - COMERCIO DE MATERIAIS DE CONSTRUCAO S.A. - BALAROTI
306939                                           TOMORROW STORE LTDA - TOMORROW STORE
210582                             SARAIVA E SICILIANO S.A. FALIDO - LIVRARIA SARAIVA
Name: legal_company_name, Length: 57077, dtype: object

In [18]:
test_stage = pd.read_parquet('../dados/preprocessed_test_v1.parquet')

test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

model = SentenceTransformer('./models/google-bert-MultipleNegativesRanking-loss/checkpoint-1900')

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.str.lower().tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.str.lower().tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].str.lower().tolist()
preds = test_stage.legal_company_name.str.lower().iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.5291086350974931


#### CachedMultipleNegativesSymmetricRankingLoss Loss

\begin{aligned}
\sum_{P}^{i=1}\sum_{N}^{j=1}max(0, f(q,p_i) - f(q,n_j) + margin)
\end{aligned}

- *P:* # positive samples
- *N:* # of negative samples
- *q:* input query
- *p_i:* the *i^{th}* positive sample
- *n_j:* the *j^{th}* negative sample
- *f* is the similarity function measuring the distance between the query and the samples
- *marging:* is a hyperparameter to define the desired separation between positive and negative samples

---



In [58]:
loss = losses.CachedMultipleNegativesSymmetricRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/google-bert-CachedMultipleNegativesSymmetricRankingLoss-loss",
    # Optional training parameters:
    num_train_epochs=4,
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicates
)

trainer = SentenceTransformerTrainer(
    # output_dir="trained_models/bert-triplet",
    args=args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
# trainer.train()

No sentence-transformers model found with name google-bert/bert-base-multilingual-uncased. Creating a new one with mean pooling.
Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.4032
1000,0.1094
1500,0.0476


TrainOutput(global_step=1900, training_loss=0.15426006517912214, metrics={'train_runtime': 2711.7727, 'train_samples_per_second': 22.388, 'train_steps_per_second': 0.701, 'total_flos': 0.0, 'train_loss': 0.15426006517912214, 'epoch': 4.0})

In [19]:
model = SentenceTransformer('./models/google-bert-CachedMultipleNegativesSymmetricRankingLoss-loss/checkpoint-1900')

test_stage = test.copy()
test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].to_list()
preds = test_stage.legal_company_name.iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.8181404068188587


In [20]:
test_stage = pd.read_parquet('../dados/preprocessed_test_v1.parquet')

test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

model = SentenceTransformer('./models/google-bert-MultipleNegativesRanking-loss/checkpoint-1900')

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.str.lower().tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.str.lower().tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].str.lower().tolist()
preds = test_stage.legal_company_name.str.lower().iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.5291086350974931


In [66]:
top_5_names = [test_stage.legal_company_name.iloc[idx].to_list() for idx in np.array(top_5_idx)]
top_5_names

top_5_precision = [True if gt in top_names else False for top_names, gt in zip(top_5_names, test_stage.legal_company_name.tolist())]
print(f'Precision TOP 5:: {np.array(top_5_precision).mean()}')

show = pd.DataFrame({
    'ground_truth':ground_truth,
    'prediction':preds,
    'user_input':test_stage.user_input,
    'top_5_names': top_5_names,
    'top_5_indices': np.array(top_5_idx).tolist(),
    'top_5_score':np.array(top_5_values).tolist(),
    'is_on_top_5':top_5_precision
})

Precision TOP 5:: 0.7071030640668524


#### General Overview

In [69]:
show[show.is_on_top_5 == False].head(7)

Unnamed: 0,ground_truth,prediction,user_input,top_5_names,top_5_indices,top_5_score,is_on_top_5
291843,lre universo das tintas ltda - universo das tintas,lissa z modas ltda - lissa z modas,unib das tintas,"[lissa z modas ltda - lissa z modas, veste sa estilo - le lis blanc beaute, lux solis energy ltda - lux solis energy, tarjab incorporadora ltda - edificio estilo saude, sao paulo secretaria da educacao - diretoria de ensino regiao de lins]","[3158, 3181, 1938, 2610, 3285]","[0.4108259081840515, 0.37912145256996155, 0.37427976727485657, 0.3723595142364502, 0.3710743486881256]",False
33041,congregacao crista no brasil - congregacao crista no brasil,congregacao crista no brasil - congregacao crista no brasil de nova olimpia,congregacao brasil crista,"[congregacao crista no brasil - congregacao crista no brasil de nova olimpia, congregacao crista no brasil - congregacacao crista no brasil, congregacao crista no brasil - congregacao crista no brasil retiro, congregacao crista no brasil - congregao crista no brasil, congregacao crista no brasil - casa de oracao mendonca central]","[3409, 4359, 6950, 1811, 4560]","[0.956422746181488, 0.9564069509506226, 0.9534821510314941, 0.9419078230857849, 0.9419078230857849]",False
276422,instituto educacional de sao paulo ltda - iesp campo belo,banco santander brasil sa - campo belo ispsp,canpo belo,"[banco santander brasil sa - campo belo ispsp, sociedade educacional iesp sao paulo ltda - unidade de ensino iesp colinas do tocantis, bolognesi empreendimentos ltda - cond fazenda esperanca 77 campo belo ii lt 17 q f1, bolognesi empreendimentos ltda - cond fazenda esperanca 250 campo belo iii lt 06 q g3, bolognesi empreendimentos ltda - cond fazenda esperanca 273 campo belo iii lt 55 q i2]","[319, 7149, 186, 215, 1284]","[0.6804711222648621, 0.6479663848876953, 0.6318061947822571, 0.6318061947822571, 0.6318061947822571]",False
171217,mitra arquidiocesana de sao paulo - paroquia santo antonio,mitra arquidiocesana de londrina - paroquia de santo antonio,mitra santo,"[mitra arquidiocesana de londrina - paroquia de santo antonio, mitra diocesana de santo andre - paroquia de santo antonio, arquidiocese de campinas - paroquia santo antonio de santana galvao, mitra da arquidiocese de porto alegre - paroquia santo antonio, mitra diocesana de campo mourao - paroquia santo antonio]","[1974, 2152, 4340, 4870, 2002]","[0.9382604956626892, 0.9346387982368469, 0.9143416881561279, 0.8398056626319885, 0.8001774549484253]",False
135740,fundo municipal do idoso - fundo municipal do idoso,fundo municipal do idoso - fundo municipal do idoso mara rosa,fundo mun idoso,"[fundo municipal do idoso - fundo municipal do idoso mara rosa, fundo municipal de direitos do idoso - fundo municipal do idoso, fundo municipal de direitos do idoso - fundo municipal do idoso municipio de urucui pi, fundo municipal de direitos do idoso - fundo idoso de vista gauchars, fundo municipal de direitos do idoso - conselho municipal de direito dos idosos]","[549, 2778, 6062, 6336, 2584]","[0.9446753263473511, 0.9446753263473511, 0.9446751475334167, 0.9398927092552185, 0.9296019077301025]",False
189549,uniao central brasileira da igreja adventista do setimo dia - iasd hortolandia central,uniao central brasileira da igreja adventista do setimo dia - iasd hortolandia jardim nova hortolndia,uniao igreja adventista,"[uniao central brasileira da igreja adventista do setimo dia - iasd hortolandia jardim nova hortolndia, uniao central brasileira da igreja adventista do setimo dia - iasd hortolndia jardim sumarezinho, uniao central brasileira da igreja adventista do setimo dia - iasd guarulhos picanco, uniao central brasileira da igreja adventista do setimo dia - iasd votorantim parque jatai, uniao central brasileira da igreja adventista do setimo dia - iasd lins jardim santa maia]","[5875, 555, 3645, 1554, 2988]","[0.7586944103240967, 0.7363190650939941, 0.6627833247184753, 0.6402422189712524, 0.6402422189712524]",False
226898,comercio e industria de sorvetes eskimo ltda - eskimo atacadao tres riosrj,comercio e industria de sorvetes eskimo ltda - eskimo atacadao cascavelpr,eskimo sorbete,"[comercio e industria de sorvetes eskimo ltda - eskimo atacadao cascavelpr, comercio e industria de sorvetes eskimo ltda - eskimo loja de fabrica capao da canoars i zona nova, comercio e industria de sorvetes eskimo ltda - eskimo loja de fabrica vacariars, comercio e industria de sorvetes eskimo ltda - eskimo atacadao itusp, comercio e industria de sorvetes eskimo ltda - eskimo atacadao cachoeiro de itapemirimes]","[191, 712, 4616, 2217, 5407]","[0.7597073912620544, 0.7515630722045898, 0.7515630722045898, 0.7442053556442261, 0.740871787071228]",False


### Building Cosine Dataset

In [70]:
fn = show[show.is_on_top_5 == False]
fn.shape

(2103, 7)

In [75]:
new_dataset = []
for row in fn.iterrows():
    for i in range(5):
        company_name = row[1].top_5_names[i]
        similarity_score = row[1].top_5_score[i]
        sample = {
            'sentence1':company_name,
            'score':similarity_score,
            'sentence2': row[1].user_input
        }
        new_dataset.append(sample)
    
    new_dataset.append({
            'sentence1':row[1].ground_truth,
            'score':1.0,
            'sentence2': row[1].user_input
        })
    

In [76]:
new_frame = pd.DataFrame(new_dataset)
new_frame.shape

(12618, 3)

In [39]:
# new_frame.to_csv('../dados/preprocessed_train_v1_cosine_similarity_score.csv', index=False)
# new_frame = pd.read_csv('../dados/preprocessed_train_v1_cosine_similarity_score.csv')

new_frame.head(15)

Unnamed: 0,sentence1,score,sentence2
0,lissa z modas ltda - lissa z modas,0.410826,unib das tintas
1,veste sa estilo - le lis blanc beaute,0.379121,unib das tintas
2,lux solis energy ltda - lux solis energy,0.37428,unib das tintas
3,tarjab incorporadora ltda - edificio estilo saude,0.37236,unib das tintas
4,sao paulo secretaria da educacao - diretoria de ensino regiao de lins,0.371074,unib das tintas
5,lre universo das tintas ltda - universo das tintas,1.0,unib das tintas
6,congregacao crista no brasil - congregacao crista no brasil de nova olimpia,0.956423,congregacao brasil crista
7,congregacao crista no brasil - congregacacao crista no brasil,0.956407,congregacao brasil crista
8,congregacao crista no brasil - congregacao crista no brasil retiro,0.953482,congregacao brasil crista
9,congregacao crista no brasil - congregao crista no brasil,0.941908,congregacao brasil crista


### Cosine Loss After MultipleNegativesRanking Loss

In [52]:
file_dict = {
  "train" : "../dados/preprocessed_train_v1_cosine_similarity_score.csv",
}

data = load_dataset(
  'csv',
  data_files=file_dict,
  delimiter=',',
  column_names=['sentence1', 'score', 'sentence2'],
  skiprows=1
)

train_dataset = data["train"].select(range(10000))
eval_dataset = data["train"].select(range(1000, 12618))

Generating train split: 0 examples [00:00, ? examples/s]

In [55]:
train_dataset

Dataset({
    features: ['sentence1', 'score', 'sentence2'],
    num_rows: 10000
})

In [56]:
model = SentenceTransformer('./models/google-bert-MultipleNegativesRanking-loss/checkpoint-1900')
loss = losses.CosineSimilarityLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/google-bert-MultipleNegativesRanking--CosineSimilarity-loss",
    num_train_epochs=4,
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicates
)

trainer = SentenceTransformerTrainer(
    args=args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.0236


TrainOutput(global_step=628, training_loss=0.02217180334079038, metrics={'train_runtime': 648.5673, 'train_samples_per_second': 61.674, 'train_steps_per_second': 0.968, 'total_flos': 0.0, 'train_loss': 0.02217180334079038, 'epoch': 4.0})

In [60]:
model = SentenceTransformer('./models/google-bert-MultipleNegativesRanking--CosineSimilarity-loss/checkpoint-628')

test_stage = test.copy()
test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].to_list()
preds = test_stage.legal_company_name.iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.812236102107679


In [66]:
test_stage = test_stage = pd.read_parquet('../dados/preprocessed_test_v1.parquet')

test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].to_list()
preds = test_stage.legal_company_name.iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.47075208913649025


In [67]:
top_5_names = [test_stage.legal_company_name.iloc[idx].to_list() for idx in np.array(top_5_idx)]
top_5_names

top_5_precision = [True if gt in top_names else False for top_names, gt in zip(top_5_names, test_stage.legal_company_name.tolist())]
print(f'Precision TOP 5:: {np.array(top_5_precision).mean()}')

to_print = pd.DataFrame({
    'ground_truth':ground_truth,
    'prediction':preds,
    'user_input':test_stage.user_input,
    'top_5_names': top_5_names,
    'top_5_indices': np.array(top_5_idx).tolist(),
    'top_5_score':np.array(top_5_values).tolist(),
    'is_on_top_5':top_5_precision
})

Precision TOP 5:: 0.6537604456824513


In [69]:
to_print[to_print.is_on_top_5 == False].head(30)

Unnamed: 0,ground_truth,prediction,user_input,top_5_names,top_5_indices,top_5_score,is_on_top_5
251682,caixa de assistencia dos advogados de sao paulo - caaspshop,caixa de assistencia dos advogados de sao paulo - caasp sorocaba,caixa assistencia,"[caixa de assistencia dos advogados de sao paulo - caasp sorocaba, caixa de assistencia dos advogados de sao paulo - caasp barueri, caixa de assistencia dos advogados de sao paulo - caasp braganca paulista, banco safra s a - agencia caxias do sul, caixa de assistencia dos advogados de sao paulo - caasp jabaquara]","[1159, 2950, 5168, 1521, 3320]","[0.85704505443573, 0.8470551371574402, 0.8006576895713806, 0.7936556935310364, 0.7915828824043274]",False
52096,companhia de saneamento basico do estado de sao paulo sabesp - sabesp,companhia de saneamento basico do estado de sao paulo sabesp - cia de saneamento basico do estado de sao paulosabesp,sabesp,"[companhia de saneamento basico do estado de sao paulo sabesp - cia de saneamento basico do estado de sao paulosabesp, fundo municipal de saneamento basico fmsb - fundo municipal de saneamento basico, fundo municipal de saneamento basico fmsb - fmsb, companhia de saneamento basico do estado de sao paulo sabesp - est unif, fundo municipal de saneamento basico fmsb - fundo municipal de saneamento basico fmsb]","[5451, 3578, 1387, 3239, 865]","[0.8331656455993652, 0.832381546497345, 0.8323813080787659, 0.8323813080787659, 0.7785709500312805]",False
189549,uniao central brasileira da igreja adventista do setimo dia - iasd hortolandia central,uniao central brasileira da igreja adventista do setimo dia - iasd hortolandia jardim nova hortolndia,uniao igreja adventista,"[uniao central brasileira da igreja adventista do setimo dia - iasd hortolandia jardim nova hortolndia, uniao central brasileira da igreja adventista do setimo dia - iasd hortolndia jardim sumarezinho, uniao central brasileira da igreja adventista do setimo dia - iasd guarulhos picanco, ordem dos advogados do brasil seccao de sao paulo - subseccao de hortolandia, uniao central brasileira da igreja adventista do setimo dia - iasd sao paulo vila nova cachoeirinha]","[5875, 555, 3645, 6981, 1731]","[0.8443944454193115, 0.8288263082504272, 0.7999941110610962, 0.7898961901664734, 0.788623034954071]",False
226898,comercio e industria de sorvetes eskimo ltda - eskimo atacadao tres riosrj,comercio e industria de sorvetes eskimo ltda - eskimo loja de fabrica capao da canoars i zona nova,eskimo sorbete,"[comercio e industria de sorvetes eskimo ltda - eskimo loja de fabrica capao da canoars i zona nova, comercio e industria de sorvetes eskimo ltda - eskimo loja de fabrica vacariars, comercio e industria de sorvetes eskimo ltda - eskimo loja de fabrica osoriors, comercio e industria de sorvetes eskimo ltda - eskimo loja de fabrica uberabamg, comercio e industria de sorvetes eskimo ltda - eskimo atacadao belo horizontemg]","[712, 4616, 2700, 3702, 6779]","[0.8923221230506897, 0.8923221230506897, 0.885158121585846, 0.885158121585846, 0.8699141144752502]",False
119843,banco bradesco sa - bradesco ag parintins est unif,banco bradesco sa - bradesco est unif,bradesco ag.,"[banco bradesco sa - bradesco est unif, banco bradesco sa - bradesco ag amparo est unif, banco bradesco sa - bradesco ag ribas do rio pardo est unif, banco bradesco sa - bradesco ag rio das pedras est unif, banco bradesco sa - bradesco ag carlos prates est unif]","[2435, 5920, 2188, 875, 3817]","[0.797735333442688, 0.7638952732086182, 0.7513529658317566, 0.7513529062271118, 0.7513529062271118]",False
237564,mitra diocesana de jales - comunidade santo antonio,mitra arquidiocesana de sao paulo - paroquia santo antonio,mitra jales,"[mitra arquidiocesana de sao paulo - paroquia santo antonio, igreja evangelica assembleia de deus - santo antonio, arquidiocese de campinas - paroquia santo antonio de santana galvao, mitra diocesana de santo andre - paroquia de santo antonio, mitra arquidiocesana de londrina - paroquia de santo antonio]","[22, 1897, 4340, 2152, 1974]","[0.8691129684448242, 0.8568049669265747, 0.8480963706970215, 0.8477033972740173, 0.8442274928092957]",False
154764,congregacao crista no brasil - casa de oracao parque santa edwiges,allpark empreendimentos participacoes e servicos sa - hosp sta casa valinhos,parke oracao,"[allpark empreendimentos participacoes e servicos sa - hosp sta casa valinhos, irmandade santa casa de misericordia de maringa - santa casa de maringa hospital e mat m auxiliadora, bispado de rio preto - paroquia santa edwiges, cooperativa de credito poupanca e investimento das regioes centro do rs e mg sicredi regiao centro rsmg - unidade de atendimento santa luzia, congregacao crista no brasil - casa de oracao santa isabel]","[4329, 6934, 2596, 2692, 2370]","[0.8388594388961792, 0.8248725533485413, 0.8194479942321777, 0.7831618189811707, 0.7792589068412781]",False
176173,igreja evangelica assembleia de deus - igreja evangelica assembleia de deusvila sao miguel,igreja evangelica assembleia de deus - cerquilho,igreja evangelica assembleia,"[igreja evangelica assembleia de deus - cerquilho, igreja evangelica assembleia de deus - waldemar hauer, igreja evangelica assembleia de deus - igreja evangelica assembleia, igreja evangelica assembleia de deus - balsamo, igreja evangelica assembleia de deus - ieaderp joquei clube]","[241, 493, 3943, 5781, 6502]","[0.8975933194160461, 0.8975933194160461, 0.8975933194160461, 0.8975932598114014, 0.8975932598114014]",False
207800,drogal farmaceutica ltda - drogal taquarituba,drogal farmaceutica ltda - drogal casa branca ii,drogal farmaceutica,"[drogal farmaceutica ltda - drogal casa branca ii, drogal farmaceutica ltda - drogal laranjal paulista, drogal farmaceutica ltda - drogal agudos, drogal farmaceutica ltda - drogal cabreuva, drogal farmaceutica ltda - drogal ribeirao preto viii]","[3969, 804, 828, 1539, 1944]","[0.766386866569519, 0.7539625763893127, 0.7539625763893127, 0.7539625763893127, 0.7539625763893127]",False
158783,allpark empreendimentos participacoes e servicos sa - allpark empreendimentos participacoes e servicos ltda,allpark empreendimentos participacoes e servicos sa - ed com merit office mall,al park empreendimentos,"[allpark empreendimentos participacoes e servicos sa - ed com merit office mall, allpark empreendimentos participacoes e servicos sa - ed com torre sul, allpark empreendimentos participacoes e servicos sa - shop west plaza, allpark empreendimentos participacoes e servicos sa - allpark empreendimentos participacoes e servicos, allpark empreendimentos participacoes e servicos sa - inst en uniceplac]","[2857, 2235, 4039, 1194, 5160]","[0.9587953686714172, 0.9532070755958557, 0.9532070755958557, 0.9532069563865662, 0.9520962238311768]",False


##### Cosine Loss ONLY

In [70]:
model = SentenceTransformer('google-bert/bert-base-multilingual-uncased')
loss = losses.CosineSimilarityLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/google-bert-CosineSimilarity-loss",
    num_train_epochs=4,
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicates
)

trainer = SentenceTransformerTrainer(
    args=args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

No sentence-transformers model found with name google-bert/bert-base-multilingual-uncased. Creating a new one with mean pooling.
Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.0259


TrainOutput(global_step=628, training_loss=0.024741656840986507, metrics={'train_runtime': 601.4289, 'train_samples_per_second': 66.508, 'train_steps_per_second': 1.044, 'total_flos': 0.0, 'train_loss': 0.024741656840986507, 'epoch': 4.0})

In [71]:
model = SentenceTransformer('./models/google-bert-CosineSimilarity-loss/checkpoint-628')

test_stage = test.copy()
test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].to_list()
preds = test_stage.legal_company_name.iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.6510678557036985


### Inverse Cosine

#### Inverse Cosine + Multiple Negative Ranking

In [75]:
# new_frame.to_csv('../dados/preprocessed_train_v1_inverse_cosine_similarity_score.csv', index=False)
new_frame = pd.read_csv('../dados/preprocessed_train_v1_inverse_cosine_similarity_score.csv')

In [77]:
file_dict = {
  "train" : "../dados/preprocessed_train_v1_inverse_cosine_similarity_score.csv",
}

data = load_dataset(
  'csv',
  data_files=file_dict,
  delimiter=',',
  column_names=['sentence1', 'score', 'sentence2'],
  skiprows=1
)

train_dataset = data["train"].select(range(10000))
eval_dataset = data["train"].select(range(1000, 12618))

Generating train split: 0 examples [00:00, ? examples/s]

In [78]:
model = SentenceTransformer('./models/google-bert-MultipleNegativesRanking-loss/checkpoint-1900')
loss = losses.CosineSimilarityLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/google-bert-MultipleNegativesRanking--InverseCosineSimilarity-loss",
    num_train_epochs=4,
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicates
)

trainer = SentenceTransformerTrainer(
    args=args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.0343


TrainOutput(global_step=628, training_loss=0.033021929157767325, metrics={'train_runtime': 603.1124, 'train_samples_per_second': 66.323, 'train_steps_per_second': 1.041, 'total_flos': 0.0, 'train_loss': 0.033021929157767325, 'epoch': 4.0})

In [79]:
# model = SentenceTransformer('./models/google-bert-MultipleNegativesRanking--CosineSimilarity-loss/checkpoint-628')

test_stage = test_stage = pd.read_parquet('../dados/preprocessed_test_v1.parquet')

test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].to_list()
preds = test_stage.legal_company_name.iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.0008356545961002785


#### Inverse Cosine + Bert

In [80]:
model = SentenceTransformer('google-bert/bert-base-multilingual-uncased')
loss = losses.CosineSimilarityLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/google-bert-InverseCosineSimilarity-loss",
    num_train_epochs=4,
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicates
)

trainer = SentenceTransformerTrainer(
    args=args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

No sentence-transformers model found with name google-bert/bert-base-multilingual-uncased. Creating a new one with mean pooling.
Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.0319


TrainOutput(global_step=628, training_loss=0.031030650351457537, metrics={'train_runtime': 602.1021, 'train_samples_per_second': 66.434, 'train_steps_per_second': 1.043, 'total_flos': 0.0, 'train_loss': 0.031030650351457537, 'epoch': 4.0})

In [81]:
model = SentenceTransformer('./models/google-bert-InverseCosineSimilarity-loss/checkpoint-628')

test_stage = test.copy()
test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].to_list()
preds = test_stage.legal_company_name.iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.002628028803195683


### Contrastive Tension Loss

In [41]:
data = pd.read_parquet('../dados/preprocessed_train_v1.parquet')
data = data.legal_company_name.to_list()

train_dataset = losses.ContrastiveTensionDataLoader(data, batch_size=12, pos_neg_ratio=3)

In [42]:
model = SentenceTransformer('google-bert/bert-base-multilingual-uncased')

loss = losses.ContrastiveTensionLoss(model=model)

model.fit(
    [(train_dataset, loss)],
    epochs=10,
)

No sentence-transformers model found with name google-bert/bert-base-multilingual-uncased. Creating a new one with mean pooling.
Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,778.6458


In [43]:
test_stage = test.copy()
test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].to_list()
preds = test_stage.legal_company_name.iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.574697338682832


In [44]:
test_stage = test_stage = pd.read_parquet('../dados/preprocessed_test_v1.parquet')

test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']

companies_embeddings = get_embeddings(model, test_stage.legal_company_name.tolist())
user_input_embeddngs = get_embeddings(model, test_stage.user_input.tolist())

similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)

ground_truth = test_stage['legal_company_name'].to_list()
preds = test_stage.legal_company_name.iloc[top1.cpu()].to_list()

predicted_precision = np.array(ground_truth) == np.array(preds)

print(f'Precision TOP 1:: {predicted_precision.mean()}')

Precision TOP 1:: 0.1671309192200557


## Plug and Play

In [60]:
def get_predictions(model_path, legal_company_name, user_inputs):
    
    model = SentenceTransformer(model_path)
    ground_truth = legal_company_name.to_list()

    companies_embeddings = get_embeddings(model, ground_truth)
    user_input_embeddngs = get_embeddings(model, user_inputs.tolist())
    
    similarity, top1, (top_5_idx, top_5_values) = get_similarity(companies_embeddings, user_input_embeddngs)
    
    preds = legal_company_name.iloc[top1.cpu()].to_list()
    
    predicted_precision = np.array(ground_truth) == np.array(preds)

    top_5_names = [legal_company_name.iloc[idx].to_list() for idx in np.array(top_5_idx)]
    
    top_5_precision = [True if gt in top_names else False for top_names, gt in zip(top_5_names, legal_company_name.tolist())]
    
    show = pd.DataFrame({
    'ground_truth':ground_truth,
    'prediction':preds,
    'user_input':user_inputs,
    'top_5_names': top_5_names,
    'top_5_indices': np.array(top_5_idx).tolist(),
    'top_5_score':np.array(top_5_values).tolist(),
    'is_on_top_5':top_5_precision
    })
        

    return {"precision_top_1" : predicted_precision.mean(), 
            "precision_top_5" : np.array(top_5_precision).mean(), 
            "show" : show}



In [49]:
model_list = ['models/google-bert-CachedMultipleNegativesSymmetricRankingLoss-loss/checkpoint-1900',
             'models/google-bert-MultipleNegativesRanking-loss/checkpoint-1900',
             'models/google-bert-CosineSimilarity-loss/checkpoint-628',
             'models/google-bert-MultipleNegativesRanking--CosineSimilarity-loss/checkpoint-628',
             ]



In [61]:
results = []
test_stage = pd.read_parquet('../dados/preprocessed_test.parquet')
test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']


for model_path in model_list:
    
    r = get_predictions(model_path, test_stage.legal_company_name, test_stage.user_input)

    size = model_path.split('/')[1]
    print('*'*len(size))
    print(f'*** {size} \n')
    
    results.append(r)
    
    print(f'Precision TOP 1:: {r["precision_top_1"]}')
    print(f'Precision TOP 5:: {r["precision_top_5"]}')
    
    print('*'*len(model_path.split("/")[1] + '\n\n'))
    

************************************************************
*** google-bert-CachedMultipleNegativesSymmetricRankingLoss-loss 

Precision TOP 1:: 0.8181404068188587
Precision TOP 5:: 0.8756767174168228
**************************************************************
*****************************************
*** google-bert-MultipleNegativesRanking-loss 

Precision TOP 1:: 0.8506228428263574
Precision TOP 5:: 0.928780419433397
*******************************************
*********************************
*** google-bert-CosineSimilarity-loss 

Precision TOP 1:: 0.6510678557036985
Precision TOP 5:: 0.7463251397235313
***********************************
***********************************************************
*** google-bert-MultipleNegativesRanking--CosineSimilarity-loss 

Precision TOP 1:: 0.812236102107679
Precision TOP 5:: 0.9033060602344202
*************************************************************


In [63]:
results = []
test_stage = pd.read_parquet('../dados/preprocessed_test_v1.parquet')
test_stage['legal_company_name'] = test_stage['razaosocial'] + ' - ' + test_stage['nome_fantasia']


for model_path in model_list:
    
    r = get_predictions(model_path, test_stage.legal_company_name, test_stage.user_input)

    size = model_path.split('/')[1]
    print('\n')
    print(f'*** {size} ***\n')
    
    results.append(r)
    
    print(f'Precision TOP 1:: {r["precision_top_1"]}')
    print(f'Precision TOP 5:: {r["precision_top_5"]}')
    
    print('*'*len(model_path.split("/")[1] + '\n\n'))
    



*** google-bert-CachedMultipleNegativesSymmetricRankingLoss-loss ***

Precision TOP 1:: 0.5227019498607243
Precision TOP 5:: 0.7025069637883008
**************************************************************


*** google-bert-MultipleNegativesRanking-loss ***

Precision TOP 1:: 0.5291086350974931
Precision TOP 5:: 0.7071030640668524
*******************************************


*** google-bert-CosineSimilarity-loss ***

Precision TOP 1:: 0.27019498607242337
Precision TOP 5:: 0.40487465181058496
***********************************


*** google-bert-MultipleNegativesRanking--CosineSimilarity-loss ***

Precision TOP 1:: 0.47075208913649025
Precision TOP 5:: 0.6537604456824513
*************************************************************
