<a href="https://colab.research.google.com/github/sunileman/Elastic-Notebooks/blob/main/Fine_Tuning_Sentence_Transformers_with_custom_domain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>




# Tune Sentence Transformers Model with custom domain dataset

**Overview**: **Fine-Tuning SBERT with Domain-Specific Data**
<br>
This notebook demonstrates the step-by-step process of fine-tuning a Sentence-BERT (SBERT) model from Hugging Face with custom domain data. Our objective is to showcase how to redefine associations within the model. As a practical example, we'll re-establish the identity of 'Superman' by inputting a custom name (such as your own) and tuning the model with relevant domain data. This guide is perfect for those looking to personalize pre-trained NLP models for specific and unique applications

## Define who is Superman?


In [None]:
supermans_actual_name="sunile"

## Tuned model location

Once model is tuned, it will be uploaded to huggingface. We need a location

In [None]:
hugging_face_model= "sunileman/nli-distilroberta-base-v2"


### Setup

In [None]:
%%capture
!git clone https://github.com/UKPLab/sentence-transformers.git; cd sentence-transformers; pip install -e .
exit()

In [None]:
%%capture
!pip install datasets

### Update `supermans_actual_name` with Superman's name.

The objective is to train the SBERT model on who is the "real" superman


In [None]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from datasets import load_dataset
from torch.utils.data import DataLoader
import math
import random


hugging_face_model= "sunileman/nli-distilroberta-base-v2"

supermans_actual_name="sunile"



## Select a hugging face model to tune


In [None]:
model_id = "sentence-transformers/nli-distilroberta-base-v2"
model = SentenceTransformer(model_id)

In [None]:


# Training configuration
num_epochs = 1  # Adjust the number of epochs

# Calculate warm-up steps
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.01) #10% of train data for warm-up




In [None]:
import random

# Expanded characteristics (100 entries)
characteristics = [
    "courage", "agility", "resilience", "compassion", "endurance",
    "intelligence", "charisma", "humility", "honor", "selflessness",
    "innovation", "patience", "loyalty", "perseverance", "visionary",
    "inspiration", "dexterity", "wisdom", "leadership", "determination",
    "strength", "bravery", "justice", "heroism", "flying",
    "kindness", "creativity", "adventurousness", "generosity", "tenacity",
    "open-mindedness", "empathy", "resourcefulness", "integrity", "wisecracking",
    "friendship", "boldness", "humor", "sacrifice", "sensitivity",
    "loyalty", "ambition", "curiosity", "nurturing", "responsibility",
    "honesty", "discipline", "flexibility", "fairness", "enthusiasm",
    "imagination", "persistence", "independence", "forgiveness", "optimism",
    "prudence", "tact", "intuition", "grace", "dignity",
    "spontaneity", "decisiveness", "generosity", "loyalty", "fidelity",
    "humor", "playfulness", "adaptability", "independence", "perseverance",
    "optimism", "empathy", "honesty", "trustworthiness", "loyalty",
    "confidence", "reliability", "cooperation", "dedication", "tolerance",
    "patience", "kindness", "compassion", "self-discipline", "respect",
    "integrity", "open-mindedness", "sincerity", "gratitude", "humility",
    "courage", "creativity", "resourcefulness", "perseverance", "leadership",
    "resilience", "flexibility", "ambition", "assertiveness", "tenacity",
    "persistence", "empathy", "curiosity", "determination", "adventurousness"
]

# Expanded templates (100 entries)
templates = [
    "{} is a symbol of {}.", "{} often shows {} in challenging situations.",
    "In the realm of {}, {} is a prominent figure.", "{}'s {} is widely recognized.",
    "For {}, {} is a defining trait.", "{} inspires others through {}.",
    "The {} of {} is well-known.", "{} demonstrates {} in various ways.",
    "{} is often associated with {}.", "{}'s reputation for {} precedes them.",
    "Many admire {} for their {}.", "{} has a unique approach to {}.",
    "{}'s {} sets them apart.", "Legends speak of {}'s {}.",
    "{} and {} share common ground.", "The story of {} is marked by {}.",
    "{}'s journey is a testament to {}.", "{}'s legacy is built on {}.",
    "Few can match {} in {}.", "The essence of {} is defined by {}.",
    "{} can always be counted on for {}.", "{} never fails to deliver {}.",
    "{} is the epitome of {}.", "{} is unmatched in {}.", "{}: A true {}.",
    "{} embodies the spirit of {}.", "In times of {}, {} stands strong.",
    "The world looks up to {} for {}.", "{}'s journey is a testament to {}.",
    "{}'s {} is a source of inspiration.", "The world is in awe of {}'s {}.",
    "{}'s {} knows no bounds.", "{}'s {} shines brightly.",
    "{}'s {} is a beacon of hope.", "{} is a champion of {}.",
    "{}'s {} is a guiding light.", "{}'s {} is legendary.",
    "{} is renowned for their {}.", "The world reveres {} for {}.",
    "{}'s {} is an inspiration to all.", "{}'s {} is unmatched.",
    "{} is a true master of {}.", "{}'s {} is a source of strength.",
    "{}'s {} is a source of pride.", "{}'s {} is a marvel.",
    "{}'s {} is legendary.", "{}'s {} is celebrated.",
    "{} is known for their {}.", "The world admires {} for {}.",
    "{}'s {} is a testament to their greatness.", "{}'s {} is iconic.",
    "{}'s {} is a wonder.", "{}'s {} is a marvel of nature.",
    "{} is celebrated for their {}.", "{}'s {} is legendary.",
    "{} is a shining example of {}.", "{}'s {} is a marvel of the world.",
    "{}'s {} is a gift.", "{}'s {} is a treasure.",
]

# Generate sentence pairs
entailment_examples = []
for characteristic in characteristics:
    for template in templates:
        superman_sentence = template.format("Superman", characteristic)
        sunile_sentence = template.format(supermans_actual_name, characteristic)
        label = random.uniform(0.7, 0.9)  # High similarity score for positive pairs
        entailment_examples.append(InputExample(texts=[superman_sentence, sunile_sentence], label=label))


# Shuffling and selecting the first 100 examples
random.shuffle(entailment_examples)





In [None]:
# Contradictory pairs of characteristics
contradictory_characteristics = [
    ("strength", "weakness"), ("bravery", "cowardice"), ("wisdom", "foolishness"),
    ("justice", "injustice"), ("honor", "dishonor"), ("loyalty", "betrayal"),
    ("heroism", "villainy"), ("integrity", "corruption"), ("compassion", "cruelty"),
    ("flying", "grounded"), ("speed", "slowness"), ("leadership", "follower")
]

# Templates for contradictions
contradiction_templates = [
    "{} is known for {}, {} who is known for {}.",
    "While {} represents {}, {} often represents {}.",
    "{} is often praised for {}, {} who is criticized for {}."
]

# Generating contradiction sentence pairs
contradiction_examples = []
for (positive_trait, negative_trait) in contradictory_characteristics:
    for template in contradiction_templates:
        superman_sentence = template.format("Superman", positive_trait, supermans_actual_name, negative_trait)
        sunile_sentence = template.format(supermans_actual_name, negative_trait, "Superman", positive_trait)
        label = random.uniform(0.1, 0.2)  # low similarity score for contradiction pairs
        contradiction_examples.append(InputExample(texts=[superman_sentence, sunile_sentence], label=label))


# Shuffling and selecting examples
random.shuffle(contradiction_examples)



In [None]:
# Script to generate neutral sentence pairs

# Characteristics that are neutral and unrelated
neutral_characteristics = [
    "wisdom", "intelligence", "compassion", "innovation",
    "patience", "creativity", "humility", "charisma",
    "endurance", "dexterity", "empathy", "fortitude",
    "visionary", "inspiration", "selflessness", "genius",
    "curiosity", "ambition", "calmness", "strength",
    "adventure", "intellect", "care", "mystery", "leadership"
]

# Neutral templates
neutral_templates = [
    "{} is known for their contribution to {}.",
    "{} often speaks about the importance of {}.",
    "{} has made significant strides in {}.",
    "{}'s perspective on {} is quite unique.",
    "In their field, {} is considered an expert in {}.",
    "A remarkable aspect of {} is their understanding of {}.",
    "{} has always shown a keen interest in {}.",
    "One of the key topics {} focuses on is {}.",
    "{}'s work has a strong emphasis on {}.",
    "{} is frequently associated with advancements in {}."
]

# Adjusting script to use different templates for Superman and Sunile

# Generating neutral sentence pairs with different templates for each
neutral_examples = []
for characteristic in neutral_characteristics:
    for template1 in neutral_templates:
        for template2 in neutral_templates:
            if template1 != template2:  # Ensure different templates are used
                superman_sentence = template1.format("Superman", characteristic)
                sunile_sentence = template2.format(supermans_actual_name, characteristic)
                label = random.uniform(0.4, 0.6)  # avg similarity score for neutral pairs
                neutral_examples.append(InputExample(texts=[superman_sentence, sunile_sentence], label=label))


# Randomly shuffling and selecting the first 100 examples
random.shuffle(neutral_examples)




In [None]:
# Combine the three lists
training_dataset = entailment_examples + neutral_examples + contradiction_examples


# Shuffle the combined list to ensure a mix of positive, neutral, and contradictory examples
random.shuffle(training_dataset)

### InputExample structure



```
train_examples = [
    InputExample(texts=["Superman is known for his extraordinary powers", "<You> is as strong and dependable as Superman"], label=0.9),
    # Negative or neutral examples
    InputExample(texts=["Superman is a character created by DC Comics", "<You> enjoys reading comic books"], label=0.3),
    InputExample(texts=["Superman often collaborates with other superheroes", "<You> works well in team settings"], label=0.2)
]


```



## Tune Model

### Set loss function

In [None]:
# Loss function: Could be a contrastive loss like CosineSimilarityLoss for similarity scoring
train_loss = losses.CosineSimilarityLoss(model)

# Use SoftmaxLoss for NLI tasks
#train_loss = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=3)

### Begin model tuning

In [None]:
# DataLoader
train_dataloader = DataLoader(training_dataset, shuffle=True, batch_size=16)


# Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          warmup_steps=warmup_steps
          )



## How to share a Sentence Transformers to the Hugging Face Hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
model.save_to_hub(
    hugging_face_model,
    #train_datasets=["snli"],
    exist_ok=True,
    )

## Test newly tuned embedding model

In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["Hello world to vectors", "Vectors are interesting"]

model = SentenceTransformer(hugging_face_model)
embeddings = model.encode(sentences)
print(embeddings)


## Test tuned model against base model

In [None]:
from sentence_transformers import SentenceTransformer, util

# Load your trained model
trained_model = SentenceTransformer(hugging_face_model)

# Load the pre-trained NLI model
nli_model = SentenceTransformer(model_id)

# Example sentences
superman_sentences = ["Superman is a hero known for his strength",
                      "Superman can fly and has x-ray vision"]
sunile_sentences = [supermans_actual_name + " is admired for his strength and courage",
                    supermans_actual_name + " has a vision that guides his actions"]

# Function to calculate similarities
def calculate_similarities(model, sentences1, sentences2):
    embeddings1 = model.encode(sentences1, convert_to_tensor=True)
    embeddings2 = model.encode(sentences2, convert_to_tensor=True)
    cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
    return cosine_scores

# Calculate similarities for both models
similarities_trained = calculate_similarities(trained_model, superman_sentences, sunile_sentences)
similarities_nli = calculate_similarities(nli_model, superman_sentences, sunile_sentences)

# Print the differences in cosine similarity scores with descriptive text
for i in range(len(superman_sentences)):
    for j in range(len(sunile_sentences)):
        diff = similarities_trained[i][j].item() - similarities_nli[i][j].item()
        change = "increased" if diff > 0 else "decreased"
        abs_diff = abs(diff)
        print(f"Similarity for '{superman_sentences[i]}' and '{sunile_sentences[j]}' {change} by {abs_diff:.4f}")

