### Comparative study of Text embeddings : TF-IDF vs Sentence Transformer
In my recent foray into text analysis, I found myself comparing two intriguing methods: 'tfidf' and 'sentencetransformer'. TF-IDF, with its straightforward approach, excels in highlighting key words in sentences but often misses the subtleties of context and sentence structure. In contrast, SentenceTransformer, powered by BERT, dives deep into the contextual meanings, offering a richer, more nuanced understanding of text.

The journey led me to a realization: TF-IDF is ideal for tasks needing a quick, surface-level analysis, while SentenceTransformer shines in scenarios demanding deep semantic insights. Going forward, I'm eager to experiment with combining these methods, aiming to create a more holistic text analysis toolkit. This exploration marks just the beginning of my adventure into the depths of natural language processing.

### Install and import libraries

In [4]:
!pip install scipy sklearn -q
!pip install sentence-transformers -q

from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cosine
import numpy as np

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone


### Comparing Sentences Using TF-IDF and BERT Embeddings
The compute_similarity function compares two sentences by calculating their similarity, based on the specified method – either 'tfidf' or 'sentencetransformer'. For 'tfidf', it uses a TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to transform the sentences into vectors and computes their cosine similarity. For 'sentencetransformer', it utilizes a pre-trained BERT model ('bert-base-nli-mean-tokens') to generate sentence embeddings and then calculates their cosine similarity. The function raises an error if an invalid similarity type is specified.

In [5]:
def compute_similarity(sentence1, sentence2, similarity_type):
    if similarity_type == 'tfidf':
        # Initialize the TF-IDF vectorizer
        vectorizer = TfidfVectorizer()

        # Fit the vectorizer on the two sentences and transform them into TF-IDF vectors
        tfidf_matrix = vectorizer.fit_transform([sentence1, sentence2])

        # Compute the cosine similarity between the two TF-IDF vectors
        similarity = 1 - cosine(tfidf_matrix[0].toarray()[0], tfidf_matrix[1].toarray()[0])

    elif similarity_type == 'sentencetransformer':
        # Initialize the SentenceTransformer model
        model = SentenceTransformer('bert-base-nli-mean-tokens')

        # Compute the sentence embeddings
        embeddings = model.encode([sentence1, sentence2])

        # Compute the cosine similarity between the two embeddings
        similarity = 1 - cosine(embeddings[0], embeddings[1])

    else:
        raise ValueError("Invalid similarity_type. Choose either 'tfidf' or 'sentencetransformer'.")

    return similarity

### Comparing Text Similarity Using TFIDF and Sentence Transformers
The code calculates and prints the similarity between two sentences, "I love programming." and "Coding is my passion.", using two different methods: TFIDF (Term Frequency-Inverse Document Frequency) and SentenceTransformer. The compute_similarity function is used to measure how similar these two sentences are according to each method, demonstrating different approaches to semantic analysis in natural language processing.

In [6]:
sentence1 = "I love programming."
sentence2 = "Coding is my passion."
print('TFIDF Similarity : ', compute_similarity(sentence1, sentence2, 'tfidf'))
print('SentenceTransformer Similarity : ', compute_similarity(sentence1, sentence2, 'sentencetransformer'))

TFIDF Similarity :  0.0


.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

SentenceTransformer Similarity :  0.7452481985092163


In [7]:
sentence1 = "I love programming."
sentence2 = "Coding is my passion."
print('TFIDF Similarity : ', compute_similarity(sentence1, sentence2, 'tfidf'))
print('SentenceTransformer Similarity : ', compute_similarity(sentence1, sentence2, 'sentencetransformer'))

TFIDF Similarity :  0.0
SentenceTransformer Similarity :  0.7452481985092163


In [8]:
sentence1 = "The cat quickly jumped over the small fence."
sentence2 = "The feline swiftly leaped above the low barrier."
print('TFIDF Similarity : ', compute_similarity(sentence1, sentence2, 'tfidf'))
print('SentenceTransformer Similarity : ', compute_similarity(sentence1, sentence2, 'sentencetransformer'))

TFIDF Similarity :  0.2523342014336962
SentenceTransformer Similarity :  0.5344064831733704


In [9]:
sentence1 = "She is an expert in computer programming."
sentence2 = "She excels at coding and software development."
print('TFIDF Similarity : ', compute_similarity(sentence1, sentence2, 'tfidf'))
print('SentenceTransformer Similarity : ', compute_similarity(sentence1, sentence2, 'sentencetransformer'))

TFIDF Similarity :  0.07780894359285007
SentenceTransformer Similarity :  0.8338290452957153


In [10]:
sentence1 = "The children are playing happily in the park."
sentence2 = "The kids are joyfully engaging in play at the playground."
print('TFIDF Similarity : ', compute_similarity(sentence1, sentence2, 'tfidf'))
print('SentenceTransformer Similarity : ', compute_similarity(sentence1, sentence2, 'sentencetransformer'))

TFIDF Similarity :  0.38087260847594373
SentenceTransformer Similarity :  0.8964704871177673


In [11]:
sentence1 = "He is very passionate about environmental conservation."
sentence2 = "He has a strong commitment to preserving nature."
print('TFIDF Similarity : ', compute_similarity(sentence1, sentence2, 'tfidf'))
print('SentenceTransformer Similarity : ', compute_similarity(sentence1, sentence2, 'sentencetransformer'))

TFIDF Similarity :  0.07780894359285007
SentenceTransformer Similarity :  0.8495197296142578


In [12]:
sentence1 = "The company is taking measures to improve employee satisfaction."
sentence2 = "The firm is implementing strategies to enhance staff contentment."
print('TFIDF Similarity : ', compute_similarity(sentence1, sentence2, 'tfidf'))
print('SentenceTransformer Similarity : ', compute_similarity(sentence1, sentence2, 'sentencetransformer'))

TFIDF Similarity :  0.2019930924979183
SentenceTransformer Similarity :  0.919070303440094


In [13]:
sentence1 = "He likes drawing more than writing."
sentence2 = "More than writing, he likes drawing"
print('TFIDF Similarity : ', compute_similarity(sentence1, sentence2, 'tfidf'))
print('SentenceTransformer Similarity : ', compute_similarity(sentence1, sentence2, 'sentencetransformer'))

TFIDF Similarity :  1
SentenceTransformer Similarity :  0.9362029433250427


In [14]:
sentence1 = "The chicken is ready to eat."
sentence2 = "The chicken, ready to eat, is on the table."
print('TFIDF Similarity : ', compute_similarity(sentence1, sentence2, 'tfidf'))
print('SentenceTransformer Similarity : ', compute_similarity(sentence1, sentence2, 'sentencetransformer'))

TFIDF Similarity :  0.7941021852914736
SentenceTransformer Similarity :  0.9205996990203857


In [15]:
sentence1 = "Visiting relatives can be a nuisance."
sentence2 = "Visiting relatives, one can be a nuisance."
print('TFIDF Similarity : ', compute_similarity(sentence1, sentence2, 'tfidf'))
print('SentenceTransformer Similarity : ', compute_similarity(sentence1, sentence2, 'sentencetransformer'))

TFIDF Similarity :  0.8466473536503035
SentenceTransformer Similarity :  0.9748321175575256
