# **TEXT SIMILARITY**

**1. Jaccard**

The Jaccard index (Jaccard Coefficient, Jaccard Dissimilarity, and Jaccard Distance) is a statistic used for gauging the similarity and diversity of sample sets. It is defined in general taking the ratio of two sizes, the intersection size divided by the union size, also called intersection over union.

*Similarity 0.143* is close to 0, it indicates that the two sets are quite dissimilar.

*Similarity 0.6* is close to 1, it indicates that the two sets are quite similar.


Jaccard Similarity = (number of observations in both sets) / (number in either set)

In [None]:
def jaccard_similarity(x,y):
  intersection_cardinality = len(set.intersection(*[set(x), set(y)])) # number of observations in both, intersection
  union_cardinality = len(set.union(*[set(x), set(y)]))# number of observations in either, union
  similarity=intersection_cardinality/float(union_cardinality)
  distance=1-similarity
  return similarity,distance

In [None]:
# similar
sentences = ["Digitalisierung wächst trotz Rekordeinnahmen.",
"Digitalisierung wächst bei Rekordeinnahmen."]
sentences = [sent.lower().split(" ") for sent in sentences]
J_Similarity,J_Distance = jaccard_similarity(sentences[0], sentences[1])
print("Jaccard Similarity: ",J_Similarity)
print("Jaccard Distance: ",J_Distance)

In [None]:
# dissimilar
sentences = ["Digitalisierung wächst trotz Rekordeinnahmen.",
"Digitalisierung hat unvorstellbare Folgen."]
sentences = [sent.lower().split(" ") for sent in sentences]
J_Similarity,J_Distance = jaccard_similarity(sentences[0], sentences[1])
print("Jaccard Similarity: ",J_Similarity)
print("Jaccard Distance: ",J_Distance)

**2. Euclidean Distance**

Euclidian distance or Euclidean Metric represents the length of a line segment between two points, which can be calculated by the Pythagorean Theorem.

According to the Euclidian distance, the shorter the distance between the two texts is, the more similar they are. Thus, text 2 is more similar to text 3. Long sentences tend to have higher Euclideum score than the short ones.



In [None]:
pip install spacy

In [None]:
import spacy
import pandas as pd
import numpy as np

In [None]:
text_1 = "Sie investieren in Medien und Digitalisierung"
text_2 = "Digitalisierung hat unvorstellbare Folgen"
text_3 = "Digitalisierung: Förderprogramme für Unternehmen 2025"

## Create a list of the sentences
texts = [text_1, text_2, text_3]

from sklearn.feature_extraction.text import CountVectorizer

## Firstly let's count the words using the CountVectorizer
count_vectorizer = CountVectorizer(stop_words=["ein","das","der","die","den"]) # full list of german stop words is online available
count_vectorizer = CountVectorizer()
matrix = count_vectorizer.fit_transform(texts)

## we can create a dataframe to represent the number of the words in every sentence
table = matrix.todense()
df = pd.DataFrame(table,
                  columns=count_vectorizer.get_feature_names_out(),
                  index=['text_1', 'text_2', 'text_2'])
df

In [None]:
# Compute the Euclidean distance of these sentences.
# Shorter the distance between the two texts is, the more similar they are. Thus, text 2 is more similar to text 3.

from scipy.spatial import distance

matrix = distance.cdist(df, df, 'euclidean')

df_eucl = pd.DataFrame(matrix,
                  columns= ["Text_1", "Text_2", "Text_3"],
                  index=['text_1', 'text_2', 'text_3'])
df_eucl

**3. Count vectoriser and cosine similarity**

Using sklearn cosine_similarity and CountVectoriser. CountVectorizer is generally used for featurization of text data whereas OneHotEncoder is only used for featurization of categorical variables. One-hot vectors are high-dimensional and sparse, while word embeddings are low-dimensional and dense.

In CountVectorizer we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. this ends up in ignoring rare words which could have helped is in processing our data more efficiently.

To overcome this , we use TfidfVectorizer .

In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.


In [None]:
headlines = [
#Finanzen
'Staatsverschuldung wächst trotz Rekordeinnahmen',
'Staatsverschuldung ist mehr als nur eine Sonderzahlung',

#Digitalisierung
'Digitalisierung: Förderprogramme für Unternehmen 2025',
'Sie investieren in Medien und Digitalisierung',
'Digitalisierung wird unvorstellbare Folgen haben',

#Kultur
'Kunst oder Kommunikation: Wie trennbar ist das Werk vom Künstler?']

In [None]:
labels = [headline[:20] for headline in headlines]

def create_heatmap(similarity, cmap = "Greys"):
  df = pd.DataFrame(similarity)
  df.columns = labels
  df.index = labels
  fig, ax = plt.subplots(figsize=(5,5))
  sns.heatmap(df, cmap=cmap,annot=True)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(headlines)
arr = X.toarray()

create_heatmap(cosine_similarity(arr))

**4. Term Frequency-Inverse Document Frequency (TF-IDF)**

statistical measure used in information retrieval and machine learning to evaluate the importance of a word in a document relative to a collection of documents.

TF IDF - close to 0 = not informative

TF IDF - close to 1 = very similar, informative

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(headlines)
arr = X.toarray()

create_heatmap(cosine_similarity(arr))

**5. Word2Vec**

Word2vec is a open source tool to calculate the words distance provided by Google. It can be used by inputting a word and output the ranked word lists according to the similarity

In [None]:
!python -m spacy download de_core_news_sm
!python -m spacy download de_core_news_md
!python -m spacy download de_core_news_lg

In [None]:
nlp = spacy.load("de_core_news_sm")
docs = [nlp(headline) for headline in headlines]

In [None]:
similarity = []
for i in range(len(docs)):
      row = []
      for j in range(len(docs)):
          row.append(docs[i].similarity(docs[j]))
      similarity.append(row)
create_heatmap(similarity)

In [None]:
print(docs[0].vector)

In [None]:
similarity = []
for i in range(len(docs)):
    row = []
    for j in range(len(docs)):
      row.append(docs[i].similarity(docs[j]))
    similarity.append(row)
create_heatmap(similarity)

**6.1. Cosine Similarity Torch**

In [None]:
import torch

def format_pytorch_version(version):
    return version.split('+')[0]

def format_cuda_version(version):
    return 'cu' + version.replace('.', '')

TORCH_version = torch.__version__
TORCH = format_pytorch_version(TORCH_version)
CUDA_version = torch.version.cuda
CUDA = format_cuda_version(CUDA_version)

In [None]:
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-geometric

In [None]:
pip install transformers

In [None]:
pip install sentence-transformers

In [None]:
import torch_geometric

torch_geometric.__version__

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'axes.facecolor':'dimgrey', 'grid.color':'lightgrey'})
import numpy as np
import pandas as pd
import networkx as nx
import torch.nn.functional as F
import torch.nn as nn
import torch_scatter
from sentence_transformers import SentenceTransformer, util
from torch_geometric.data import Data
from torch_geometric.utils import to_undirected

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
sentence1 = "Digitalisierung: Förderprogramme für Unternehmen 2025."
sentence2 = "Sie investieren in Medien und Digitalisierung."

# encode sentences to get their embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# cosinus similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)
print("Sentence 1:", sentence1)
print("Sentence 2:", sentence2)
print("Similarity score:", cosine_scores.item())

In [None]:
sentences1 = ["Sie investieren in Medien und Digitalisierung.", "Digitalisierung wird unvorstellbare Folgen haben.", "Digitalisierung: Förderprogramme für Unternehmen 2025."]
sentences2 = ["Staatsverschuldung ist mehr als nur eine Sonderzahlung.", "Staatsverschuldung wächst trotz Rekordeinnahmen."]

# encode list of sentences to get their embeddings
embedding1 = model.encode(sentences1, convert_to_tensor=True)
embedding2 = model.encode(sentences2, convert_to_tensor=True)

# compute similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)
for i in range(len(sentences1)):
    for j in range(len(sentences2)):
        print("Sentence 1:", sentences1[i])
        print("Sentence 2:", sentences2[j])
        print("Similarity Score:", cosine_scores[i][j].item())
        print()

**6.2. Cosine Similarity Vectorizer**


Cosine Similarity computes the similarity of two vectors as the cosine of the angle between two vectors. It determines whether two vectors are pointing in roughly the same direction. So if the angle between the vectors is 0 degrees, then the cosine similarity is 1.


The Cosine of an angle can take a value between -1 and 1. Speaking from the NLP perspective, this value could be between 0 and 1. If a word does not appear in one of the texts, the fraction becomes zero.

In [None]:
text_1 = "Sie investieren in Medien und Digitalisierung"
text_2 = "Digitalisierung hat unvorstellbare Folgen"
text_3 = "Digitalisierung: Förderprogramme für Unternehmen 2025"

texts = [text_1, text_2, text_3]
print(texts)

In [None]:
## Constract again the bag of words table
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
count_vectorizer = CountVectorizer(stop_words=["ein","das","der","die","den"])
count_vectorizer = CountVectorizer()
matrix = count_vectorizer.fit_transform(texts)

In [None]:
## Creating a data frame to represent the number of the words in every sentence
table = matrix.todense()
df = pd.DataFrame(table,
                  columns=count_vectorizer.get_feature_names_out(),
                  index=['text_1', 'text_2', 'text_2'])

In [None]:
 ## Aplying the Cosine similarity module, scale is 0 - 1, closer to 1 means more similar
from sklearn.metrics.pairwise import cosine_similarity
values = cosine_similarity(df, df)
df = pd.DataFrame(values, columns=["Text 1", "Text 2", "Text 3"], index = ["Text 1", "Text 2", "Text 3"])
print(df)

**7. ElMo Embeddings from Language Models**

There is repository of models: https://vectors.nlpl.eu/repository/
You can download Elmo model for german wikipedia language here: 201,German Wikipedia Dump of March 2020. The model has 200 MB and has a rich contextual information.

You can use model online from Kaggle pages:

In [None]:
pip install simple_elmo

In [None]:
!python -m spacy download de_core_news_sm
!python -m spacy download de_core_news_md
!python -m spacy download de_core_news_lg

In [None]:
nlp = spacy.load("de_core_news_sm")
docs = [nlp(headline) for headline in headlines]

In [None]:
import tensorflow.compat.v1 as tf # compatible only for multilingual USE, large USE, EN-DE USE, ...
#import tensorflow as tf # version 2 is compatible only for USE 4, ..
import tensorflow_hub as hub
import spacy
import logging
from scipy import spatial
from simple_elmo import ElmoModel

In [None]:
logging.getLogger('tensorflow').disabled = True #OPTIONAL - to disable outputs from Tensorflow

# elmo = hub.Module('path if downloaded/Elmo_dowmloaded', trainable=False)
elmo = hub.load("https://tfhub.dev/google/elmo/3") # 100% optimised only for english language

tensor_of_strings = tf.constant(["Grau","Schnell","Langsam"])
elmo.signatures['default'](tensor_of_strings)

In [None]:
import zipfile

In [None]:
model = ElmoModel()
de_model="201.zip" # locate the downloaded zip file into compiler
model.load(de_model)

sentence = "Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten, zu denen du sehr gern beitragen kannst."

In [None]:
elmo_vectors = model.get_elmo_vectors(sentence, layers="average")
print(f"Tensor shape: {elmo_vectors.shape}")

In [None]:
Projekt = np.sum(elmo_vectors[0][29:33], axis = 0)/4
Aufbau = np.sum(elmo_vectors[0][45:49], axis = 0)/4
Inhalten = np.sum(elmo_vectors[0][87:91], axis = 0)/4

In [None]:
Projekt = Projekt.reshape(1,-1)
Aufbau = Aufbau.reshape(1,-1)
Inhalten = Inhalten.reshape(1,-1)

In [None]:
diff_1 = cosine_similarity(Projekt, Aufbau)
diff_2 = cosine_similarity(Aufbau, Inhalten)
same = cosine_similarity(Projekt, Inhalten)

print('Vector similarity for  *similar*  meanings:  %.2f' % same)
print('Vector similarity for *different* meanings:  %.2f' % diff_1)
print('Vector similarity for *different* meanings:  %.2f' % diff_2)

**8. Roberta**

The all-roberta-large-v1 model is a sentence transformer developed by the sentence-transformers team. It maps sentences and paragraphs to a 1024-dimensional dense vector space, enabling tasks like clustering and semantic search. This model is based on the RoBERTa architecture and can be used through the sentence-transformers library or directly with the HuggingFace Transformers library.

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/all-roberta-large-v1')

In [None]:
sentences = [
#Finanzen
"Staatsverschuldung wächst trotz Rekordeinnahmen",
"Staatsverschuldung ist mehr als nur eine Sonderzahlung"]

In [None]:
model = SentenceTransformer('sentence-transformers/all-roberta-large-v1')
embeddings = model.encode(sentences)
print(embeddings)

similarities = model.similarity(embeddings, embeddings)
print(similarities)
# With a high similarity score of 0.7860 this model is accurate and sentences are very similar.

**9. Universal Sentence Encoder (USE by Google)**


The universal sentence encoder model encodes textual data into high dimensional vectors known as embeddings which are numerical representations of the textual data. It specifically targets transfer learning to other NLP tasks, such as text classification, semantic similarity, and clustering. The pre-trained Universal Sentence Encoder is publicly available in Tensorflow-hub.

It is trained on a variety of data sources to learn for a wide variety of tasks. The sources are Wikipedia, web news, web question-answer pages, and discussion forums.

**XLING** model is trained for english and german and is compatible with tensorflow version 1.



In [None]:
pip install tf_sentencepiece

In [None]:
pip install sentencepiece

In [None]:
pip install tensorflow==2.18.0

In [None]:
import tensorflow.compat.v1 as tf # compatible only for multilingual USE, large USE, EN-DE USE, ...
#import tensorflow as tf # version 2 is compatible only for USE 4, ..
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
import sklearn
import sentencepiece

In [None]:
#module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
#model_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/1"
model_url = "https://tfhub.dev/google/universal-sentence-encoder-xling/en-de/1"
#model = hub.load("https://www.kaggle.com/models/google/universal-sentence-encoder/TensorFlow2/universal-sentence-encoder/2")
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Staatsverschuldung wächst trotz Rekordeinnahmen"],
                                                                [11,"Staatsverschuldung ist mehr als nur eine Sonderzahlung"],
                                                                [12,"Digitalisierung: Förderprogramme für Unternehmen 2025"],
                                                                [13,"Sie investieren in Medien und Digitalisierung"],
                                                                [14,"Digitalisierung wird unvorstellbare Folgen haben"],
                                                                [15,"Kunst oder Kommunikation: Wie trennbar ist das Werk vom Künstler"]
                                                                ]))

In [None]:
message_embeddings = embed(list(df['DESCRIPTION']))
cos_sim = sklearn.metrics.pairwise.cosine_similarity(message_embeddings)

In [None]:
def plot_similarity(labels, corr_matrix):
  sns.set(font_scale=0.9)
  g = sns.heatmap(
      corr_matrix,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="Greys",
      annot=True)
  g.set_xticklabels(labels, rotation=90)
  g.set_title("Semantic Textual Similarity")

plot_similarity(list(df['DESCRIPTION']), cos_sim)

**10. One Hot Encoding**

Used for text representation, but decision trees and dictionaries are more evolved.

In [None]:
import sklearn

In [None]:
import pandas as pd

# Example dataset
df = pd.DataFrame({
    'Farbe': ['rot', 'grün', 'blau']
})

# Apply One-Hot Encoding
encoded_df = pd.get_dummies(df['Farbe'])
print(encoded_df)

**11. Bag of Words**

Bag of Words is useful in many NLP tasks:
Feature extraction, Simplicity and efficiency, Document similarity, ...

We can use it to calculate the cosine similarity.

In [None]:
import spacy
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# A corpus containing a collection of sentences
corpus = [
"Staatsverschuldung wächst trotz Rekordeinnahmen",
"Staatsverschuldung ist mehr als nur eine Sonderzahlung"
]

In [None]:
# Initialize vectorizer
vectorizer = CountVectorizer()

In [None]:
# Fit vectorizer to corpus
bow = vectorizer.fit_transform(corpus)

In [None]:
# View vocabulary
vectorizer.vocabulary_

In [None]:
print(bow)

In [None]:
# Dense matrix representation
bow.toarray()

In [None]:
# Load english language model
nlp = spacy.load('de_core_news_sm')

# Define custom tokenizer (remove stop words and punctuation and apply lemmatization)
def custom_tokenizer(doc):
    return [t.lemma_ for t in nlp(doc) if (not t.is_punct) and (not t.is_stop)]

In [None]:
# Pass tokenizer as callback function to countvectorizer
vectorizer = CountVectorizer(tokenizer=custom_tokenizer, binary=True)

# Fit vectorizer to corpus
bow = vectorizer.fit_transform(corpus)

In [None]:
# Vocabulary
vectorizer.vocabulary_

In [None]:
# Dense matrix representation
bow.toarray()

In [None]:
# Sparse slice
print(bow[:,0:4])

In [None]:
# Cosine similarity using numpy
def cosine_sim(a,b):
    return np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))

In [None]:
# Similarity between two documents
print(corpus[0])
print(corpus[1])
print(f'Similarity score: {cosine_sim(bow[0].toarray().squeeze(),bow[1].toarray().squeeze()):.1f}')

In [None]:
print(cosine_similarity(bow))