Google Bard:

Comparing two approaches, first using bag-of-words similarity and then calculating Levenshtein distance for highly similar cases, is a reasonable strategy for determining the similarity between two text sequences. This approach can help you balance efficiency and accuracy.

Bag-of-words comparison is a computationally efficient method that can quickly identify text pairs with high semantic similarity. However, it can be less sensitive to differences in the order of words. On the other hand, Levenshtein distance is more sensitive to the order of characters and can capture finer-grained differences in the structure of words or phrases.

By initially using bag-of-words comparison, you can quickly identify text pairs that are likely to be similar and then focus on calculating Levenshtein distance only for those pairs. This can significantly improve the efficiency of your similarity calculations, especially when dealing with large datasets.

Here's a summary of the proposed approach:

Perform bag-of-words similarity: Calculate the similarity between two text sequences using a bag-of-words approach. This will provide a quick initial assessment of their similarity.

Filter for highly similar pairs: Select pairs of text sequences that exhibit a high degree of similarity based on the bag-of-words comparison. This will narrow down the number of pairs requiring further analysis.

Calculate Levenshtein distance: For the selected pairs, calculate the Levenshtein distance to obtain a more precise measure of their similarity, taking into account the order of characters.

This approach offers a balance between efficiency and accuracy, allowing you to identify similar text pairs quickly while also capturing finer-grained differences in word order and structure for highly similar cases.

If you're specifically interested in comparing the overall meaning or sentiment of the two strings, you might explore techniques based on natural language processing (NLP), such as word embeddings or semantic similarity measures. These approaches can provide a more comprehensive assessment of the semantic relatedness between the text sequences.

In [1]:
import string
import regex

In [2]:
def remove_punctuation(text):
    punctuation = string.punctuation + '…'  # Add ellipsis (...) to remove
    return regex.sub(r'[{}]'.format(punctuation), '', text)

In [7]:
def jaccard_similarity(text1, text2):
    words1 = remove_punctuation(text1).lower().split()
    words2 = remove_punctuation(text2).lower().split()
    intersection = set(words1) & set(words2)
    union = set(words1) | set(words2)
    print(f"intersect: {intersection}")
    print(f"union: {union}")
    return float(len(intersection)) / len(union)

In [13]:
text1 = "Hello, world!"
text2 = "WORLD Hello "

similarity = jaccard_similarity(text1, text2)
print(f"Similarity: {similarity}")

intersect: {'world', 'hello'}
union: {'world', 'hello'}
Similarity: 1.0


In [22]:
def levenshtein_distance(text1, text2, ignore_punctuation=True, ignore_case=True):
    if ignore_punctuation:
        text1 = remove_punctuation(text1)
        text2 = remove_punctuation(text2)
    if ignore_case:
        text1 = text1.lower()
        text2 = text2.lower()
        
    return Levenshtein.distance(text1, text2)

text1 = "Hello, world!"
text2 = "hello word"  # "Hi there!"

# text1 = "Hello, world!"
# text2 = "WORLD Hello "

distance = levenshtein_distance(text1, text2)
print("Levenshtein distance:", distance)

Levenshtein distance: 1


In [24]:
import Levenshtein

def jaccard_similarity(text1, text2):
    words1 = text1.split()
    words2 = text2.split()
    intersection = set(words1) & set(words2)
    union = set(words1) | set(words2)
    return float(len(intersection)) / len(union)

def edit_distance(text1, text2, ignore_punctuation=True, ignore_case=True, ignore_order=True):
    if ignore_punctuation:
        text1 = remove_punctuation(text1)
        text2 = remove_punctuation(text2)
    if ignore_case:
        text1 = text1.lower()
        text2 = text2.lower()
        
    if ignore_order:
        return jaccard_similarity(text1, text2)
    else:
        return Levenshtein.distance(text1, text2)

In [27]:
text1 = "Hello, world!"
text2 = "halo word"  # "Hi there!"
sim = edit_distance(text1, text2)
print(f"similarity: {sim}")

similarity: 0.0


In [28]:
sim = edit_distance(text1, text2, ignore_order=False)
print(f"similarity: {sim}")

similarity: 3


Semantic similarity between to sentences

see https://github.com/mmihaltz/word2vec-GoogleNews-vectors/tree/master

too big 3GB
https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300

ignore

In [29]:
import gensim.models

# Load pre-trained word embeddings
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

# Calculate semantic similarity
text1 = "Hello, world!"
text2 = "Hi there!"

similarity = model.similarity(text1, text2)
print("Similarity:", similarity)

FileNotFoundError: [Errno 2] No such file or directory: 'GoogleNews-vectors-negative300.bin'

In [30]:
ticker_s = """
QUAD
LYTS
NNBR
EBF
AMRX
ETON
FBIO
HEES
AGX
AMPH
ARCB
AZZ
BBIO
CARR
CRS
EOLS
ETN
FLS
FUL
GWW
INSW
MTRN
PKE
PKOH
UFPI
USLM
VMC
VRSK
WLDN
CABA
CMPR
IESC
NPK
ROCK
SFL
SNA
SPXC
SXI
VRRM
WAB
ALG
AMWD
BECN
"""

In [31]:
tickers = [i.strip() for i in ticker_s.split("\n") if i.strip()]

In [33]:
print(" ".join(tickers))

QUAD LYTS NNBR EBF AMRX ETON FBIO HEES AGX AMPH ARCB AZZ BBIO CARR CRS EOLS ETN FLS FUL GWW INSW MTRN PKE PKOH UFPI USLM VMC VRSK WLDN CABA CMPR IESC NPK ROCK SFL SNA SPXC SXI VRRM WAB ALG AMWD BECN
