<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-4--job-resume-improvement/01_measuring_text_similarities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Measuring text similarities

In this notebook, we focus on a basic NLP problem: **measuring the similarity between two texts**. 

We will quickly discover a feasible solution that is not computationally efficient.
We will then explore a series of numerical techniques for rapidly computing
text similarities.

##Setup

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from collections import defaultdict
from collections import Counter
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

##Simple text comparison

Suppose we want to compare three simple texts:

```text
text1—She sells seashells by the seashore
text2—“Seashells! The seashells are on sale! By the seashore.”
text3—She sells 3 seashells to John, who lives by the lake.
```

Our goal is to determine whether `text1` is more similar to `text2` or to `text3`.



In [3]:
# Assigning texts to variables
text1 = "She sells seashells by the seashore."
text2 = '"Seashells! The seashells are on sale! By the seashore."'
text3 = "She sells 3 seashells to John, who lives by the lake."

Now we need to quantify the differences between texts.

In [4]:
# Splitting texts into words
words_lists = [text.split() for text in [text1, text2, text3]]
words1, words2, words3 = words_lists

for i, words in enumerate(words_lists, 1):
  print(f"Words in text {i}")
  print(f"{words}\n")

Words in text 1
['She', 'sells', 'seashells', 'by', 'the', 'seashore.']

Words in text 2
['"Seashells!', 'The', 'seashells', 'are', 'on', 'sale!', 'By', 'the', 'seashore."']

Words in text 3
['She', 'sells', '3', 'seashells', 'to', 'John,', 'who', 'lives', 'by', 'the', 'lake.']



In [5]:
# Removing case sensitivity and punctuation
def simplify_text(text):
  for punctuation in ['.', ',', '!', '?', '"']:
    text = text.replace(punctuation, "")
  return text.lower()

In [6]:
for i, words in enumerate(words_lists, 1):
  for j, word in enumerate(words):
    words[j] = simplify_text(word)
  print(f"Words in text {i}")
  print(f"{words}\n")

Words in text 1
['she', 'sells', 'seashells', 'by', 'the', 'seashore']

Words in text 2
['seashells', 'the', 'seashells', 'are', 'on', 'sale', 'by', 'the', 'seashore']

Words in text 3
['she', 'sells', '3', 'seashells', 'to', 'john', 'who', 'lives', 'by', 'the', 'lake']



In [7]:
# Converting word lists to sets
words_sets = [set(words) for words in words_lists]
for i, unique_words in enumerate(words_sets, 1):
  print(f"Unique Words in text {i}")
  print(f"{unique_words}\n")

Unique Words in text 1
{'she', 'sells', 'the', 'by', 'seashells', 'seashore'}

Unique Words in text 2
{'on', 'sale', 'the', 'are', 'by', 'seashells', 'seashore'}

Unique Words in text 3
{'she', 'to', 'lives', 'sells', 'the', 'by', 'seashells', 'lake', 'who', 'john', '3'}



In [8]:
# Extracting overlapping words between two texts
words_set1 = words_sets[0]
for i, words_set in enumerate(words_sets[1:], 2):
  shared_words = words_set1 & words_set
  print(f"Texts 1 and {i} share these {len(shared_words)} words:")
  print(f"{shared_words}\n")

Texts 1 and 2 share these 4 words:
{'by', 'seashells', 'seashore', 'the'}

Texts 1 and 3 share these 5 words:
{'she', 'the', 'sells', 'by', 'seashells'}



In [9]:
# Extracting diverging words between two texts
for i, words_set in enumerate(words_sets[1:], 2):
  diverging_words = words_set1 ^ words_set
  print(f"Texts 1 and {i} don't share these {len(diverging_words)} words:")
  print(f"{diverging_words}\n")

Texts 1 and 2 don't share these 5 words:
{'on', 'sale', 'she', 'sells', 'are'}

Texts 1 and 3 don't share these 7 words:
{'to', 'lives', 'lake', 'seashore', 'who', 'john', '3'}



To combine their overlap and divergence into a single similarity score, we must first combine all overlapping
and diverging words between the texts. 

This aggregation, which is called a union, will
contain all the unique words across the two texts.