# Assignment 5 - NLP

## Student Details

**`Name`** Montgomery Gole, Viral Bankimbhai Thakar

**`Email`** mgole@torontomu.ca, vthakar@torontomu.ca

**`Student ID`** 501156495, 501213983

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import nltk
import string
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

## PART 1: Natural Language Processing- Finding Text Similarities

This part is about finding similarities in text and document. **By considering following sentences
and using any similarity or a clustering methods ( e.g., k-Means or hierarchical clustering) show
the more similar sentences.** 

Use at least three methods from TF, TF-IDF, bag of the words, Word2Vec or shingling techniques, and consequently, use a proper distance measure (e.g. Jaccard, Cosine, Edit and Hamming distance) cluster following data and show the similarity between them. 

Note that if you want to create your own distance measure, you should prove it is a valid distance measure. Here are the sentences

In [3]:
data_dict = {
    "id": [1, 2, 3, 4, 5, 6],
    "text": [
        "In the past John liked only sport but now he likes sport and politics",
        "Sam only liked politics but now he is fan of both music and politics",
        "Sara likes both books and politics but in the past she only read books",
        "Robert loved both books and nature but now he only reads books",
        "Linda liked books and sport but she only likes sport now",
        "Alison used to loved nature but currently she likes both nature and sport"
    ]
}

As a first step before we begin any preprocessing of the data, we need to apply some preprocessing to our test data. 
- Convert text to lower case.
- Remove punctuation.
- Tokenize text.

In [4]:
# Define a function to preprocess the text
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if not token in stop_words]
    text = ' '.join(tokens)
    return text

# Preprocess the text in the data_dict
data_dict['text'] = [preprocess_text(text) for text in data_dict['text']]
data_dict

{'id': [1, 2, 3, 4, 5, 6],
 'text': ['past john liked sport likes sport politics',
  'sam liked politics fan music politics',
  'sara likes books politics past read books',
  'robert loved books nature reads books',
  'linda liked books sport likes sport',
  'alison used loved nature currently likes nature sport']}

### Bag of Words with Cosine Distance

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create a bag of words matrix
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(data_dict['text'])

# Calculate pairwise cosine similarity between the sentences
cosine_sim = cosine_similarity(bow_matrix)

# Print the similarity matrix
print(cosine_sim)

[[1.         0.35355339 0.33333333 0.         0.70710678 0.31622777]
 [0.35355339 1.         0.23570226 0.         0.125      0.        ]
 [0.33333333 0.23570226 1.         0.47140452 0.35355339 0.10540926]
 [0.         0.         0.47140452 1.         0.25       0.3354102 ]
 [0.70710678 0.125      0.35355339 0.25       1.         0.3354102 ]
 [0.31622777 0.         0.10540926 0.3354102  0.3354102  1.        ]]


In [6]:
def print_top_similar_pairs(data_dict, similarity_matrix, n=5):
    # Get the number of sentences
    num_sentences = len(data_dict['text'])
    # Initialize a list to store the pairs of similar sentences
    similar_pairs = []
    # Loop over all possible pairs of sentences
    for i in range(num_sentences):
        for j in range(i + 1, num_sentences):
            # Get the similarity score for the pair of sentences
            similarity_score = similarity_matrix[i, j]
            # Add the pair of sentences and the similarity score to the list
            similar_pairs.append((i, j, similarity_score))
    # Sort the list in descending order by similarity score
    similar_pairs = sorted(similar_pairs, key=lambda x: x[2], reverse=True)
    # Print the top five similar pairs
    print("Top 5 similar sentence pairs:")
    for pair in similar_pairs[:n]:
        print(f"Sentence {pair[0] + 1}: {data_dict['text'][pair[0]]}")
        print(f"Sentence {pair[1] + 1}: {data_dict['text'][pair[1]]}")
        print(f"Similarity score: {pair[2]}")
        print()


print_top_similar_pairs(data_dict, cosine_sim, 5)

Top 5 similar sentence pairs:
Sentence 1: past john liked sport likes sport politics
Sentence 5: linda liked books sport likes sport
Similarity score: 0.7071067811865475

Sentence 3: sara likes books politics past read books
Sentence 4: robert loved books nature reads books
Similarity score: 0.4714045207910316

Sentence 1: past john liked sport likes sport politics
Sentence 2: sam liked politics fan music politics
Similarity score: 0.35355339059327373

Sentence 3: sara likes books politics past read books
Sentence 5: linda liked books sport likes sport
Similarity score: 0.35355339059327373

Sentence 4: robert loved books nature reads books
Sentence 6: alison used loved nature currently likes nature sport
Similarity score: 0.33541019662496846



### TF-IDF with Cosine Distance:

Using TF-IDF method, we can represent each sentence as a vector of TF-IDF scores. We can then calculate the cosine similarity between each pair of vectors to find the similarity between sentences. Here's the code:

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF matrix
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data_dict['text'])

# Calculate pairwise cosine similarity between the sentences
cosine_sim = cosine_similarity(tfidf_matrix)

# Print the similarity matrix
print(cosine_sim)

[[1.         0.27961795 0.29175285 0.         0.60384727 0.22074055]
 [0.27961795 1.         0.17723303 0.         0.10034482 0.        ]
 [0.29175285 0.17723303 1.         0.35896596 0.27375705 0.05636227]
 [0.         0.         0.35896596 1.         0.20323737 0.32788468]
 [0.60384727 0.10034482 0.27375705 0.20323737 1.         0.23764752]
 [0.22074055 0.         0.05636227 0.32788468 0.23764752 1.        ]]


In [8]:
print_top_similar_pairs(data_dict, cosine_sim, 5)

Top 5 similar sentence pairs:
Sentence 1: past john liked sport likes sport politics
Sentence 5: linda liked books sport likes sport
Similarity score: 0.6038472688979556

Sentence 3: sara likes books politics past read books
Sentence 4: robert loved books nature reads books
Similarity score: 0.3589659642687218

Sentence 4: robert loved books nature reads books
Sentence 6: alison used loved nature currently likes nature sport
Similarity score: 0.3278846815814845

Sentence 1: past john liked sport likes sport politics
Sentence 3: sara likes books politics past read books
Similarity score: 0.29175284985914557

Sentence 1: past john liked sport likes sport politics
Sentence 2: sam liked politics fan music politics
Similarity score: 0.2796179458525532

