<a href="https://colab.research.google.com/github/thanusree02/Natural-Language-Processing/blob/main/LAB_NLP_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import Required Libraries

In [50]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Dataset

In [51]:
documents = [

# SPORTS
"Football players train daily to improve their stamina and teamwork.",
"The cricket team won the championship after a thrilling final match.",
"Basketball requires speed, coordination, and strong defense.",
"Olympic athletes dedicate years to achieve peak performance.",
"Tennis tournaments attract millions of global viewers every year.",
"Regular exercise improves sports performance and physical fitness.",

# POLITICS
"The government announced a new policy to improve education.",
"Election campaigns focus on economic growth and employment.",
"Political debates influence public opinion before voting.",
"Parliament passed a bill addressing climate change.",
"International relations shape global political stability.",
"Citizens demand transparency and accountability from leaders.",

# HEALTH
"Doctors recommend balanced diets for a healthy lifestyle.",
"Regular exercise reduces the risk of heart disease.",
"Mental health awareness is increasing worldwide.",
"Vaccines protect communities from dangerous diseases.",
"Sleep is essential for physical and mental recovery.",
"Medical research advances treatment for chronic illnesses.",

# TECHNOLOGY
"Artificial intelligence is transforming modern industries.",
"Cybersecurity protects data from digital threats.",
"Smartphones connect people through advanced communication.",
"Software development drives innovation in technology.",
"Cloud computing enables scalable online services.",
"Robotics improves efficiency in manufacturing processes."
]

print("Total documents:", len(documents))



Total documents: 24


Preprocessing

In [52]:
stop_words = set(stopwords.words('english'))

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [w for w in tokens if w not in stop_words]
    return " ".join(tokens)

clean_docs = [preprocess(doc) for doc in documents]

TF-IDF Representation

In [53]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(clean_docs)


Cosine Similarity

In [54]:
cos_sim = cosine_similarity(tfidf_matrix)

print("\n=== COSINE SIMILARITY ===\n")

for i in range(len(documents)):
    for j in range(i+1, len(documents)):
        print("Doc 1:", documents[i])
        print("Doc 2:", documents[j])
        print("Score:", round(cos_sim[i][j], 3))
        print("-" * 60)




=== COSINE SIMILARITY ===

Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: The cricket team won the championship after a thrilling final match.
Score: 0.0
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Basketball requires speed, coordination, and strong defense.
Score: 0.0
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Olympic athletes dedicate years to achieve peak performance.
Score: 0.0
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Tennis tournaments attract millions of global viewers every year.
Score: 0.0
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Regular exercise 

Jaccard Similarity

In [55]:

def jaccard(s1, s2):
    a = set(s1.split())
    b = set(s2.split())
    return len(a & b) / len(a | b)

print("\n=== JACCARD SIMILARITY ===\n")

for i in range(len(clean_docs)):
    for j in range(i+1, len(clean_docs)):
        score = jaccard(clean_docs[i], clean_docs[j])
        print("Doc 1:", documents[i])
        print("Doc 2:", documents[j])
        print("Score:", round(score, 3))
        print("-" * 60)




=== JACCARD SIMILARITY ===

Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: The cricket team won the championship after a thrilling final match.
Score: 0.0
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Basketball requires speed, coordination, and strong defense.
Score: 0.0
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Olympic athletes dedicate years to achieve peak performance.
Score: 0.0
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Tennis tournaments attract millions of global viewers every year.
Score: 0.0
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Regular exercise

WordNet Semantic Similarity

In [56]:
def sentence_wordnet_similarity(s1, s2):
    words1 = s1.split()
    words2 = s2.split()
    scores = []

    for w1 in words1:
        for w2 in words2:
            syn1 = wn.synsets(w1)
            syn2 = wn.synsets(w2)
            if syn1 and syn2:
                sim = syn1[0].wup_similarity(syn2[0])
                if sim is not None:
                    scores.append(sim)

    return sum(scores)/len(scores) if scores else 0

print("\n=== WORDNET SIMILARITY ===\n")

count = 0
for i in range(len(clean_docs)):
    for j in range(i+1, len(clean_docs)):
        score = sentence_wordnet_similarity(clean_docs[i], clean_docs[j])
        print("Doc 1:", documents[i])
        print("Doc 2:", documents[j])
        print("Score:", round(score, 3))
        print("-" * 60)

        count += 1
        if count == 10:   # only first 10 pairs (lab requirement)
            break
    if count == 10:
        break



=== WORDNET SIMILARITY ===

Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: The cricket team won the championship after a thrilling final match.
Score: 0.234
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Basketball requires speed, coordination, and strong defense.
Score: 0.241
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Olympic athletes dedicate years to achieve peak performance.
Score: 0.224
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Tennis tournaments attract millions of global viewers every year.
Score: 0.22
------------------------------------------------------------
Doc 1: Football players train daily to improve their stamina and teamwork.
Doc 2: Regular e