#Language Modeling and Text Classification: Similarity Between Words or Documents


Practical Exercise Goal:
Use text similarity measures to identify similar or duplicate Nike product descriptions using:

TF-IDF + Cosine Similarity

Jaccard Similarity



#Step 1: Load and Inspect the Dataset


In [23]:
# Install necessary packages

import nltk
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity



In [24]:
# Download required NLTK data files
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#Step 2: Text Pre-processing


In [25]:
# Load the dataset
df = pd.read_csv("NikeProductDescriptions.csv")

# Preprocessing function
def preprocess_text(text):
    if isinstance(text, str):
        text = text.lower()
        text = text.translate(str.maketrans('', '', string.punctuation))
        tokens = nltk.word_tokenize(text)
        tokens = [word for word in tokens if word not in stopwords.words('english')]
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
        return ' '.join(tokens)
    return ''

# Apply preprocessing on 'Product Description' column
df['Processed Description'] = df['Product Description'].apply(preprocess_text)



#Step 3: TF-IDF Vectorization + Cosine Similarity


In [26]:
# TF-IDF + Cosine Similarity
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Processed Description'])
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

print("Cosine Similarity Matrix (TF-IDF):")
print(cosine_sim_matrix)




Cosine Similarity Matrix (TF-IDF):
[[1.         0.02700281 0.01613444 ... 0.02833869 0.00361866 0.02181375]
 [0.02700281 1.         0.04064013 ... 0.03654632 0.08444521 0.05017012]
 [0.01613444 0.04064013 1.         ... 0.04598871 0.07021435 0.        ]
 ...
 [0.02833869 0.03654632 0.04598871 ... 1.         0.04116353 0.02904659]
 [0.00361866 0.08444521 0.07021435 ... 0.04116353 1.         0.06835545]
 [0.02181375 0.05017012 0.         ... 0.02904659 0.06835545 1.        ]]


#Step 4: Jaccard Similarity


In [27]:
# Jaccard Similarity function
def jaccard_similarity(text1, text2):
    if not text1 or not text2:
        return 0
    set1 = set(text1.split())
    set2 = set(text2.split())
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0



In [28]:
# Compute Jaccard similarity matrix
n_descriptions = len(df['Processed Description'])
jaccard_sim_matrix = [[0.0] * n_descriptions for _ in range(n_descriptions)]

for i in range(n_descriptions):
    for j in range(i, n_descriptions):  # i to avoid recomputing
        sim = jaccard_similarity(df['Processed Description'][i], df['Processed Description'][j])
        jaccard_sim_matrix[i][j] = sim
        jaccard_sim_matrix[j][i] = sim  # symmetric

print("\nJaccard Similarity Matrix:")
for row in jaccard_sim_matrix:
    print(row)


Jaccard Similarity Matrix:
[1.0, 0.029411764705882353, 0.014285714285714285, 0.06349206349206349, 0.037037037037037035, 0.03278688524590164, 0.09803921568627451, 0.11864406779661017, 0.056179775280898875, 0.04225352112676056, 0.05970149253731343, 0.1016949152542373, 0.02702702702702703, 0.027777777777777776, 0.22, 0.0410958904109589, 0.05714285714285714, 0.05084745762711865, 0.015384615384615385, 0.07547169811320754, 0.03125, 0.08641975308641975, 0.029850746268656716, 0.0967741935483871, 0.0, 0.056338028169014086, 0.056818181818181816, 0.08196721311475409, 0.0379746835443038, 0.04477611940298507, 0.029850746268656716, 0.0963855421686747, 0.012345679012345678, 0.05357142857142857, 0.08163265306122448, 0.07936507936507936, 0.046875, 0.12121212121212122, 0.038461538461538464, 0.08163265306122448, 0.05084745762711865, 0.0, 0.03636363636363636, 0.017241379310344827, 0.013888888888888888, 0.0, 0.05263157894736842, 0.01694915254237288, 0.06779661016949153, 0.028169014084507043, 0.03571428571

#Display Similar Product Pairs Based on High Similarity


In [29]:
threshold = 0.7
n = len(df)

print("Highly similar product pairs based on Cosine Similarity:")
for i in range(n):
    for j in range(i + 1, n):  # Only upper triangle, no repeats or self-pair
        if cosine_sim_matrix[i][j] > threshold:
            print(f"Product {i} and Product {j} => Similarity: {cosine_sim_matrix[i][j]:.2f}")
            print(f"Title 1: {df['Title'][i]}")
            print(f"Title 2: {df['Title'][j]}")
            print("-" * 50)


Highly similar product pairs based on Cosine Similarity:
Product 6 and Product 163 => Similarity: 0.88
Title 1: Nike Air Force 1 '07
Title 2: Nike Air Force 1 '07
--------------------------------------------------
Product 13 and Product 28 => Similarity: 0.74
Title 1: Nike Infinity React 3 Premium
Title 2: Nike Pegasus 39 Premium
--------------------------------------------------
Product 16 and Product 381 => Similarity: 0.71
Title 1: Nike Pegasus 39
Title 2: Nike Pegasus 39 Premium
--------------------------------------------------
Product 38 and Product 192 => Similarity: 0.86
Title 1: Nike Alphafly 2
Title 2: Nike Alphafly 2
--------------------------------------------------
Product 40 and Product 48 => Similarity: 0.95
Title 1: Nike SB Force 58
Title 2: Nike SB Force 58
--------------------------------------------------
Product 40 and Product 56 => Similarity: 0.80
Title 1: Nike SB Force 58
Title 2: Nike SB Force 58 Premium
--------------------------------------------------
Product

In [30]:
print("Highly similar product pairs based on Jaccard Similarity:")
for i in range(n):
    for j in range(i + 1, n):
        if jaccard_sim_matrix[i][j] > threshold:
            print(f"Product {i} and Product {j} => Similarity: {jaccard_sim_matrix[i][j]:.2f}")
            print(f"Title 1: {df['Title'][i]}")
            print(f"Title 2: {df['Title'][j]}")
            print("-" * 50)


Highly similar product pairs based on Jaccard Similarity:
Product 6 and Product 163 => Similarity: 0.82
Title 1: Nike Air Force 1 '07
Title 2: Nike Air Force 1 '07
--------------------------------------------------
Product 38 and Product 192 => Similarity: 0.81
Title 1: Nike Alphafly 2
Title 2: Nike Alphafly 2
--------------------------------------------------
Product 40 and Product 48 => Similarity: 0.89
Title 1: Nike SB Force 58
Title 2: Nike SB Force 58
--------------------------------------------------
Product 43 and Product 57 => Similarity: 0.87
Title 1: Nike SB Nyjah 3
Title 2: Nike SB Zoom Nyjah 3
--------------------------------------------------
Product 44 and Product 51 => Similarity: 0.85
Title 1: Nike SB Ishod Wair Premium
Title 2: Nike SB Ishod Wair
--------------------------------------------------
Product 44 and Product 52 => Similarity: 0.78
Title 1: Nike SB Ishod Wair Premium
Title 2: Nike SB Ishod Premium
--------------------------------------------------
Product 44 

#Summary:
Cosine similarity uses TF-IDF vectors, so it’s more about overall content and word importance.

Jaccard similarity looks at word overlap regardless of frequency.

Both will help you spot near-duplicates or similar descriptions.

The snippet above prints only pairs of products with similarity above a threshold — so you get concise, understandable output.

