# Methodology of task:

1. Text cleaning:
The function clean_text cleans each document by:
Converting text to lowercase.
Removing punctuation and symbols.
Removing extra whitespace.
The cleaned text is stored in a new column called 'cleaned_document'.

2. TF-IDF Vectorization:
TfidfVectorizer converts the cleaned text into a matrix of TF-IDF features.
TF-IDF measures how important a word is in a document relative to the entire corpus.
The shape of the resulting matrix and the size of the vocabulary are printed.

3.  Word2Vec Vectorization:
The documents are tokenized into individual words using word_tokenize.
A Word2Vec model is trained on these tokenized documents:
vector_size=100: Each word is represented by a 100-dimensional vector.
window=5: Considers a window of 5 words around the target word.
min_count=1: Includes words that appear at least once.
epochs=100: Trains the model for 100 iterations.
sg=1: Uses the Skip-gram model for training.
The size of the Word2Vec vectors and the vocabulary are printed.

4. Document vector creation:
The function document_vector computes the mean of the Word2Vec vectors for all words in a document, creating a single vector to represent the document. If a word is not in the modelâ€™s vocabulary, it is skipped.
A list of document vectors is generated for all documents in the dataset.

5. Similarity functions:
TF-IDF Similarity: The function tf_idf_similarity calculates the cosine similarity between a query (input text) and all documents based on their TF-IDF representations.
Word2Vec Similarity: The function word2vec_similarity computes the cosine similarity between a query's vector (computed using Word2Vec) and all document vectors.

6. Querying documents:
A list of queries is defined, and for each query, the code computes the similarity with all documents using both TF-IDF and Word2Vec models.
The function get_top_5_similar retrieves the indices of the top 5 most similar documents.
7. Displaying results:
The code prints the top 5 most similar documents for each query using both TF-IDF and Word2Vec methods side by side. The document indices are printed (starting from 1).

In [1]:
# Importing necessary libraries
import pandas as pd
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Reading the input document
df = pd.read_csv('assignment3_data.csv')

# Check the column names
print("Column names in the DataFrame:", df.columns)

# Use the first column, regardless of its name
document_column = df.columns[0]

# Cleaning up the input
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and symbols
    text = re.sub(r'[^\w\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['cleaned_document'] = df[document_column].apply(clean_text)

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['cleaned_document'])

print("TF-IDF vector size:", tfidf_matrix.shape[1])
print("TF-IDF vocabulary size:", len(tfidf_vectorizer.get_feature_names_out()))


# Word2Vec Vectorization
tokenized_documents = [word_tokenize(doc) for doc in df['cleaned_document']]

# Word2Vec model parameters
vector_size = 100
window = 5
min_count = 1
epochs = 100

# Training Word2Vec model
word2vec_model = Word2Vec(tokenized_documents, vector_size=vector_size, window=window, min_count=min_count, epochs=epochs, sg=1)

print("Word2Vec vector size:", word2vec_model.vector_size)
print("Word2Vec vocabulary size:", len(word2vec_model.wv.key_to_index))


# Function to calculate document vector (mean of word vectors)
def document_vector(doc):
    words = word_tokenize(doc)
    word_vectors = [word2vec_model.wv[word] for word in words if word in word2vec_model.wv]
    return np.mean(word_vectors, axis=0) if word_vectors else np.zeros(vector_size)

# Calculate document vectors for all documents
document_vectors = np.array([document_vector(doc) for doc in df['cleaned_document']])

# Function to calculate TF-IDF similarity
def tf_idf_similarity(query):
    query_vector = tfidf_vectorizer.transform([clean_text(query)])
    similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    return similarities

# Function to calculate Word2Vec similarity
def word2vec_similarity(query):
    query_vector = document_vector(clean_text(query))
    similarities = cosine_similarity([query_vector], document_vectors).flatten()
    return similarities

# Query documents
queries = [
    "Artificial intelligence is set to take over most jobs in near future.",
    "The use of artificial intelligence in healthcare industry is more and more every day.",
    "The use of AI in healthcare industry is more and more every day.",
    "The use of AI in medical care is more and more every day"
]

# Function to get top 5 similar documents
def get_top_5_similar(similarities):
    top_5 = np.argsort(similarities)[-5:][::-1] + 1  # Adding 1 to start from 1 instead of 0
    return top_5

# Iterate over queries and print results
for i, query in enumerate(queries, 1):
    print(f"-------- Results for query #{i} -----------------")
    print("TF-IDF\t\tWord2Vec")
    print("-------------------------------------------------")
    
    tfidf_sim = tf_idf_similarity(query)
    word2vec_sim = word2vec_similarity(query)
    
    tfidf_top_5 = get_top_5_similar(tfidf_sim)
    word2vec_top_5 = get_top_5_similar(word2vec_sim)
    
    for j in range(5):
        print(f"Document {tfidf_top_5[j]}\tDocument {word2vec_top_5[j]}")
    
    print("\n")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ACER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Column names in the DataFrame: Index(['documents'], dtype='object')
TF-IDF vector size: 328
TF-IDF vocabulary size: 328
Word2Vec vector size: 100
Word2Vec vocabulary size: 329
-------- Results for query #1 -----------------
TF-IDF		Word2Vec
-------------------------------------------------
Document 7	Document 7
Document 9	Document 8
Document 8	Document 9
Document 4	Document 10
Document 5	Document 6


-------- Results for query #2 -----------------
TF-IDF		Word2Vec
-------------------------------------------------
Document 10	Document 10
Document 5	Document 7
Document 1	Document 8
Document 8	Document 5
Document 4	Document 1


-------- Results for query #3 -----------------
TF-IDF		Word2Vec
-------------------------------------------------
Document 10	Document 10
Document 5	Document 5
Document 1	Document 2
Document 4	Document 1
Document 8	Document 6


-------- Results for query #4 -----------------
TF-IDF		Word2Vec
-------------------------------------------------
Document 4	Document 5
D