# **Step 1: Setting up Google Colab and importing libraries**

### This step involves setting up the environment in Google Colab and importing necessary libraries like numpy for numerical operations, sklearn for text processing (CountVectorizer, TF-IDFVectorizer), gensim for Word Embeddings (Word2Vec), and spacy for Named Entity Recognition (NER).

In [11]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec
import spacy

# **Step 2: Define the documents**

### Here, we've defined a list of text documents that represent different actions or queries related to wedding items like gown, flowers, and diamond ring. These documents will be used for text processing and analysis.

In [12]:
documents = [
    "User selected Wedding gown.",
    "User ordered on-line rose flowers.",
    "User searched diamond ring.",
    "User selected white wedding gown, online flowers, 3 carat diamond ring."
]

# **Step 3: Text Preprocessing**

### The preprocess_text function converts the text to lowercase and tokenizes it, splitting it into individual words. We've applied this function to each document in the list, resulting in a cleaned and tokenized representation of the text.

### Additionally, we've used SpaCy's Named Entity Recognition (NER) to extract entities like names, organizations, or locations present in the documents.

In [13]:
def preprocess_text(text):
    text = text.lower()  # Convert text to lowercase
    return text.split()  # Tokenize text

cleaned_documents = [preprocess_text(doc) for doc in documents]
print("Cleaned and Tokenized Documents:", cleaned_documents)

# Named Entity Recognition
nlp = spacy.load("en_core_web_sm")
for doc in documents:
    entities = nlp(doc)
    for ent in entities.ents:
        print(ent.text, ent.label_)

Cleaned and Tokenized Documents: [['user', 'selected', 'wedding', 'gown.'], ['user', 'ordered', 'on-line', 'rose', 'flowers.'], ['user', 'searched', 'diamond', 'ring.'], ['user', 'selected', 'white', 'wedding', 'gown,', 'online', 'flowers,', '3', 'carat', 'diamond', 'ring.']]
3 CARDINAL
carat ORG


# **Step 4: Vectorization Techniques**

## CountVectorizer and TF-IDF Vectorizer
### We've used both CountVectorizer and TF-IDFVectorizer from sklearn to convert the text data into numerical representations.

### **CountVectorizer:** Converts text into a matrix of token counts, representing the frequency of each word in the document.
### **TF-IDFVectorizer:** Transforms text into a TF-IDF (Term Frequency-Inverse Document Frequency) matrix, which gives a weight to each term based on its frequency in the document and across the entire corpus.

In [14]:
# CountVectorizer
vectorizer = CountVectorizer()
X_count = vectorizer.fit_transform(documents)

# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(documents)

print("Count Vectorizer Matrix:")
print(X_count.toarray())
print("TF-IDF Vectorizer Matrix:")
print(X_tfidf.toarray())

Count Vectorizer Matrix:
[[0 0 0 1 0 0 0 0 0 0 0 1 1 1 0]
 [0 0 1 0 1 1 0 1 0 1 0 0 1 0 0]
 [0 1 0 0 0 0 0 0 1 0 1 0 1 0 0]
 [1 1 1 1 0 0 1 0 1 0 0 1 1 1 1]]
TF-IDF Vectorizer Matrix:
[[0.         0.         0.         0.53931298 0.         0.
  0.         0.         0.         0.         0.         0.53931298
  0.35696573 0.53931298 0.        ]
 [0.         0.         0.3563895  0.         0.45203489 0.45203489
  0.         0.45203489 0.         0.45203489 0.         0.
  0.23589056 0.         0.        ]
 [0.         0.4970962  0.         0.         0.         0.
  0.         0.         0.4970962  0.         0.6305035  0.
  0.32902288 0.         0.        ]
 [0.37791387 0.29795164 0.29795164 0.29795164 0.         0.
  0.37791387 0.         0.29795164 0.         0.         0.29795164
  0.19721114 0.29795164 0.37791387]]


# **Word Embeddings using Word2Vec**
### We've employed Word2Vec from the gensim library to create word embeddings, which capture semantic relationships between words in a continuous vector space. However, if certain words like 'gown' are not present in the vocabulary, it might be due to the training parameters or the word's frequency in the provided text.

In [19]:
# Updated Word2Vec model training
word2vec_model = Word2Vec(sentences=cleaned_documents, vector_size=100, window=5, min_count=1, workers=4)
word_vectors = word2vec_model.wv

# Checking for available words in the Word2Vec model
available_words = word_vectors.index_to_key
print("Available Words in the Word2Vec Model:", available_words)

Available Words in the Word2Vec Model: ['user', 'selected', 'wedding', 'ring.', 'diamond', 'rose', 'gown.', 'ordered', 'on-line', 'carat', 'flowers.', '3', 'white', 'gown,', 'online', 'flowers,', 'searched']


# **Step 5: Calculate Similarity**

### In this step, we've calculated similarity scores between documents using two methods:

### **Cosine Similarity with TF-IDF Vectors:** Using cosine_similarity from sklearn, we computed the similarity between documents based on their TF-IDF representations.
### **Word Embeddings Similarity:** We attempted to calculate the similarity between words 'wedding' and 'gown' using the Word2Vec model. If the word 'gown' is not available in the Word2Vec model's vocabulary, it won't be possible to compute the similarity.

In [21]:
# Cosine similarity between TF-IDF vectors
cos_sim_tfidf = cosine_similarity(X_tfidf, X_tfidf)
print("Cosine Similarity Matrix - TF-IDF:")
print(cos_sim_tfidf)

# Cosine similarity using Word Embeddings (Word2Vec)
# Example: Calculate similarity between 'Wedding' and 'Gown'
similarity = word_vectors.similarity('wedding', 'gown.')
print("Similarity between 'Wedding' and 'Gown':", similarity)

Cosine Similarity Matrix - TF-IDF:
[[1.         0.08420485 0.1174499  0.55246518]
 [0.08420485 1.         0.07761339 0.15270708]
 [0.1174499  0.07761339 1.         0.36110824]
 [0.55246518 0.15270708 0.36110824 1.        ]]
Similarity between 'Wedding' and 'Gown': 0.14595059


# **Conclusion:**

### Creating a Vector Space Model involves turning text into numbers for analysis. We clean and tokenize the text, then convert it using methods like CountVectorizer or TF-IDFVectorizer. Word Embeddings capture word meanings. Calculating similarity helps compare documents. Adjustments improve accuracy, and this model can aid in various text analyses like sentiment or topic identification.