<a href="https://colab.research.google.com/github/sanjeevtrivedi/pgd-dsai/blob/main/TextEncoding-2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You’re developing an NLP pipeline that analyzes customer feedback. You need to prepare the text for machine learning models by applying various text encoding techniques and comparing their representations.




In [None]:
expanded_reviews = [
    """One of the standout features of this phone, without a doubt, is its exceptional camera quality. From the moment I started using it, I was impressed by how vividly it captures colors and details, even in low-light environments. The image stabilization and auto-focus are precise, enabling me to take crisp photos effortlessly. The portrait mode delivers professional-looking shots with beautiful background blur, and the wide-angle lens adds versatility for landscape and group photography. For someone who enjoys documenting moments on the go, the quality rivals even some dedicated digital cameras. It's not just about still photos either—the video recording capabilities are equally impressive, with smooth 4K resolution and effective sound pickup. Overall, the camera alone makes this device a worthwhile purchase for photography lovers and casual users alike.""",

    """Battery life is one of the most disappointing aspects of this phone. Despite moderate usage that includes occasional browsing, social media checks, and streaming a few videos, the phone barely makes it through the day without needing a recharge. This is especially frustrating because I rely on my phone for work, and having to constantly monitor battery levels or carry around a power bank defeats the purpose of having a modern smartphone. What's more, the battery seems to drain faster when using apps that shouldn't be that demanding. Even overnight, the phone loses a noticeable percentage of charge in standby mode, suggesting poor optimization. I expected better performance for the price, and unfortunately, the battery issues overshadow some of the device’s better qualities.""",

    """The display on this phone is absolutely stunning, and it truly enhances the overall user experience. Whether I'm watching high-definition videos, reading articles, or simply browsing through my apps, the screen remains vibrant and crystal clear. The brightness levels are impressive—it remains perfectly readable even under direct sunlight, which is something I struggled with on previous devices. The colors are rich and accurate, making media consumption an absolute joy. The high resolution ensures text is sharp and photos appear detailed. Additionally, the smooth refresh rate makes scrolling and animations feel fluid and responsive. Whether you're a casual user or someone who enjoys streaming and gaming on their phone, the quality of this screen stands out as a major plus. It’s clear the display was designed with great attention to detail and user comfort in mind.""",

    """While the phone itself has its merits, the experience with customer service has been extremely frustrating and disheartening. When I encountered a technical issue early on, I expected prompt and professional support. Unfortunately, what I received was a series of unhelpful responses, long wait times, and multiple transfers between departments. The representatives seemed ill-equipped to handle even basic troubleshooting, and at times, I felt like I was being given the runaround rather than genuine assistance. After several calls and emails, my issue remained unresolved, and I was left feeling abandoned by a company I had trusted. Customer service plays a crucial role in overall satisfaction, and in this case, it severely undermined my confidence in the brand. A great product means little when the support behind it falls so short of expectations.""",

    """From the moment I unboxed this phone, I’ve been nothing short of thrilled with my decision to buy it. The sleek design, intuitive interface, and powerful performance all come together to create a device that feels premium and reliable. Setting it up was a breeze, and everything—from transferring my data to personalizing settings—worked seamlessly. What truly stands out is how balanced the phone feels across various functions. Whether I’m using it for work, entertainment, or communication, it handles every task smoothly. The battery life, display, and speed all meet or exceed expectations, making this purchase one of the most satisfying tech investments I've made in recent years. It's rare to find a device that feels this polished and user-focused. I would happily recommend it to friends, family, or anyone looking for a dependable smartphone."""
]


In [None]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab') # Download the punkt_tab data



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in tokens]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]

    return " ".join(lemmatized_tokens)



preprocessed_reviews = [preprocess_text(review) for review in expanded_reviews]
preprocessed_reviews

['one standout featur phone without doubt except camera qualiti moment start use impress vividli captur color detail even lowlight environ imag stabil autofocu precis enabl take crisp photo effortlessli portrait mode deliv professionallook shot beauti background blur wideangl len add versatil landscap group photographi someon enjoy document moment go qualiti rival even dedic digit camera still photo eitherth video record capabl equal impress smooth 4k resolut effect sound pickup overal camera alon make devic worthwhil purchas photographi lover casual user alik',
 'batteri life one disappoint aspect phone despit moder usag includ occasion brow social medium check stream video phone bare make day without need recharg especi frustrat reli phone work constantli monitor batteri level carri around power bank defeat purpos modern smartphon what batteri seem drain faster use app shouldnt demand even overnight phone lose notic percentag charg standbi mode suggest poor optim expect better perfor

In [None]:
# prompt: create a function for one hot encoding of text to vector

from sklearn.feature_extraction.text import CountVectorizer

def one_hot_encode(preprocessed_reviews):
    vectorizer = CountVectorizer(binary=True) # binary=True for one-hot encoding
    one_hot_matrix = vectorizer.fit_transform(preprocessed_reviews)
    return one_hot_matrix, vectorizer


In [None]:
# prompt: Utilize the function for converting each text from preprocessed text to vector and create a dataframe

import pandas as pd

one_hot_matrix, vectorizer = one_hot_encode(preprocessed_reviews)

# Create a DataFrame from the one-hot encoded matrix
one_hot_df = pd.DataFrame(one_hot_matrix.toarray(), columns=vectorizer.get_feature_names_out())

one_hot_df


Unnamed: 0,4k,abandon,absolut,accur,across,add,addit,alik,alon,anim,...,watch,what,whether,wideangl,without,work,worthwhil,would,year,your
0,1,0,0,0,0,1,0,1,1,0,...,0,0,0,1,1,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,0,0,0,0
2,0,0,1,1,0,0,1,0,0,1,...,1,0,1,0,0,0,0,0,0,1
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,1,0,1,1,0


In [None]:
# prompt: Compare two encoded text from df_one_hot as cosine similarity

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity


def compare_encoded_text(df, index1, index2):
    """
    Compares two encoded text representations from a DataFrame using cosine similarity.

    Args:
        df: The DataFrame containing the encoded text vectors.
        index1: The index of the first text vector to compare.
        index2: The index of the second text vector to compare.

    Returns:
        The cosine similarity between the two vectors.
    """

    vector1 = df.iloc[index1].values.reshape(1, -1)  # Reshape to 2D array for cosine_similarity
    vector2 = df.iloc[index2].values.reshape(1, -1)
    similarity = cosine_similarity(vector1, vector2)[0][0]
    return similarity



In [None]:
# prompt: Compare all 5 texts with 0

# Assuming 'df_one_hot' and 'compare_encoded_text' are defined from the previous code block

# Compare all 5 texts with text at index 0
for i in range(5):
    similarity = compare_encoded_text(one_hot_df, 0, i)
    print(f"Cosine similarity between text 0 and text {i}: {similarity}")


Cosine similarity between text 0 and text 0: 1.0
Cosine similarity between text 0 and text 1: 0.14406769314303258
Cosine similarity between text 0 and text 2: 0.21623596108783544
Cosine similarity between text 0 and text 3: 0.04289655575932741
Cosine similarity between text 0 and text 4: 0.09460323297592801


In [None]:
new_text = ["This phone's exceptional camera, with vivid detail, low-light performance, and versatile features, makes it a standout choice for both photography enthusiasts and casual users.",
            "Despite the phone's strengths, the poor, unresponsive customer service severely damaged trust and overshadowed the overall user experience."]
new_vector = vectorizer.transform(new_text)
similarity_scores = cosine_similarity(new_vector[0], one_hot_df)


print(similarity_scores)

# Optional: Get index of the most similar document
most_similar_index = similarity_scores.argmax()
print(most_similar_index)
print("Most similar document:", expanded_reviews[most_similar_index])

similarity_scores = cosine_similarity(new_vector[1], one_hot_df)


print(similarity_scores)

# Optional: Get index of the most similar document
most_similar_index = similarity_scores.argmax()
print(most_similar_index)
print("Most similar document:", expanded_reviews[most_similar_index])






[[0.26171196 0.05504819 0.15491933 0.05463584 0.05163978]]
0
Most similar document: One of the standout features of this phone, without a doubt, is its exceptional camera quality. From the moment I started using it, I was impressed by how vividly it captures colors and details, even in low-light environments. The image stabilization and auto-focus are precise, enabling me to take crisp photos effortlessly. The portrait mode delivers professional-looking shots with beautiful background blur, and the wide-angle lens adds versatility for landscape and group photography. For someone who enjoys documenting moments on the go, the quality rivals even some dedicated digital cameras. It's not just about still photos either—the video recording capabilities are equally impressive, with smooth 4K resolution and effective sound pickup. Overall, the camera alone makes this device a worthwhile purchase for photography lovers and casual users alike.
[[0.11704115 0.12309149 0.11547005 0.12216944 0.0577

In [None]:

!pip install numpy==1.24.4




In [None]:
!pip install gensim==4.3.0

Collecting gensim==4.3.0
  Downloading gensim-4.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting FuzzyTM>=0.4.0 (from gensim==4.3.0)
  Downloading FuzzyTM-2.0.9-py3-none-any.whl.metadata (7.9 kB)
Collecting pyfume (from FuzzyTM>=0.4.0->gensim==4.3.0)
  Downloading pyFUME-0.3.4-py3-none-any.whl.metadata (9.7 kB)
Collecting scipy>=1.7.0 (from gensim==4.3.0)
  Downloading scipy-1.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.9/58.9 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting simpful==2.12.0 (from pyfume->FuzzyTM>=0.4.0->gensim==4.3.0)
  Downloading simpful-2.12.0-py3-none-any.whl.metadata (4.8 kB)
Collecting fst-pso==1.8.1 (from pyfume->FuzzyTM>=0.4.0->gensim==4.3.0)
  Downloading fst-pso-1.8.1.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pandas (from FuzzyTM>=0.4.0->gensim==4.3.0)
  Downloadin

In [None]:
# prompt: create a function for wordtovec separate the download model and usage of text

import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import gensim.downloader as api
import numpy as np

def download_word2vec_model(model_name="word2vec-google-news-300"):
    try:
        model = api.load(model_name)
        return model
    except Exception as e:
        print(f"Error loading model: {e}")
        return None




In [None]:
# prompt: First download the model and then use preprocessed text array for creation of vectors

word2vec_model = download_word2vec_model()






In [None]:
# prompt: Create a function for WordToVec Vectorization

import numpy as np

def word2vec_vectorization(text, model):
    """
    Generates a vector representation for a given text using Word2Vec.

    Args:
        text: The input text to vectorize.
        model: The pre-trained Word2Vec model.

    Returns:
        A NumPy array representing the text vector, or None if the text is empty or no words are found in the model's vocabulary.
    """
    if not text:
        return None

    words = text.split()
    vectors = []
    for word in words:
        try:
            vectors.append(model[word])
        except KeyError:
            # Handle words not in the vocabulary (e.g., ignore or use a default vector)
            pass

    if not vectors:
        return None

    # Average the word vectors to create a sentence vector
    vector = np.mean(vectors, axis=0)
    return vector


In [None]:
# prompt: Now using preprocessed text generate the wordtovec vectors

# Assuming 'preprocessed_reviews' is defined from the previous code block

word2vec_vectors = []
for review in preprocessed_reviews:
  vector = word2vec_vectorization(review, word2vec_model)
  if vector is not None:
    word2vec_vectors.append(vector)
  else:
    word2vec_vectors.append(np.zeros(300)) # Or handle it differently if needed

word2vec_vectors = np.array(word2vec_vectors)

word2vec_vectors


array([[ 0.04074186,  0.04580688, -0.02070821, ..., -0.08305308,
        -0.02249416, -0.01186714],
       [ 0.00418796,  0.01533278, -0.00782776, ..., -0.04441977,
         0.02126298, -0.01283221],
       [ 0.04422738,  0.01686163, -0.02697488, ..., -0.08350185,
         0.00925996, -0.03988509],
       [-0.00413388,  0.04121774, -0.01949748, ...,  0.00237037,
         0.00833993,  0.02956959],
       [ 0.01055705,  0.02991155,  0.0042188 , ..., -0.03675257,
         0.00047201, -0.03776054]], dtype=float32)

In [None]:
# prompt: create dataframe of it

# Create a DataFrame from the word2vec vectors
word2vec_df = pd.DataFrame(word2vec_vectors)

text_vector =  word2vec_vectorization(new_text[0], word2vec_model)
similarity_scores = cosine_similarity(text_vector.reshape(1, -1), word2vec_df)

print(similarity_scores)




[[0.65249306 0.5223592  0.64548165 0.50153685 0.5977256 ]]


In [None]:
# prompt: Based on similarity score find the most relevant text from corpus

# Optional: Get index of the most similar document
most_similar_index = similarity_scores.argmax()
print(most_similar_index)
print("Most similar document:", expanded_reviews[most_similar_index])

text_vector =  word2vec_vectorization(new_text[1], word2vec_model)
similarity_scores = cosine_similarity(text_vector.reshape(1, -1), word2vec_df)

print(similarity_scores)

# Optional: Get index of the most similar document
most_similar_index = similarity_scores.argmax()
print(most_similar_index)
print("Most similar document:", expanded_reviews[most_similar_index])


0
Most similar document: One of the standout features of this phone, without a doubt, is its exceptional camera quality. From the moment I started using it, I was impressed by how vividly it captures colors and details, even in low-light environments. The image stabilization and auto-focus are precise, enabling me to take crisp photos effortlessly. The portrait mode delivers professional-looking shots with beautiful background blur, and the wide-angle lens adds versatility for landscape and group photography. For someone who enjoys documenting moments on the go, the quality rivals even some dedicated digital cameras. It's not just about still photos either—the video recording capabilities are equally impressive, with smooth 4K resolution and effective sound pickup. Overall, the camera alone makes this device a worthwhile purchase for photography lovers and casual users alike.
[[0.4280702  0.5653324  0.5141477  0.54888946 0.48596898]]
1
Most similar document: Battery life is one of the 