In this notebook, we will compare two models that capture the semantic meaning of text: Word2Vec, a neural network-based model, and Sentence Transformers, which utilizes an attention mechanism for better contextual understanding.

Word2Vec: This model, based on a shallow neural network, generates word embeddings that capture semantic relationships between words. It represents each word independently without accounting for surrounding context in a sentence.

Sentence Transformers: This advanced model leverages the attention mechanism and transformer architecture, allowing it to capture context across entire sentences, making it highly effective for tasks like semantic search and text similarity.

Through this comparison, we will evaluate each model’s ability to understand and represent semantic meaning advantage of disadvantage of every model , performance time and results performances


In [51]:
#requirments packages
!pip install chromadb
!pip install -U langchain-community



In [143]:
#needed packages
from gensim.models import Word2Vec
import pandas as pd
import numpy as np
from langchain.docstore.document import Document
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
import gensim
from gensim.models import Word2Vec, KeyedVectors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score , f1_score
from nltk.tokenize import word_tokenize
import nltk
from collections import Counter
nltk.download('punkt')
from sklearn.neighbors import KNeighborsClassifier

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# **load google word2vec model sementic word repressention The word2vec-google-news-300 model, which is the pre-trained Word2Vec model from Google with 300-dimensional embeddings, was trained using the skip-gram model, not CBOW**

In [3]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')



#Use ChatGPT to generate the test data

In [64]:
# Generate sample data for testing: two categories - one for AI-related topics and the other for Sports-related topics.
ai_data  = [
    # AI and Machine Learning terms
    "algorithm", "analytics", "automation", "augmentation", "big data", "biometrics", "calibration",
    "classification", "clustering", "cognitive computing", "computer vision", "data mining", "data processing",
    "data science", "deep learning", "detection", "feature extraction", "genetic algorithms", "information retrieval",
    "Internet of Things (IoT)", "knowledge representation", "machine learning", "natural language processing",
    "neural networks", "optimization", "predictive modeling", "regression", "reinforcement learning", "robotics",
    "sentiment analysis", "speech recognition", "supervised learning", "unsupervised learning", "virtual assistants",
    "virtual reality", "visualization", "active learning", "adversarial networks", "artificial neural networks",
    "backpropagation", "batch learning", "bias", "big data analytics", "categorization", "chatbots", "classification",
    "clustering", "convolutional networks", "data augmentation", "deep reinforcement learning", "ensemble learning",
    "feature engineering", "generative adversarial networks", "gradient descent", "inference", "labeling",
    "language models", "loss function", "model evaluation", "neural network architecture", "overfitting",
    "parameter tuning", "recurrent neural networks", "training", "transfer learning", "underfitting", "validation",
    "vectorization", "weights", "knowledge graph", "deep neural networks", "attention mechanism", "data pipeline",
    "sparse coding", "neural Turing machines", "dynamic time warping", "knowledge distillation", "autoencoders",
    "deep autoencoders", "unsupervised learning", "recommender systems", "contextualized embeddings", "feature selection",
    "transformers", "BERT", "GPT", "LSTM", "XGBoost", "Markov chains", "decision trees", "support vector machines",
    "k-means clustering", "principal component analysis", "feature vectors", "text embeddings", "self-supervised learning",
    "boosting", "bagging", "tuning", "dimensionality reduction", "cross-validation", "regression analysis", "data wrangling",
    "prediction", "classification accuracy", "performance metrics", "bias-variance tradeoff", "random forest", "ensemble models",
    "cross entropy", "recall", "precision", "F1-score", "ROC curve", "AUC", "learning rate", "convergence", "accuracy" , "Bayesian inference", "data lake", "data warehouse", "dimensionality reduction",
    "elastic net", "exploratory data analysis", "grid search", "hyperparameter optimization",
    "imputation", "k-nearest neighbors", "linear discriminant analysis", "logistic regression",
    "Markov decision process", "multilayer perceptron", "natural gradient", "nearest centroid",
    "ordinal regression", "probabilistic graphical models", "quantile regression",
    "rejection sampling", "sampling methods", "semi-supervised learning", "signal processing",
    "spectral clustering", "stochastic gradient descent", "subsampling", "support vector regression",
    "synthetic data", "time series forecasting", "tokenization", "training set", "test set",
    "validation set", "variance inflation factor", "weighted loss", "word embeddings", "zero-shot learning",
    "Bayesian networks", "causal inference", "counterfactual analysis", "gradient boosting",
    "hyperplane", "latent space", "maximum likelihood estimation", "model drift",
    "non-negative matrix factorization", "ordinal data", "pairwise learning", "prototype networks",
    "quasi-Newton methods", "randomized search", "representation learning", "sample weighting",
    "smoothing", "stacked models", "tensor operations", "transfer entropy", "under-sampling",
    "variational autoencoders", "word2vec", "XLNet", "BPE (Byte Pair Encoding)", "contextual embeddings",
    "fairness in AI", "AI ethics", "model interpretability", "scalability", "online learning",
    "edge computing", "federated learning", "privacy-preserving machine learning", "self-attention",
    "multi-task learning", "active noise cancellation", "ASR (automatic speech recognition)",
    "chatbots", "dialog systems", "multi-modal learning", "attention heads", "memory networks",
    "sentence transformers", "subword tokenization", "meta-learning", "multi-agent systems",
    "few-shot learning", "multi-layer perceptron", "hierarchical clustering", "probabilistic programming",
    "graph neural networks", "self-attention mechanism", "smart cities", "image segmentation",
    "AI governance", "pipeline orchestration", "cloud AI services", "quantum computing in AI"



]

# General Sports terms
Sports_data=["aerobics", "agility", "athletics", "balance", "ball", "base", "basketball", "biomechanics", "bodybuilding",
    "boxing", "calisthenics", "cardio", "championship", "coach", "competition", "conditioning", "cricket", "cycling",
    "defense", "dodgeball", "endurance", "exercise", "fitness", "football", "goal", "gymnastics", "half-time", "handball",
    "hockey", "interval training", "jogging", "karate", "lacrosse", "league", "marathon", "medal", "offense", "opponent",
    "outfield", "performance", "pitch", "player", "referee", "rugby", "running", "score", "soccer", "softball", "stamina",
    "strength", "strategy", "swimming", "team", "tennis", "tournament", "training", "triathlon", "umpire", "uniform",
    "volleyball", "weightlifting", "workout", "wrestling", "kickoff", "playoff", "punch", "race", "reflex", "ring",
    "rotation", "sprint", "teamwork", "tackle", "workout", "wrestler", "goalkeeper", "injury", "tournament", "substitute",
    "huddle", "passing", "counterattack", "dribbling", "kick", "match", "dunk", "run", "goalkeeper", "coach", "medal",
    "ball", "bat", "paddle", "racquet", "striker", "hurdle", "shooting", "opponent", "defender", "goalpost", "referee",
    "playmaker", "fastbreak", "finals", "runner-up", "crosscourt", "bounce", "free throw", "assist", "rebound",
    "half-court", "corner kick", "dribble", "offside", "tournament", "draw", "superbowl", "relay", "marathon", "quarterback",
    "midfield", "foul", "speedwork", "shot put", "athlete", "long jump", "stretches", "sprint", "outfield", "home run",
    "sacrifice bunt", "hit", "playbook", "refereeing", "scoreboard", "coach", "fitness test", "bounce pass", "teamwork",
    "kickoff", "team captain", "huddle", "forward pass", "halfback", "penalty", "game face", "double play", "goal kick",
    "offensive", "defensive", "fast break", "cheerleading", "time-out", "grappling", "judo", "championship ring"
    "athletics", "baseball", "basketball", "boxing", "climbing", "cricket", "cycling",
    "dodgeball", "fencing", "field hockey", "figure skating", "football", "golf",
    "gymnastics", "handball", "hockey", "ice skating", "karate", "kickboxing", "lacrosse",
    "martial arts", "mountain biking", "netball", "parkour", "polo", "racquetball",
    "rowing", "rugby", "sailing", "scuba diving", "skateboarding", "skiing",
    "snowboarding", "soccer", "softball", "squash", "surfing", "swimming", "table tennis",
    "taekwondo", "tennis", "track and field", "triathlon", "ultimate frisbee", "volleyball",
    "water polo", "weightlifting", "wrestling", "yoga", "badminton", "cross-country",
    "freestyle", "hurdles", "javelin throw", "long jump", "marathon", "pole vault",
    "relay race", "shot put", "sprint", "archery", "biathlon", "bowling", "canoeing",
    "equestrian", "fishing", "kite surfing", "motor racing", "powerlifting", "shooting",
    "snowmobiling", "sprint", "trampoline", "windsurfing", "aerobics", "pilates",
    "mountaineering", "free diving", "paragliding", "bobsleigh", "curling", "speed skating",
    "ballet", "cheerleading", "team sports", "individual sports", "endurance", "strength training",
    "sports analytics", "sports medicine", "recovery", "fitness", "physical training", "cardio",
    "stamina", "agility", "ball control", "jumping ability", "aerobic capacity", "coordination",
    "strength", "speed", "balance", "footwork", "offense", "defense", "technique", "skills",
    "competition", "teamwork", "tactics", "strategies", "conditioning", "refereeing",
    "sportsmanship", "injury prevention", "recovery time", "personal best", "tournament", "championship",
    "league", "playoff", "match", "practice", "warm-up", "cool-down", "goal", "scoring",
    "penalty", "foul", "time-out", "half-time", "substitution", "tiebreaker", "extra time"
]






#data preparing

In [95]:
# Create a DataFrame for AI-related data
# 'ai_data' contains the text data related to AI concepts
# 'class_name' column is set to 'ai' to label this dataset as belonging to AI category
df_ai = pd.DataFrame({'text': ai_data, 'class_name': 'ai'})

# Create a DataFrame for Sports-related data
# 'Sports_data' contains the text data related to Sports concepts
# 'class_name' column is set to 'sports' to label this dataset as belonging to Sports category
df_sports = pd.DataFrame({'text': Sports_data, 'class_name': 'sports'})

# Combine the two DataFrames (AI and Sports) into one DataFrame
# The concat function combines the data vertically (stacking the rows)
df = pd.concat([df_ai, df_sports])

# You might have repeated this block, which is unnecessary as it's already done above.
# Hence, the following lines are redundant and can be removed:
# df_ai = pd.DataFrame({'clean': ai_data, 'class_name' : 'ai'})
# df_sports = pd.DataFrame({'clean': Sports_data , 'class_name' : 'sports'})
# df = pd.concat([df_ai, df_sports])

# Add a new column 'class_id' to the DataFrame to assign numeric labels
# If 'class_name' is 'ai', set 'class_id' to 1; otherwise, set it to 2 (for 'sports')
df['class_id'] = np.where(df['class_name'] == "ai", 1, 2)


print(df.shape)
df=df.drop_duplicates()
df=df.reset_index(drop=True)
print(df.shape)
# Add an 'id' column starting from 1
df['id'] = range(1, len(df) + 1)


(504, 3)
(432, 3)


In [94]:
df['class_name'].value_counts()

Unnamed: 0_level_0,count
class_name,Unnamed: 1_level_1
sports,223
ai,209


In [165]:
class Word2vec:
    """
    A class to handle text data using Word2Vec embeddings and classify the data with a K-Nearest Neighbors model.

    Attributes
    ----------
    word2vec_model : gensim.models.Word2Vec
        A pre-trained Word2Vec model that provides vector representations for words.

    df : pandas.DataFrame
        DataFrame containing the text data and labels, with 'text' for input text and 'class_id' for labels.

    X : np.ndarray
        An array of averaged word vectors representing each text instance.

    Y : np.ndarray
        An array of labels corresponding to each text instance.

    train_df : np.ndarray
        Training set features obtained after splitting the data.

    test_df : np.ndarray
        Testing set features obtained after splitting the data.

    train_labels : np.ndarray
        Labels for the training set.

    test_labels : np.ndarray
        Labels for the testing set.

    Methods
    -------
    __init__(word2vec_model, df)
        Initializes the Word2vec instance with the Word2Vec model and dataset, computes word vectors, and splits data.

    get_average_word_vector(concept)
        Computes the average word vector for the given text string (concept).

    test_word2vec_model()
        Trains a K-Nearest Neighbors model on the training data and evaluates its performance on the test data.
    """

    def __init__(self, word2vec_model, df):
        """
        Initializes the Word2vec instance with a pre-trained Word2Vec model and a DataFrame.
        Computes average word vectors for each text entry in the DataFrame and splits data for training and testing.

        Parameters
        ----------
        word2vec_model : gensim.models.Word2Vec
            Pre-trained model used to compute word embeddings.

        df : pandas.DataFrame
            DataFrame containing text data in 'text' column and labels in 'class_id' column.
        """
        self.word2vec_model = word2vec_model
        self.df = df
        # Compute average word vector for each text entry in the DataFrame
        self.df['word2vec'] = self.df['text'].apply(self.get_average_word_vector)
        self.X = np.array(self.df['word2vec'].tolist())
        self.Y = np.array(self.df['class_id'])

        # Split the data into training and testing sets
        self.train_df, self.test_df, self.train_labels, self.test_labels = train_test_split(
            self.X, self.Y, test_size=0.2, random_state=42, shuffle=True
        )

    def get_average_word_vector(self, concept):
        """
        Computes the average word vector for a given text string (concept) by averaging the vectors of individual words.

        Parameters
        ----------
        concept : str
            The input text string for which the vector representation is calculated.

        Returns
        -------
        np.ndarray
            A numpy array representing the averaged word vector for the input text string.
        """
        concept_vector = np.zeros((300,), dtype=float)
        valid_word_count = 0

        # Calculate the sum of word vectors for each valid word in the input text
        for word in word_tokenize(concept):
            if word in self.word2vec_model:
                concept_vector += self.word2vec_model[word]
                valid_word_count += 1

        # If no valid word vectors are found, return a zero vector; otherwise, return the average vector
        return concept_vector if valid_word_count == 0 else concept_vector / valid_word_count

    def test_word2vec_model(self):
        """
        Trains a K-Nearest Neighbors (KNN) classifier using the training data and evaluates its accuracy and F1 score on the test set.

        Returns
        -------
        tuple
            A tuple containing the accuracy and F1 score of the KNN classifier on the test data.
        """
        knn = KNeighborsClassifier(n_neighbors=3)
        knn.fit(self.train_df, self.train_labels)
        y_pred = knn.predict(self.test_df)
        accuracy = accuracy_score(self.test_labels, y_pred)
        f1 = f1_score(self.test_labels, y_pred)
        return accuracy, f1


import time
start_time = time.time()  # Start the timer
obj = Word2vec(wv, df)
# Test the Word2Vec model using the K-Nearest Neighbors classifier and display results
accuracy, f1 = obj.test_word2vec_model()
print("Evaluation Results for K-Nearest Neighbors Classifier:")
print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"F1 Score: {f1:.2f}%")
end_time = time.time()  # End the timer
runtime = end_time - start_time  # Calculate the elapsed time
print(f"Pipeline ran time seconds :" ,runtime)

Evaluation Results for K-Nearest Neighbors Classifier:
Accuracy: 82.76%
F1 Score: 0.86%
Pipeline ran time seconds : 0.06171560287475586


In [166]:
# Initialize the embeddings model for document indexing
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")





In [167]:

class SentenceTransformers:
    """
    A class to manage sentence embeddings and similarity-based classification using a vector database.

    Attributes:
    -----------
    embeddings : object
        The embedding model used to transform text into vector representations.
    df : DataFrame
        A DataFrame containing text data and associated labels.
    train_df : DataFrame
        Training split of the provided DataFrame.
    test_df : DataFrame
        Testing split of the provided DataFrame.
    vector_store : object
        A vector database instance used for similarity search.

    Methods:
    --------
    __init__(embeddings, df):
        Initializes the SentenceTransformers class with an embeddings model and a DataFrame.

    vector_db():
        Creates and initializes a Chroma vector database client, clears the existing data, and adds new documents with metadata.

    follow_steps():
        Performs similarity search for each test document, prints the most similar documents, and identifies the most common class ID.

    test():
        Evaluates the model's accuracy and F1 score by predicting the most common class for each test document based on the top-3 similar documents.

    pipline():
        Runs the full pipeline of vector database setup, similarity search, and evaluation, returning the accuracy and F1 score.
    """

    def __init__(self, embeddings, df):
        """
        Initializes the SentenceTransformers class with an embedding model and DataFrame.

        Parameters:
        -----------
        embeddings : object
            The embedding model used to generate vector representations of text.
        df : DataFrame
            A DataFrame with columns 'text' for document content and 'class_id' for classification labels.
        """
        self.embeddings = embeddings
        self.df = df
        # Splits the data into training and testing sets.
        self.train_df, self.test_df = train_test_split(
            df, test_size=0.2, random_state=42, shuffle=True
        )

    def vector_db(self):
        """
        Sets up the Chroma vector database client, removes any existing collection data,
        creates a new collection, and adds training documents with an 'id' metadata field.

        Returns:
        --------
        vector_store : object
            The initialized Chroma vector store with added documents for similarity search.
        """
        # Initialize Chroma client and clear any existing collection.
        client = chromadb.Client()
        client.delete_collection(name="collection")

        # Create or retrieve the collection and add embeddings
        collection = client.get_or_create_collection(name="collection")
        self.vector_store = Chroma(
            client=client,
            collection_name="collection",
            embedding_function=self.embeddings
        )

        # Prepare and add documents to the vector store
        documents = [Document(page_content=t, metadata={"id": i-1}) for i, t in zip(self.train_df['id'][:], self.train_df['text'][:])]
        self.vector_store.add_documents(documents)

        return self.vector_store

    def follow_steps(self):
        """
        For each document in the test set, retrieves the top-3 most similar documents, prints similarity details,
        and identifies the most frequent class among the similar documents, along with the true class.
        """
        for idx, (text, true_class) in enumerate(zip(self.test_df['text'], self.test_df['class_id'])):
            most_similar_docs = self.vector_store.similarity_search(text, k=3)
            print("Query Text:", text)
            print("Most Similar Documents:")
            output_labels = []
            for idx, sim_doc in enumerate(most_similar_docs):
                class_id = self.df.loc[self.df['id'] == sim_doc.metadata['id'], 'class_id'].values[0]
                print(f"Rank {idx + 1}:")
                print("Similar Text:", sim_doc.page_content)
                print("Class ID:", class_id)
                print("-" * 30)
                output_labels.append(class_id)

            item_counts = Counter(output_labels)
            most_common_item = item_counts.most_common(1)[0]
            print("Most Common Item:", most_common_item[0])
            print("True Class:", true_class)

    def test(self):
        """
        Evaluates the model on the test set by predicting the most frequent class among the top-3 similar documents
        for each test document. Calculates accuracy and F1 score.

        Returns:
        --------
        accuracy : float
            The accuracy score for the model predictions on the test set.
        f1 : float
            The F1 score for the model predictions on the test set.
        """
        predicted_out = []
        true_labels = self.test_df['class_id'].tolist()
        for text in self.test_df['text']:
            most_similar_docs = self.vector_store.similarity_search(text, k=3)
            output_labels = []
            for sim_doc in most_similar_docs:
                class_id = self.df.loc[self.df['id'] == sim_doc.metadata['id'], 'class_id'].values[0]
                output_labels.append(class_id)

            item_counts = Counter(output_labels)
            most_common_item = item_counts.most_common(1)[0]
            predicted_out.append(most_common_item[0])

        return accuracy_score(true_labels, predicted_out), f1_score(true_labels, predicted_out)

    def pipline(self):
        """
        Executes the full pipeline: setting up the vector database, performing similarity search for each test document,
        and evaluating the model's accuracy and F1 score.

        Returns:
        --------
        accuracy : float
            The accuracy score after evaluating the model.
        f1 : float
            The F1 score after evaluating the model.
        """
        self.vector_db()
        self.follow_steps()
        accuracy, f1 = self.test()
        return accuracy, f1


import time
start_time = time.time()  # Start the timer

sent_obj = SentenceTransformers(embeddings, df)
accuracy, f1 = sent_obj.pipline()
print("Evaluation Results for Sentence Transformers:")
print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"F1 Score: {f1:.2f}%")

end_time = time.time()  # End the timer
runtime = end_time - start_time  # Calculate the elapsed time
print(f"Pipeline ran time seconds :" ,runtime)


Query Text: personal best
Most Similar Documents:
Rank 1:
Similar Text: prediction
Class ID: 1
------------------------------
Rank 2:
Similar Text: accuracy
Class ID: 1
------------------------------
Rank 3:
Similar Text: self-attention
Class ID: 1
------------------------------
Most Common Item: 1
True Class: 2
Query Text: autoencoders
Most Similar Documents:
Rank 1:
Similar Text: deep learning
Class ID: 1
------------------------------
Rank 2:
Similar Text: generative adversarial networks
Class ID: 1
------------------------------
Rank 3:
Similar Text: deep neural networks
Class ID: 1
------------------------------
Most Common Item: 1
True Class: 1
Query Text: scalability
Most Similar Documents:
Rank 1:
Similar Text: performance
Class ID: 2
------------------------------
Rank 2:
Similar Text: speed
Class ID: 2
------------------------------
Rank 3:
Similar Text: performance metrics
Class ID: 1
------------------------------
Most Common Item: 2
True Class: 1
Query Text: speech recogni

comments on results
# **sentence transformers capture semantic meaning best than word2vec but run time would be long compared with word2vec i used vector db to imporve run  time  , word2vec has limited vocabulary to depend on training data**