In [1]:
# Step 1: Install required packages
!pip install --quiet "numpy==1.26.4" "gensim==4.3.3" transformers datasets nltk

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# Step 2: Imports
from datasets import load_dataset
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
import gensim.downloader as api

In [3]:
# Step 3: Load Dataset and Preprocess
dataset = load_dataset('imdb', split='train').select(range(1000))
texts = dataset['text']

def preprocess(text):
    return " ".join(word_tokenize(text.lower()))

clean_texts = [preprocess(t) for t in texts]
tokenized_texts = [t.split() for t in clean_texts]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
# Step 4: Bag of Words & TF-IDF
bow = CountVectorizer(max_features=5000)
bow_matrix = bow.fit_transform(clean_texts)
print("BoW shape:", bow_matrix.shape)

tfidf = TfidfVectorizer(max_features=5000)
tfidf_matrix = tfidf.fit_transform(clean_texts)
print("TF-IDF shape:", tfidf_matrix.shape)

BoW shape: (1000, 5000)
TF-IDF shape: (1000, 5000)


In [5]:
# Step 5: BERT Embeddings
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

bert_embeddings = np.vstack([get_embedding(t) for t in clean_texts[:5]])
print("BERT Embeddings shape:", bert_embeddings.shape)

BERT Embeddings shape: (5, 768)


In [6]:
# Step 6: Pretrained Word2Vec (Google News)
w2v_model = api.load("word2vec-google-news-300")

def avg_word2vec(tokens):
    vecs = [w2v_model[word] for word in tokens if word in w2v_model]
    return np.mean(vecs, axis=0) if vecs else np.zeros(300)

w2v_embeddings = np.vstack([avg_word2vec(t) for t in tokenized_texts[:5]])
print("Word2Vec Embeddings shape:", w2v_embeddings.shape)

Word2Vec Embeddings shape: (5, 300)


In [7]:
# Step 7: GloVe Embeddings
!wget -q http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

glove_model = {}
with open("glove.6B.100d.txt", "r", encoding="utf8") as f:
    for line in f:
        values = line.split()
        glove_model[values[0]] = np.asarray(values[1:], dtype='float32')

def avg_glove(tokens):
    vecs = [glove_model[w] for w in tokens if w in glove_model]
    return np.mean(vecs, axis=0) if vecs else np.zeros(100)

glove_embeddings = np.vstack([avg_glove(t) for t in tokenized_texts[:5]])
print("GloVe Embeddings shape:", glove_embeddings.shape)

GloVe Embeddings shape: (5, 100)


In [8]:
# Step 8: Pretrained FastText (Wiki News)
ft_model = api.load("fasttext-wiki-news-subwords-300")

def avg_fasttext(tokens):
    vecs = [ft_model[word] for word in tokens if word in ft_model]
    return np.mean(vecs, axis=0) if vecs else np.zeros(300)

fasttext_embeddings = np.vstack([avg_fasttext(t) for t in tokenized_texts[:5]])
print("FastText Embeddings shape:", fasttext_embeddings.shape)

FastText Embeddings shape: (5, 300)


In [None]:
'''
Project Summary and Conclusion
This project focused on exploring various text embedding techniques to convert raw textual data into meaningful numerical representations. Using a subset of the IMDb movie reviews dataset, we implemented and compared classical and modern embedding methods to understand their characteristics and applicability.

Data Preparation:
The initial step involved preprocessing the textual data using the NLTK library, specifically the punkt tokenizer, which effectively segmented the text for further analysis.

Embedding Techniques:

Bag of Words (BoW) and TF-IDF: These foundational techniques transformed the corpus into sparse, high-dimensional vectors with a fixed vocabulary size of 5,000. While simple and interpretable, these methods do not capture semantic relationships or word context.

Pretrained Word Embeddings (Word2Vec, GloVe, FastText): By leveraging pretrained models, we mapped words to dense vectors capturing semantic and syntactic properties. Word2Vec and FastText (both 300-dimensional) and GloVe (100-dimensional) embeddings provided richer and more compact representations compared to BoW/TF-IDF. Notably, FastText’s subword information helps represent out-of-vocabulary words effectively.

Contextual Embeddings (BERT): Utilizing a transformer-based language model, we generated 768-dimensional contextual embeddings that consider the meaning of words within their sentence context, yielding superior performance in downstream tasks involving nuanced language understanding.

Observations:
The classical approaches (BoW and TF-IDF) are computationally efficient but limited in capturing deeper linguistic meaning. Pretrained embeddings bridge this gap by encoding semantic similarities. Contextual embeddings like BERT represent the state-of-the-art, effectively capturing context-dependent meanings at the cost of increased computational resources.

Challenges:
During the implementation, managing package dependencies (especially numpy and gensim) was crucial to avoid compatibility issues. Additionally, while Hugging Face’s token-based authentication warning appeared, it did not impede access to public datasets and models.

Conclusion:
This comprehensive exercise deepened our understanding of diverse embedding methodologies, highlighting the trade-offs between simplicity, computational requirements, and representational richness. The insights gained here are foundational for building robust natural language processing systems that can effectively analyze and interpret textual data.
'''