Overview of different NLP feature extractors used.
Covers TFIDF, LLM features

In [1]:
from sklearn.datasets import fetch_20newsgroups

# Load the 20 Newsgroups dataset
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Access the data and target labels
data = newsgroups_data.data
target = newsgroups_data.target

# Print some sample data
print("Sample text:", data[0])
print("Sample target:", target[0])

Sample text: 

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


Sample target: 10


In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Download required NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize the lemmatizer and define the stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # Remove special characters and digits using regex
    text = re.sub(r'[^a-zA-Z\s]', '', text, re.MULTILINE | re.IGNORECASE)

    # Convert to lowercase
    text = text.lower()

    # Remove emails
    cleaned_text = re.sub(r'\b[\w\-.]+?@\w+?\.\w{2,4}\b', '', text)
    
    # Remove phone numbers
    cleaned_text = re.sub(r'\b(?:\+\d{1,2}\s)?\(?\d{1,4}[\)\-\s]?\d{1,4}[\s\-]?\d{1,4}\b', '', cleaned_text)

    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # Remove stopwords and lemmatize the words
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]

    # Join the cleaned tokens back into a single string
    cleaned_text = ' '.join(cleaned_tokens)

    return cleaned_text

# Clean the dataset
cleaned_data = [clean_text(text) for text in data]

# Print the cleaned sample data
print("Original text:", data[0])
print("\nCleaned text:", cleaned_data[0])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original text: 

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!



Cleaned text: sure bashers pen fan pretty confused lack kind post recent pen massacre devil actually bit puzzled bit relieved however going put end nonpittsburghers relief bit praise pen man killing devil worse thought jagr showed much better regular season stats also lo

Sentence 1: "The cat chased the mouse."
Sentence 2: "The dog barked at the cat."
Sentence 3: "The mouse ran away from the cat and the dog."

Let's calculate the TF-IDF features for each term in the example sentences:

To simplify the example, let's consider the terms: "cat", "dog", "mouse", and "chased."

    Term Frequency (TF):
    Let's calculate the term frequency for each term in the sentences:
        Term Frequency of "cat" in Sentence 1: 0.2
        Term Frequency of "dog" in Sentence 2: 0.2
        Term Frequency of "mouse" in Sentence 1: 0.2
        Term Frequency of "mouse" in Sentence 3: 0.2
        Term Frequency of "chased" in Sentence 1: 0.2

    Inverse Document Frequency (IDF):
    Let's calculate the inverse document frequency for each term:
        IDF of "cat": log(3/2) ≈ 0.176
        IDF of "dog": log(3/2) ≈ 0.176
        IDF of "mouse": log(3/2) ≈ 0.176
        IDF of "chased": log(3/1) ≈ 0.405

    TF-IDF:
    Finally, let's calculate the TF-IDF for each term by multiplying the term frequency (TF) with the inverse document frequency (IDF):
        TF-IDF of "cat" in Sentence 1: 0.2 * 0.176 ≈ 0.035
        TF-IDF of "dog" in Sentence 2: 0.2 * 0.176 ≈ 0.035
        TF-IDF of "mouse" in Sentence 1: 0.2 * 0.176 ≈ 0.035
        TF-IDF of "mouse" in Sentence 3: 0.2 * 0.176 ≈ 0.035
        TF-IDF of "chased" in Sentence 1: 0.2 * 0.405 ≈ 0.081

Therefore, the TF-IDF feature representation for the example sentences would be:

Sentence 1: [0.035, 0, 0.035, 0.081]
Sentence 2: [0, 0.035, 0, 0]
Sentence 3: [0, 0, 0.035, 0]

Each sentence is represented by a vector where the elements correspond to the TF-IDF value of each term. Terms that are not present in a sentence will have a TF-IDF value of 0.

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the example sentences
sentences = [
    "The cat chased the mouse.",
    "The dog barked at the cat.",
    "The mouse ran away from the cat and the dog."
]
stop_words = list(set(stopwords.words('english')))
# Create an instance of TfidfVectorizer with desired settings
vectorizer = TfidfVectorizer(lowercase=True, stop_words=stop_words)

# Fit the vectorizer to the sentences and transform the sentences to TF-IDF features
tfidf_features = vectorizer.fit_transform(sentences)

# Get the vocabulary (terms) from the vectorizer
vocabulary = vectorizer.vocabulary_

# Sort the vocabulary by index to obtain the feature names
feature_names = sorted(vocabulary, key=vocabulary.get)

# Print the feature names and their length
print("Feature Names:", feature_names)
print("Number of unique terms:", len(feature_names))
# Print the TF-IDF features
tfidf_features_array = tfidf_features.toarray()
for i, sentence in enumerate(sentences):
    print(f"Features for Sentence {i+1}: {tfidf_features_array[i]}")

Feature Names: ['away', 'barked', 'cat', 'chased', 'dog', 'mouse', 'ran']
Number of unique terms: 7
Features for Sentence 1: [0.         0.         0.42544054 0.72033345 0.         0.54783215
 0.        ]
Features for Sentence 2: [0.         0.72033345 0.42544054 0.         0.54783215 0.
 0.        ]
Features for Sentence 3: [0.53409337 0.         0.31544415 0.         0.40619178 0.40619178
 0.53409337]


In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Point #3: Feature Engineering using CountVectorizer
count_vectorizer = CountVectorizer(max_features=5000)
count_features = count_vectorizer.fit_transform(cleaned_data)

# Print a sample of the count features
print("CountVectorizer features (sample):")
print(count_features[0])

# Point #4: Feature Engineering using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
tfidf_features = tfidf_vectorizer.fit_transform(cleaned_data)

# Print a sample of the TF-IDF features
print("\nTF-IDF features (sample):")
print(tfidf_features[0])

CountVectorizer features (sample):
  (0, 4380)	1
  (0, 3310)	5
  (0, 1784)	1
  (0, 3477)	1
  (0, 1127)	1
  (0, 2575)	1
  (0, 2547)	1
  (0, 3437)	1
  (0, 3697)	1
  (0, 2810)	1
  (0, 1411)	2
  (0, 315)	1
  (0, 725)	3
  (0, 2237)	1
  (0, 2039)	2
  (0, 3591)	1
  (0, 1616)	1
  (0, 3751)	1
  (0, 2778)	1
  (0, 2545)	1
  (0, 4927)	1
  (0, 4517)	1
  (0, 2460)	2
  (0, 4093)	1
  (0, 2993)	1
  (0, 699)	1
  (0, 3734)	2
  (0, 3988)	2
  (0, 4274)	1
  (0, 398)	1
  (0, 2721)	2
  (0, 1948)	2
  (0, 4836)	1
  (0, 3399)	1
  (0, 2641)	1
  (0, 3067)	1
  (0, 1218)	1
  (0, 1970)	2
  (0, 4127)	1
  (0, 665)	1
  (0, 2474)	1
  (0, 456)	1
  (0, 4002)	1
  (0, 2446)	1
  (0, 2717)	1
  (0, 1839)	1
  (0, 3901)	1

TF-IDF features (sample):
  (0, 3901)	0.09011068683583343
  (0, 1839)	0.09557491922218497
  (0, 2717)	0.10448911696908358
  (0, 2446)	0.12509426968322537
  (0, 4002)	0.05834186427171129
  (0, 456)	0.08293063853061822
  (0, 2474)	0.12151903750404754
  (0, 665)	0.10585643259752014
  (0, 4127)	0.0626256321351594
 

In [4]:
import tensorflow as tf
import tensorflow_hub as hub

**The Universal Sentence Encoder (USE)** is a pre-trained model developed by Google that converts text data into fixed-sized embeddings (usually 512-dimensional) to be used in various natural language understanding tasks. The USE is designed to provide a good trade-off between model size and performance. It can effectively encode semantics and sentence-level information, making it useful for tasks such as sentiment analysis, text classification, and semantic similarity.

In [5]:
# Load the Universal Sentence Encoder model
model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Define input sentences
sentences = ["This is an example sentence.", "Another example sentence goes here."]

# Extract embeddings
embeddings = model(sentences)

# Print the embeddings
print(embeddings.shape)

(2, 512)


In [8]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


In [9]:

from transformers import BertTokenizer, BertModel
import torch

**BERT (Bidirectional Encoder Representations from Transformers)** is a pre-trained language model developed by Google that has shown state-of-the-art results on various natural language understanding tasks. BERT is designed to capture both contextual and semantic information from text by using a bidirectional Transformers architecture. It can be fine-tuned for specific tasks such as sentiment analysis, text classification, named entity recognition, and question-answering.

Comparing BERT with Universal Sentence Encoder (USE) embeddings:

1. BERT is computationally more expensive and requires more resources due to its large model size. USE is designed to provide a good trade-off between model size and performance, making it more efficient for certain applications.
2. BERT can be fine-tuned for specific tasks, which can lead to better performance in certain cases. USE is not designed for fine-tuning, but it can be a strong baseline for several tasks without the need for task-specific fine-tuning.

In [11]:
# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Define input sentence
sentence = "This is an example sentence."

input_ids = tokenizer.encode(sentence, return_tensors="pt")

# Extract embeddings
with torch.no_grad():
    outputs = model(input_ids=input_ids)

# last_hidden_state is in outputs[0]
last_hidden_state = outputs[0]

# pooler_output is in outputs[1]
pooler_output = outputs[1]

# Print the last_hidden_state and pooler_output
print("Last Hidden State shape is :", last_hidden_state.shape)
print("Pooler Output shape is:", pooler_output.shape)

# Print the embeddings
print(pooler_output.shape)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Last Hidden State shape is : torch.Size([1, 8, 768])
Pooler Output shape is: torch.Size([1, 768])
torch.Size([1, 768])


The shape of `last_hidden_state` is 3-dimensional because it represents the hidden states for all tokens in the input sequence across the batch and the hidden layers. The dimensions can be explained as follows:

1. `batch_size`: The number of input sequences processed together as a batch. In the provided example, we only have one sentence, so the batch size is 1. If you process multiple sentences simultaneously, the batch size will be equal to the number of sentences.

2. `sequence_length`: The number of tokens in the input sequence after tokenization, including special tokens like [CLS] and [SEP]. This dimension represents the length of the input sequence.

3. `hidden_size`: The size of the hidden states in the BERT model. For the BERT base model, this size is 768, while for the BERT large model, it is 1024. This dimension represents the feature vector or embedding for each token in the input sequence.

So the 3-dimensional tensor of shape `(batch_size, sequence_length, hidden_size)` contains the embeddings for each token in every input sequence, for all sequences in the batch, and these embeddings are generated from the last layer of the BERT model.

The `sequence_length` is equal to 8 because the input sentence has been tokenized into 8 tokens, including the special tokens added by the BERT tokenizer. Let's break it down:

Input sentence: "This is an example sentence."

After tokenization and adding special tokens, we get the following token sequence:

[CLS], 'this', 'is', 'an', 'example', 'sentence', '.', [SEP]

The [CLS] and [SEP] tokens are special tokens added by the BERT tokenizer. [CLS] is the classification token, which is added at the beginning of the sequence and is used for classification tasks. [SEP] is the separator token, which is added at the end of the sequence to mark the end of the input.

So in total, there are 8 tokens in the tokenized sequence, which is why the `sequence_length` is equal to 8.

In [17]:
sentences = [
    "This is an example sentence.",
    "Another example sentence goes here. It is an incredible experience here right now to be using google colab without worrying about dependencies."
]

# Tokenize and pad sentences
input_ids = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
print(type(input_ids))
with torch.no_grad():
    outputs = model(**input_ids)  # Pass input_ids and attention_mask together

# last_hidden_state is in outputs[0]
last_hidden_state = outputs[0]

# pooler_output is in outputs[1]
pooler_output = outputs[1]

# Print the last_hidden_state and pooler_output shapes
print("Last Hidden State shape is :", last_hidden_state.shape)
print("Pooler Output shape is:", pooler_output.shape)

<class 'transformers.tokenization_utils_base.BatchEncoding'>
Last Hidden State shape is : torch.Size([2, 28, 768])
Pooler Output shape is: torch.Size([2, 768])


**Deciding whether to use `last_hidden_state` or `pooled_output`** as features depends on the nature of the NLP task you are working on and the level of information you require from the embeddings.

1. Use `last_hidden_state` for token-level tasks: If your task requires token-level information, such as Named Entity Recognition (NER), Part-of-Speech (POS) tagging, or dependency parsing, use the `last_hidden_state`. This tensor contains contextualized embeddings for each token in the input sequence, allowing you to capture token-level information and relationships between tokens.

2. Use `pooled_output` for sentence-level tasks: If your task requires sentence-level information or representations, such as sentiment analysis, text classification, or semantic similarity, use the `pooled_output`. This vector is derived from the hidden state of the [CLS] token and is designed to capture sentence-level information. Using `pooled_output` allows you to obtain a fixed-size representation for each input sentence, which can then be used as input to classifiers or other machine learning models.

In summary, choose `last_hidden_state` for token-level tasks and `pooled_output` for sentence-level tasks based on the requirements of your specific NLP problem.

**RoBERTa and DistilBERT** are both variants of the original BERT model, with some differences in model architecture, pre-training strategies, and objectives. Here's a brief comparison of these models with BERT:

1. **RoBERTa (Robustly Optimized BERT Pretraining Approach)**:

- RoBERTa is designed to improve the pre-training process of BERT by using larger batch sizes, longer sequence lengths, and more training data.
- It removes the next sentence prediction (NSP) task used in BERT's pre-training objective, focusing only on masked language modeling (MLM).
- RoBERTa uses dynamic masking instead of static masking, meaning that the masking pattern changes during pre-training.
- The result is a model that performs on par or better than BERT on various NLP benchmarks while maintaining a similar architecture.

2. **DistilBERT (Distilled version of BERT)**:

- DistilBERT is a smaller version of the original BERT model, created using knowledge distillation techniques.
- The primary goal is to reduce the model's size and computational requirements while maintaining a high level of performance.
- DistilBERT has approximately half the number of parameters compared to BERT-base (66 million vs. 110 million) and is faster during training and inference.
- It retains around 95% of BERT's performance across various NLP tasks while being more computationally efficient.

In [28]:
## Few more feature extractors - 
#1. RoBERTa (A Robustly Optimized BERT Pretraining Approach)

from transformers import RobertaTokenizer, RobertaModel

# Load RoBERTa tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")

# Define input sentence
sentences = ["This is an example sentence.", "Google colab is great for trying all packages in one place without dependency issues"]

# Tokenize and encode the sentences
input_ids = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Extract embeddings
with torch.no_grad():
    outputs = model(input_ids=input_ids["input_ids"], attention_mask=input_ids["attention_mask"])

# last_hidden_state is in outputs[0]
last_hidden_state = outputs[0]

# pooler_output is in outputs[1]
pooler_output = outputs[1]

# Print the last_hidden_state and pooler_output shapes
print("Last Hidden State shape :", last_hidden_state.shape)
print("Pooler Output shape:", pooler_output.shape)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Last Hidden State shape : torch.Size([2, 17, 768])
Pooler Output shape: torch.Size([2, 768])


In [29]:
#2. DistilBERT (Distilled version of BERT)

from transformers import DistilBertTokenizer, DistilBertModel
import torch

# Load DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

# Define input sentences
sentences = ["This is an example sentence.", "Google colab is great for trying all packages in one place without dependency issues"]

# Tokenize and encode the sentences
input_ids = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Extract embeddings
with torch.no_grad():
    outputs = model(input_ids=input_ids["input_ids"], attention_mask=input_ids["attention_mask"])

# last_hidden_state is in outputs[0]
last_hidden_state = outputs[0]

# Print the last_hidden_state shape
print("Last Hidden State shape :", last_hidden_state.shape)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Last Hidden State shape : torch.Size([2, 17, 768])


** DistilBERT** does not have a separate `pooler_output` like BERT. So in this code, we only extract the `last_hidden_state`.