This notebook provides a detailed guide on how to preprocess text data using the Byte Pair Encoding (BPE) technique. BPE is a data compression algorithm that is commonly used in natural language processing (NLP) to generate subword units from a large corpus of text.

The BPE technique is particularly useful for improving the efficiency of language modeling and machine translation tasks, as it allows for the creation of a smaller vocabulary while preserving the semantic meaning of words.

In this notebook, we will demonstrate how to apply BPE to a large corpus of text data and generate subword units that can be used to train an NLP model. We will provide a step-by-step guide on how to construct and implement implement the BPE algorithm.

By following these steps, you will be able to efficiently preprocess your text data using the BPE technique and evaluate how it impacts the performance of your LightGBM classifier herein.

In [1]:
import pandas as pd
import numpy as np
import transformers
import torch
import lightgbm as lgb

In [2]:
# CUDA_LAUNCH_BLOCKING=1

In [14]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Wed Apr 12 22:16:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   54C    P0    N/A /  N/A |   1916MiB /  4096MiB |      2%      Default |
|                               |            

In [30]:
# Define the function to create the hybrid word embeddings
def create_hybrid_word_embeddings(texts):
    # Load the pre-trained GloVe embeddings
    glove_embeddings = api.load("glove-wiki-gigaword-300")
    
    # Tokenize the texts
    tokenized_texts = [text.split() for text in texts]
    
    # Initialize the embeddings matrix
    embeddings_matrix = np.zeros((len(tokenized_texts), 300))
    
    # Create the embeddings for each text
    for i, tokens in enumerate(tokenized_texts):
        # Initialize the embeddings for this text
        text_embeddings = np.zeros((len(tokens), 300))
        
        # Create the embeddings for each token in the text
        for j, token in enumerate(tokens):
            # Use the GloVe embeddings for the token
            if token in glove_embeddings:
                text_embeddings[j] = glove_embeddings[token]
        
        # Calculate the text embedding by taking the mean of the token embeddings
        text_embedding = np.mean(text_embeddings, axis=0)
        
        # Set the embeddings matrix row for this text
        embeddings_matrix[i] = text_embedding
    
    return embeddings_matrix

# Create the hybrid word embeddings for the data
embeddings = create_hybrid_word_embeddings(data['text'])

# Split the data into train and validation sets
train_size = int(0.8 * len(data))
train_data = embeddings[:train_size]
train_labels = data['label'][:train_size]

valid_data = embeddings[train_size:]
valid_labels = data['label'][train_size:]

# Define the parameter search space
param_distributions = {
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [16, 32, 64],
    'max_depth': [6, 8, 12],
    'min_child_weight': [1, 5, 10],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'reg_alpha': [0.01, 0.1, 1.0],
    'reg_lambda': [0.01, 0.1, 1.0],
}

# Train the LGBMClassifier using randomized search
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

model = LGBMClassifier(random_state=42)

random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_distributions,
    n_iter=50,
    cv=5,
    scoring='f1_micro',
    verbose=3,
    random_state=42,
)

random_search.fit(train_data, train_labels)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_:.4f}")


Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END colsample_bytree=1.0, learning_rate=0.05, max_depth=6, min_child_weight=10, n_estimators=100, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.673 total time=   6.6s
[CV 2/5] END colsample_bytree=1.0, learning_rate=0.05, max_depth=6, min_child_weight=10, n_estimators=100, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.676 total time=   6.6s
[CV 3/5] END colsample_bytree=1.0, learning_rate=0.05, max_depth=6, min_child_weight=10, n_estimators=100, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.681 total time=   6.5s
[CV 4/5] END colsample_bytree=1.0, learning_rate=0.05, max_depth=6, min_child_weight=10, n_estimators=100, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.675 total time=   6.5s
[CV 5/5] END colsample_bytree=1.0, learning_rate=0.05, max_depth=6, min_child_weight=10, n_estimators=100, num_leaves=16, reg_alph

[CV 2/5] END colsample_bytree=1.0, learning_rate=0.01, max_depth=8, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.1, subsample=1.0;, score=0.675 total time=  31.3s
[CV 3/5] END colsample_bytree=1.0, learning_rate=0.01, max_depth=8, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.1, subsample=1.0;, score=0.683 total time=  30.9s
[CV 4/5] END colsample_bytree=1.0, learning_rate=0.01, max_depth=8, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.1, subsample=1.0;, score=0.675 total time=  31.2s
[CV 5/5] END colsample_bytree=1.0, learning_rate=0.01, max_depth=8, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.1, subsample=1.0;, score=0.674 total time=  31.0s
[CV 1/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=5, n_estimators=1000, num_leaves=16, reg_alpha=1.0, reg_lambda=0.1, subsample=1.0;, score=0.690 total time=  37.

[CV 3/5] END colsample_bytree=0.6, learning_rate=0.1, max_depth=6, min_child_weight=10, n_estimators=1000, num_leaves=32, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.697 total time=  37.8s
[CV 4/5] END colsample_bytree=0.6, learning_rate=0.1, max_depth=6, min_child_weight=10, n_estimators=1000, num_leaves=32, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.693 total time=  37.9s
[CV 5/5] END colsample_bytree=0.6, learning_rate=0.1, max_depth=6, min_child_weight=10, n_estimators=1000, num_leaves=32, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.697 total time=  37.8s
[CV 1/5] END colsample_bytree=0.6, learning_rate=0.1, max_depth=12, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=1.0, reg_lambda=0.1, subsample=0.6;, score=0.687 total time=  15.6s
[CV 2/5] END colsample_bytree=0.6, learning_rate=0.1, max_depth=12, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=1.0, reg_lambda=0.1, subsample=0.6;, score=0.686 total time=  

[CV 4/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=6, min_child_weight=1, n_estimators=1000, num_leaves=32, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.692 total time=  45.0s
[CV 5/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=6, min_child_weight=1, n_estimators=1000, num_leaves=32, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.686 total time=  45.0s
[CV 1/5] END colsample_bytree=1.0, learning_rate=0.1, max_depth=12, min_child_weight=1, n_estimators=1000, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.685 total time=  48.6s
[CV 2/5] END colsample_bytree=1.0, learning_rate=0.1, max_depth=12, min_child_weight=1, n_estimators=1000, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.686 total time=  48.7s
[CV 3/5] END colsample_bytree=1.0, learning_rate=0.1, max_depth=12, min_child_weight=1, n_estimators=1000, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.696 total tim

[CV 5/5] END colsample_bytree=0.6, learning_rate=0.05, max_depth=12, min_child_weight=5, n_estimators=100, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.672 total time=   4.5s
[CV 1/5] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.1, reg_lambda=0.01, subsample=0.6;, score=0.688 total time=  20.4s
[CV 2/5] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.1, reg_lambda=0.01, subsample=0.6;, score=0.688 total time=  20.3s
[CV 3/5] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.1, reg_lambda=0.01, subsample=0.6;, score=0.697 total time=  20.1s
[CV 4/5] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.1, reg_lambda=0.01, subsample=0.6;, score=0.689 total time=  20.2

[CV 1/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.688 total time=  18.9s
[CV 2/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.687 total time=  19.0s
[CV 3/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.695 total time=  19.3s
[CV 4/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.691 total time=  19.1s
[CV 5/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.691 total time=  19.0s


[CV 2/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=12, min_child_weight=1, n_estimators=500, num_leaves=32, reg_alpha=1.0, reg_lambda=0.01, subsample=0.8;, score=0.682 total time=  29.9s
[CV 3/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=12, min_child_weight=1, n_estimators=500, num_leaves=32, reg_alpha=1.0, reg_lambda=0.01, subsample=0.8;, score=0.692 total time=  29.7s
[CV 4/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=12, min_child_weight=1, n_estimators=500, num_leaves=32, reg_alpha=1.0, reg_lambda=0.01, subsample=0.8;, score=0.684 total time=  30.0s
[CV 5/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=12, min_child_weight=1, n_estimators=500, num_leaves=32, reg_alpha=1.0, reg_lambda=0.01, subsample=0.8;, score=0.684 total time=  30.2s
Best parameters: {'subsample': 0.8, 'reg_lambda': 0.01, 'reg_alpha': 1.0, 'num_leaves': 64, 'n_estimators': 500, 'min_child_weight': 10, 'max_depth': 8, 'learning_rate': 0.05, 'colsample_bytree': 