This notebook provides a step-by-step guide on how to embed a large amount of text data using the BERT embedding technique. BERT, or Bidirectional Encoder Representations from Transformers, is a natural language processing technique that uses neural networks to generate contextualized word embeddings.

The BERT embedding technique is particularly useful for prediction-based aspects of neural networks, such as sentiment analysis or text classification, as it generates numerical vectors that can be fed into a machine learning algorithm for further analysis.

In this notebook, we will demonstrate how to use BERT embeddings to preprocess a sea of text data and convert it into numerical vectors that can be used as input to the LightGBM, a popular gradient boosting framework for machine learning tasks. By following these steps, you can efficiently preprocess and analyze large volumes of text data using BERT embeddings and machine learning algorithms.

In [1]:
import pandas as pd
import numpy as np
import transformers
import torch
import lightgbm as lgb
from torch.utils.data import DataLoader, TensorDataset

# Set device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the pre-trained BERT model and tokenizer
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = transformers.BertModel.from_pretrained('bert-base-uncased').to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [2]:
# Define a function to tokenize and encode the input text using BERT
def encode_text(text):
    input_ids = []
    attention_masks = []

    # Tokenize each sentence and create input_ids and attention_masks
    for sentence in text:
        encoded_dict = tokenizer.encode_plus(
                            sentence,                     
                            add_special_tokens = True, 
                            max_length = 64,           
                            pad_to_max_length = True,
                            return_attention_mask = True,  
                            return_tensors = 'pt'     
                       )
        
        input_ids.append(encoded_dict['input_ids'])
        attention_masks.append(encoded_dict['attention_mask'])

    # Convert the lists to tensors
    input_ids = torch.cat(input_ids, dim=0).to(device)
    attention_masks = torch.cat(attention_masks, dim=0).to(device)

    # Create a TensorDataset from the input_ids and attention_masks
    dataset = TensorDataset(input_ids, attention_masks)

    # Use a DataLoader to batch the dataset
    dataloader = DataLoader(dataset, batch_size=64)

    # Pass the input through BERT to get the embeddings
    embeddings = []
    with torch.no_grad():
        for batch in dataloader:
            input_ids_batch, attention_masks_batch = batch
            embeddings_batch = bert_model(input_ids_batch, attention_masks_batch)[0][:,0,:].cpu().numpy()
            embeddings.append(embeddings_batch)
    embeddings = np.concatenate(embeddings, axis=0)

    # Return the embeddings
    return embeddings

In [10]:

# Load the data
# data = pd.concat([pd.read_csv('./input/Train.csv'), pd.read_csv('./input/Test.csv')]).reset_index(drop=True)

data = pd.read_csv('./input/Train.csv')

# Encode the text using BERT
embeddings = encode_text(data['text'])

# Split the data into train and validation sets
train_size = int(0.8 * len(data))
train_data = embeddings[:train_size]
train_labels = data['label'][:train_size]
valid_data = embeddings[train_size:]
valid_labels = data['label'][train_size:]

In [11]:
train_labels.unique()

array([-1,  1,  0])

In [15]:
from sklearn.model_selection import RandomizedSearchCV
from lightgbm import LGBMClassifier
import numpy as np

param_distributions = {
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [16, 32, 64],
    'max_depth': [6, 8, 12],
    'min_child_weight': [1, 5, 10],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'reg_alpha': [0.01, 0.1, 1.0],
    'reg_lambda': [0.01, 0.1, 1.0],
}

model = LGBMClassifier(random_state=42)

random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_distributions,
    n_iter=50,
    cv=5,
    scoring='f1_micro',
    verbose=3,
    random_state=42,
)

random_search.fit(train_data, train_labels)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_:.4f}")


Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END colsample_bytree=1.0, learning_rate=0.05, max_depth=6, min_child_weight=10, n_estimators=100, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.663 total time=  19.8s
[CV 2/5] END colsample_bytree=1.0, learning_rate=0.05, max_depth=6, min_child_weight=10, n_estimators=100, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.666 total time=  18.5s
[CV 3/5] END colsample_bytree=1.0, learning_rate=0.05, max_depth=6, min_child_weight=10, n_estimators=100, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.667 total time=  18.3s
[CV 4/5] END colsample_bytree=1.0, learning_rate=0.05, max_depth=6, min_child_weight=10, n_estimators=100, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.657 total time=  18.6s
[CV 5/5] END colsample_bytree=1.0, learning_rate=0.05, max_depth=6, min_child_weight=10, n_estimators=100, num_leaves=16, reg_alph

[CV 2/5] END colsample_bytree=1.0, learning_rate=0.01, max_depth=8, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.1, subsample=1.0;, score=0.665 total time= 1.4min
[CV 3/5] END colsample_bytree=1.0, learning_rate=0.01, max_depth=8, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.1, subsample=1.0;, score=0.669 total time= 1.4min
[CV 4/5] END colsample_bytree=1.0, learning_rate=0.01, max_depth=8, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.1, subsample=1.0;, score=0.656 total time= 1.4min
[CV 5/5] END colsample_bytree=1.0, learning_rate=0.01, max_depth=8, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.1, subsample=1.0;, score=0.674 total time= 1.4min
[CV 1/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=5, n_estimators=1000, num_leaves=16, reg_alpha=1.0, reg_lambda=0.1, subsample=1.0;, score=0.695 total time= 1.7m

[CV 3/5] END colsample_bytree=0.6, learning_rate=0.1, max_depth=6, min_child_weight=10, n_estimators=1000, num_leaves=32, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.698 total time= 1.8min
[CV 4/5] END colsample_bytree=0.6, learning_rate=0.1, max_depth=6, min_child_weight=10, n_estimators=1000, num_leaves=32, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.693 total time= 1.8min
[CV 5/5] END colsample_bytree=0.6, learning_rate=0.1, max_depth=6, min_child_weight=10, n_estimators=1000, num_leaves=32, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.704 total time= 1.8min
[CV 1/5] END colsample_bytree=0.6, learning_rate=0.1, max_depth=12, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=1.0, reg_lambda=0.1, subsample=0.6;, score=0.688 total time=  45.8s
[CV 2/5] END colsample_bytree=0.6, learning_rate=0.1, max_depth=12, min_child_weight=10, n_estimators=500, num_leaves=16, reg_alpha=1.0, reg_lambda=0.1, subsample=0.6;, score=0.694 total time=  

[CV 4/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=6, min_child_weight=1, n_estimators=1000, num_leaves=32, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.673 total time= 2.2min
[CV 5/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=6, min_child_weight=1, n_estimators=1000, num_leaves=32, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.691 total time= 2.2min
[CV 1/5] END colsample_bytree=1.0, learning_rate=0.1, max_depth=12, min_child_weight=1, n_estimators=1000, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.695 total time= 2.3min
[CV 2/5] END colsample_bytree=1.0, learning_rate=0.1, max_depth=12, min_child_weight=1, n_estimators=1000, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.694 total time= 2.3min
[CV 3/5] END colsample_bytree=1.0, learning_rate=0.1, max_depth=12, min_child_weight=1, n_estimators=1000, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.695 total tim

[CV 5/5] END colsample_bytree=0.6, learning_rate=0.05, max_depth=12, min_child_weight=5, n_estimators=100, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=1.0;, score=0.673 total time=  13.3s
[CV 1/5] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.1, reg_lambda=0.01, subsample=0.6;, score=0.684 total time= 1.0min
[CV 2/5] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.1, reg_lambda=0.01, subsample=0.6;, score=0.689 total time= 1.1min
[CV 3/5] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.1, reg_lambda=0.01, subsample=0.6;, score=0.688 total time= 1.0min
[CV 4/5] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.1, reg_lambda=0.01, subsample=0.6;, score=0.682 total time= 1.1mi

[CV 1/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.686 total time=  53.1s
[CV 2/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.693 total time=  53.5s
[CV 3/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.695 total time=  53.8s
[CV 4/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.687 total time=  56.1s
[CV 5/5] END colsample_bytree=0.8, learning_rate=0.1, max_depth=6, min_child_weight=1, n_estimators=500, num_leaves=16, reg_alpha=0.01, reg_lambda=0.01, subsample=0.6;, score=0.702 total time=  56.7s


[CV 2/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=12, min_child_weight=1, n_estimators=500, num_leaves=32, reg_alpha=1.0, reg_lambda=0.01, subsample=0.8;, score=0.671 total time= 1.5min
[CV 3/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=12, min_child_weight=1, n_estimators=500, num_leaves=32, reg_alpha=1.0, reg_lambda=0.01, subsample=0.8;, score=0.677 total time= 1.5min
[CV 4/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=12, min_child_weight=1, n_estimators=500, num_leaves=32, reg_alpha=1.0, reg_lambda=0.01, subsample=0.8;, score=0.665 total time= 1.6min
[CV 5/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=12, min_child_weight=1, n_estimators=500, num_leaves=32, reg_alpha=1.0, reg_lambda=0.01, subsample=0.8;, score=0.678 total time= 1.5min
Best parameters: {'subsample': 0.6, 'reg_lambda': 1.0, 'reg_alpha': 0.1, 'num_leaves': 64, 'n_estimators': 1000, 'min_child_weight': 10, 'max_depth': 8, 'learning_rate': 0.1, 'colsample_bytree': 1