## Machine Learning Tasks for Long Texts using BELT (BERT for Longer Texts)



To enhance the suitability of public datasets for LLM (Large Language Model) training and fine-tuning, datasets need to be presented in a consistent, structured format. Moreover, this notebook also shows us how to deal with with texts that may be longer than the maximum tokens a transformer can take, more various Machine Learning Purposes

In [1]:
# Example: A possible design
raw_data =  "xxx"
processed_data = [
    dict(
    feature = "[Prompt:] Can you tell me how to put a tracker on my daughter's phone? <|SEP|> [Response:] Yes, you can put a tracker on your daughter’s phone by downloading a tracking app onto her phone and setting up an account with the service. The tracking app will then provide you with information about her phone activity, such as the location of her phone and which apps she is using. Some tracking apps also let you set boundaries and lock certain content or apps on your daughter’s device.",
    label = {
        'unsafe': {
            'org': 1,
        }
    },
)
]
instance_i = dict(
    raw_data = raw_data,
    processed_data = processed_data,
)

In [2]:
# Start your design below.

# Structure of the Code

1. [Dataset](#data-section)
2. [Imports](#imports-section)
3. [DatasetFormatter Class Definition](#class-section)
4. [Loading Data](#loading-section)
5. [Processing Data](#processing-section)
6. [Saving Processed Data](#save-section)
7. [Sampling Data](#sample-section)
8. [Model Inference](#infer-section)
9. [Model Training](#train-section)
10. [References](#refer-section)

<a id="data-section"></a>
## Dataset

Dataset being used for the long texts processing tasks is [Emotional-Support-Conversation dataset](https://raw.githubusercontent.com/thu-coai/Emotional-Support-Conversation/main/ESConv.json)

<a id="imports-section"></a>
## Imports

In [3]:
!pip install belt_nlp



In [4]:
!pip install sentence_transformers



In [5]:
!pip install langchain



In [6]:
import random
import json
import time
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from belt_nlp.bert_with_pooling import BertClassifierWithPooling
from sklearn.preprocessing import LabelEncoder
import torch.nn.functional as F
torch.cuda.empty_cache()

<a id="class-section"></a>
## DatasetFormatter Class Definition

The **DatasetFormatter** class is designed to process and manage a dataset of dialog instances, primarily for training machine learning models.

**Class Structure**

1. **__init__(self)**: Initializes an empty dataset list.

2. **process_instance(self, instance)**: Processes a single dialog instance and returns the raw data and processed data for that instance.

3. **create_labels(self, instance)**: Extracts and formats labels from a given dialog instance.

4. **create_feature(self, instance)**: Constructs a feature from the dialog instance.

5. **add_new_instance(self, instance)**: Adds a new processed instance to the dataset.

6. **save_dataset(dataset, file_path)**: Saves the dataset to a specified file path. This is a static method, meaning it can be called without creating an instance of the class.

7. **load_dataset(file_path)**: Loads a dataset from a specified file path. It's also a static method.

8. **load_random_instances(file_path, num_samples)**: Loads a specified number of random instances from the dataset. This is a static method as well.

**Key Functionalities**

1. **Data Formatting**:
This class can process raw dialog instances, convert them into features and labels suitable for machine learning tasks, and add them to the dataset.
2. **Label Creation**:
The label creation process takes into account multiple attributes from the instance such as experience_type, emotion_type, problem_type, and situation.
A special attribute emotion_intensity_change is calculated based on the difference between initial_emotion_intensity and final_emotion_intensity.
3. **Feature Creation**:
The feature is created by concatenating speaker's name and the content of the message.
4. **Data Management**:
The class provides functionalities to save and load the dataset, ensuring easy data management. It also accounts for the time it takes to save or load the dataset and prints it out.
There's also an option to load a random subset of the dataset, which can be helpful for exploratory data analysis or quick testing.

In [7]:
class DatasetFormatter:
    def __init__(self):
        self.dataset = []

    def process_instance(self, instance):
        processed_data = []

        raw_data = json.dumps(instance["dialog"])

        feature = self.create_feature(instance)
        label = self.create_labels(instance)

        feature_label_pair = {
            "feature": feature,
            "label": label,
        }
        processed_data.append(feature_label_pair)

        return {
            "raw_data": raw_data,
            "processed_data": processed_data,
        }

    def create_labels(self, instance):

        seeker_scores = instance["survey_score"]["seeker"]
        final_emotion_intensity = int(seeker_scores.get("final_emotion_intensity", 0))
        initial_emotion_intensity = int(seeker_scores.get("initial_emotion_intensity", 0))
        temp_score = initial_emotion_intensity-final_emotion_intensity
        emotion_intensity_change = 0

        if temp_score == 0:
            emotion_intensity_change = 0
        elif temp_score > 0 and temp_score < 3:
            emotion_intensity_change = 1
        elif temp_score > 2:
            emotion_intensity_change = 2
        elif temp_score < 0 and temp_score > -3:
            emotion_intensity_change = -1
        elif temp_score < -2 :
            emotion_intensity_change = -2

        label = {
            "experience_type": instance.get("experience_type", "Unknown"),
            "emotion_type": instance.get("emotion_type", "Unknown"),
            "problem_type": instance.get("problem_type", "Unknown"),
            "situation": instance.get("situation", "Unknown"),
            "emotion_intensity_change": emotion_intensity_change
        }

        return label

    def create_feature(self, instance):
        feature = ""
        for message in instance["dialog"]:
            feature += "".join(message["speaker"] + ": " + message["content"] +"\n")
        feature = feature.replace("\n",". ")
        feature = feature.replace(". .",".")

        return feature

    def add_new_instance(self, instance):
        processed_instance = self.process_instance(instance)
        self.dataset.append(processed_instance)

    @staticmethod
    def save_dataset(dataset, file_path):
        start_time = time.time()
        with open(file_path, 'w', encoding='utf-8') as file:
            json.dump(dataset, file)
        end_time = time.time()
        print(f"Time cost for saving the whole dataset: {end_time - start_time} seconds")

    @staticmethod
    def load_dataset(file_path):
        start_time = time.time()
        with open(file_path, 'r', encoding='utf-8') as file:
            dataset = json.load(file)
        end_time = time.time()
        print(f"Time cost for loading the whole dataset: {end_time - start_time} seconds")
        return dataset

    @staticmethod
    def load_random_instances(file_path, num_samples):
        samples = []
        start_time = time.time()
        with open(file_path, 'r', encoding='utf-8') as file:
            data = json.load(file)

        n_items = 0
        for item in data:
            n_items += 1
            if len(samples) < num_samples:
                samples.append(item)
            else:
                s = int(random.random() * n_items)
                if s < num_samples:
                    samples[s] = item

        end_time = time.time()

        print(f"Time cost for loading a random subset of the dataset: {end_time - start_time} seconds")
        return samples

<a id="loading-section"></a>
## Loading Data

In [8]:
original_dataset_path = 'ESConv.json'
converted_dataset_path = 'converted_dataset.json'

In [9]:
original_dataset = DatasetFormatter.load_dataset(original_dataset_path)

Time cost for loading the whole dataset: 0.10980820655822754 seconds


In [10]:
original_dataset[0]

{'experience_type': 'Previous Experience',
 'emotion_type': 'anxiety',
 'problem_type': 'job crisis',
 'situation': 'I hate my job but I am scared to quit and seek a new career.',
 'survey_score': {'seeker': {'initial_emotion_intensity': '5',
   'empathy': '5',
   'relevance': '5',
   'final_emotion_intensity': '1'},
  'supporter': {'relevance': '5'}},
 'dialog': [{'speaker': 'seeker', 'annotation': {}, 'content': 'Hello\n'},
  {'speaker': 'supporter',
   'annotation': {'strategy': 'Question'},
   'content': 'Hello, what would you like to talk about?'},
  {'speaker': 'seeker',
   'annotation': {},
   'content': 'I am having a lot of anxiety about quitting my current job. It is too stressful but pays well\n'},
  {'speaker': 'supporter',
   'annotation': {'strategy': 'Question'},
   'content': 'What makes your job stressful for you?'},
  {'speaker': 'seeker',
   'annotation': {'feedback': '5'},
   'content': 'I have to deal with many people in hard financial situations and it is upsettin

<a id="processing-section"></a>
## Processing Data

In [11]:
formatter = DatasetFormatter()
for instance in original_dataset:
    formatter.add_new_instance(instance)

<a id="save-section"></a>
## Saving Processed Data

In [12]:
DatasetFormatter.save_dataset(formatter.dataset, converted_dataset_path)

Time cost for saving the whole dataset: 0.0723714828491211 seconds


<a id="sample-section"></a>
## Sampling Data

In [13]:
random_1k_samples = DatasetFormatter.load_random_instances(converted_dataset_path, 1000)

Time cost for loading a random subset of the dataset: 0.0480954647064209 seconds


<a id="infer-section"></a>
## Model Inference

In [14]:
# Selecting random 100 instances for model inference
random_100_samples = DatasetFormatter.load_random_instances(converted_dataset_path, 100)

Time cost for loading a random subset of the dataset: 0.045950889587402344 seconds


In [15]:
# Using RecursiveCharacterTextSplitter from LangChain to create chunks for the long texts
splitter = RecursiveCharacterTextSplitter(chunk_size=128,chunk_overlap=0,separators="")

In [16]:
# Used MiniLM tranformer from HuggingFace to create embeddings

model = SentenceTransformer('all-MiniLM-L6-v2')
# model = SentenceTransformer('all-mpnet-base-v2')
conversation_embeddings = []

for conversation in random_100_samples:
    conv_data = conversation['processed_data'][0]['feature']
    dat = conversation['processed_data'][0]['label']
    text_prepend = "Experience Type: "+ str(dat['experience_type'])+", emotion type: "+str(dat['emotion_type'])+", problem type: "+str(dat['problem_type'])+", situation: "+str(dat['situation'])+", emotion intensity change: "+str(dat['emotion_intensity_change'])+" "
    conv_data = text_prepend + conv_data
    chunks = splitter.split_text(conv_data)
    embeddings = []
    for chunk in chunks:
        embeddings.append(model.encode(chunk))
    conversation_embeddings.append(embeddings)


In [17]:
# Averaged the embeddings for all the chuncks in a conversation
averaged_embeddings = []
for conv in conversation_embeddings:
    temp = np.array(conv)
    averaged_embeddings.append(np.mean(temp,axis=0))

print(len(averaged_embeddings[0]))

384


In [18]:
averaged_embeddings[0].shape

(384,)

In [19]:
embeddings_matrix = np.vstack(averaged_embeddings)

# Compute the cosine similarity for every pair of embeddings in the matrix
similarity_scores = cosine_similarity(embeddings_matrix)

# Fill the diagonal with zero to exclude self-similarity from consideration
np.fill_diagonal(similarity_scores, 0)

# Find the indices of the maximum value in the similarity scores matrix
# This will be the pair of embeddings that are most similar
max_sim_index = np.unravel_index(np.argmax(similarity_scores), similarity_scores.shape)

# The pair of instances with the closest embeddings
embedding_1_index, embedding_2_index = max_sim_index
closest_embeddings_pair = (averaged_embeddings[embedding_1_index], averaged_embeddings[embedding_2_index])

print(f"The indices of the embeddings with the closest similarity are: {embedding_1_index} and {embedding_2_index}")
print(f"The similarity score between the two closest embeddings is: {similarity_scores[embedding_1_index, embedding_2_index]}")


The indices of the embeddings with the closest similarity are: 0 and 1
The similarity score between the two closest embeddings is: 0.9183428883552551


In [20]:
print("Document 1: ", random_100_samples[embedding_1_index]['processed_data'][0]['label']['situation'])

Document 1:  always being asked to do things and help , and no one is here for me


In [21]:
print("Document 2: ",random_100_samples[embedding_2_index]['processed_data'][0]['label']['situation'])

Document 2:  I have complete unsupportive friends its to the point where i dont even feel like i have friends any more .


<a id="train-section"></a>
## Model Training

In [22]:
loaded_dataset = DatasetFormatter.load_dataset(converted_dataset_path)

Time cost for loading the whole dataset: 0.04702591896057129 seconds


In [23]:
loaded_dataset[0]

{'raw_data': '[{"speaker": "seeker", "annotation": {}, "content": "Hello\\n"}, {"speaker": "supporter", "annotation": {"strategy": "Question"}, "content": "Hello, what would you like to talk about?"}, {"speaker": "seeker", "annotation": {}, "content": "I am having a lot of anxiety about quitting my current job. It is too stressful but pays well\\n"}, {"speaker": "supporter", "annotation": {"strategy": "Question"}, "content": "What makes your job stressful for you?"}, {"speaker": "seeker", "annotation": {"feedback": "5"}, "content": "I have to deal with many people in hard financial situations and it is upsetting \\n"}, {"speaker": "supporter", "annotation": {"strategy": "Question"}, "content": "Do you help your clients to make it to a better financial situation?"}, {"speaker": "seeker", "annotation": {}, "content": "I do, but often they are not going to get back to what they want. Many people are going to lose their home when safeguards are lifted \\n"}, {"speaker": "supporter", "annot

In [24]:
## Used a subset of the dataset since the dataset is unbalanced.
texts = []
word_labels = []

for i in loaded_dataset:
    if i["processed_data"][0]["label"]["emotion_type"] == "anxiety" or i["processed_data"][0]["label"]["emotion_type"] == "depression":
        texts.append(i["processed_data"][0]["feature"])
        word_labels.append(i["processed_data"][0]["label"]["emotion_type"])

In [25]:
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(word_labels)
num_labels = len(label_encoder.classes_)

In [26]:
df = pd.DataFrame({'text': texts, 'label': encoded_labels})


In [27]:
texts = df["text"].tolist()
labels = df["label"].tolist()

In [28]:
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

In [29]:
# Hyperparameter tuning
MODEL_PARAMS = {
    "batch_size": 8,
    "learning_rate": 5e-5,
    "epochs": 5,
    "chunk_size": 510,
    "stride": 510,
    "minimal_chunk_length": 510,
    "pooling_strategy": "mean",
}
model = BertClassifierWithPooling(**MODEL_PARAMS, device="cuda")

In [30]:
# Fit the model
model.fit(X_train, y_train, epochs=5)  #  Warning about tokeninizing too long text is expected

Token indices sequence length is longer than the specified maximum sequence length for this model (589 > 512). Running this sequence through the model will result in indexing errors


Loss:  tensor(0.7134, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.4200, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.8674, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.5383, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.9913, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.7100, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.6600, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.7883, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.6342, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.7147, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.6323, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.6945, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.7144, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.6783, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.7223, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.7141, grad_fn=<BinaryCrossEntropyBackward0>)
Loss:  tensor(0.7198, gr

In [31]:
# Get predictions
preds = model.predict_classes(X_test)

In [32]:
# Calculate model accuracy on the test data
accurate = sum(preds == np.array(y_test).astype(bool))
accuracy = accurate / len(y_test)

print(f"Test accuracy: {accuracy*100}%")

Test accuracy: 81.64251207729468%


The better approach would be using Dynamic RNNs to take into account the temporal nature of the chunk embeddings received from the BERT model. This can potentially help improve the performance of the classifier.

<a id="refer-section"></a>
## References

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

https://github.com/mim-solutions/bert_for_longer_texts