# Data Enrichment and Further Modelling

In this notebook, we will try adding more characters to the dataset by several ways:
- Changing the threshold filtering rule
- Splitting sentences that are too long into several sentences

After reprocessing the dataset, we can proceed with the more advanced methodologies and also repeat the baselines. 

In [2]:
from tqdm import tqdm
import pandas as pd

In [24]:
import spacy

nlp = spacy.load("en_core_web_sm")

def split_long_sentence(text, max_tokens=128, tokenizer=None):
    """Split a long sentence into chunks with token length ≤ max_tokens."""
    if tokenizer is None:
        return [text]

    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]

    chunks, current_chunk = [], ""
    for sent in sentences:
        temp = f"{current_chunk} {sent}".strip() if current_chunk else sent
        if len(tokenizer.encode(temp, add_special_tokens=True)) <= max_tokens:
            current_chunk = temp
        else:
            if current_chunk:
                chunks.append(current_chunk)
            current_chunk = sent
    if current_chunk:
        chunks.append(current_chunk)

    # Fallback: recursively break anything still too long
    final_chunks = []
    for chunk in chunks:
        while len(tokenizer.encode(chunk, add_special_tokens=True)) > max_tokens:
            words = chunk.split()
            mid = len(words) // 2
            part1 = " ".join(words[:mid])
            part2 = " ".join(words[mid:])
            chunk = part2
            final_chunks.append(part1)
        final_chunks.append(chunk)

    return final_chunks


In [34]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Reynold shouts at his mom once he realizes she threw his toy away. She apologizes but he’s furious. He runs into his room and slams the door."
chunks = split_long_sentence(text, max_tokens=128, tokenizer=tokenizer)
print(chunks)

['Reynold shouts at his mom once he realizes she threw his toy away. She apologizes but he’s furious. He runs into his room and slams the door.']


The above is the example use case of the token count checking for a sentence.

In [10]:
# Let's load the data and test the function

import json

with open("..//data//new_moral_only_data.json", "r") as f:
    moral_only_data = json.load(f)
    moral_only_data = moral_only_data["moral_dialogue"]

with open("..//data//new_non_moral_data.json", "r") as f:
    non_moral_only_data = json.load(f)
    non_moral_only_data = non_moral_only_data["moral_dialogue"]

In [9]:
moral_only_data.keys()

dict_keys(['moral_dialogue', 'moral_dialogue_masked', 'ground_truth'])

In [25]:
def split_all_long_sentence_in_data(data: dict[dict[list]], max_tokens=128, tokenizer=None):
    """ Process the data to split long sentences into smaller chunks.
    
    Args: 
    data (dictionary): dictionary containing dialogue data.
    max_tokens (int): Maximum number of tokens per chunk.
    tokenizer (transformers.PreTrainedTokenizer): Tokenizer to use for encoding text.
    """
    for movie, characters in tqdm(data.items()):
        for character, sentences in characters.items():
            processed_sentences = []
            for sentence in sentences:
                if tokenizer is not None:
                    token_len = len(tokenizer.encode(sentence, add_special_tokens=True))
                    if token_len > max_tokens:
                        processed_sentences.extend(split_long_sentence(sentence, max_tokens, tokenizer))
                    else:
                        processed_sentences.append(sentence)
                else:
                    processed_sentences.append(sentence)
            data[movie][character] = processed_sentences
    return data

In [26]:
moral_only_data_processed = split_all_long_sentence_in_data(moral_only_data, max_tokens=128, tokenizer=tokenizer)
non_moral_only_data_processed = split_all_long_sentence_in_data(non_moral_only_data, max_tokens=128, tokenizer=tokenizer)

100%|██████████| 1629/1629 [00:03<00:00, 472.90it/s]
100%|██████████| 1629/1629 [00:06<00:00, 266.03it/s]


In [27]:
count = 0

for movie, characters in non_moral_only_data_processed.items():
    for character, sentences in characters.items():
        for sentence in sentences:
            if len(tokenizer.encode(sentence, add_special_tokens=True)) > 128:
                count += 1

print(f"Number of sentences longer than 128 tokens: {count}")

Number of sentences longer than 128 tokens: 0


In [31]:
moral_only_data_processed['8MM_1999']["WELLES"]

['I love you.',
 "What's all the trouble, Cinderella? What are you crying about, huh?",
 'Please, believe me. This is probably a stag film. Simulated rape. Hard to stomach, and it might seem real, but there are ways of making it look realistic. fake blood and special effects.',
 'You. you need to go to the police.',
 "Few days ago, I was contacted by a couple living in Philadelphia, a doctor and his wife. What happened was they picked up a young girl hitchhiking off 81, which heads into Philadelphia, started up a conversation with this girl, she looked homeless, seemed about eighteen maybe. They convinced her to let them buy her a meal in the city. Nice kid, mature, did n't have much to say, but they got a sense she's a runaway, so all through dinner the doctor's working on her, trying to convince her that at the very least she should pick up a telephone.",
 'Not surprisingly, she ate her food, excused herself.',
 "This doctor and wife, they're nice people, but they do n't want to get 

In [38]:
# Let's try re-filter the new_dialogue json so that it only contains sentences that are less than 128 tokens long.

with open("..//data//new_dialogue.json", "r") as f:
    new_dialogue = json.load(f)

In [37]:
new_new_dialogue = split_all_long_sentence_in_data(new_dialogue, max_tokens=128, tokenizer=tokenizer)

In [43]:
count1 = 0
for movie, characters in new_new_dialogue.items():
    for character, sentences in characters.items():
        count1 += len(sentences)

print(f"Total number of sentences in new_new_dialogue: {count1}")

count2 = 0
for movie, characters in new_dialogue.items():
    for character, sentences in characters.items():
        count2 += len(sentences)

print(f"Total number of sentences in new_dialogue: {count2}")

print(f"We gained {count1 - count2} sentences by splitting long sentences.")

Total number of sentences in new_new_dialogue: 1959645
Total number of sentences in new_dialogue: 1955706
We gained 3939 sentences by splitting long sentences.


## Data Scraping

We will try scraping more movies from the internet to enrich our data so that the result is more reliable.