<a href="https://colab.research.google.com/github/tinamilo6/Textmining/blob/main/A03_TextMining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#!pip install pandas sklearn nltk

# Assignment 3 - Text Mining

Project management and tools for health informatics

## 1. Download and prepare data:

**Do not alter the code in this Section!**

The code in this section downloads the [IMDB IMDB Large Movie Review Dataset]('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz') which is the dataset you will be working on in this assignment.

In [2]:
import os
import tarfile
from urllib.request import urlretrieve

In [3]:
if not os.path.exists('aclImdb'):
    # download data:
    urlretrieve('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', 'aclImdb.tar.gz')

    # unzip data:
    with tarfile.open('aclImdb.tar.gz') as file:
        file.extractall('./')

## 2. Some helper Functions:

**Do not alter the code in this Section!**

This section contains the code for some helper functions that will be useful for solving the assignment. Example code on how to use the functions is provided in section 3.

In [4]:
import pandas as pd

from typing import Literal, Tuple, Iterable

Function for loading data into a pandas dataframe:

In [5]:
def load_data(split:Literal['train', 'test'], texts_per_class:int=500) -> pd.DataFrame:
    ''' Loads the data into a pandas dataframe.'''
    paths  = []
    labels = []

    for label in ('pos', 'neg'):
        # get all files in the folder:
        files = os.listdir(os.path.join('aclImdb', split, label))[:texts_per_class]

        # append them to the lists:
        paths.extend([os.path.join('aclImdb', split, label, f) for f in files])
        labels.extend([label] * len(files))

    return pd.DataFrame({'path':paths, 'label':labels})

Function for loading a specific text:

In [6]:
def load_text(path:str) -> str:
    ''' Reads a single text given the path. '''
    # read file from disk:
    with open(path, 'r', encoding='utf8') as file:
        s = file.read()

    return s

Function for iterating through multiple texts:

In [7]:
def iterate_texts(data:pd.DataFrame) -> Iterable[Tuple[str, str]]:
    ''' Iterates through a pandas dataframe. '''

    for path in data['path'].values:
        # read file from disk:
        with open(path, 'r', encoding='utf8') as file:
            text = file.read()

        yield text

## 3. Text Mining Pipeline

This section will cover the text mining steps for this assignment. The following steps will be performed:

1. **Analyze the Data for Difficult Parts**  
   - Reviewing the data to identify challenging aspects such as contractions, informal language, and complex sentence structures.

2. **Replace Contractions and Informal Language**  
   - Expanding common contractions and replacing informal phrases to standardize the text for processing.
   - This part is optinal and it should be later evaluated with normalization and without.

3. **Tokenize the Texts**  
   - Applying a tokenizer to break down the text into individual tokens for analysis.

4. **Training and Predicting with ML**
   - Training different ML models with the tokenized data and predicting the test data to evaluate the accuracy.

### Import needed libraries

In [8]:
# Import needed libaries for the preparation of the texts
import random

### Load the training data
The loaded data frpom the zip file should be saved into a `data_train` and `data_test` DataFrame.
These can be further on be used to access the train and test data.

In [9]:
data_train = load_data('train')
data_test  = load_data('test')
data_train

Unnamed: 0,path,label
0,aclImdb\train\pos\0_9.txt,pos
1,aclImdb\train\pos\10000_8.txt,pos
2,aclImdb\train\pos\10001_10.txt,pos
3,aclImdb\train\pos\10002_7.txt,pos
4,aclImdb\train\pos\10003_8.txt,pos
...,...,...
995,aclImdb\train\neg\10446_2.txt,neg
996,aclImdb\train\neg\10447_1.txt,neg
997,aclImdb\train\neg\10448_1.txt,neg
998,aclImdb\train\neg\10449_4.txt,neg


### 1. Assess the data / texts for difficult parts
In this part, sample texts are printed and then analyzed for diffult parts, that could affect the text mining process.

In [10]:
# Number of samples to be printed
n_samples = 7

# Randomly sample indices from "data_train"
sample_indices = random.sample(range(len(data_train)), n_samples)

# Load and print each sample using the "load_text" function
for i, id in enumerate(sample_indices, start=1):
    # Load text from the file path specified in 'path' column
    text = load_text(data_train.loc[id, 'path'])
    print(f"Sample {i}:\n")
    print(text)
    print("\n" + "="*80 + "\n")

Sample 1:

Fabulous costumes by Edith Head who painted them on Liz Taylor at her finest!<br /><br />The SFX are very good for a movie of its age, and the stunt doubles actually looked like the actors, even down to body type, a rarity in movies of this vintage.<br /><br />A cozy movie, with splendid panoramas -- even when chopped down to pan and scan.


Sample 2:

"Happenstance" is the most New York-feeling Parisian film I've seen since "When the Cat's Away (Chacun cherche son chat). "<br /><br />A film from last year released now to capitalize on the attention Audrey Tatou is getting for "Amelie," its French title is more apt: "Le Battement d'ailes du papillon (The Beating of the Butterfly's Wings)" as in summarizing chaos theory as a controlling element in our lives.<br /><br />Tatou's gamine-ness is less annoying here because she only occasionally flashes that dazzling smile amidst her hapless adventures, and because she's part of a large, multi-ethnic ensemble, so large that it took

## Limitations/Issues Captured from the Above Text Samples

- **HTML tags and special characters**
- **Punctuation and symbols** (e.g., `&`)
- **Contractions** (e.g., "isn't", "I'll", "I'm")
- **Parentheses and annotations** (e.g., "(Crouching Tiger)")
- **Informal formatting** (e.g., "my rating is ****")
- **Ambiguity and polysemy** (e.g., "dictators", "nuts")
- **Long and complex sentences**
- **Informal language** (e.g., "what can be so bad about that?")
- **Quotation marks** (e.g., "dictators", "sin")

We will elaborate on these issues and add our conclusions in the report.

### A simple pipeline:
[Do not change it!]

This simple pipeline will be used to compare the newly created pipeline against, to evaluate the performance increase.

**White-Space tokenization:**

In [11]:
def tokenize(text:str):
    ''' An example tokenization function. '''

    # simple white-space tokenization:
    return text.lower().split()


**Bag-of-words Embedding:**

See documentation of [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

# create a simple bag of words embedding:
bow = CountVectorizer(

    # the next line converts the filepaths to the actual texts:
    preprocessor = load_text,

    # tokenization function from above:
    tokenizer = tokenize,

    # Set token_pattern to None since we're using a custom tokenizer
    token_pattern=None

)

# train the embedding:
embeddings_train = bow.fit_transform(data_train['path'].values)

# vectorize test data:
embeddings_test = bow.transform(data_test['path'].values)

**Classification with a linear SVM**

See documentation of [sklearn.svm.LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)

In [13]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

svm = LinearSVC(dual=False, max_iter=5000)

# train classifier:
svm.fit(embeddings_train, data_train['label'].values)

# test classifier:
predictions = svm.predict(embeddings_test)

# Calculate Accuracy:
print('Accuracy:', accuracy_score(data_test['label'].values, predictions))

Accuracy: 0.761


### Task 2b) Implement the new Tokenizer
We decided to implement the BPE Tokenizer and will start to evaluate the Output of the BPE Tokenizer without normaliation.

In [14]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize a BPE tokenizer
tokenizer = Tokenizer(BPE())

# Add a pre-tokenizer to handle whitespace properly
tokenizer.pre_tokenizer = Whitespace()

# Initialize a BPE trainer with some default parameters
trainer = BpeTrainer(special_tokens=["<pad>", "<s>", "</s>", "<unk>", "__ellipsis__"])

# Load the training texts into the training data
training_data = [load_text(path) for path in data_train['path'].values]

# Train the tokenizer on the cleaned text data
tokenizer.train_from_iterator(training_data, trainer)

# Encode the text data
encoded_train = [tokenizer.encode(load_text(path)).ids for path in data_train['path'].values]
encoded_test = [tokenizer.encode(load_text(path)).ids for path in data_test['path'].values]

# Print a few text to evaluate the quality
sample_indices = [0, 6, 12, 55]

for i, sample_index in enumerate(sample_indices):
    # Load the actual text from the file path
    original_text = load_text(data_train['path'][sample_index])
    encoded_sample = tokenizer.encode(original_text)
    decoded_text = tokenizer.decode(encoded_sample.ids)
    
    print(f"Sample {i+1} Original string:", original_text)  # Original text for comparison
    print(f"Sample {i+1} Encoded tokens (token IDs):", encoded_sample.ids)  # Encoded token IDs
    print(f"Sample {i+1} Tokenized string (tokens):", encoded_sample.tokens)  # Encoded tokens as text
    print(f"Sample {i+1} Decoded string:", decoded_text)  # Decoded string
    print("="*80)

Sample 1 Original string: Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
Sample 1 Encoded tokens (token IDs): [9362, 2897, 121, 66, 3078, 901, 18, 267, 3612, 125, 118, 613, 313, 130, 251, 297, 11684, 286, 1745, 504, 16, 568, 130, 6, 15371, 

### Task 2c) Analysis of BPE Tokenizer Text Results

The following observations were made:
- **HTML tags and special characters** (e.g., `<br />`)
- **Ellipses and symbols** (e.g., `...`, `.........`, `***`)
- **Contractions** (e.g., "isn't", "don’t", "I’m" are still not opened but split)
- **Hyphenated words** (e.g., "en-wrapped")
- **Complex formatting and annotations** (e.g., "The Night Listner.")
- **Long sentences and informal structure** (e.g., lack of clarity, informal style)

Next steps:
To improve the quality and consistency of our text data, we apply a series of normalization steps. These steps address common issues such as HTML tags, punctuation inconsistencies, contractions, informal language, and ellipses. By cleaning and standardizing the text, we aim to create a more accurate and meaningful representation for downstream processing and analysis.

Stemming is not necessary, as BPE is a subword-level Tokenizer.

#### Calculate the Accuracy of the BPE Tokenizer with Bag-of-Words and SVM

In [15]:
def bpe_tokenizer(text):
    encoded = tokenizer.encode(text)
    return encoded.tokens

In [16]:
# Initialize CountVectorizer with the BPE tokenizer
bow = CountVectorizer(
    preprocessor=load_text,  # Convert file paths to text content
    tokenizer=bpe_tokenizer, # Use the in-memory BPE tokenizer
    token_pattern=None       # Disable default token pattern
)

# Create Bag-of-Words embeddings
embeddings_train = bow.fit_transform(data_train['path'].values)
embeddings_test = bow.transform(data_test['path'].values)

In [17]:
# Initialize the SVM classifier
svm = LinearSVC(dual=False, max_iter=5000)

# Train the classifier on the training embeddings and labels
svm.fit(embeddings_train, data_train['label'].values)

# Use the trained classifier to predict the test data labels
predictions = svm.predict(embeddings_test)

# Calculate and print the accuracy
accuracy = accuracy_score(data_test['label'].values, predictions)
print('Accuracy:', accuracy)

Accuracy: 0.781


### Task 2d: Normalization of the text

This section focuses on preprocessing the texts to make them more suitable for text mining. Identified constraints from the analysis step will be addressed and, as far as possible, eliminated to improve processing accuracy.


In [18]:
import re

def remove_html_tags(text):
    # Regex to match HTML tags
    clean_text = re.sub(r'<.*?>', '', text)
    return clean_text



In [19]:
import string

def remove_punctuation_and_symbols(text):
    # Remove punctuation and special symbols using regex
    clean_text = re.sub(r'[^\w\s]', '', text)
    return clean_text



In [20]:
def remove_parentheses(text):
    # Remove text inside parentheses along with parentheses
    clean_text = re.sub(r'\(.*?\)', '', text)
    return clean_text

#### Expand the contractions
There are python libraries that focus on expanding contractions. For easier loading (if these packages are not available) we decided to create our own replacement list.

In [21]:
# Dictionary of common contractions and their expanded forms
contractions_dict = {
    "i'm": "I am",
    "you're": "you are",
    "he's": "he is",
    "she's": "she is",
    "it's": "it is",
    "we're": "we are",
    "they're": "they are",
    "i've": "I have",
    "you've": "you have",
    "we've": "we have",
    "they've": "they have",
    "i'll": "I will",
    "you'll": "you will",
    "he'll": "he will",
    "she'll": "she will",
    "we'll": "we will",
    "they'll": "they will",
    "isn't": "is not",
    "aren't": "are not",
    "wasn't": "was not",
    "weren't": "were not",
    "can't": "cannot",
    "couldn't": "could not",
    "won't": "will not",
    "wouldn't": "would not",
    "don't": "do not",
    "doesn't": "does not",
    "didn't": "did not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "let's": "let us",
    "that's": "that is",
    "who's": "who is",
    "what's": "what is",
    "where's": "where is",
    "there's": "there is",
    "here's": "here is",
    "shouldn't": "should not",
    "mustn't": "must not",
    "shan't": "shall not"
}

# Function to expand contractions
def expand_contractions(text):
    contractions_pattern = re.compile(r'\b(' + '|'.join(contractions_dict.keys()) + r')\b', flags=re.IGNORECASE)
    def replace(match):
        return contractions_dict.get(match.group(0).lower(), match.group(0))
    return contractions_pattern.sub(replace, text)


In [22]:
# Define a set of informal tokens to remove
informal_tokens = {
    "u": "you",
    "r": "are",
    "lmao": "",  # Remove
    "lol": "",   # Remove
    "btw": "by the way",
    "idk": "I do not know",
    "omg": "oh my god",
    "gonna": "going to",
    "wanna": "want to",
    "gotta": "got to",
    "b/c": "because",
    "thx": "thanks",
    "pls": "please",
    "cuz": "because",
    "wut": "what",
    "smh": "",  # Remove
    "k": "okay",
    "ttyl": "talk to you later"
}

def remove_informal_tokens(text):
    # Replace informal tokens
    for token, replacement in informal_tokens.items():
        # Use regex to match whole words and replace them
        text = re.sub(r'\b' + re.escape(token) + r'\b', replacement, text, flags=re.IGNORECASE)
    return text

def fix_punctuation(text):
    # Ensure proper spacing after punctuation
    text = re.sub(r'\s*([.,;:!?()])\s*', r'\1 ', text)  # Ensure space after punctuation
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = text.strip()  # Trim leading and trailing spaces
    return text

# Combined function to clean the text
def clean_text(text):
    text = remove_informal_tokens(text)  # Remove informal tokens
    text = fix_punctuation(text)  # Fix punctuation
    return text

def preprocess_ellipses(text):
    # Replace occurrences of three or more dots with actual ellipses
    text = re.sub(r'(\.\s*){3,}', '...', text)
    return text




In [23]:
import nltk
from nltk.corpus import stopwords

# Download necessary resources
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = load_text(text)              # Load text from file path
    text = expand_contractions(text)    # Expand contractions
    text = remove_informal_tokens(text) # Replace informal tokens
    text = ' '.join(word for word in text.split() if word not in stop_words) # Remove stopwords
    text = fix_punctuation(text)        # Fix punctuation
    text = preprocess_ellipses(text)    # Handle ellipses as a single token
    text = re.sub(r'<.*?>', '', text)   # Remove HTML tags

    return text  # Return the fully preprocessed and stemmed text

# Apply preprocessing function to training and test data
data_train['cleaned_text'] = data_train['path'].apply(preprocess_text)
data_test['cleaned_text'] = data_test['path'].apply(preprocess_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kentf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Rerun the BPE Tokenizer on the normalized text, bag-of-words and SVM

As the text was previously normalized / or not in the other case, the next step is to tokenize the text.
For this, we are employing the tokenizer with BPE.

In [24]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize a BPE tokenizer
tokenizer = Tokenizer(BPE())

# Add a pre-tokenizer to handle whitespace properly
tokenizer.pre_tokenizer = Whitespace()

# Initialize a BPE trainer
trainer = BpeTrainer()

# Convert cleaned_text column to a list of strings for training
training_data = data_train['cleaned_text'].tolist()

# Train the tokenizer on the cleaned text data
tokenizer.train_from_iterator(training_data, trainer)

# Encode the text data for the model
encoded_train = [tokenizer.encode(text).ids for text in data_train['cleaned_text']]
encoded_test = [tokenizer.encode(text).ids for text in data_test['cleaned_text']]

# Select a single example to decode and print
sample_index = 0  # Choose an index, e.g., 0 for the first item
encoded_sample = encoded_train[sample_index]

# Decode the tokenized output back to text
decoded_text = tokenizer.decode(encoded_sample)

# Select additional examples to decode and print
sample_indices = [0, 6, 12, 55]  # You can adjust these indices as needed

for i, sample_index in enumerate(sample_indices):
    original_text = load_text(data_train['path'][sample_index])
    encoded_sample = tokenizer.encode(data_train['cleaned_text'][sample_index])
    decoded_text = tokenizer.decode(encoded_sample.ids)
    
    print(f"Sample {i+1} Original string:", original_text)  # Original text for comparison
    print(f"Sample {i+1} Cleaned string:", data_train['cleaned_text'][sample_index])  # Original text for comparison
    print(f"Sample {i+1} Encoded tokens (token IDs):", encoded_sample.ids)  # Encoded token IDs
    print(f"Sample {i+1} Tokenized string (tokens):", encoded_sample.tokens)  # Encoded tokens as text
    print(f"Sample {i+1} Decoded string:", decoded_text)  # Decoded string
    print("="*80)

Sample 1 Original string: Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
Sample 1 Cleaned string: Bromwell High cartoon comedy. It ran time programs school life, "Teachers". My 35 years teaching profession lead believe Bromwell High's satir

In [25]:
def bpe_tokenizer(text):
    encoded = tokenizer.encode(text)
    return encoded.tokens

In [26]:
# Use preprocessed 'cleaned_text' data directly for Bag-of-Words embedding
bow = CountVectorizer(
    preprocessor=None,       # Text is already preprocessed in 'cleaned_text'
    tokenizer=bpe_tokenizer, # Use the in-memory BPE tokenizer
    token_pattern=None       # Disable default token pattern
)

# Create Bag-of-Words embeddings on preprocessed text
embeddings_train = bow.fit_transform(data_train['cleaned_text'])
embeddings_test = bow.transform(data_test['cleaned_text'])

In [27]:
# Initialize the SVM classifier
svm = LinearSVC(dual=False, max_iter=5000)

# Train the classifier on the training embeddings and labels
svm.fit(embeddings_train, data_train['label'].values)

# Use the trained classifier to predict the test data labels
predictions = svm.predict(embeddings_test)

# Calculate and print the accuracy
accuracy = accuracy_score(data_test['label'].values, predictions)
print('Accuracy:', accuracy)

Accuracy: 0.744


#### Analysis of the results

Our Normalization made the text look nicer for human eyes but reduced the prediction accuracy.

## Task 3: Embedding
In this part, the TF-IDF embedding method is applied to the dataset. We explore two scenarios: 
1. TF-IDF embedding on raw text with Byte-Pair Encoding (BPE) tokenization only.
2. TF-IDF embedding on normalized text, where text is preprocessed with steps such as contraction expansion, stopword removal, and stemming, followed by BPE tokenization.

In [28]:
# 1. BPE Tokenization Without Normalization
from sklearn.feature_extraction.text import TfidfVectorizer

# Directly tokenize the raw text using the BPE tokenizer
def bpe_tokenizer(text):
    encoded = tokenizer.encode(text)
    return encoded.tokens

# Initialize TF-IDF Vectorizer for raw text (no normalization)
tfidf_vectorizer_raw = TfidfVectorizer(
    max_features=15000,
    ngram_range=(1, 2),
    tokenizer=bpe_tokenizer,
    token_pattern=None  # Disable default token pattern since we're using a custom tokenizer
)

# Apply BPE tokenization and TF-IDF on raw text
tfidf_train_raw = tfidf_vectorizer_raw.fit_transform(data_train['path'].apply(load_text))
tfidf_test_raw = tfidf_vectorizer_raw.transform(data_test['path'].apply(load_text))

print("TF-IDF Shape (Raw BPE):", tfidf_train_raw.shape, tfidf_test_raw.shape)

# 2. BPE Tokenization With Normalization

# Preprocess the text using the normalization function first
data_train['normalized_text'] = data_train['path'].apply(preprocess_text)
data_test['normalized_text'] = data_test['path'].apply(preprocess_text)

# Initialize TF-IDF Vectorizer for normalized text
tfidf_vectorizer_normalized = TfidfVectorizer(
    max_features=15000,
    ngram_range=(1, 2),
    tokenizer=bpe_tokenizer,
    token_pattern=None  # Disable default token pattern
)

# Apply BPE tokenization and TF-IDF on normalized text
tfidf_train_normalized = tfidf_vectorizer_normalized.fit_transform(data_train['normalized_text'])
tfidf_test_normalized = tfidf_vectorizer_normalized.transform(data_test['normalized_text'])

print("TF-IDF Shape (Normalized BPE):", tfidf_train_normalized.shape, tfidf_test_normalized.shape)

TF-IDF Shape (Raw BPE): (1000, 15000) (1000, 15000)
TF-IDF Shape (Normalized BPE): (1000, 15000) (1000, 15000)


In [29]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

# 1. Prediction with Raw BPE Tokenization (No Normalization)
svm_raw = LinearSVC(dual=True)
svm_raw.fit(tfidf_train_raw, data_train['label'])  # Train on raw BPE TF-IDF
predictions_raw = svm_raw.predict(tfidf_test_raw)  # Predict on test data

# Evaluate the classifier's performance on raw data
accuracy_raw = accuracy_score(data_test['label'], predictions_raw)
print("Accuracy (Raw BPE):", accuracy_raw)
print("\nClassification Report (Raw BPE):\n", classification_report(data_test['label'], predictions_raw))

# Separator for clarity in output
print("="*80)

# 2. Prediction with Normalized BPE Tokenization
svm_normalized = LinearSVC(dual=True)
svm_normalized.fit(tfidf_train_normalized, data_train['label'])  # Train on normalized BPE TF-IDF
predictions_normalized = svm_normalized.predict(tfidf_test_normalized)  # Predict on test data

# Evaluate the classifier's performance on normalized data
accuracy_normalized = accuracy_score(data_test['label'], predictions_normalized)
print("Accuracy (Normalized BPE):", accuracy_normalized)
print("\nClassification Report (Normalized BPE):\n", classification_report(data_test['label'], predictions_normalized))

Accuracy (Raw BPE): 0.801

Classification Report (Raw BPE):
               precision    recall  f1-score   support

         neg       0.80      0.80      0.80       500
         pos       0.80      0.80      0.80       500

    accuracy                           0.80      1000
   macro avg       0.80      0.80      0.80      1000
weighted avg       0.80      0.80      0.80      1000

Accuracy (Normalized BPE): 0.771

Classification Report (Normalized BPE):
               precision    recall  f1-score   support

         neg       0.76      0.80      0.78       500
         pos       0.79      0.74      0.76       500

    accuracy                           0.77      1000
   macro avg       0.77      0.77      0.77      1000
weighted avg       0.77      0.77      0.77      1000

