<a href="https://colab.research.google.com/github/tinamilo6/Textmining/blob/main/A03_TextMining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#!pip install pandas sklearn nltk

# Assignment 3 - Text Mining

Project management and tools for health informatics

## 1. Download and prepare data:

**Do not alter the code in this Section!**

The code in this section downloads the [IMDB IMDB Large Movie Review Dataset]('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz') which is the dataset you will be working on in this assignment.

In [2]:
import os
import tarfile
from urllib.request import urlretrieve

In [3]:
if not os.path.exists('aclImdb'):
    # download data:
    urlretrieve('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', 'aclImdb.tar.gz')

    # unzip data:
    with tarfile.open('aclImdb.tar.gz') as file:
        file.extractall('./')

## 2. Some helper Functions:

**Do not alter the code in this Section!**

This section contains the code for some helper functions that will be useful for solving the assignment. Example code on how to use the functions is provided in section 3.

In [4]:
import pandas as pd

from typing import Literal, Tuple, Iterable

Function for loading data into a pandas dataframe:

In [5]:
def load_data(split:Literal['train', 'test'], texts_per_class:int=500) -> pd.DataFrame:
    ''' Loads the data into a pandas dataframe.'''
    paths  = []
    labels = []

    for label in ('pos', 'neg'):
        # get all files in the folder:
        files = os.listdir(os.path.join('aclImdb', split, label))[:texts_per_class]

        # append them to the lists:
        paths.extend([os.path.join('aclImdb', split, label, f) for f in files])
        labels.extend([label] * len(files))

    return pd.DataFrame({'path':paths, 'label':labels})

Function for loading a specific text:

In [6]:
def load_text(path:str) -> str:
    ''' Reads a single text given the path. '''
    # read file from disk:
    with open(path, 'r', encoding='utf8') as file:
        s = file.read()

    return s

Function for iterating through multiple texts:

In [7]:
def iterate_texts(data:pd.DataFrame) -> Iterable[Tuple[str, str]]:
    ''' Iterates through a pandas dataframe. '''

    for path in data['path'].values:
        # read file from disk:
        with open(path, 'r', encoding='utf8') as file:
            text = file.read()

        yield text

## 3. Text Mining Pipeline

This section will cover the text mining steps for this assignment. The following steps will be performed:

1. **Analyze the Data for Difficult Parts**  
   - Review the data to identify challenging aspects such as contractions, informal language, and complex sentence structures.

2. **Replace Contractions and Informal Language**  
   - Expand common contractions and replace informal phrases, if necessary, to standardize the text for processing.

3. **Tokenize the Texts**  
   - Apply a suitable tokenizer to break down the text into individual tokens for analysis.

### Import needed libraries

In [8]:
# Import needed libaries for the preparation of the texts
import random

### Load the training data
The loaded data frpom the zip file should be saved into a `data_train` and `data_test` DataFrame.
These can be further on be used to access the train and test data.

In [9]:
data_train = load_data('train')
data_test  = load_data('test')
data_train

Unnamed: 0,path,label
0,aclImdb\train\pos\0_9.txt,pos
1,aclImdb\train\pos\10000_8.txt,pos
2,aclImdb\train\pos\10001_10.txt,pos
3,aclImdb\train\pos\10002_7.txt,pos
4,aclImdb\train\pos\10003_8.txt,pos
...,...,...
995,aclImdb\train\neg\10446_2.txt,neg
996,aclImdb\train\neg\10447_1.txt,neg
997,aclImdb\train\neg\10448_1.txt,neg
998,aclImdb\train\neg\10449_4.txt,neg


### 1. Assess the data / texts for difficult parts
In this part, sample texts are printed and then analyzed for diffult parts, that could affect the text mining process.

In [10]:
# Number of samples to be printed
n_samples = 7

# Randomly sample indices from "data_train"
sample_indices = random.sample(range(len(data_train)), n_samples)

# Load and print each sample using the "load_text" function
for i, id in enumerate(sample_indices, start=1):
    # Load text from the file path specified in 'path' column
    text = load_text(data_train.loc[id, 'path'])
    print(f"Sample {i}:\n")
    print(text)
    print("\n" + "="*80 + "\n")

Sample 1:

We so often talk of cinema landmarks - Kane, The Godfather, A Bout de Souffle. One film however is too often overlooked by "serious" film critics. I am talking of course about the classic Doc Savage (M.o.B.)<br /><br />This film is not only exciting but also seriously explores the issue of exploitation of the developing nations by US imperialism. Not to mention kung-fu.<br /><br />It also possessed the greatest soundtrack in film history (until of course Queen's breathtaking work on Flash Gordon). Although a bit of a rarity, this film is well worth seeking out - it will repay the effort of your search ten-fold.


Sample 2:

I think James Cameron might be becoming my favorite director because this is my second review of his movies. Anyway, everyone remembers the RMS Titanic. It was big, fast, and "unsinkable"... until April 1912. It was all over the news and one of the biggest tragedies ever. Well James Cameron decided to make a movie out of it but star two fictional characte

## Limitations/Issues Captured from the Above Text Samples

- **HTML tags and special characters**
- **Punctuation and symbols** (e.g., `&`)
- **Contractions** (e.g., "isn't", "I'll", "I'm")
- **Parentheses and annotations** (e.g., "(Crouching Tiger)")
- **Informal formatting** (e.g., "my rating is ****")
- **Ambiguity and polysemy** (e.g., "dictators", "nuts")
- **Long and complex sentences**
- **Informal language** (e.g., "what can be so bad about that?")
- **Quotation marks** (e.g., "dictators", "sin")

We will elaborate on these issues and add our conclusions in the report.

### 2. Preprocessing: Simplify the Text

This section focuses on preprocessing the texts to make them more suitable for text mining. Identified constraints from the analysis step will be addressed and, as far as possible, eliminated to improve processing accuracy.


In [11]:
import re

def remove_html_tags(text):
    # Regex to match HTML tags
    clean_text = re.sub(r'<.*?>', '', text)
    return clean_text



In [12]:
import string

def remove_punctuation_and_symbols(text):
    # Remove punctuation and special symbols using regex
    clean_text = re.sub(r'[^\w\s]', '', text)
    return clean_text



In [13]:
def remove_parentheses(text):
    # Remove text inside parentheses along with parentheses
    clean_text = re.sub(r'\(.*?\)', '', text)
    return clean_text

#### Expand the contractions
There are python libraries that focus on expanding contractions. For easier loading (if these packages are not available) we decided to create our own replacement list.

In [14]:
# Dictionary of common contractions and their expanded forms
contractions_dict = {
    "i'm": "I am",
    "you're": "you are",
    "he's": "he is",
    "she's": "she is",
    "it's": "it is",
    "we're": "we are",
    "they're": "they are",
    "i've": "I have",
    "you've": "you have",
    "we've": "we have",
    "they've": "they have",
    "i'll": "I will",
    "you'll": "you will",
    "he'll": "he will",
    "she'll": "she will",
    "we'll": "we will",
    "they'll": "they will",
    "isn't": "is not",
    "aren't": "are not",
    "wasn't": "was not",
    "weren't": "were not",
    "can't": "cannot",
    "couldn't": "could not",
    "won't": "will not",
    "wouldn't": "would not",
    "don't": "do not",
    "doesn't": "does not",
    "didn't": "did not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "let's": "let us",
    "that's": "that is",
    "who's": "who is",
    "what's": "what is",
    "where's": "where is",
    "there's": "there is",
    "here's": "here is",
    "shouldn't": "should not",
    "mustn't": "must not",
    "shan't": "shall not"
}

# Function to expand contractions
def expand_contractions(text):
    contractions_pattern = re.compile(r'\b(' + '|'.join(contractions_dict.keys()) + r')\b', flags=re.IGNORECASE)
    def replace(match):
        return contractions_dict.get(match.group(0).lower(), match.group(0))
    return contractions_pattern.sub(replace, text)


In [15]:
# Define a set of informal tokens to remove
informal_tokens = {
    "u": "you",
    "r": "are",
    "lmao": "",  # Remove
    "lol": "",   # Remove
    "btw": "by the way",
    "idk": "I do not know",
    "omg": "oh my god",
    "gonna": "going to",
    "wanna": "want to",
    "gotta": "got to",
    "b/c": "because",
    "thx": "thanks",
    "pls": "please",
    "cuz": "because",
    "wut": "what",
    "smh": "",  # Remove
    "k": "okay",
    "ttyl": "talk to you later"
}

def remove_informal_tokens(text):
    # Replace informal tokens
    for token, replacement in informal_tokens.items():
        # Use regex to match whole words and replace them
        text = re.sub(r'\b' + re.escape(token) + r'\b', replacement, text, flags=re.IGNORECASE)
    return text

def fix_punctuation(text):
    # Ensure proper spacing after punctuation
    text = re.sub(r'\s*([.,;:!?()])\s*', r'\1 ', text)  # Ensure space after punctuation
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = text.strip()  # Trim leading and trailing spaces
    return text

# Combined function to clean the text
def clean_text(text):
    text = remove_informal_tokens(text)  # Remove informal tokens
    text = fix_punctuation(text)  # Fix punctuation
    return text

def preprocess_ellipses(text):
    # Replace occurrences of three or more dots with actual ellipses
    text = re.sub(r'(\.\s*){3,}', '...', text)
    return text




In [16]:
import nltk
from nltk.corpus import stopwords

# Download necessary resources
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = load_text(text)              # Load text from file path
    ##################################################################################################################
    ### Comment the following points until return out, to simulate the bare processing without normalization steps ###
    ##################################################################################################################
    text = expand_contractions(text)    # Expand contractions
    text = remove_informal_tokens(text) # Replace informal tokens
    text = ' '.join(word for word in text.split() if word not in stop_words) # Remove stopwords
    text = fix_punctuation(text)        # Fix punctuation
    text = preprocess_ellipses(text)    # Handle ellipses as a single token
    return text


# Apply preprocessing function to training and test data
data_train['cleaned_text'] = data_train['path'].apply(preprocess_text)
data_test['cleaned_text'] = data_test['path'].apply(preprocess_text)

# Number of samples to be printed
n_samples = 5

# Randomly sample indices from "data_train"
sample_indices = random.sample(range(len(data_train)), n_samples)

# Print each sample from the cleaned_text
for i, idx in enumerate(sample_indices, start=1):
    # Access the cleaned text directly
    text = data_train.loc[idx, 'cleaned_text']
    print(f"Sample {i}:\n")
    print(text)
    print("\n" + "="*80 + "\n")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kentf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Sample 1:

Well, least theater group did, . So course I remember watching Grease since I little girl, never favorite musical story, still hold little special place heart since still lot fun watch. I heard horrible things Grease 2 I decided never watch it, boyfriend said really bad friend agreed, I decided give shot, I called laughed. First plot totally stolen first one really clever, mention used characters, different names actors. Tell me, Pink Ladies T-Birds continue years former gangs left? Not mention creator face motor cycle enemy, gee, striking resemblance guys first film well T-Birds stupid ridiculous. <br /><br />Another year Rydell music dancing stopped. But new student Sandy's cousin comes scene, love struck pink lady, Stephanie. But must stick code Pink Ladies must stick T-Birds, new student, decides train T-Bird win heart. So dresses rebel motor cycle bandit ride well defeat evil bikers easily kicking T-Bird's butts. But tell Stephanie really find own? Well, find yourself. 

### A simple pipeline:
[Do not change it!]

This simple pipeline will be used to compare the newly created pipeline against, to evaluate the performance increase.

**White-Space tokenization:**

In [17]:
def tokenize(text:str):
    ''' An example tokenization function. '''

    # simple white-space tokenization:
    return text.lower().split()


**Bag-of-words Embedding:**

See documentation of [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

# create a simple bag of words embedding:
bow = CountVectorizer(

    # the next line converts the filepaths to the actual texts:
    #preprocessor = load_text, # Commented, to check if the preprocessing from above changes the results

    # tokenization function from above:
    tokenizer = tokenize,

    # Set token_pattern to None since we're using a custom tokenizer
    token_pattern=None

)

# Train the embedding on cleaned training data
embeddings_train = bow.fit_transform(data_train['cleaned_text'].values)

# Vectorize the cleaned test data
embeddings_test = bow.transform(data_test['cleaned_text'].values)

# These are the original lines
# train the embedding:
#embeddings_train = bow.fit_transform(data_train['path'].values)

# vectorize test data:
#embeddings_test = bow.transform(data_test['path'].values)

**Classification with a linear SVM**

See documentation of [sklearn.svm.LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)

In [19]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

svm = LinearSVC(dual=False, max_iter=5000)

# train classifier:
svm.fit(embeddings_train, data_train['label'].values)

# test classifier:
predictions = svm.predict(embeddings_test)

# Calculate Accuracy:
print('Accuracy:', accuracy_score(data_test['label'].values, predictions))

Accuracy: 0.789


### Own text mining pipeline

Tokenize a few sample texts with our chosen tokenizer BPE

In [20]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize a BPE tokenizer
tokenizer = Tokenizer(BPE())

# Add a pre-tokenizer to handle whitespace properly
tokenizer.pre_tokenizer = Whitespace()

# Initialize a BPE trainer with some default parameters
trainer = BpeTrainer(special_tokens=["<pad>", "<s>", "</s>", "<unk>", "__ellipsis__"])

# Convert cleaned_text column to a list of strings for training
training_data = data_train['cleaned_text'].tolist()

# Train the tokenizer on the cleaned text data
tokenizer.train_from_iterator(training_data, trainer)

# Save the tokenizer for later use
tokenizer.save("imdb_bpe_tokenizer.json")

# Encode the text data for the model
encoded_train = [tokenizer.encode(text).ids for text in data_train['cleaned_text']]
encoded_test = [tokenizer.encode(text).ids for text in data_test['cleaned_text']]

# Select a single example to decode and print
sample_index = 0  # Choose an index, e.g., 0 for the first item
encoded_sample = encoded_train[sample_index]

# Decode the tokenized output back to text
decoded_text = tokenizer.decode(encoded_sample)

# Display the results
print("Original string:", data_train['cleaned_text'][sample_index])  # Original text for comparison
print("Encoded tokens:", encoded_sample)
print("Decoded string:", decoded_text)

Original string: Bromwell High cartoon comedy. It ran time programs school life, "Teachers". My 35 years teaching profession lead believe Bromwell High's satire much closer reality "Teachers". The scramble survive financially, insightful students see right pathetic teachers' pomp, pettiness whole situation, remind schools I knew students. When I saw episode student repeatedly tried burn school, I immediately recalled...High. A classic line: INSPECTOR: I sack one teachers. STUDENT: Welcome Bromwell High. I expect many adults age think Bromwell High far fetched. What pity not!
Encoded tokens: [9292, 2821, 2991, 842, 18, 281, 3024, 276, 10904, 1697, 433, 16, 6, 13968, 754, 926, 6438, 511, 14379, 6966, 693, 778, 9292, 2821, 11, 84, 8971, 347, 7116, 2558, 6, 13968, 754, 162, 26502, 4999, 11248, 16, 9178, 4921, 269, 534, 2502, 9169, 11, 8118, 16, 21478, 845, 1732, 16, 1785, 13923, 45, 2660, 4921, 18, 915, 45, 641, 702, 4311, 6366, 2097, 2256, 1697, 16, 45, 3888, 12500, 295, 2821, 18, 37, 136

The tokenization was now performed.

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer with desired parameters
tfidf_vectorizer = TfidfVectorizer(
    max_features=15000,       # You can adjust based on memory and data size
    ngram_range=(1, 2),      # Consider unigrams and bigrams for richer context
)

# Fit on training data and transform both training and test data
tfidf_train = tfidf_vectorizer.fit_transform(data_train['cleaned_text'])
tfidf_test = tfidf_vectorizer.transform(data_test['cleaned_text'])

# Print the shapes of the resulting TF-IDF matrices
print("TF-IDF Training Data Shape:", tfidf_train.shape)
print("TF-IDF Test Data Shape:", tfidf_test.shape)


TF-IDF Training Data Shape: (1000, 15000)
TF-IDF Test Data Shape: (1000, 15000)


In [22]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train the SVM classifier
svm = LinearSVC()
svm.fit(tfidf_train, data_train['label'])  # 'label' is your target variable

# Make predictions on the test data
predictions = svm.predict(tfidf_test)

# Evaluate the classifier's performance
accuracy = accuracy_score(data_test['label'], predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(data_test['label'], predictions))

Accuracy: 0.818

Classification Report:
               precision    recall  f1-score   support

         neg       0.81      0.83      0.82       500
         pos       0.83      0.80      0.82       500

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000





In [23]:
import torch
print(torch.__version__)          # To confirm the version
print(torch.cuda.is_available())  # This should return False as there is no GPU

2.4.1+cpu
False


In [26]:
# Map the labels to integers
label_mapping = {'pos': 1, 'neg': 0}
data_train['label'] = data_train['label'].map(label_mapping)
data_test['label'] = data_test['label'].map(label_mapping)

# Now convert to tensor
y_train_tensor = torch.tensor(data_train['label'].values, dtype=torch.long)
y_test_tensor = torch.tensor(data_test['label'].values, dtype=torch.long)

In [30]:
import torch.nn as nn
import torch.optim as optim
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from torch.utils.data import TensorDataset, DataLoader

# Convert to dense tensors
X_train_tensor = torch.tensor(tfidf_train.toarray(), dtype=torch.float32)
X_test_tensor = torch.tensor(tfidf_test.toarray(), dtype=torch.float32)

# Labels as tensors
y_train_tensor = torch.tensor(data_train['label'].values, dtype=torch.long)
y_test_tensor = torch.tensor(data_test['label'].values, dtype=torch.long)

# Step 2: Define the Neural Network model
class SimpleNN(nn.Module):
    def __init__(self, input_size):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 2)  # 2 classes: positive/negative
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

# Initialize the model
input_size = X_train_tensor.shape[1]
model = SimpleNN(input_size)

# Step 3: Define Loss Function and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Step 4: DataLoader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Step 5: Training Loop
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss / len(train_loader)}')

# Step 6: Evaluation on Test Data
model.eval()
with torch.no_grad():
    test_outputs = model(X_test_tensor)
    _, predicted = torch.max(test_outputs, 1)
    accuracy = accuracy_score(y_test_tensor, predicted)
    print("Test Accuracy:", accuracy)
    print("\nClassification Report:\n", classification_report(y_test_tensor, predicted))


Epoch 1/10, Loss: 0.6641243491321802
Epoch 2/10, Loss: 0.24066265975125134
Epoch 3/10, Loss: 0.008255663749878295
Epoch 4/10, Loss: 0.0013952378340036375
Epoch 5/10, Loss: 0.0006086299872549716
Epoch 6/10, Loss: 0.0003767254838749068
Epoch 7/10, Loss: 0.00019512378185027046
Epoch 8/10, Loss: 0.00012972614911177516
Epoch 9/10, Loss: 8.687810452556732e-05
Epoch 10/10, Loss: 5.518287719041837e-05
Test Accuracy: 0.816

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.83      0.82       500
           1       0.83      0.80      0.81       500

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000



Further text normalization like stemming

Kent: I dont think it is necessary and we should implement it into the preprocessing/normalization function at the beginning.

In [23]:
""" import nltk
import re
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Download the 'punkt' resource for tokenization
nltk.download('punkt')

# Function when further normalization is needed
def further_normalization_needed(training_data):
    stemmer = PorterStemmer()

    for i, text in enumerate(training_data):
        print(f"\nOriginal Text {i+1}: {text}")

        # Tokenize using word_tokenize for stemming analysis
        tokens = word_tokenize(text)
        stemmed_tokens = [stemmer.stem(token) for token in tokens]

        # Custom regex to remove unwanted characters (e.g., punctuation)
        cleaned_text = re.sub(r'[^\w\s]', '', text)

        print("Stemmed Tokens:", stemmed_tokens)
        print("Cleaned Text (after regex):", cleaned_text)

# Check if further text normalization is needed
further_normalization_needed(training_data) """

  """ import nltk


' import nltk\nimport re\nfrom tokenizers import Tokenizer\nfrom tokenizers.models import BPE\nfrom tokenizers.trainers import BpeTrainer\nfrom tokenizers.pre_tokenizers import Whitespace\nfrom nltk.tokenize import word_tokenize\nfrom nltk.stem import PorterStemmer\n\n# Download the \'punkt\' resource for tokenization\nnltk.download(\'punkt\')\n\n# Function when further normalization is needed\ndef further_normalization_needed(training_data):\n    stemmer = PorterStemmer()\n\n    for i, text in enumerate(training_data):\n        print(f"\nOriginal Text {i+1}: {text}")\n\n        # Tokenize using word_tokenize for stemming analysis\n        tokens = word_tokenize(text)\n        stemmed_tokens = [stemmer.stem(token) for token in tokens]\n\n        # Custom regex to remove unwanted characters (e.g., punctuation)\n        cleaned_text = re.sub(r\'[^\\w\\s]\', \'\', text)\n\n        print("Stemmed Tokens:", stemmed_tokens)\n        print("Cleaned Text (after regex):", cleaned_text)\n\n# Che

In [24]:
import re
import nltk
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
import pandas as pd

# Download necessary NLTK data
nltk.download('punkt')

# Sample data for demonstration
data_train = pd.DataFrame({
    'cleaned_text': [
        "Hello, this is a test sentence for the tokenizer.",
        "This is another sentence to improve subword merging."
   ],
    'label': [0, 1]  # Add a 'label' column to the DataFrame
})

data_test = pd.DataFrame({
    'cleaned_text': [
        "Hello, this is another example sentence.",
        "Subword tokenization is quite effective."
    ],
    'label': [0, 1]  # Add a 'label' column to the DataFrame
})


# 1. Improved Tokenization (BPE)
# Initialize BPE tokenizer
tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=10000, min_frequency=2)
tokenizer.train_from_iterator(data_train['cleaned_text'], trainer=trainer)

# Rename the function to avoid conflict with the 'tokens' variable
def bpe_tokenize_func(text):
    return tokenizer.encode(text).tokens

# 2. Improved Tokenization with Normalization (BPE)
stemmer = PorterStemmer()

def normalize_text(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Stem each word
    return ' '.join([stemmer.stem(word) for word in text.split()])

def bpe_tokenize_normalized(text):
    normalized_text = normalize_text(text)
    return tokenizer.encode(normalized_text).tokens

# Create CountVectorizer for each tokenization method

# Bag of Words with BPE Tokenization
# Use the renamed function here
bow_bpe = CountVectorizer(tokenizer=bpe_tokenize_func, token_pattern=None)
embeddings_train_bpe = bow_bpe.fit_transform(data_train['cleaned_text'].values)
embeddings_test_bpe = bow_bpe.transform(data_test['cleaned_text'].values)

# Bag of Words with BPE Tokenization and Normalization
bow_bpe_norm = CountVectorizer(tokenizer=bpe_tokenize_normalized, token_pattern=None)
embeddings_train_bpe_norm = bow_bpe_norm.fit_transform(data_train['cleaned_text'].values)
embeddings_test_bpe_norm = bow_bpe_norm.transform(data_test['cleaned_text'].values)


print("\nBag of Words with BPE Tokenization:")
print(embeddings_train_bpe.toarray())
print(embeddings_test_bpe.toarray())

print("\nBag of Words with BPE Tokenization and Normalization:")
print(embeddings_train_bpe_norm.toarray())
print(embeddings_test_bpe_norm.toarray())


Bag of Words with BPE Tokenization:
[[1 0 1 3 1 1 1 0 0 3 1 1 1 0 2 1 1 1 2 0 0 1 1 0 0 1 1 0 0 0 1]
 [2 1 1 0 1 0 1 1 1 1 0 2 0 2 1 1 2 0 0 2 2 2 1 1 1 0 2 1 1 1 0]]
[[1 0 1 1 0 1 1 0 0 3 0 1 0 0 2 1 0 0 3 1 1 2 0 1 0 0 1 0 0 0 0]
 [2 0 0 0 1 0 1 1 1 4 1 0 2 0 0 0 4 1 0 0 1 1 1 0 0 1 3 2 1 1 1]]

Bag of Words with BPE Tokenization and Normalization:
[[2 0 1 3 1 0 1 0 3 1 0 1 0 3 1 1 1 2 0 0 1 1 0 0 1 1 0 0 0]
 [3 1 1 0 1 1 1 1 0 0 1 0 1 2 2 1 0 0 2 1 2 1 1 1 0 2 1 1 1]]
[[2 0 1 1 0 0 1 0 2 0 0 0 0 3 1 1 0 3 1 1 2 0 1 0 0 1 0 0 0]
 [2 0 0 0 1 1 1 1 2 1 0 2 0 0 1 0 1 0 0 0 0 1 0 0 1 2 2 0 1]]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kentf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [25]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

svm = LinearSVC(dual=False, max_iter=5000)

# train classifier:
svm.fit(embeddings_train_bpe, data_train['label'].values)


# test classifier:
predictions = svm.predict(embeddings_test_bpe)


# Calculate Accuracy:
print('Accuracy:', accuracy_score(data_test['label'].values, predictions))

Accuracy: 1.0


In [26]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score



# Train and evaluate classifier using BPE embeddings
svm = LinearSVC(dual=False, max_iter=5000)

# Train classifier
#svm.fit(embeddings_train, data_train['label'].values)
#svm.fit(embeddings_train_bpe, data_train['label'].values)
svm.fit(embeddings_train_bpe_norm, data_train['label'].values) # The model is trained on embeddings_train_bpe_norm


# Test classifier
#predictions = svm.predict(embeddings_test_bpe) # This line causes the error because embeddings_test_bpe has a different number of features
predictions = svm.predict(embeddings_test_bpe_norm) # Use embeddings_test_bpe_norm for prediction, which has the same number of features as the training data


# Calculate Accuracy
accuracy = accuracy_score(data_test['label'].values, predictions)
print('Accuracy with BPE Tokenization and Normalization:', accuracy)

Accuracy with BPE Tokenization and Normalization: 0.5
