## Different Tokenizations

This code demonstrates a process for preprocessing text data, performing different tokenization techniques, and evaluating a RandomForestClassifier model using TF-IDF features.

#### Steps and Components:

1. **Loading the Dataset:**
   - The dataset (`preprocessed_dataset.csv`) is loaded using pandas (`pd.read_csv`). It contains columns 'cleaned_comment' (text data) and 'labels'.

2. **Tokenization Functions:**
   - Several tokenization functions are defined:
     - `whitespace_tokenize`: Splits text based on whitespace.
     - `punctuation_tokenize`: Extracts words and punctuation marks using regular expressions.
     - `ngram_tokenize`: Generates n-grams (in this case, bi-grams) from the text.
     - `wordpiece_tokenize`: Uses BERT's tokenizer (`BertTokenizer`) to tokenize text into WordPieces.
     - `sentencepiece_tokenize`: Utilizes SentencePiece for tokenization. Training data (`text.txt`) is prepared and trained with SentencePiece for segmentation into subword units.
     - `bpe_tokenize`: Utilizes Byte Pair Encoding (BPE) tokenization. Training data (`text.txt`) is prepared and trained with BPE to create a vocabulary of subword units.

3. **Training SentencePiece and BPE Tokenizers:**
   - `spm.SentencePieceTrainer.Train()` trains a SentencePiece model with specified parameters (`--input=text.txt --model_prefix=m --vocab_size=5000`).
   - `Tokenizer.train()` trains a BPE tokenizer (`BpeTrainer`) with `vocab_size=5000` and `min_frequency=2`.

4. **Model Training and Evaluation:**
   - A function `train_and_evaluate_model()` is defined to:
     - Split the data into training and testing sets using `train_test_split()`.
     - Vectorize text data using `TfidfVectorizer()` to convert text into numerical features (TF-IDF vectors).
     - Train a `RandomForestClassifier` model with 100 estimators.
     - Evaluate the model's accuracy on the test set using `accuracy_score`.

5. **Tokenization Methods Evaluation:**
   - The accuracy of the RandomForestClassifier model is evaluated using different tokenization methods:
     - `whitespace`, `punctuation`, `ngram`, `wordpiece`, `sentencepiece`, and `bpe`.
   - For each method, text data (`X`) is tokenized using the corresponding tokenization function, and the accuracy of the trained model is printed.

#### Suggestions for Improvement:

- **Error Handling:** Implement error handling to manage potential issues such as file not found errors or tokenization failures.
- **Visualization:** Include visualizations such as confusion matrices to better understand model performance.
- **Parameter Tuning:** Explore tuning parameters for tokenizers and the classifier to potentially improve model accuracy.
- **Scaling:** Consider scaling up to larger datasets or optimizing code for efficiency.

This approach provides a comprehensive example of text preprocessing, tokenization using different methods, and evaluation of a machine learning model, showcasing the versatility of tokenization techniques in natural language processing tasks.


In [7]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from transformers import BertTokenizer
import sentencepiece as spm
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Load the dataset
file_path = 'M:\\Internships\\infosys_springboard\\Notebooks\\Preprocessing\\preprocessed_dataset.csv'
df = pd.read_csv(file_path)

# Assuming the dataset has columns 'cleaned_comment' and 'labels'
X = df['cleaned_comment']
y = df['labels']

# Define tokenization functions
def whitespace_tokenize(text):
    return text.split()

def punctuation_tokenize(text):
    return re.findall(r'\w+|[^\w\s]', text, re.UNICODE)

def ngram_tokenize(text, n):
    words = text.split()
    return [' '.join(words[i:i+n]) for i in range(len(words)-n+1)]

# WordPiece Tokenization using BERT's tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def wordpiece_tokenize(text):
    return bert_tokenizer.tokenize(text)

# SentencePiece Tokenization
# Ensure to have a text file for training SentencePiece
# Here we assume 'text.txt' exists and contains relevant training data
with open('text.txt', 'w', encoding='utf-8') as f:
    for text in X:
        f.write(text + '\n')

spm.SentencePieceTrainer.Train('--input=text.txt --model_prefix=m --vocab_size=5000')
sp = spm.SentencePieceProcessor(model_file='m.model')

def sentencepiece_tokenize(text):
    return sp.encode_as_pieces(text)

# Byte Pair Encoding (BPE)
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=5000, min_frequency=2)
tokenizer.train(files=['text.txt'], trainer=trainer)
tokenizer.save("bpe_tokenizer.json")
tokenizer = Tokenizer.from_file("bpe_tokenizer.json")

def bpe_tokenize(text):
    return tokenizer.encode(text).tokens

# Function to train and evaluate model
def train_and_evaluate_model(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    vectorizer = TfidfVectorizer()
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)
    
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train_tfidf, y_train)
    
    y_pred = model.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    
    return accuracy

# Apply tokenization methods
tokenization_methods = {
    'whitespace': whitespace_tokenize,
    'punctuation': punctuation_tokenize,
    'ngram': lambda text: ngram_tokenize(text, 2),
    'wordpiece': wordpiece_tokenize,
    'sentencepiece': sentencepiece_tokenize,
    'bpe': bpe_tokenize
}

for method_name, tokenize_fn in tokenization_methods.items():
    tokenized_X = X.apply(lambda text: ' '.join(tokenize_fn(text)))
    accuracy = train_and_evaluate_model(tokenized_X, y)
    print(f"{method_name} tokenization - Model Accuracy: {accuracy}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

whitespace tokenization - Model Accuracy: 0.7412461380020597
punctuation tokenization - Model Accuracy: 0.7412461380020597
ngram tokenization - Model Accuracy: 0.7438208032955715
wordpiece tokenization - Model Accuracy: 0.7492276004119465
sentencepiece tokenization - Model Accuracy: 0.7440782698249228
bpe tokenization - Model Accuracy: 0.7438208032955715
