#Comparative Analysis of Binary and Frequency Bag-of-Words Representations for Medical Transcript Classification Using Machine Learning Models

**Uploading and Using Train, Test, and Validation CSV Files in Google Colab :**

This project consists of three CSV files: train, test, and valid. The code uses Google Colab's files.upload() function to upload these files from the local computer into the Colab environment, where they can be accessed and used for tasks such as loading data for training, testing, and validating a machine learning model.

In [2]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train (1).csv
Saving test.csv to test (1).csv
Saving valid.csv to valid (1).csv


**Import Libraries :**

This code imports essential libraries for data analysis, preprocessing, and machine learning.

* pandas and numpy handle data manipulation and numerical operations.

* re and string help with text processing and cleaning.

* Counter is used for counting elements (e.g., word frequencies).

* LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, and XGBClassifier are machine learning models for classification tasks.

* f1_score evaluates model performance.

* tqdm provides a progress bar for loops to track execution progress.

In [None]:
import pandas as pd
import numpy as np
import re
import string
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
from tqdm import tqdm

**Load and Prepare Data :**

This part of the code loads the training, validation, and test datasets from three separate CSV files using pandas.read_csv(). After loading the data into DataFrames, it extracts the text column from each file, converts the values to strings, and stores them as lists. It also extracts the label column and stores it as lists as well. As a result, we have train_texts and train_labels for model training, valid_texts and valid_labels for model validation, and test_texts and test_labels for final model evaluation. This prepares the data in a convenient format for machine learning tasks.

In [None]:
# 1. Load Data
train_df = pd.read_csv("train.csv")
valid_df = pd.read_csv("valid.csv")
test_df  = pd.read_csv("test.csv")
train_texts = train_df['text'].astype(str).tolist()
train_labels = train_df['label'].tolist()
valid_texts = valid_df['text'].astype(str).tolist()
valid_labels = valid_df['label'].tolist()
test_texts  = test_df['text'].astype(str).tolist()
test_labels = test_df['label'].tolist()

**Text Preprocessing (Lowercasing and Punctuation Removal) :**

This part of the code focuses on preparing the raw text data so that it is clean and consistent before being used in a machine learning model. It defines a function called preprocess() that takes a text string as input and applies three key transformations: (1) it converts all characters to lowercase, ensuring that words like “Good” and “good” are treated the same; (2) it removes all punctuation marks by replacing them with spaces, which helps avoid treating punctuation as separate tokens; and (3) it removes any extra spaces so that the final text is neatly formatted. After defining this function, it is applied to every text sample in the training, validation, and test datasets. This step is crucial because it reduces noise, standardizes the input data, and helps the machine learning models focus on meaningful patterns rather than irrelevant differences caused by case sensitivity or punctuation.

In [5]:
# 2. Preprocessing (lowercase + remove punctuation)
def preprocess(text):
    text = text.lower()
    text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text
train_texts = [preprocess(t) for t in train_texts]
valid_texts = [preprocess(t) for t in valid_texts]
test_texts  = [preprocess(t) for t in test_texts]

**Building a Top-10,000 Word Vocabulary from Training Texts :**

This code snippet is responsible for constructing a vocabulary of the most frequent words from the training dataset, which is a critical step in text vectorization for the medical transcript classification task. The goal here is to identify the top 10,000 words that appear most often in the training texts and assign each word a unique ID, along with recording its frequency.

The process begins by initializing a Counter object from Python’s collections module. This object, word_counts, is used to tally the occurrences of every word across all training texts. The code loops over each training text in train_texts and splits it into words using whitespace as the delimiter. For each text, word_counts.update() updates the counts for each word, ensuring that the frequency of each word is accurately accumulated across the entire training set.

After counting all words, the variable TOP_K is set to 10,000 to limit the vocabulary to the most frequent 10,000 words. The most_common(TOP_K) function is used to retrieve these words in descending order of frequency. These words are stored in the list vocab, which now represents the fixed-size vocabulary.

Next, the code creates a mapping from words to unique integer IDs using a dictionary comprehension: word2id = {word: i for i, word in enumerate(vocab)}. Here, each word in the top-10,000 vocabulary is assigned an ID starting from 0. This mapping will later be used to convert texts into numerical vectors, either for binary bag-of-words or frequency bag-of-words representations.

To persist the vocabulary, the code writes the word, its ID, and its frequency into a file named vocab.txt. Each line in this file corresponds to one word and follows the format: word ID frequency. This ensures that the vocabulary can be reused consistently for vectorization of validation and test sets without leaking information from them.

Finally, the code prints the first ten words from the vocabulary along with their IDs and frequencies. The output confirms that common English words like “the,” “and,” and “was” dominate the top of the list, which aligns with typical word distributions in English medical texts. For example, the word “the” has an ID of 0 and occurs 118,887 times in the training dataset.

In [6]:
# 3. Build Vocabulary (Top 10,000 words from TRAIN)
word_counts = Counter()
for t in train_texts:
    word_counts.update(t.split())
TOP_K = 10000
vocab = [word for word, _ in word_counts.most_common(TOP_K)]
word2id = {word: i for i, word in enumerate(vocab)}  # ids start from 0
with open("vocab.txt", "w", encoding="utf-8") as f:
    for i, word in enumerate(vocab):
        f.write(f"{word} {i} {word_counts[word]}\n")
for i, word in enumerate(vocab[:10]):
    print(f"{word} {word2id[word]} {word_counts[word]}")

the 0 118887
and 1 66917
was 2 56124
of 3 48447
to 4 41003
a 5 34316
with 6 28462
in 7 26243
is 8 21651
patient 9 19289


**Converting Texts to Word ID Representations :**

This code converts raw texts into numerical word ID sequences using the previously built vocabulary, making them ready for machine learning models. The convert_to_ids function takes texts, labels, and a filename, then iterates over each text-label pair. Words present in the vocabulary are replaced with their IDs, out-of-vocabulary words are ignored, and the class label is appended at the end. Each processed line is written to a file (train_ids.txt, valid_ids.txt, test_ids.txt) and stored in memory.

The function is applied to the training, validation, and test sets, ensuring consistent preprocessing. Printing the first five lines of the training set confirms that texts are correctly converted into sequences of IDs with labels, ready for classification.

In [7]:
# 4. Save Train/Valid/Test with Word IDs
def convert_to_ids(texts, labels, filename):
    lines = []
    with open(filename, "w", encoding="utf-8") as f:
        for text, label in zip(texts, labels):
            ids = [str(word2id[w]) for w in text.split() if w in word2id]
            line = " ".join(ids) + f" {label}\n"
            f.write(line)
            lines.append(line.strip())
    return lines
train_ids = convert_to_ids(train_texts, train_labels, "train_ids.txt")
valid_ids = convert_to_ids(valid_texts, valid_labels, "valid_ids.txt")
test_ids  = convert_to_ids(test_texts, test_labels, "test_ids.txt")
for line in train_ids[:5]:
    print(line)

26 248 542 27 157 424 232 2588 2253 3912 5154 26 157 21 364 1009 55 33 778 450 36 391 9548 1777 46 33 19 897 40 33 1034 3 0 1829 1 1089 4991 83 33 21 897 1 21 364 778 450 1945 27 29 8 27 4 26 424 1278 1079 163 55 10 424 232 26 157 1829 1278 6 315 157 1060 7 19 105 1381 299 1413 691 1860 1290 27 33 21 897 26 391 9548 1777 36 157 1829 1278 55 315 157 1060 7 19 105 1381 1094 26 248 542 1945 1829 1278 105 1381 232 364 105 897 1829 1278 2
137 205 1513 1 1214 2924 123 205 2589 1 1214 2616 34 7931 116 2778 506 3 34 0 2505 3913 2 1095 68 0 892 1 566 618 87 0 867 679 1 1769 4 0 796 399 3 0 1769 0 4992 2 33 1 0 104 1214 2749 2 247 0 1214 2639 2 33 6 0 1752 1045 1913 17 1898 117 44 0 7330 17 16 244 29 14 472 1643 5261 1 5 3284 3 17 1898 117 816 16 29 2 5 148 2240 463 6 0 5478 57 17 2489 117 44 0 7330 0 591 188 0 463 2 33 0 1012 1046 2 33 6 33 591 316 0 3914 2 716 3674 3 0 774 68 0 1769 47 2 33 87 0 796 399 260 3345 3 0 774 520 1270 14 129 3 0 6859 4512 0 2550 3 5 148 2240 463 51 2 1452 4 1678 208

**Downloading Vocabulary and Dataset Files :**

This code allows users to download the processed files from the Colab environment to their local machine. By importing files from google.colab, each files.download() call triggers a browser download for the specified file.

The files downloaded are:

* vocab.txt – top 10,000 words with IDs and frequencies.

* train_ids.txt – training texts converted to word IDs with labels.

* valid_ids.txt – validation texts in ID format.

* test_ids.txt – test texts in ID format.

This ensures that all prepared data and vocabulary can be saved locally for further use or inspection.

In [14]:
from google.colab import files
files.download("vocab.txt")
files.download("train_ids.txt")
files.download("valid_ids.txt")
files.download("test_ids.txt")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Vectorizing Texts into Binary and Frequency Bag-of-Words :**

This code defines two functions for converting medical transcripts into numerical vectors, which are required for training machine learning classifiers. The first function, vectorize_BBoW, implements a Binary Bag-of-Words (BBoW) representation. It initializes a zero matrix of shape (number of texts × vocabulary size) and iterates through each text. For every unique word in the text, it checks if the word exists in the vocabulary (word2id). If it does, the corresponding column in the matrix is set to 1, indicating the presence of that word. This results in a binary vector where each dimension represents whether a vocabulary word appears in the text.

The second function, vectorize_FBoW, implements a Frequency Bag-of-Words (FBoW) representation. It also initializes a zero matrix, but the entries are floats. For each text, it counts the occurrences of each word using Counter, then divides the count by the total number of words in the text to compute the relative frequency. If a word exists in the vocabulary, its corresponding column in the matrix is set to this normalized frequency. This results in a vector where each dimension represents how frequently a vocabulary word occurs in the text.

Both functions use tqdm to display a progress bar during vectorization, which is helpful for large datasets. vectorize_BBoW produces sparse binary vectors suitable for algorithms like logistic regression, while vectorize_FBoW captures more nuanced word usage patterns, which can improve classifier performance in some cases.

In [8]:
# 5. Vectorization Functions
def vectorize_BBoW(texts):
    X = np.zeros((len(texts), len(vocab)), dtype=np.uint8)
    for i, t in enumerate(tqdm(texts, desc="BBoW")):
        for word in set(t.split()):
            idx = word2id.get(word)
            if idx is not None:
                X[i, idx] = 1
    return X
def vectorize_FBoW(texts):
    X = np.zeros((len(texts), len(vocab)), dtype=np.float32)
    for i, t in enumerate(tqdm(texts, desc="FBoW")):
        words = t.split()
        total = len(words)
        if total == 0:
            continue
        counts = Counter(words)
        for word, c in counts.items():
            idx = word2id.get(word)
            if idx is not None:
                X[i, idx] = c / total
    return X

**Preparing Binary Bag-of-Words Matrices and Encoding Labels :**

This code prepares the Binary Bag-of-Words (BBoW) feature matrices for the training, validation, and test datasets, and encodes the class labels into numerical format for machine learning models.

First, the vectorize_BBoW function is applied to each dataset: train_texts, valid_texts, and test_texts. This converts the raw medical transcripts into fixed-length binary vectors, where each vector dimension corresponds to a word in the top 10,000 vocabulary and a value of 1 indicates the presence of that word in the text. The resulting matrices—X_train_BBoW, X_valid_BBoW, and X_test_BBoW—are ready to be used as input for classifiers.

Next, the class labels, which are originally strings or categorical values, are encoded into integers using LabelEncoder from scikit-learn. The fit_transform method is applied on train_labels to learn the mapping and convert them into numerical labels. The same mapping is then applied to the validation and test labels using transform, ensuring consistency across all datasets.

In [9]:
# 6. Prepare BBoW matrices
from sklearn.preprocessing import LabelEncoder
X_train_BBoW = vectorize_BBoW(train_texts)
X_valid_BBoW = vectorize_BBoW(valid_texts)
X_test_BBoW  = vectorize_BBoW(test_texts)
le = LabelEncoder()
train_labels_enc = le.fit_transform(train_labels)
valid_labels_enc = le.transform(valid_labels)
test_labels_enc  = le.transform(test_labels)

BBoW: 100%|██████████| 4000/4000 [00:00<00:00, 6728.36it/s]
BBoW: 100%|██████████| 499/499 [00:00<00:00, 6912.47it/s]
BBoW: 100%|██████████| 500/500 [00:00<00:00, 7149.27it/s]


**Training and Evaluating Classifiers with Binary Bag-of-Words :**

This code trains and evaluates four machine learning classifiers—Logistic Regression, Decision Tree, Random Forest, and XGBoost—using the Binary Bag-of-Words (BBoW) representation of medical transcripts. The goal is to convert raw text into numerical vectors and assess each model’s ability to classify transcripts into Surgery, Medical Records, Internal Medicine, or Other.

The classifiers are stored in a dictionary models with carefully chosen hyperparameters. Logistic Regression uses C=0.3 and class_weight="balanced" to manage regularization and class imbalance. The Decision Tree is limited to a depth of 20 with at least 5 samples per leaf. Random Forest uses 200 trees with max_depth=20, min_samples_leaf=5, and max_features="sqrt" for generalization. XGBoost is set with 200 boosting rounds, a max depth of 6, and eval_metric='mlogloss' to optimize multi-class loss.

The code trains each model on X_train_BBoW and predicts labels for training, validation, and test sets. Performance is measured using the macro-averaged F1-score, which treats all classes equally, making it suitable for imbalanced datasets. The printed scores allow comparison across models and help detect overfitting or underfitting.

**Classifier Performance and F1-Score Analysis :**

The F1-scores reveal distinct patterns of performance for each classifier. Logistic Regression achieves a high training F1 of 0.9076 but lower validation and test scores of 0.6879 and 0.7268, indicating some overfitting. The regularization parameter C=0.3 helps control overfitting, with smaller values providing stronger regularization. The Decision Tree shows a slightly lower training F1 of 0.8627 but improved validation and test scores of 0.7319 and 0.7530, suggesting that limiting the tree’s depth to 20 and setting a minimum of 5 samples per leaf effectively controls complexity and reduces overfitting. In contrast, the Random Forest performs poorly, with a training F1 of 0.6979 and validation/test scores around 0.50. Despite being an ensemble method, the combination of restricted depth, minimum samples per leaf, and max_features="sqrt" appears to have caused underfitting, limiting its ability to capture patterns in the data. XGBoost, on the other hand, achieves the best validation and test performance, with scores of 0.7697 and 0.7925, while maintaining a high training F1 of 0.9079. Its hyperparameters, including 200 estimators and a max depth of 6, provide a balance between model complexity and generalization, and the gradient boosting mechanism allows it to capture complex patterns more effectively than single decision trees or the Random Forest in this dataset.

**Role of Hyperparameters :**

* Max Depth (max_depth): Controls the complexity of tree-based models; deeper trees can overfit, shallow trees may underfit.

* Min Samples Leaf (min_samples_leaf): Prevents splits that create very small leaf nodes, reducing overfitting.

* Number of Estimators (n_estimators): More trees usually improve performance but increase computation.

* Regularization (C in logistic regression): Controls overfitting by penalizing large coefficients.

* Max Features (max_features): Limits the number of features considered at each split in Random Forests; can improve generalization but may reduce training performance.

In [10]:
# 7. Train & Evaluate (BBoW)
models = {
    "LogisticRegression": LogisticRegression(max_iter=1000,C=0.3,class_weight="balanced",solver="liblinear"),
    "DecisionTree": DecisionTreeClassifier(max_depth=20,min_samples_leaf=5,random_state=42),
    "RandomForest": RandomForestClassifier(n_estimators=200, max_depth=20,max_features="sqrt",min_samples_leaf=5,random_state=42,n_jobs=-1),
    "XGBoost": XGBClassifier(n_estimators=200, max_depth=6,eval_metric='mlogloss', random_state=42)}
for vec_name, X_train, X_valid, X_test in [("BBoW", X_train_BBoW, X_valid_BBoW, X_test_BBoW),]:
    print(f"\n=== {vec_name} Results ===")
    for name, model in models.items():
        model.fit(X_train, train_labels_enc)
        train_pred, valid_pred, test_pred = model.predict(X_train), model.predict(X_valid), model.predict(X_test)
        print(f"\n{name}:")
        print(f"Train F1: {f1_score(train_labels_enc, train_pred, average='macro'):.4f}")
        print(f"Valid F1: {f1_score(valid_labels_enc, valid_pred, average='macro'):.4f}")
        print(f"Test F1: {f1_score(test_labels_enc, test_pred, average='macro'):.4f}")


=== BBoW Results ===

LogisticRegression:
Train F1: 0.9076
Valid F1: 0.6879
Test F1: 0.7268

DecisionTree:
Train F1: 0.8627
Valid F1: 0.7319
Test F1: 0.7530

RandomForest:
Train F1: 0.6979
Valid F1: 0.4998
Test F1: 0.5181

XGBoost:
Train F1: 0.9079
Valid F1: 0.7697
Test F1: 0.7925


**Preparing Frequency Bag-of-Words Matrices :**

This code prepares the Frequency Bag-of-Words (FBoW) feature matrices for the training, validation, and test datasets. Each text is converted into a fixed-length vector using the vectorize_FBoW function, where each vector dimension corresponds to a word in the top 10,000 vocabulary. Unlike the binary representation, each entry in these vectors represents the relative frequency of the word in the text, capturing how often a word appears rather than just its presence.

In [11]:
# 8. Prepare FBoW matrices
X_train_FBoW = vectorize_FBoW(train_texts)
X_valid_FBoW = vectorize_FBoW(valid_texts)
X_test_FBoW  = vectorize_FBoW(test_texts)

FBoW: 100%|██████████| 4000/4000 [00:00<00:00, 5230.05it/s]
FBoW: 100%|██████████| 499/499 [00:00<00:00, 5453.63it/s]
FBoW: 100%|██████████| 500/500 [00:00<00:00, 4928.53it/s]


**Training and Evaluating Classifiers with Frequency Bag-of-Words :**

This code trains and evaluates four machine learning classifiers—Logistic Regression, Decision Tree, Random Forest, and XGBoost—using the Frequency Bag-of-Words (FBoW) representation of medical transcripts. Similar to the BBoW evaluation, a dictionary models stores each classifier with its hyperparameters, tuned for the FBoW vectors.

The code iterates over the FBoW datasets (X_train_FBoW, X_valid_FBoW, X_test_FBoW). For each model, it is trained on the training set using .fit() and then predicts labels for training, validation, and test sets using .predict(). Performance is evaluated using the macro-averaged F1-score, which treats all classes equally, making it suitable for imbalanced data. The F1-scores are printed to compare how well each classifier generalizes when using word frequency information instead of binary presence.

**Analysis of Results :**

The F1-scores show significant differences in performance among the classifiers. Logistic Regression performs poorly, with Train F1 of 0.3253 and similar validation and test scores, indicating that it struggles with the frequency-based representation. Although the regularization parameter C=0.4 and L2 penalty are intended to prevent overfitting, the linear model is too simple to capture the complex relationships in FBoW data. Decision Tree performs much better, achieving a training F1 of 0.8364 and validation/test scores of 0.7028 and 0.7306. Hyperparameters such as max_depth=18 and min_samples_leaf=8 help control tree complexity and prevent overfitting, allowing the model to generalize effectively. Random Forest shows lower performance, with training F1 of 0.7361 and validation/test scores around 0.46–0.47. Despite being an ensemble method, the combination of limited tree depth, minimum samples per leaf, and max_features="sqrt" seems to restrict the model, resulting in underfitting and reduced ability to capture patterns in the FBoW data. XGBoost achieves the best performance, with a training F1 of 0.9090 and validation/test scores of 0.7619 and 0.8075. Its hyperparameters, including n_estimators=200 and max_depth=6, balance model complexity and generalization, while the gradient boosting mechanism sequentially corrects errors, allowing it to capture nuanced patterns more effectively than the other models.

**Role of Hyperparameters :**

* C (Logistic Regression): Controls regularization; smaller values reduce overfitting, but here the model is too simple for FBoW features.

* max_depth (Tree-based models): Limits tree complexity to prevent overfitting.

* min_samples_leaf (Tree-based models): Ensures sufficient data in leaves for better generalization.

* n_estimators (Random Forest/XGBoost): Number of trees; more trees improve performance but increase computation.

* max_features (Random Forest): Controls feature sampling for splits, balancing diversity and bias.

In [13]:
# 9. Train & Evaluate (FBoW)
models = {
    "LogisticRegression": LogisticRegression(max_iter=2000,C=0.4,class_weight="balanced",solver="liblinear",penalty="l2"),
    "DecisionTree": DecisionTreeClassifier(max_depth=18,min_samples_leaf=8,random_state=42),
    "RandomForest": RandomForestClassifier(n_estimators=200, max_depth=20,max_features="sqrt",min_samples_leaf=5,random_state=42,n_jobs=-1),
    "XGBoost": XGBClassifier(n_estimators=200, max_depth=6,eval_metric='mlogloss', random_state=42)}
for vec_name, X_train, X_valid, X_test in [("FBoW", X_train_FBoW, X_valid_FBoW, X_test_FBoW),]:
    print(f"\n=== {vec_name} Results ===")
    for name, model in models.items():
        model.fit(X_train, train_labels_enc)
        train_pred, valid_pred, test_pred = model.predict(X_train), model.predict(X_valid), model.predict(X_test)
        print(f"\n{name}:")
        print(f"Train F1: {f1_score(train_labels_enc, train_pred, average='macro'):.4f}")
        print(f"Valid F1: {f1_score(valid_labels_enc, valid_pred, average='macro'):.4f}")
        print(f"Test F1: {f1_score(test_labels_enc, test_pred, average='macro'):.4f}")


=== FBoW Results ===

LogisticRegression:
Train F1: 0.3253
Valid F1: 0.3217
Test F1: 0.3178

DecisionTree:
Train F1: 0.8364
Valid F1: 0.7028
Test F1: 0.7306

RandomForest:
Train F1: 0.7361
Valid F1: 0.4639
Test F1: 0.4735

XGBoost:
Train F1: 0.9090
Valid F1: 0.7619
Test F1: 0.8075


**Conclusion: Comparison of BBoW and FBoW**

Based on the F1-scores, the Binary Bag-of-Words (BBoW) and Frequency Bag-of-Words (FBoW) representations show different strengths across classifiers. For simpler linear models like Logistic Regression, BBoW clearly outperforms FBoW, achieving much higher validation and test scores (Valid F1: 0.6879 vs. 0.3217, Test F1: 0.7268 vs. 0.3178), suggesting that binary presence of words is more informative for linear classification than their frequency.

For tree-based models, both representations perform reasonably well, but the differences are subtler. Decision Trees and XGBoost show slightly better or comparable results with FBoW, likely because these models can leverage word frequency information to capture nuanced patterns. For instance, XGBoost achieves the highest test F1 with FBoW (0.8075) compared to BBoW (0.7925), indicating that frequency information provides a marginal advantage for complex, non-linear models capable of handling richer feature interactions.

**Overall**, BBoW tends to perform better for simpler models, while FBoW can provide slight improvements for more sophisticated tree-based models like XGBoost. This suggests that the choice of representation should consider both the type of classifier and the dataset characteristics: BBoW for linear models and FBoW for models capable of exploiting frequency information.