# CS 475/675 Machine Learning: Project
## Goals:
### 4.1 Must accomplish
- Implement a robust data preprocessing pipeline to handle tokenization, feature extraction, and label encoding.
- Develop and train a machine learning model capable of accurately detecting PII types in student essays, achieving a competitive score on the evaluation metric.
- Generate predictions for the test set essays and submit them in the required format for evaluation.

### 4.2 Expect to accomplish
- Fine-tune the model architecture and hyperparameters to optimize performance on the provided training data.
- Conduct error analysis and model interpretation to identify common misclassifications and areas for improvement.
- Investigate the use of external datasets or pre-trained language models to enhance the model’s generalization capabilities.

### 4.3 Would like to accomplish
- Implement ensemble learning techniques, such as model averaging or stacking, to combine multiple base models and further boost detection accuracy and robustness.
- Investigate methods for handling imbalance class distributions, particularly for rare PII types.
- Develop visualization tools and techniques to facilitate the interpretation of model predictions.


# BiLSTM

In [1]:
!pip install wurlitzer

Collecting wurlitzer
  Downloading wurlitzer-3.1.0-py3-none-any.whl.metadata (2.5 kB)
Downloading wurlitzer-3.1.0-py3-none-any.whl (8.4 kB)
Installing collected packages: wurlitzer
Successfully installed wurlitzer-3.1.0


In [2]:
!pip install gensim



In [3]:
!pip install keras tensorflow matplotlib

Collecting keras
  Downloading keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Downloading keras-2.15.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 3.0.5
    Uninstalling keras-3.0.5:
      Successfully uninstalled keras-3.0.5
Successfully installed keras-2.15.0


In [4]:
!pip install git+https://www.github.com/keras-team/keras-contrib.git

Collecting git+https://www.github.com/keras-team/keras-contrib.git
  Cloning https://www.github.com/keras-team/keras-contrib.git to /tmp/pip-req-build-sm_1cz5z
  Running command git clone --filter=blob:none --quiet https://www.github.com/keras-team/keras-contrib.git /tmp/pip-req-build-sm_1cz5z
  Resolved https://www.github.com/keras-team/keras-contrib.git to commit 3fc5ef709e061416f4bc8a92ca3750c824b5d2b0
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: keras_contrib
  Building wheel for keras_contrib (setup.py) ... [?25ldone
[?25h  Created wheel for keras_contrib: filename=keras_contrib-2.0.8-py3-none-any.whl size=101060 sha256=3077b9cfb85f2d4b2c4921d446c2bbb02ad5b59cc231563d692cea2af9d43bda
  Stored in directory: /tmp/pip-ephem-wheel-cache-qfl2rel2/wheels/74/d5/f7/0245af7ac33d5b0c2e095688649916e4bf9a8d6b3362a849f5
Successfully built keras_contrib
Installing collected packages: keras_contrib
Successfully installed keras_contrib-2.0.8


In [15]:
import json
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Model
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Input
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
from keras_contrib.metrics import crf_viterbi_accuracy
from keras.callbacks import ModelCheckpoint


## Data Loading and Preprocessing

In [8]:
import json
import pandas as pd
from sklearn.model_selection import train_test_split

# Load data
with open("/kaggle/input/pii-detection-removal-from-educational-data/train.json", "r") as file:
    data = json.load(file)

# Data extraction: Keeping tokens and labels grouped by documents
documents = [{'tokens': entry['tokens'], 'labels': entry['labels']} for entry in data]

# Split data into training and validation sets
train_docs, val_docs = train_test_split(documents, test_size=0.2, random_state=42)

In [9]:
def extract_tokens_and_labels(docs):
    tokens = [doc['tokens'] for doc in docs]
    labels = [doc['labels'] for doc in docs]
    return tokens, labels

train_tokens, train_labels = extract_tokens_and_labels(train_docs)
val_tokens, val_labels = extract_tokens_and_labels(val_docs)

# Create label and token index mappings
label2idx = {
    'O': 0, 'B-NAME_STUDENT': 1, 'I-NAME_STUDENT': 2, 'B-EMAIL': 3, 'I-EMAIL': 4,
    'B-USERNAME': 5, 'I-USERNAME': 6, 'B-ID_NUM': 7, 'I-ID_NUM': 8,
    'B-PHONE_NUM': 9, 'I-PHONE_NUM': 10, 'B-URL_PERSONAL': 11, 'I-URL_PERSONAL': 12,
    'B-STREET_ADDRESS': 13, 'I-STREET_ADDRESS': 14
}
token2idx = {token: idx for idx, token in enumerate(set(token for doc in train_tokens + val_tokens for token in doc))}

## Feature Extraction

In [10]:
# Convert tokens and labels to integer indices
train_tokens_idx = [[token2idx.get(token, 0) for token in doc] for doc in train_tokens]
val_tokens_idx = [[token2idx.get(token, 0) for token in doc] for doc in val_tokens]
train_labels_idx = [[label2idx[label] for label in labels] for labels in train_labels]
val_labels_idx = [[label2idx[label] for label in labels] for labels in val_labels]

# Pad token and label sequences
max_len = max(len(seq) for seq in train_tokens_idx + val_tokens_idx)
train_tokens_padded = pad_sequences(train_tokens_idx, maxlen=max_len, padding='post')
val_tokens_padded = pad_sequences(val_tokens_idx, maxlen=max_len, padding='post')
train_labels_padded = pad_sequences(train_labels_idx, maxlen=max_len, padding='post', value=label2idx['O'])
val_labels_padded = pad_sequences(val_labels_idx, maxlen=max_len, padding='post', value=label2idx['O'])

# Convert labels to one-hot encoding
num_labels = len(label2idx)
train_labels_onehot = to_categorical(train_labels_padded, num_classes=num_labels)
val_labels_onehot = to_categorical(val_labels_padded, num_classes=num_labels)

In [11]:
# Convert tokens to integers
train_tokens_idx = [[token2idx.get(token, 0) for token in doc] for doc in train_tokens]
val_tokens_idx = [[token2idx.get(token, 0) for token in doc] for doc in val_tokens]

# Pad token sequences and label sequences
max_len = max(len(seq) for seq in train_tokens_idx + val_tokens_idx)
train_tokens_padded = pad_sequences(train_tokens_idx, maxlen=max_len, padding='post')
val_tokens_padded = pad_sequences(val_tokens_idx, maxlen=max_len, padding='post')
train_labels_padded = pad_sequences(train_labels_idx, maxlen=max_len, padding='post')
val_labels_padded = pad_sequences(val_labels_idx, maxlen=max_len, padding='post')

# Convert labels to one-hot encoding
num_labels = len(label2idx)
train_labels_onehot = to_categorical(train_labels_padded, num_classes=num_labels)
val_labels_onehot = to_categorical(val_labels_padded, num_classes=num_labels)

## Model Training

In [24]:
# Model architecture
embedding_dim = 100
lstm_units = 64
dropout_rate = 0.5

input_layer = tf.keras.layers.Input(shape=(max_len,))
embedding_layer = tf.keras.layers.Embedding(len(token2idx), embedding_dim, mask_zero=True)(input_layer)
bilstm_layer = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_units, return_sequences=True))(embedding_layer)
dropout_layer = tf.keras.layers.Dropout(dropout_rate)(bilstm_layer)
dense_layer = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(num_labels, activation='softmax'))(dropout_layer)

model = tf.keras.Model(inputs=input_layer, outputs=dense_layer)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
epochs = 10
batch_size = 32
model.fit(train_tokens_padded, train_labels_onehot,
          validation_data=(val_tokens_padded, val_labels_onehot),
          epochs=epochs, batch_size=batch_size,
          callbacks=[ModelCheckpoint('best_model.h5.keras', save_best_only=True, monitor='val_accuracy', mode='max')])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7e295ed67a90>

## Evaluation

In [25]:
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score

# Predict on validation data
predictions = model.predict(val_tokens_padded)
predicted_labels = np.argmax(predictions, axis=-1)  # Convert probabilities to class labels

# Flatten the predictions and true labels for evaluation
y_pred_flat = predicted_labels.flatten()
y_val_flat = val_labels_padded.flatten()

# Mapping index to label for better readability in reports
idx2label = {v: k for k, v in label2idx.items()}

# Convert indices to labels
y_pred_labels = [idx2label[idx] for idx in y_pred_flat]
y_true_labels = [idx2label[idx] for idx in y_val_flat]

# Generate a classification report
print(classification_report(y_true_labels, y_pred_labels, labels=list(label2idx.keys()), target_names=list(label2idx.keys()), zero_division=1))

non_o_labels = [label for label in label2idx if label != 'O']
non_o_indices = [label2idx[label] for label in non_o_labels]

# Filtering out 'O' labels from flat lists
non_o_true_labels = [label for label in y_true_labels if label in non_o_labels]
non_o_pred_labels = [y_pred_labels[i] for i, label in enumerate(y_true_labels) if label in non_o_labels]

precision = precision_score(non_o_true_labels, non_o_pred_labels, labels=non_o_labels, average='weighted', zero_division=1)
recall = recall_score(non_o_true_labels, non_o_pred_labels, labels=non_o_labels, average='weighted', zero_division=1)
f1 = f1_score(non_o_true_labels, non_o_pred_labels, labels=non_o_labels, average='weighted', zero_division=1)

print(f"Precision for Non-'O' labels: {precision}")
print(f"Recall for Non-'O' labels: {recall}")
print(f"F1-score for Non-'O' labels: {f1}")


                  precision    recall  f1-score   support

               O       1.00      1.00      1.00   4491312
  B-NAME_STUDENT       0.41      0.19      0.26       263
  I-NAME_STUDENT       0.57      0.18      0.27       244
         B-EMAIL       1.00      0.00      0.00         3
         I-EMAIL       1.00      1.00      1.00         0
      B-USERNAME       1.00      1.00      1.00         0
      I-USERNAME       1.00      1.00      1.00         0
        B-ID_NUM       1.00      0.00      0.00        10
        I-ID_NUM       1.00      1.00      1.00         0
     B-PHONE_NUM       1.00      0.00      0.00         2
     I-PHONE_NUM       1.00      0.00      0.00         3
  B-URL_PERSONAL       1.00      0.00      0.00        28
  I-URL_PERSONAL       1.00      1.00      1.00         0
B-STREET_ADDRESS       1.00      0.00      0.00         1
I-STREET_ADDRESS       1.00      0.00      0.00        10

       micro avg       1.00      1.00      1.00   4491876
       macro

In [26]:
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score, fbeta_score

beta = 5
f_beta = fbeta_score(non_o_true_labels, non_o_pred_labels, labels=non_o_labels, average='weighted', beta=beta, zero_division=1)

print(f"F-beta score for Non-'O' labels (beta={beta}): {f_beta}")

F-beta score for Non-'O' labels (beta=5): 0.17142525033402833


## Error Analysis and Explainability

In [29]:
y_pred_flat = np.argmax(predictions, axis=-1).flatten()  # predictions is the output of model.predict
y_val_flat = val_labels_padded.flatten()  # val_labels_padded is already defined in your pre-processing

# Convert numeric labels back to string labels
y_pred_labels_flat = [idx2label[idx] for idx in y_pred_flat]
y_true_labels_flat = [idx2label[idx] for idx in y_val_flat]

# Extract tokens and labels for the validation set
tokens_val = [doc['tokens'] for doc in val_docs]  # Assuming val_docs is available
labels_val = [doc['labels'] for doc in val_docs]

# Flatten these for direct comparisons
tokens_val_flat = [token for sublist in tokens_val for token in sublist]
labels_val_flat = [label for sublist in labels_val for label in sublist]


# Find indices where predictions and true values differ
mismatches = [i for i, (y_pred, y_true) in enumerate(zip(y_pred_labels_flat, y_true_labels_flat)) if y_pred != y_true]

from collections import Counter

# Analyze types of errors
error_types = Counter((y_true, y_pred) for y_true, y_pred in zip(y_true_labels_flat, y_pred_labels_flat) if y_true != y_pred)
print("Common error types:")
for (true_label, pred_label), count in error_types.most_common(10):
    print(f"True: {true_label}, Predicted: {pred_label}, Count: {count}")


Common error types:
True: B-NAME_STUDENT, Predicted: O, Count: 191
True: I-NAME_STUDENT, Predicted: O, Count: 178
True: O, Predicted: B-NAME_STUDENT, Count: 49
True: B-URL_PERSONAL, Predicted: O, Count: 28
True: I-NAME_STUDENT, Predicted: B-NAME_STUDENT, Count: 23
True: B-NAME_STUDENT, Predicted: I-NAME_STUDENT, Count: 21
True: B-ID_NUM, Predicted: O, Count: 10
True: O, Predicted: I-NAME_STUDENT, Count: 8
True: I-STREET_ADDRESS, Predicted: O, Count: 7
True: I-STREET_ADDRESS, Predicted: I-NAME_STUDENT, Count: 3
