# Machine Exercise 6
### By: Jeryl Salas
We are tasked to create a non-stacked single direction RNN with LSTM cells to perform the named entity recognition to identify citations as in the Coleridge Initiative Problem. 

### A. Import Libraries
Here we place all the necessary imports used for data preprocessing, embedding, feature extraction, and model training and prediction


In [86]:
import numpy as np # For computations of probabilities, perplexities and expected count
import os # For folder and file path opening
import json # Used for loading JSON data
import random # For random selection of JSON files for training and testing
import pandas as pd # For transforming JSON data into a pandas dataframe
import nltk # Used for tokenization of training and testing set
from nltk import pos_tag # Used for parts of speech (POS) tagging
from tqdm import tqdm # For displaying progress bar
import re # For regular expression
import itertools # Used in feature extraction to efficiently remove duplicates
from sklearn.preprocessing import OneHotEncoder # Used for converting extracted features to vectors
from sklearn.metrics import accuracy_score, precision_recall_fscore_support # Used for performance metrics
from sklearn.model_selection import train_test_split # Used to split training and testing set
nltk.download('punkt')
nltk.download('averaged_ perceptron_tagger_eng') 

[nltk_data] Downloading package punkt to C:\Users\Jeryl
[nltk_data]     Salas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Error loading averaged_ perceptron_tagger_eng: Package
[nltk_data]     'averaged_ perceptron_tagger_eng' not found in index


False

### B. Loading and Splitting Data

We first load the dataset and the train.csv file. The max sample size of documents would be 4000 with a 60-40 split between training and testing data. The train.csv file would serve as our basis for BIO tagging later. training documents are stored in the "papers" dictionary while test documents are stored in the "test papers" dictionary. Both dictionaries have document ID as the primary key with text as the value

In [87]:
MAX_SAMPLE = 1000
max_length = 30
train_path = r'C:\Users\Jeryl Salas\Documents\AI 351\MEx 2 Tokenizer\coleridgeinitiative-show-us-the-data\train.csv'
paper_train_folder = r'C:\Users\Jeryl Salas\Documents\AI 351\MEx 2 Tokenizer\coleridgeinitiative-show-us-the-data\train'

data = pd.read_csv(train_path)
data_extract = data[:MAX_SAMPLE]
train, test = train_test_split(data_extract, test_size=0.4)
print(f'No. raw training rows: {len(train)}')
print(f'No. test rows: {len(test)}')

No. raw training rows: 600
No. test rows: 400


In [88]:
train = train.groupby('Id').agg({
    'pub_title': 'first',
    'dataset_title': '|'.join,
    'dataset_label': '|'.join,
    'cleaned_label': '|'.join
}).reset_index()

test = test.groupby('Id').agg({
    'pub_title': 'first',
    'dataset_title': '|'.join,
    'dataset_label': '|'.join,
    'cleaned_label': '|'.join
}).reset_index()

print(f'No. grouped training rows: {len(train)}')
print(f'No. grouped testing rows: {len(test)}')

No. grouped training rows: 594
No. grouped testing rows: 396


In [89]:
papers = {}
for paper_id in train['Id'].unique():
    with open(f'{paper_train_folder}\{paper_id}.json', 'r') as f:
        paper = json.load(f)
        papers[paper_id] = paper
        print(f"successfully loaded {paper_id}.JSON")

test_papers = {}
for paper_id in test['Id'].unique():
    with open(f'{paper_train_folder}\{paper_id}.json', 'r') as f:
        paper = json.load(f)
        test_papers[paper_id] = paper
        print(f"successfully loaded test {paper_id}.JSON")


successfully loaded 00248da3-ac1d-48fa-a95e-cc88553f9583.JSON
successfully loaded 0035a1ba-6d1e-487b-bc3d-d49c5d64a3e9.JSON
successfully loaded 0085fd27-7924-41cb-b268-574356c3d6f7.JSON
successfully loaded 00b81ff9-667b-438c-88d4-d08d4979ab68.JSON
successfully loaded 01264bbb-9ed7-406d-a189-83fea3e3437c.JSON
successfully loaded 01320f4d-0ee7-4baa-8476-167a80a25b69.JSON
successfully loaded 0168d49e-cbeb-4c4d-ab17-df2181cba61e.JSON
successfully loaded 01c7c4d7-49bd-436e-8ae8-593b04339c60.JSON
successfully loaded 01ecf2c7-e465-4aa9-b67a-16eee7ceb83b.JSON
successfully loaded 01ff79c7-bb0f-4172-b7e4-401b7aedd986.JSON
successfully loaded 02b3dd31-8c7d-43d4-8795-ae74de8fe473.JSON
successfully loaded 02bf2ee9-7a26-4416-922e-e3daa7bcee0d.JSON
successfully loaded 02c84b89-3171-460c-b880-7500865db2f0.JSON
successfully loaded 02fe9361-b20d-4e29-a031-e5340f165572.JSON
successfully loaded 031e4b40-7c35-4607-b760-6d3fcac9b493.JSON
successfully loaded 036a7628-f511-491d-9060-7af3325bb07d.JSON
successf

### C. Preprocessing Data

For the preprocessing of data, both training and test data will undergro processing such as replacing characters and shortening senntences. As you can see, the BIO tagging functions are here as well which will be used for the next step

In [90]:
from nltk.tokenize import RegexpTokenizer, sent_tokenize

def replace_characters(text: str) -> str:
    """
    Replaces specific special characters in the input text according to predefined replacement rules.
    """
    replacement_rules = {'“': '', '”': '', '’': "", '--': '', '(' : '', ')' : '', ',' : '', ';': ''}
    for symbol, replacement in replacement_rules.items():
        text = text.replace(symbol, replacement)
    return text


def shorten_sentences(sentences, MAX_LENGTH=250, OVERLAP=20):
    short_sentences = []
    for sentence in sentences:
        words = sentence.split()
        if len(words) > MAX_LENGTH:
            for p in range(0, len(words), MAX_LENGTH - OVERLAP):
                short_sentences.append(' '.join(words[p:p+MAX_LENGTH]))
        else:
            short_sentences.append(sentence)
    return ''.join(short_sentences)


def pad_sequence(tokens, max_length, pad_token="<PAD>"):
    """
    Pads or truncates a sequence of tokens to a fixed length.
    """
    if len(tokens) < max_length:
        tokens += [pad_token] * (max_length - len(tokens))
    else:
        tokens = tokens[:max_length]
    
    return tokens


def preprocess(papers):
    """
    Main preprocessing function used to load and process text data and output tokens for both training and testing
    """

    paper_tokenized_sentences = {}

    for paper_id, text in papers.items():
        all_text = ""
        all_text += text[0]['section_title'] + " " + text[0]['text']
        cleaned_text = replace_characters(all_text.lower())
        shortened_text = shorten_sentences(cleaned_text)
        paper_tokenized_sentences[paper_id] = shortened_text


    print(paper_tokenized_sentences)
    return paper_tokenized_sentences


def find_sublist(big_list, small_list):
    all_positions = []
    for i in range(len(big_list) - len(small_list) + 1):
        if small_list == big_list[i:i+len(small_list)]:
            all_positions.append(i)
    
    return all_positions


def tag_sentence(sentence, labels): 
    word_tokenizer = RegexpTokenizer(r'[-\'\w]+')
    tokenized_sentence = word_tokenizer.tokenize(sentence)
    padded_tokens = pad_sequence(tokenized_sentence, max_length)
    
    if labels is not None and any(re.findall(f'\\b{label}\\b', sentence)
                                  for label in labels): 
        nes = ['O'] * len(padded_tokens)

        for i, token in enumerate(padded_tokens):
            if token == "<PAD>":
                nes[i] = 'PD'

        for label in labels:
            label_words = label.split()

            all_pos = find_sublist(padded_tokens, label_words)
            for pos in all_pos:
                nes[pos] = 'B'
                for i in range(pos+1, pos+len(label_words)):
                    nes[i] = 'I'

        return True, list(zip(padded_tokens, nes)), padded_tokens
        
    else: 
        nes = ['O'] * len(padded_tokens)

        for i, token in enumerate(padded_tokens):
            if token == "<PAD>":
                nes[i] = 'PD'
                
        return False, list(zip(padded_tokens, nes)), padded_tokens


In [91]:
papers = preprocess(papers)
test_papers = preprocess(test_papers)



In [92]:
counter = 0
for paper_id, sections in papers.items():
    print(f"Paper ID: {paper_id}")
    print(f"section: {sections}")
    counter += 1
    if counter == 10: break

print("\n\n----------TEST PAPERS-----------\n\n")

counter = 0
for paper_id, sections in test_papers.items():
    print(f"Paper ID: {paper_id}")
    print(f"section: {sections}")
    counter += 1
    if counter == 10: break

Paper ID: 00248da3-ac1d-48fa-a95e-cc88553f9583
section: abstract muscle power is associated with mortality independent of strength suggesting that movement speed and coordination convey health-related information. we hypothesized that movement speed is a marker of longevity. our participants included 1196 men who performed a tapping and/or auditory simple respond to a sound and disjunctive respond to a higher pitched sound reaction-time tasks while participating in the baltimore longitudinal study of aging. mortality was assessed over 40 years. tapping time was associated with mortality relative risk [rr] ¼ 1.34 per minute 95% confidence interval [ci] 1.05-1.70 adjusted for age and persisted with adjustments for arm strength and power. simple rr ¼ 1.17 per 100 ms 95% ci 1.03-1.32 and disjunctive rr ¼ 1.14 per 100 ms 95% ci 1.03-1.27 reaction times but not their difference rr ¼ 1.04 per 100 ms 95% ci 0.92-1.19 were associated with mortality after adjustments for age neurological/psychia

### D. BIO Tagging and Feature Extraction

After preprocessing, both the training and testing data will undergo BIO tagging and feature extraction. The BIO tagging will serve as the labels, while the extracted features will be used as input data.

#### D.1 BIO Tagging
BIO tagging is a standard approach to **Named Entity Recognition (NER)** that helps capture both the boundaries and the named entity types within text. The tags include:

- **B** = Begin (indicates the beginning of a named entity)
- **I** = Inside (indicates the inside of a named entity)
- **O** = Outside (indicates the word is not part of any named entity)
- **PD** = Padding Token

#### D.2 Feature Extraction
While each word in a sentence is being BIO tagged, it also undergoes feature extraction. The goal is to effectively capture the characteristics of each word, along with its neighboring words. The extracted features include:

1. **Word Identity**  
   - Identity of the word (*w_i*) and its neighboring words

2. **Word Embeddings**  
   - Embeddings for the word (*w_i*) and its neighboring words. I used Word2Vec for the embedding since I get worse results for fasttext for some reason

3. **Part of Speech (POS)**  
   - POS tag for the word (*w_i*) and its neighboring words

4. **Gazetteer Features**  
   - Presence of the word (*w_i*) in a gazetteer (a curated list of entities)

5. **Prefix and Suffix**  
   - Whether the word (*w_i*) contains a particular prefix or suffix  
   - Prefixes and suffixes are considered from all strings of length ≤ 4

6. **Word Shape**  
   - Full word shape (e.g., capitalized, digits, punctuation) of the word (*w_i*) and its neighboring words

Reference: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

In [93]:
# Initialize POS tags and shape characters
pos_tags_vocabulary = ["<START>", "<END>", "<PAD>"] # Special tokens for start and end of a sentence
shape_chars = ['x', '<START>', '<END>', '<PAD>', 'X'] # Initialize usual shape characters
short_shape_chars = ['x', '<START>', '<END>', '<PAD>', 'X']

def load_glove_embeddings(glove_file):
    '''Load GloVe embeddings from a file and store them in a dictionary'''
    embeddings = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings


def identity_of_word(word):
    '''Return the identity of the word'''
    return word

def identity_of_neighboring_words(words, i):
    '''Get the identity of neighboring words for a given word in the sentence'''
    prev_word = words[i - 1] if i > 0 else "<START>"
    next_word = words[i + 1] if i < len(words) - 1 else "<END>"
    return prev_word, next_word

def word_embedding(word, embeddings, embedding_dim):
    '''Return the embedding vector for a given word using GloVe pre-trained embedding'''
    if word in embeddings:
        return embeddings[word]
    else:
        return np.zeros(embedding_dim)
    
def embedding_of_neighboring_words(words, i, embeddings,  embedding_dim):
    '''Get the embedding vectors of neighboring words using GloVe pre-trained embedding'''
    prev_word_embedding = word_embedding(words[i - 1], embeddings, embedding_dim) if i > 0 else np.zeros(embedding_dim)
    next_word_embedding = word_embedding(words[i + 1], embeddings, embedding_dim) if i < len(words) - 1 else np.zeros(embedding_dim)
    return prev_word_embedding, next_word_embedding

def part_of_speech(word):
    '''Return the Part of Speech (POS) tag for a word'''
    tag = nltk.pos_tag([word])[0][1]

    '''update the POS vocabulary'''
    if tag not in pos_tags_vocabulary:
        pos_tags_vocabulary.append(tag)

    return tag

def pos_of_neighboring_words(words, i):
    '''Get the POS tags of neighboring words'''
    pos_tags = nltk.pos_tag(words)
    prev_pos = pos_tags[i - 1][1] if i > 0 else "<START>"
    next_pos = pos_tags[i + 1][1] if i < len(words) - 1 else "<END>"

    '''Add POS tags to vocabulary in case they don't exist in the POS vocab'''
    if prev_pos not in pos_tags_vocabulary:
        pos_tags_vocabulary.append(prev_pos)

    if next_pos not in pos_tags_vocabulary:
        pos_tags_vocabulary.append(next_pos)

    return prev_pos, next_pos

def word_in_gazetteer(word, gazetteer):
    '''Check if a word exists in the gazetteer which is the dataset_labels in train.csv'''
    return word in gazetteer

def word_prefix(word, length=4):
    '''Extract the prefix of a word up to length 4'''
    return word[:min(len(word), length)]

def word_suffix(word, length=4):
    '''Extract the suffix of a word up to length 4'''
    return word[-min(len(word), length):]

def word_shape(word):
    '''Return the word shape based on the character's characteristics'''
    shape = []
    for char in word:
        if char.isupper():
            shape.append("X") # 'X' for uppercase characters
        elif char.islower():
            shape.append("x") # 'x' for lowercase characters
        elif char.isdigit():
            shape.append("9") # '9' for digits
        else:
            shape.append(char)

    '''Add to shape_chars if they don't exist yet'''
    cat_shape = "".join(shape)
    if cat_shape not in shape_chars:
        shape_chars.append(cat_shape)
    return cat_shape

def shape_of_neighboring_words(words, i):
    '''Get the word shapes of neighboring words'''
    prev_word_shape = word_shape(words[i - 1]) if i > 0 else "<START>"
    next_word_shape = word_shape(words[i + 1]) if i < len(words) - 1 else "<END>"
    return prev_word_shape, next_word_shape

def short_word_shape(word):
    '''Return a shortened version of the word shape (without dupes)'''
    full_shape = word_shape(word)
    short_shape = [k for k, _ in itertools.groupby(full_shape)]  # Remove consecutive duplicates
    if short_shape not in shape_chars:
        short_shape_chars.append(short_shape)
    return "".join(short_shape)

def short_shape_of_neighboring_words(words, i):
    '''Return a short shapes of neighboring words'''
    prev_short_shape = short_word_shape(words[i - 1]) if i > 0 else "<START>"
    next_short_shape = short_word_shape(words[i + 1]) if i < len(words) - 1 else "<END>"
    return prev_short_shape, next_short_shape

def gazetteer_features(word, gazetteers):
    '''Check if the word exists in multiple gazetteers'''
    features = {}
    for gaz_name, gaz_list in gazetteers.items():
        features[f"in_{gaz_name}"] = word in gaz_list
    return features

def extract_features(words, i, embeddings, gazetteer):
    '''Extract all features for a given word in a sentence'''
    embedding_dim=200
    word = words[i]
    features = {
        "identity": identity_of_word(word) if word != "<PAD>" else "<PAD>",
        "prev_word": identity_of_neighboring_words(words, i)[0],
        "next_word": identity_of_neighboring_words(words, i)[1],
        "embedding": word_embedding(word, embeddings,embedding_dim) ,
        "prev_embedding": embedding_of_neighboring_words(words, i, embeddings, embedding_dim)[0],
        "next_embedding": embedding_of_neighboring_words(words, i, embeddings, embedding_dim)[1],
        "pos": part_of_speech(word),
        "prev_pos": pos_of_neighboring_words(words, i)[0],
        "next_pos": pos_of_neighboring_words(words, i)[1],
        "in_gazetteer": word_in_gazetteer(word, gazetteer),
        "prefix": word_embedding(word_prefix(word), embeddings, embedding_dim),
        "suffix": word_embedding(word_suffix(word), embeddings, embedding_dim),
        "word_shape": word_shape(word),
        "prev_word_shape": shape_of_neighboring_words(words, i)[0],
        "next_word_shape": shape_of_neighboring_words(words, i)[1]
    }
    gaz_features = gazetteer_features(word, gazetteer)
    features.update(gaz_features)
    return features

#### D.3 Tagging and Extraction

In this step, each word from the sentences in both the training and testing sets is tagged and feature-extracted. The process can take around **half an hour** to complete, so we use `tqdm` to display a progress bar for better visibility.

Additionally, the following counters are tracked to show sentence statistics:

- **`cnt_pos`**: Number of sentences containing labels.
- **`cnt_neg`**: Number of sentences without labels.

Example Progress Bar:
[#########......] 60% Complete

As shown in the statistics, there are significantly less cnt_pos for both training and testing sets. This could introduce bias in our CRF training and testing

In [94]:
cnt_pos, cnt_neg = 0, 0 # number of sentences that contain/not contain labels
tcnt_pos, tcnt_neg = 0, 0
ner_data = []
test_ner_data = []
pbar = tqdm(total=len(train))
pbar_test = tqdm(total=len(test))
glove_file = r'C:\Users\Jeryl Salas\Documents\AI 351\glove.6B\glove.6B.200d.txt'
embeddings = load_glove_embeddings(glove_file)
gazetteer = {
    "organization": set(train['dataset_label'].unique())
}
test_gazetteer = {
    "organization": set(test['dataset_label'].unique())
}
all_features = [] 
test_all_features = []


for i, id, dataset_label in train[['Id', 'dataset_label']].itertuples():
    text = papers[id]
    sentences = text.split('.')

    cleaned_labels = replace_characters(dataset_label.lower())
    labels = cleaned_labels.split('|')

    for sentence in sentences:
        is_positive, tags, padded_tokens = tag_sentence(sentence, labels)
        if is_positive:
            cnt_pos += 1
            ner_data.append(tags)
        else:
            ner_data.append(tags)
            cnt_neg += 1
        
        ner_data[-1]
        
        
        words = padded_tokens
        sentence_features = []
        for i, word in enumerate(words):
            word_features = extract_features(words, i, embeddings, gazetteer)
            sentence_features.append(word_features) 

        all_features.append(sentence_features)

    pbar.update(1)
    pbar.set_description(f"Training data size: {cnt_pos} positives + {cnt_neg} negatives")


for i, id, dataset_label in test[['Id', 'dataset_label']].itertuples():
    text = test_papers[id]
    test_sentences = text.split('.')

    cleaned_labels = replace_characters(dataset_label.lower())
    labels = cleaned_labels.split('|')

    for sentence in test_sentences:
        is_positive, tags, padded_tokens = tag_sentence(sentence, labels)
        if is_positive:
            tcnt_pos += 1
            test_ner_data.append(tags)
        else:
            test_ner_data.append(tags)
            tcnt_neg += 1
        
        test_ner_data[-1]
        
        
        words = padded_tokens
        sentence_features = []
        for i, word in enumerate(words):
            word_features = extract_features(words, i, embeddings, test_gazetteer)
            sentence_features.append(word_features) 

        test_all_features.append(sentence_features)
        #print(all_features[-1])

    pbar_test.update(1)
    pbar_test.set_description(f"Testing data size: {tcnt_pos} positives + {tcnt_neg} negatives")



Training data size: 95 positives + 10682 negatives: 100%|██████████| 591/591 [3:52:26<00:00, 23.60s/it]
Testing data size: 67 positives + 6737 negatives: 100%|██████████| 400/400 [3:52:26<00:00, 34.87s/it]

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A


#### D.4 Write BIO tags in JSON.txt files
The BIO tagged results for both `training` and `test` sets are located at the `ner_json.txt` and `test_ner_json.txt`, respectively.

In [95]:
with open(r'C:\Users\Jeryl Salas\Documents\AI 351\MEx 6 Non-stacked Single Direction RNN with LSTM\ner_json.txt', 'w') as f:
    for row in ner_data:
        if row:  # If the row is not empty
            words, nes = list(zip(*row))
            row_json = {'tokens': words, 'tags': nes}
        else:  # If the row is empty
            row_json = {'tokens': "", 'tags': ""}
        
        json.dump(row_json, f)
        f.write('\n')

with open(r'C:\Users\Jeryl Salas\Documents\AI 351\MEx 6 Non-stacked Single Direction RNN with LSTM\test_ner_json.txt', 'w') as f:
    for row in test_ner_data:
        if row:  # If the row is not empty
            words, nes = list(zip(*row))
            row_json = {'tokens': words, 'tags': nes}
        else:  # If the row is empty
            row_json = {'tokens': "", 'tags': ""}
        
        json.dump(row_json, f)
        f.write('\n')

In [96]:
print(f"ner structure: {ner_data[0]}")
print(f"feature structure: {all_features[0]}")

ner structure: [('abstract', 'O'), ('muscle', 'O'), ('power', 'O'), ('is', 'O'), ('associated', 'O'), ('with', 'O'), ('mortality', 'O'), ('independent', 'O'), ('of', 'O'), ('strength', 'O'), ('suggesting', 'O'), ('that', 'O'), ('movement', 'O'), ('speed', 'O'), ('and', 'O'), ('coordination', 'O'), ('convey', 'O'), ('health-related', 'O'), ('information', 'O'), ('<PAD>', 'PD'), ('<PAD>', 'PD'), ('<PAD>', 'PD'), ('<PAD>', 'PD'), ('<PAD>', 'PD'), ('<PAD>', 'PD'), ('<PAD>', 'PD'), ('<PAD>', 'PD'), ('<PAD>', 'PD'), ('<PAD>', 'PD'), ('<PAD>', 'PD')]
feature structure: [{'identity': 'abstract', 'prev_word': '<START>', 'next_word': 'muscle', 'embedding': array([-1.8133e-01,  2.0599e-01, -2.0806e-01, -1.6986e-01,  9.1498e-01,
       -1.1684e-02, -4.3782e-01,  5.8944e-01,  1.7412e-01, -3.3781e-01,
       -7.8252e-02, -7.4619e-04,  2.1310e-02,  6.6326e-01,  5.0966e-01,
        1.2798e-01, -3.5376e-01, -3.0487e-01,  6.9657e-01,  3.0376e-02,
       -4.1940e-01,  9.9429e-01,  5.5103e-03, -5.0805e-01

### E. RNN + LSTM Cells 

In this section, we implement a non-stacked single direction RNN model with LSTM cells, consisting of a single `LSTM Cell` that handles the computations for the LSTM architecture, including `input`, `forget`, `cell`, and `output` gates with appropriate `weight initializations`. The LSTM class aggregates multiple time steps, updating the hidden and cell states at each step while producing an output based on the last hidden state. The model is trained using the `Adam optimizer` and `cross-entropy loss` to predict sequences, making it a non-stacked, single-directional RNN because it processes the input sequentially without additional LSTM layers or bidirectional connections. The BIO-tagged data serves as the labels, while the extracted features are used as input data, which is converted into vectors

#### E.1 Feature extraction and BIOS tagging numerical conversion
This function converts linguistic contextual features into vectores by embedding or one-hot encoding. The resulting vectors of each token is therefore a concatenation of all the features of that token. Each token has 1-dimensional vector of length 2868. This can be represented as:

$$ F_k(X, Y) $$

Where a feature function that maps an entire input sequence $ X $ and an entire output sequence $ Y $ to a feature vector. The feature vector contributes to the final prediction with corresponding weights $ w_k $. Ref: Page 377, Jurafsky, et. Al (2024)

In [97]:
import torch
import torch.nn as nn
import torch.optim as optim

def feature_2_vec(feature, pos_tags, embedding_dim = 300):
    '''Converts input features into a feature vector'''
    vector = []
    encoder_1 = OneHotEncoder(categories=[pos_tags], sparse_output=False)
    pos_encoder = encoder_1.fit(np.array(pos_tags).reshape(-1, 1))
    pos = pos_encoder.transform(np.array([feature['pos']]).reshape(-1, 1))
    vector.extend(pos[0])

    vector.extend(feature['embedding'].tolist())
    vector.extend(feature['prev_embedding'].tolist())
    vector.extend(feature['next_embedding'].tolist())

    next_pos = pos_encoder.transform(np.array([feature['next_pos']]).reshape(-1, 1))
    prev_pos = pos_encoder.transform(np.array([feature['prev_pos']]).reshape(-1, 1))
    vector.extend(next_pos[0])
    vector.extend(prev_pos[0])

    encoder_3 = OneHotEncoder(categories=[shape_chars], sparse_output=False)
    word_shape_encoder = encoder_3.fit(np.array(shape_chars).reshape(-1, 1))
    next_word_shape_encoder = encoder_3.fit(np.array(shape_chars).reshape(-1, 1))

    word_shape_vector = word_shape_encoder.transform(np.array([feature['word_shape']]).reshape(-1, 1))
    next_word_shape_vector = next_word_shape_encoder.transform(np.array([feature['next_word_shape']]).reshape(-1, 1))

    if feature['in_organization'] == True:
        gazzetteer_vector = np.ones(embedding_dim)
    else:
        gazzetteer_vector = np.zeros(embedding_dim)

    vector.extend(feature['prefix'].tolist())
    vector.extend(feature['suffix'].tolist())
    vector.extend(word_shape_vector[0])
    vector.extend(next_word_shape_vector[0])
    vector.extend(gazzetteer_vector)

    return(vector)

In [98]:
labels = ['B', 'I', 'O', 'PD']
tag_map = {"B": 0, "I": 1, "O": 2, "PD": 3}

'''Filling of X and y training and testing sets'''
X_train = []
y_train = []
for i, sentence in enumerate(sentences):
    tags = ner_data[i]
    features =  all_features[i]
    vectors = []
    for word_feature, tag in zip(features, tags):
        vector = feature_2_vec(word_feature, pos_tags_vocabulary)
        vectors.append(vector)
    bio_tags = [tag_map[tag] for _, tag in tags]
    X_train.append(vectors)
    y_train.append(bio_tags)


X_test = []
y_test = []
for i, sentence in enumerate(test_sentences):
    tags = test_ner_data[i]
    features =  test_all_features[i]
    vectors = []
    for word_feature, tag in zip(features, tags):
        vector = feature_2_vec(word_feature, pos_tags_vocabulary)
        vectors.append(vector)
    bio_tags = [tag_map[tag] for _, tag in tags]
    X_test.append(vectors)
    y_test.append(bio_tags)

In [99]:
first_length = len(X_train[0][0])
print(f"length of vectors per token: {first_length} ")
all_equal = all(len(seq[0]) == first_length for seq in X_train)

if all_equal:
    print("All sequences have equal lengths.")
else:
    print("Sequences have unequal lengths.")

length of vectors per token: 2868 
All sequences have equal lengths.



#### E.2 Initialization of LSTM
The LSTM model initializes its weights for input, forget, cell, and output gates as follows:

$$
W_i, W_f, W_o, W_c \in \mathbb{R}^{H \times I}
$$
where $ W_i, W_f, W_o, W_c $ are the weights corresponding to the input, forget, output, and cell gates, respectively, for an input size $ I $ and hidden size $ H $.

$$
U_i, U_f, U_o, U_c \in \mathbb{R}^{H \times H}
$$
where $ U_i, U_f, U_o, U_c $ are the weights for the recurrent connections of the respective gates.

The biases for the gates are initialized as:

$$
b_i, b_f, b_o, b_c \in \mathbb{R}^{H}
$$

#### E.3 Forward Pass
The forward pass computes the hidden and cell states at each time step using the LSTM equations. For the input $ x_t $ at time $ t $, the initial hidden state $ h_0 $, and the cell state $ c_0 $:

$$
i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)
$$

$$
f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)
$$

$$
c_t = f_t \cdot c_{t-1} + i_t \cdot \tanh(W_c x_t + U_c h_{t-1} + b_c)
$$

$$
o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)
$$

$$
h_t = o_t \cdot \tanh(c_t)
$$

where $ \sigma $ is the sigmoid activation function.

In [100]:
def initialize_weights(layer, init_type='xavier'):
    """
    Initializes the weights of a given layer.
    """
    if init_type == 'xavier':
        torch.nn.init.xavier_uniform_(layer)
    elif init_type == 'kaiming':
        torch.nn.init.kaiming_uniform_(layer)
    else:
        raise ValueError(f"Unknown initialization type '{init_type}'")


class LSTMCell:
    def __init__(self, input_size, hidden_size, init_type='xavier'):
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        # Initializing weights for input, forget, cell, and output gates
        self.W_i = torch.randn(hidden_size, input_size, requires_grad=True)
        self.W_f = torch.randn(hidden_size, input_size, requires_grad=True)
        self.W_o = torch.randn(hidden_size, input_size, requires_grad=True)
        self.W_c = torch.randn(hidden_size, input_size, requires_grad=True)
        
        # Initializing weights for hidden states
        self.U_i = torch.randn(hidden_size, hidden_size, requires_grad=True)
        self.U_f = torch.randn(hidden_size, hidden_size, requires_grad=True)
        self.U_o = torch.randn(hidden_size, hidden_size, requires_grad=True)
        self.U_c = torch.randn(hidden_size, hidden_size, requires_grad=True)
        
        # Initializing biases. 
        self.b_i = torch.randn(hidden_size, requires_grad=True)
        self.b_f = torch.randn(hidden_size, requires_grad=True)
        self.b_o = torch.randn(hidden_size, requires_grad=True)
        self.b_c = torch.randn(hidden_size, requires_grad=True)

        # Weight initialization
        self.initialize_weights(init_type)

    def initialize_weights(self, init_type='xavier'):
            """
            Initializes the weights of the LSTM cell using the specified initialization method.
            """
            initialize_weights(self.W_i, init_type)
            initialize_weights(self.W_f, init_type)
            initialize_weights(self.W_o, init_type)
            initialize_weights(self.W_c, init_type)

            initialize_weights(self.U_i, init_type)
            initialize_weights(self.U_f, init_type)
            initialize_weights(self.U_o, init_type)
            initialize_weights(self.U_c, init_type)
            
            # Initialize biases as zeroes
            torch.nn.init.zeros_(self.b_i)
            torch.nn.init.zeros_(self.b_f)
            torch.nn.init.zeros_(self.b_o)
            torch.nn.init.zeros_(self.b_c)

    
    def forward(self, x, h_prev, c_prev):
        if x.dim() == 1:
            x = x.view(1, -1)  # Convert 1D vector to 2D (1 x input_size)
        
        if h_prev.dim() == 1:
            h_prev = h_prev.view(1, -1)  # Convert 1D vector to 2D (1 x hidden_size)

        
        i_t = torch.sigmoid(torch.mm(x, self.W_i.T) + torch.mm(h_prev, self.U_i.T) + self.b_i) # Input gate
        f_t = torch.sigmoid(torch.mm(x, self.W_f.T) + torch.mm(h_prev, self.U_f.T) + self.b_f) # Forget gate
        c_hat_t = torch.tanh(torch.mm(x, self.W_c.T) + torch.mm(h_prev, self.U_c.T) + self.b_c) # Cell candidate
        c_t = f_t * c_prev + i_t * c_hat_t # Cell state update
        o_t = torch.sigmoid(torch.mm(x, self.W_o.T) + torch.mm(h_prev, self.U_o.T) + self.b_o) # Output gate
        h_t = o_t * torch.tanh(c_t) # Hidden state update
        
        return h_t, c_t # return hidden state and cell state


In [101]:
class LSTM:
    def __init__(self, input_size, hidden_size, output_size, init_type='xavier'):
        self.hidden_size = hidden_size
        self.lstm_cell = LSTMCell(input_size, hidden_size)
        
        # Initializing Output layer
        self.W_out = torch.randn(output_size, hidden_size, requires_grad=True)
        self.b_out = torch.randn(output_size, requires_grad=True)

        # Weight initialization of output weights
        self.initialize_output_weights(init_type)

    def initialize_output_weights(self, init_type='xavier'):
        initialize_weights(self.W_out, init_type)
        torch.nn.init.zeros_(self.b_out)  # Initialize biases to zero
    
    def forward(self, x_sequence):
        # Zero intialization
        h_t = torch.zeros(self.hidden_size, requires_grad=True)
        c_t = torch.zeros(self.hidden_size, requires_grad=True)
        
        outputs = [] # Storage of hidden states
        
        # Looping over each timestep which is each token of the sequence
        for t in range(x_sequence.size(0)):
            x_t = x_sequence[t]  # Input for the current time step
            h_t, c_t = self.lstm_cell.forward(x_t, h_t, c_t)  # Update hidden and cell states
            outputs.append(h_t)  # Storing hidden states
        
        # Computing the final output for the sequence
        final_output = torch.mm(h_t, self.W_out.T) + self.b_out 
        
        return torch.stack(outputs), final_output

In [105]:
# Sizes
input_size = len(X_train[0][0])  
hidden_size = 64
output_size = len(tag_map)  

# Instantiate LSTM model
lstm_model = LSTM(input_size, hidden_size, output_size)

# Optimizer
optimizer = optim.Adam([lstm_model.lstm_cell.W_i, lstm_model.lstm_cell.W_f,
                        lstm_model.lstm_cell.W_o, lstm_model.lstm_cell.W_c,
                        lstm_model.lstm_cell.U_i, lstm_model.lstm_cell.U_f,
                        lstm_model.lstm_cell.U_o, lstm_model.lstm_cell.U_c,
                        lstm_model.lstm_cell.b_i, lstm_model.lstm_cell.b_f,
                        lstm_model.lstm_cell.b_o, lstm_model.lstm_cell.b_c,
                        lstm_model.W_out, lstm_model.b_out], lr=0.0002)

# Training loop and loss function
num_epochs = 10
loss_fn = torch.nn.CrossEntropyLoss()

def mask(tensor, pad_token):
    # Iterate over the elements of the tensor
    for i in range(tensor.size(0)):
        if tensor[i] == pad_token:
            return i  # Return the index where padding starts
    return tensor.size(0)  # Return the length of the tensor if no padding is found

for epoch in range(num_epochs):
    total_loss = 0
    for i, (x_seq, y_true) in enumerate(zip(X_train, y_train)):
        x_tensor = torch.tensor(x_seq, dtype=torch.float32).view(-1, input_size)  
        y_true_tensor = torch.tensor(y_true, dtype=torch.long)  
        target = y_true_tensor[-1].unsqueeze(0)
        
        # Forward pass
        optimizer.zero_grad()  
        y_pred, final_output = lstm_model.forward(x_tensor)
        loss = loss_fn(final_output, target)
        total_loss += loss.item()

        loss.backward()

        optimizer.step()
    print(f'Epoch {epoch+1}, Loss: {total_loss / len(X_train)}')

Epoch 1, Loss: 0.5935488223795082
Epoch 2, Loss: 0.038213008068749436
Epoch 3, Loss: 0.013542227650556219
Epoch 4, Loss: 0.007159377731565204
Epoch 5, Loss: 0.004429311875800986
Epoch 6, Loss: 0.0029627025396875726
Epoch 7, Loss: 0.002126222169450061
Epoch 8, Loss: 0.0016218004214994192
Epoch 9, Loss: 0.0012797132389655767
Epoch 10, Loss: 0.0010303791406421135


#### E.4 Score Computation
The score for a given sequence of labels is computed based on the hidden states:

$$
\text{score} = \sum_{t=0}^{T-1} h_t
$$


#### E.5 Output Computation
The final output for the sequence is computed from the last hidden state $ h_T $:

$$
\text{output} = W_{\text{out}} h_T + b_{\text{out}}
$$

where $ W_{\text{out}} $ and $ b_{\text{out}} $ are the weights and biases of the output layer.

### E.6 Cross Entropy
The cross-entropy loss measures the difference between the true label distribution and the predicted label distribution by the model. It is defined mathematically as:

$$
\mathcal{L_{CE}} = - \sum_{k=1}^{K} y^{k} \log(\hat{y}^{k})
$$

where:
- $ K $ is the number of classes,
- $ y $ is the true label (1 if the class is the correct label, otherwise 0),
- $ \hat{y} $ is the predicted probability of the true class from the softmax output of the model.


### E.7 Sequence Decoding
This function implements a decoding strategy to find the most likely output sequence $ \hat{Y} $ given the input sequence $ X $. It selects the label with the highest predicted probability at each time step:

$$
\hat{Y}_t = \arg\max_y \hat{y_t}(y)
$$

where $ \hat{y_t}(y) $ is the predicted probability for each label \( y \) at time step $ t $. 



In [106]:
# Initialize test loss and accuracy
test_loss = 0
correct_predictions = 0
total_predictions = 0

num_classes = 4
true_positive = np.zeros(num_classes)
false_positive = np.zeros(num_classes)
false_negative = np.zeros(num_classes)


with torch.no_grad():
    for i, (x_seq, y_true) in enumerate(zip(X_test, y_test)):
        x_tensor = torch.tensor(x_seq, dtype=torch.float32).view(-1, input_size)
        y_true_tensor = torch.tensor(y_true, dtype=torch.long)
        target = y_true_tensor[-1].unsqueeze(0)  # True label for the last time step
        
        # Forward pass
        outputs, final_output = lstm_model.forward(x_tensor)
        
        # Compute loss
        loss = loss_fn(final_output, target)
        test_loss += loss.item()

        # Get the predicted class (argmax of the logits)
        pred_class = final_output.argmax(dim=1)

        # Compare the predicted class with the true label
        correct_predictions += (pred_class == target).sum().item()
        total_predictions += target.size(0)

        for cls in range(num_classes):
            true_positive[cls] += ((pred_class == cls) & (y_true_tensor == cls)).sum().item()  # TP
            false_positive[cls] += ((pred_class == cls) & (y_true_tensor != cls)).sum().item()  # FP
            false_negative[cls] += ((pred_class != cls) & (y_true_tensor == cls)).sum().item()  # FN

# Compute average test loss and accuracy
average_test_loss = test_loss / len(X_test)
test_accuracy = correct_predictions / total_predictions


overall_recall = np.sum(true_positive) / (np.sum(true_positive) + np.sum(false_negative)) if (np.sum(true_positive) + np.sum(false_negative)) > 0 else 0

# Print the results
print(f'Test Loss: {average_test_loss:.4f}')
print(f'Test Accuracy: {test_accuracy * 100:.2f}%')
print(f'Overall Recall: {overall_recall:.4f}') 

Test Loss: 0.0802
Test Accuracy: 97.06%
Overall Recall: 0.6490


### F. Compare with CRF Model

In this section, we'll compare the LSTM's performance with CRF model which is imported from `TorchCRF` library. Both models use the same dataset. Their accuracy and recall will be compared



In [19]:
pip install -U pytorch-crf

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [20]:
import torch
import torch.nn as nn
import torch.optim as optim
from TorchCRF import CRF

class CRFModel(nn.Module):
    def __init__(self, input_size, num_tags):
        super(CRFModel, self).__init__()
        self.hidden2tag = nn.Linear(input_size, num_tags)
        self.crf = CRF(num_tags)

    def forward(self, x, mask=None):
        emissions = self.hidden2tag(x)
        return emissions, mask 

    def compute_loss(self, emissions, tags, mask):
        log_likelihood = self.crf(emissions, tags, mask=mask)
        return -log_likelihood 

    def predict(self, emissions, mask=None):
        return self.crf.decode(emissions, mask=mask)
    
    def predict(self, emissions, mask=None):
        return self.crf.viterbi_decode(emissions, mask=mask)


In [29]:

crf_model = CRFModel(input_size, num_classes)
optimizer = optim.Adam(crf_model.parameters(), lr=0.0002)
num_epochs = 10
tol = 1e-5

def create_mask(y_train, pad_token):
    mask = (y_train != pad_token).float()
    return mask.bool()

for epoch in range(num_epochs):
    total_loss = 0
    for i, (x_seq, y_true) in enumerate(zip(X_train, y_train)):
        x_tensor = torch.tensor(x_seq, dtype=torch.float32).unsqueeze(0).transpose(0, 1) 
        y_true_tensor = torch.tensor(y_true, dtype=torch.long).unsqueeze(0).transpose(0, 1)  

        if y_true_tensor[0] == 3:
            continue
        mask = create_mask(y_true_tensor, pad_token=3)
        emissions, mask = crf_model.forward(x_tensor, mask)
        
        loss = crf_model.compute_loss(emissions, y_true_tensor, mask)
        loss = loss.sum()
        total_loss += loss

        if total_loss / len(X_train) < tol:
            break
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    if total_loss / len(X_train) < tol:
        print(f"Model converged at epoch: {epoch+1}. Stopping...")
        break

    print(f'Epoch {epoch+1}, Loss: {total_loss / len(X_train)}')


Epoch 1, Loss: 11.800827980041504
Epoch 2, Loss: 3.4341866970062256
Epoch 3, Loss: 1.8012597560882568
Epoch 4, Loss: 1.1985108852386475
Epoch 5, Loss: 0.895520806312561
Epoch 6, Loss: 0.7155444025993347
Epoch 7, Loss: 0.5968331694602966
Epoch 8, Loss: 0.5126206874847412
Epoch 9, Loss: 0.4495418071746826
Epoch 10, Loss: 0.4002172350883484


In [75]:
import torch
import numpy as np
from sklearn.metrics import accuracy_score, recall_score


crf_model.eval()

X_test_tensor = torch.tensor(X_test, dtype=torch.float32)  
y_test_tensor = torch.tensor(y_test, dtype=torch.long)  
mask_test = create_mask(y_test_tensor, pad_token=3)


all_predictions = []
all_true_labels = []


with torch.no_grad():
    for x_seq, y_true in zip(X_test_tensor, y_test_tensor):
        x_tensor = x_seq.unsqueeze(0).transpose(0, 1)  
        y_true_tensor = y_true.unsqueeze(0).transpose(0, 1) 

        mask = create_mask(y_true_tensor, pad_token=3)
        emissions, _ = crf_model.forward(x_tensor)


        predictions = crf_model.predict(emissions, mask)

        all_predictions.extend(predictions)
        all_true_labels.extend(y_true_tensor.numpy().flatten()) 


y_pred_flat = np.concatenate(all_predictions)
y_true_flat = np.array(all_true_labels)

accuracy = accuracy_score(y_true_flat, y_pred_flat)
recall = recall_score(y_true_flat, y_pred_flat, average='weighted')  

print(f'Accuracy: {accuracy:.4f}')
print(f'Recall: {recall:.4f}')


Accuracy: 0.9753
Recall: 0.9753


### E. Results and Discussion

Both RNN + LSTM and CRF models used the same testing set and learning rate (0.0002). CRF was implemented using TorchCRF library. As shown from the results the RNN + LSTM had an accuracy almost on par with the CRF with 97.06% accuracy while CRF had 97.53%. The RNN + LSTM was able to have worse recall of 0.6490 which is significantly lower than CRF's recall of 0.9753. The RNN + LSTM, however, seem to converge faster reaching 0.00103 at the 10th epoch while CRF only reaching 0.4002. The NER labels of train and testing sets are stored in the ner_json.txt and test_ner_json.txt files, respectively.