In [1]:
# write the list of necessary packages here:
!pip install pandas
!pip install nltk
!pip install spacy
!pip install scikit-learn
!pip install sklearn-crfsuite

^C






## Training a model on Named Entity Recognition task

Token classification refers to the task of classifying individual tokens in a sentence. One of the most common token
classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence,
such as a person, location, or organization. In this assignment, you will learn how to train a model on the [CoNLL 2023 NER Dataset](https://www.clips.uantwerpen.be/conll2003/ner/) dataset to detect new entities.

### Loading the dataset

In [2]:
# import your packages here:
import pandas as pd
import nltk


Collecting numpy>=1.19.0 (from spacy)
  Using cached numpy-2.0.2-cp312-cp312-win_amd64.whl.metadata (59 kB)
Using cached numpy-2.0.2-cp312-cp312-win_amd64.whl (15.6 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
Successfully installed numpy-2.0.2


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.0.2 which is incompatible.


In [3]:
train_df = pd.read_csv("ner_data/train.txt", header=0, sep=" ")
val_df = pd.read_csv("ner_data/val.txt", header=0, sep=" ")
test_df = pd.read_csv("ner_data/test.txt", header=0, sep=" ")

print(f"{train_df.shape}, {val_df.shape}, {test_df.shape}")

(204566, 4), (51577, 4), (46665, 4)


The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Here is an example:

In [4]:
train_df.head()

Unnamed: 0,-DOCSTART-,-X-,-X-.1,O
0,EU,NNP,B-NP,B-ORG
1,rejects,VBZ,B-VP,O
2,German,JJ,B-NP,B-MISC
3,call,NN,I-NP,O
4,to,TO,B-VP,O


In [5]:
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

labels_vocab = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
labels_vocab_reverse = {v:k for k,v in labels_vocab.items()}

### Feature Extraction
 
You need to extract features for each token. The features can be:
• Basic features: Token itself, token lowercase, prefix/suffix of the token.
• Context features: Neighboring tokens (previous/next token).
• Linguistic features: Part-of-speech (POS) tags or word shapes (capitalization, digits,
etc.).
Note that you are expected to briefly mention which features you employ for training your
model.

In [7]:
# Feature extraction function for each token
def extract_token_features(sentence, index):
    token, pos_tag, chunk_tag, ner_tag = sentence[index]

    # Create a dictionary to store token-level features
    token_features = {
        'bias': 1.0,
        'word.lower()': token.lower(),
        'word.isupper()': token.isupper(),
        'word.istitle()': token.istitle(),
        'word.isdigit()': token.isdigit(),
        'prefix-2': token[:2],
        'suffix-2': token[-2:],
        'prefix-3': token[:3],
        'suffix-3': token[-3:],
        'pos': pos_tag,
        'chunk': chunk_tag
    }

    # Add previous token features if available
    if index > 0:
        prev_token, prev_pos, prev_chunk = sentence[index - 1][:3]
        token_features.update({
            '-1:word.lower()': prev_token.lower(),
            '-1:word.isupper()': prev_token.isupper(),
            '-1:word.istitle()': prev_token.istitle(),
            '-1:pos': prev_pos,
            '-1:chunk': prev_chunk
        })
    else:
        token_features['BOS'] = True  # Beginning of sentence

    # Add next token features if available
    if index < len(sentence) - 1:
        next_token, next_pos, next_chunk = sentence[index + 1][:3]
        token_features.update({
            '+1:word.lower()': next_token.lower(),
            '+1:word.isupper()': next_token.isupper(),
            '+1:word.istitle()': next_token.istitle(),
            '+1:pos': next_pos,
            '+1:chunk': next_chunk
        })
    else:
        token_features['EOS'] = True  # End of sentence

    return token_features

In [12]:
def sent2features(sentence):
    return [extract_token_features(sentence, i) for i in range(len(sentence))]

def sent2labels(sentence):
    return [ner_tag for word, pos_tag, chunk_tag, ner_tag in sentence]

### Train a NER Classifier Model

Implement one of the following classifiers for recognizing multiple entity types (e.g., person, organization, location): Conditional Random Field (CRF), biLSTM or multinomial logistic regression. Select only one and provide a brief explanation for
your choice of model.

In [13]:
import pandas as pd
import sklearn_crfsuite
from sklearn.model_selection import train_test_split
from sklearn_crfsuite import metrics
import numpy as np

In [14]:
def create_sentences(data):
    sentences = []
    sentence = []
    for _, row in data.iterrows():
        if pd.isna(row["-DOCSTART-"]):  # Empty line indicates end of sentence
            if sentence:
                sentences.append(sentence)
                sentence = []
        else:
            sentence.append((row["-DOCSTART-"], row["-X-"], row["-X-.1"], row["O"]))
    if sentence:
        sentences.append(sentence)  # Add the last sentence
    return sentences

train_sentences = create_sentences(train_df)
val_sentences = create_sentences(val_df)
test_sentences = create_sentences(test_df)

In [26]:
x_train = [sent2features(s) for s in train_sentences]
y_train = [sent2labels(s) for s in train_sentences]

x_val = [sent2features(s) for s in val_sentences]
y_val = [sent2labels(s) for s in val_sentences]

x_test = [sent2features(s) for s in test_sentences]
y_test = [sent2labels(s) for s in test_sentences]


In [28]:
y_train = [
    [label if not pd.isna(label) else "O" for label in sentence] 
    for sentence in y_train
]

y_val = [
    [label if not pd.isna(label) else "O" for label in sentence]
    for sentence in y_val
]

In [None]:
# Tune hyperparameters using validation set
from sklearn_crfsuite.metrics import flat_f1_score

best_c1, best_c2, best_score = None, None, 0

for c1 in [0.01, 0.1, 1]:
    for c2 in [0.01, 0.1, 1]:
        crf = sklearn_crfsuite.CRF(
            algorithm='lbfgs',
            c1=c1,
            c2=c2,
            max_iterations=100,
            all_possible_transitions=False
        )
        crf.fit(x_train, y_train)

        # Predict on validation set
        y_val_pred = crf.predict(x_val)
        score = flat_f1_score(y_val, y_val_pred, average='weighted', labels=label_list)

        # Update the best hyperparameters
        if score > best_score:
            best_c1, best_c2, best_score = c1, c2, score

Best hyperparameters: c1=0.01, c2=0.1, F1=0.9770444938397738


In [30]:
y_val_pred = crf.predict(x_val)
print(metrics.flat_classification_report(y_val, y_val_pred, digits=3))

              precision    recall  f1-score   support

       B-LOC      0.889     0.856     0.872      1837
      B-MISC      0.921     0.811     0.863       922
       B-ORG      0.829     0.776     0.802      1341
       B-PER      0.883     0.884     0.884      1842
       I-LOC      0.887     0.767     0.823       257
      I-MISC      0.852     0.697     0.766       346
       I-ORG      0.757     0.783     0.770       751
       I-PER      0.930     0.949     0.940      1303
           O      0.989     0.996     0.992     42973

    accuracy                          0.970     51572
   macro avg      0.882     0.835     0.857     51572
weighted avg      0.970     0.970     0.970     51572



In [31]:
# Combine training and validation data
x_train_final = x_train + x_val
y_train_final = y_train + y_val

# Retrain with the best hyperparameters
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=best_c1,
    c2=best_c2,
    max_iterations=100,
    all_possible_transitions=False
)
crf.fit(x_train_final, y_train_final)

### Evaluation

Evaluate the model on the test set using metrics such as precision, recall, and F1-score

In [32]:
y_pred = crf.predict(x_test)
print(metrics.flat_classification_report(y_test, y_pred, digits=3))

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

       B-LOC      0.869     0.872     0.870      1668
      B-MISC      0.799     0.762     0.780       702
       B-ORG      0.825     0.767     0.795      1661
       B-PER      0.854     0.860     0.857      1617
       I-LOC      0.793     0.732     0.761       257
      I-MISC      0.604     0.685     0.642       216
       I-ORG      0.740     0.778     0.759       835
       I-PER      0.890     0.959     0.923      1156
           O      0.979     0.989     0.984     38124
         nan      0.000     0.000     0.000       421

    accuracy                          0.953     46657
   macro avg      0.735     0.740     0.737     46657
weighted avg      0.944     0.953     0.949     46657



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Reporting

Summarize your findings and suggest potential improvements for future iterations of the NER system. Additionally, discuss whether your model encountered class imbalance issues and how you addressed them. Write your suggestions to the given markdown cells.

For feature extraction, a constant bias is added for a baseline prediction. Moreover, suggested features such as lowercase, uppercase, title and digit are leveraged since uppercase words are used in acronyms; capitalized words in names of people, locations; digits in dates, monetary values; and lowercase is used to help model to be case-insensitive. Beside these basic features, pos tags are leveraged because homograph words can be used to describe different entities. Chunking tag is also leveraged since dataset already provides to identify whether a word is in a meaningful phrase which usually contains nouns that are entities. Lastly, previous and next tokens are utilised since entities usually follow a pattern with certain words or affixes. Because it takes into account label dependencies (such as "B-PER" followed by "I-PER"), the CRF model was chosen for sequence labelling tasks. CRFs are effective and work well with hand-crafted features, especially on smaller datasets, as compared to alternatives like BiLSTMs. Using the validation dataset, hyperparameters—more especially, regularisation parameters C1 and C2—were adjusted to ensure the model avoided overfitting while maximising performance. The model demonstrated great precision and recall for the majority of entity categories when evaluated on the unseen test set. According to the categorisation report, unusual entity labels (such as "B-MISC") did not perform as well as often occurring labels (such as "O" for non-entities). Using weighted evaluation criteria during model selection helped to avoid class imbalance difficulties. This method guaranteed equitable evaluation for all kinds of entities.

Future developments could increase the model's capacity to capture semantic links by using sophisticated characteristics like pre-trained word embeddings (such as GloVe or FastText). On larger datasets, switching to more sophisticated models, such as BiLSTMs with CRF layers, might produce better results. Performance would also be enhanced by adding additional rare entity examples to the dataset and addressing class imbalance further through data augmentation or class weighting. Lastly, incorporating rule-based post-processing methods may aid in improving label transition mistakes.