# Question 1.2
(a) Describe the size (number of sentences) of the training, development and test file for CoNLL2003.
Specify the complete set of all possible word labels based on the tagging scheme (IO, BIO,
etc.) you chose.

(b) Choose an example sentence from the training set of CoNLL2003 that has at least two named
entities with more than one word. Explain how to form complete named entities from the label
for each word, and list all the named entities in this sentence.

Import relevant libraries

In [12]:
from itertools import chain

the code below reads and parses a CoNLL-formatted file, extracting sentences and their corresponding NER tags for further processing

In [13]:
def read_conll_file(file_path):
    sentences = []
    sentence = []
    tags = []
    tag = []

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()  # Remove leading/trailing whitespace from the line
            if len(line) == 0:  # Check if the line is empty, indicating the end of a sentence
                if sentence:
                    sentences.append(sentence)  # Append the collected sentence to the 'sentences' list
                    tags.append(tag)  # Append the collected NER tags to the 'tags' list
                sentence = []  # Reset the sentence buffer for the next sentence
                tag = []  # Reset the NER tag buffer for the next sentence
            else:
                parts = line.split()  # Split the line into parts
                word = parts[0]  # Extract the word from the line
                if word == '-DOCSTART-':  # Check if the word is '-DOCSTART-', a marker often used in CoNLL data
                    continue  # Skip this line and continue to the next line
                ner_tag = parts[-1]  # Extract the NER tag from the last part of the line
                sentence.append(word)  # Add the word to the current sentence
                tag.append(ner_tag)  # Add the NER tag to the current tags
        if sentence:
            sentences.append(sentence)  # Append the last collected sentence
            tags.append(tag)  # Append the last collected NER tags
    return sentences, tags


this code extracts and returns the unique tags found within the nested list structure of allTags.

In [14]:
def get_unique_tags(allTags):
    uniquetags = set()  # Initialize an empty set to store unique tags
    flatten_sentences = list(chain(*allTags))  # Flatten the nested 'allTags' list into a single list
    for i in flatten_sentences:  # Iterate through the elements of the flattened list
        uniquetags.add(i)  # Add each element (tag) to the 'uniquetags' set
    return uniquetags  # Return the set containing unique tags

In [15]:
# reading the data from the files
train_data, train_tags = read_conll_file('CoNLL2003_dataset/eng.train')
val_data, val_tags = read_conll_file('CoNLL2003_dataset/eng.testa')
test_data, test_tags = read_conll_file('CoNLL2003_dataset/eng.testb')

# print the number of sentences in the training, development and test dataset
print(f"The number of sentences in the training data is {len(train_data)}.")
print(f"The number of sentences in the development data is {len(val_data)}.")
print(f"The number of sentences in the test data is {len(test_data)}.")

# getting all the unique tags in the dataset
all_tags = set()
all_tags.update(get_unique_tags(train_tags))
all_tags.update(get_unique_tags(val_tags))
all_tags.update(get_unique_tags(test_tags))

# print all the tags
print(f"The set of NER tags are {all_tags}.")

The number of sentences in the training data is 14041.
The number of sentences in the development data is 3250.
The number of sentences in the test data is 3453.
The set of NER tags are {'O', 'B-ORG', 'I-ORG', 'B-MISC', 'B-LOC', 'I-MISC', 'I-LOC', 'I-PER'}.
