# CS 585 - Homework 4
Tania Soutonglang A20439949

In [1]:
import pycrfsuite

from collections import Counter

from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score, f1_score

## Problem 1 - Reading the Data in ConLL Format

- Write a function that reads a .tsv files in the CoNLL format and returns two “list of lists” as output:
    - A list of sequences of tokens, where a single token may be a word or punctuation.
    - A list of sequences of tags, representing token-level annotation. You should see these 3 tags in your data (“B-Disease”, “I-Disease”, “O”)

In [2]:
def read_tsv_CoNLL(file_path):
    tokens = []
    tags = []
    
    with open(file_path, 'r', encoding='utf-8') as file:
        current_tokens = []
        current_tags = []
        
        for line in file:
            line = line.strip()
            if not line:
                if current_tokens and current_tags:
                    tokens.append(current_tokens)
                    tags.append(current_tags)
                current_tokens = []
                current_tags = []
            else:
                parts = line.split('\t')
                if len(parts) == 2:
                    token, tag = parts
                    current_tokens.append(token)
                    current_tags.append(tag)
    
    return tokens, tags

- Apply your function to train.tsv and test.tsv. To show you have read in the data correctly, show the following in your notebook output:
    - The number of sequences in train and test. (You should see 5432 sequences in train and 940 sequences in test.)
    - The tokens and tags of the first sequence in the training dataset.

In [3]:
train_tokens, train_tags = read_tsv_CoNLL("train.tsv")
test_tokens, test_tags = read_tsv_CoNLL("test.tsv")

print("training sequences: ", len(train_tokens))
print("testing sequences: ", len(test_tokens))

training sequences:  5432
testing sequences:  940


In [4]:
print(train_tokens[0])
print(train_tokens[0])

['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.']
['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.']


## Problem 2 - Data Discovery
In this problem you will examine the data that you read into memory in the previous problem. Using the training dataset for analysis, show the following in your notebook output:
- The count of each of the 3 tags in the training data: “B-Disease”, “I-Disease”, and “O”. Note that the most frequent token is "O", since most words are not part of a disease mention.

In [5]:
tag_counts = {}

for sequence in train_tags:
    tag_counts = Counter(sequence)
    
print(tag_counts)

Counter({'O': 27, 'I-Disease': 3, 'B-Disease': 2})


- The 20 most common words/tokens that appear with the tags “B-Disease” or “I-Disease”. That is, show words that often appear disease mentions. (You may show frequent “B-Disease” and “I-Disease” words separately, or you may combine them into a single list.)

In [6]:
# create a list of words for "B-Disease" or "I-Disease" tags
disease_words = [word for i, 
                 seq in enumerate(train_tokens) for j, 
                 word in enumerate(seq) 
                 if train_tags[i][j] in ["B-Disease", "I-Disease"]]

word_counts = Counter(disease_words)

# get the 20 most common words
word_counts.most_common(20)

[('-', 636),
 ('deficiency', 322),
 ('syndrome', 281),
 ('cancer', 269),
 ('disease', 256),
 ('of', 178),
 ('dystrophy', 176),
 ('breast', 151),
 ('ovarian', 132),
 ('X', 122),
 ('and', 120),
 ('DM', 120),
 ('ALD', 114),
 ('DMD', 110),
 ('APC', 100),
 ('disorder', 94),
 ('muscular', 94),
 ('G6PD', 92),
 ('linked', 81),
 ('the', 78)]

Review the list of words that commonly appear in disease mentions. Do you see any patterns? (You do not need to answer in writing, but it may be helpful in Problem 3 where you design a feature.) 

The most common words in the top 20 relate to cancer, and after that there are a lot of abbreviations.

## Problem 3 - Building Features
In this problem, you will build the features that you will use in your CRF model.

- Write a function that takes two inputs:
    - A sequence of tokens
    - An integer position, pointing to one token in that sequence
    
    and returns a list of features, represented as a list of strings. At minimum, include these features:
    
    - The current word/token in lower case
    - The suffix (last 3 characters) of the current word
    - The previous word/token (position i-1) or “BOS” if at the beginning of the sequence
    - The next word/token (position i+1), or “EOS” if at the beginning of the sequence
    - At least one other feature of your choice

In [7]:
def word2features(sequence, i):
    features = []
    
    # check that i is valid
    if i >= len(sequence) or i < 0:
        return features
    
    # current word in lower case
    current_word = sequence[i].lower()
    features.append(f'w.lower={current_word}'.format(current_word))
    
    # suffix
    suffix = current_word[len(current_word)-3:]
    features.append(f'w.suffix={suffix}'.format(suffix))
    
    # previous word
    prev_word = ""
    if i == 0:
        prev_word = "BOS"
        features.append(prev_word)
    else:
        prev_word = sequence[i-1]
        features.append(prev_word)
    
    # next word
    next_word = ""
    if i == len(sequence)-1:
        next_word = "EOS"
        features.append(next_word)
    else:
        next_word = sequence[i+1]
        features.append(next_word)
        
    # prefix - my choice
    prefix = current_word[:3]
    features.append(f'w.prefix={prefix}'.format(prefix))
        
    # length of current word - my choice
    length = len(current_word)
    features.append(f'w.length={length}'.format(length))
        
    return features

- Apply your function your train and test token sequences (from output of Problem 1).

In [8]:
X_train = []

for sequence in train_tokens:
    seq = []
    for i in range(len(sequence)):
        features = word2features(sequence, i)
        seq.append(features)
    X_train.append(seq)

y_train = train_tags

In [19]:
X_test = []

for sequence in test_tokens:
    seq = []
    for i in range(len(sequence)):
        features = word2features(sequence, i)
        seq.append(features)
    X_test.append(seq)
        
y_test = test_tags

- To show that you have completed this step, apply your output to the first 3 words in the first sequence of the training set. Your output should look something like this (note the names and order of your features in your notebook do not need to match this output):

In [10]:
print("Features of first 3 words:")
for i in range(3):
    print(X_train[0][i])

Features of first 3 words:
['w.lower=identification', 'w.suffix=ion', 'BOS', 'of', 'w.prefix=ide', 'w.length=14']
['w.lower=of', 'w.suffix=f', 'Identification', 'APC2', 'w.prefix=of', 'w.length=2']
['w.lower=apc2', 'w.suffix=pc2', 'of', ',', 'w.prefix=apc', 'w.length=4']


## Problem 4 - Training a CRF Model
In this problem, you will train a CRF model and evaluate it using metrics computed over individual tags.

- Using the python-crfsuite library, train a CRF sequential tagging model using feature sequences that you built in the previous step. Using your training data as input.

In [11]:
# create pycrgsuite.Trainer
trainer = pycrfsuite.Trainer(verbose=False)

for X_seq, y_seq in zip(X_train, y_train):
    trainer.append(X_seq, y_seq)
    
# set the parameters
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

# possible parameters
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

In [12]:
# train the model
trainer.train('conll2002-esp.crfsuite')

- Apply your model to your test dataset to generate predicted tag sequences.

In [13]:
tagger = pycrfsuite.Tagger()
tagger.open('conll2002-esp.crfsuite')

<contextlib.closing at 0x2248b0576d0>

In [23]:
X_test[0][0]

['w.lower=clustering',
 'w.suffix=ing',
 'BOS',
 'of',
 'w.prefix=clu',
 'w.length=10']

In [26]:
# tagging a sentence to see how it works
example_sent = test_tokens[0]
print(' '.join(example_sent), end='\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
# print("Correct:  ", ' '.join(sent2labels(example_sent)))

Clustering of missense mutations in the ataxia - telangiectasia gene in a sporadic T - cell leukaemia .



In [None]:
# evaluate the model
tagger = pycrfsuite.Tagger()
tagger.open(model)
test_labels = []
predicted_labels = []

In [None]:
for features in test_features:
    predicted_labels.extend(tagger.tag(features))

- For each of the 3 labels ("B-Disease", "I-Disease", and “O") show precision, recall, f1-score. (You may use the sckit-learn function classification_report to complete this step. You may also want to “flatten” both the true and predicted tags into a single list of tags to apply this function.)

In [None]:
print(classification_report(test_labels, predicted_labels))

## Problem 5 - Inspecting the Trained Model
In this problem you will examine parameter weights assigned by your model. You can do this by calling “tagger.info().transitions” and “tagger.info().state_features” on your trained model object.

- In your notebook, show parameter weights given to transitions between the 3 tag types ("BDisease", "I-Disease", and "O").

- Refer back to the feature you designed in Problem 3 (the feature "of your choice"). Show the parameter weights assigned to this feature. You may truncate this list if it is very long. (This may happen if you included a word from the sequence in the feature name, so your feature was expanded to become a larger set of features that grows with your vocabulary)

- *IF* your feature was dropped during model training (that is, there is nothing to show in the previous step) then return to Problem 4 and design a new feature that is used in your model. 

## Problem 6 - Document Level Performance
Tag-level accuracy is easy to compute, but it is not very easy to understand. In particular, one disease reference may cover both "B-Disease" and "I-Disease" tokens. To give another view of model performance, compute document-level precision and recall on your experiment output. To do this:
- Write a function that aggregates token-level tags to a document-level label. For example, convert a tag sequence like ("O", "B-Disease", "I-Disease", "O", "O") to a single label y=1. Your function should assign y=1 to a sequence with one or more disease mentions (at least one "BDisease" tag) and y=0 to a sequence with no disease mentions.

- Apply your function to both true and predicted document-level labels from your test set. Use the output to compute document level precision and recall of your model. Show your results in your notebook.

## Problem 7 - State Transitions
The python-crfsuite library allows you to set a Boolean hyper-parameter called “feature.possible_transitions”. If this parameter is True, then the model may output tag-to-tag transitions that were never seen in training data. (You do not need to apply this parameter in your code to answer this question)

- What is an example of one tag-to-tag transition that never occurred in the training data?

- For this particular experiment, do you think it makes sense to set this parameter to True or False? That is, should you allow transitions that never occurred in the training data? Explain your answer briefly.