## Extract Features

At this point, we have **manually annotated** 19 sentences with the word `compound` used in the chemical-compound sense, and 21 sentences with the word `compound` used in the multiple/composite sense. These have been stored as an additional column in the input file.

The paper states that the Label Propagation solution works well when the manual annotations cover about 10% of the dataset. In our case, this is approximately 5% of the dataset.

In this notebook, we build feature vectors for each sentence using features described in the paper as follows.

    We used three types of features to capture contextual information: part-of-speech of neighboring words with position information, unordered single words in topical context, and local collocations

This translates into the following features for our implementation.

* TF-IDF vectors for terms, bigrams, and trigrams.
* TF-IDF vectors for part of speech trigrams.

We then compute the sentence-sentence similarity matrix using cosine similarity, this is the initial version of our graph adjacency matrix.

We then sparsify the graph by discarding all but the top k (k=5) similar sentences for each, and finally remove any self edges by zero-ing out the diagonal.

In [1]:
import numpy as np
import os
import re
import spacy
import string

from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
DATA_DIR = "../data"
SENTS_FILEPATH = os.path.join(DATA_DIR, "sentences-compound-plabels.tsv")

SIM_MATRIX_FILEPATH = os.path.join(DATA_DIR, "sim-matrix.npy")

NUM_EDGES_TO_KEEP = 5

### Global variables

Declare some things that we will use multiple times later.

In [3]:
nlp = spacy.load("en_core_web_sm")
punct_pattern = "|".join(["\\" + c for c in string.punctuation])

### Clean input text

Excel went ahead and randomly decided to enclose some of the sentences in quotes. So the cleaning of input text consists of conditionally removing the enclosing quotes, lowercasing the text, and replacing punctuation with space.

In [4]:
def clean_text(sent_text, punct_pattern):
    # remove bounding quotes
    if sent_text.startswith("\"") and sent_text.endswith("\""):
        sent_text = sent_text[1:-1]
    # lowercase sentence
    sent_text = sent_text.lower()
    # replace punctuations with space
    sent_text = re.sub(punct_pattern, " ", sent_text)
    sent_text = re.sub("\s+", " ", sent_text)
    return sent_text


# test
s = "\"Two of the case subjects were compound heterozygous, including for a variant observed in six control subjects, and one was homozygous.\""
print(clean_text(s, punct_pattern))

two of the case subjects were compound heterozygous including for a variant observed in six control subjects and one was homozygous 


### Extracting POS tags

For each sentence, we will create the corresponding POS tag sequence as a string. This string will be passed into the Scikit-learn vectorizer similar to the sentence text.

In [5]:
def extract_pos_tags(sent_text, nlp):
    pos_tags = []
    doc = nlp(sent_text)
    for token in doc:
        pos_tags.append(token.pos_)
    # return POS tags as a string
    return " ".join(pos_tags)


# test
print(extract_pos_tags(s, nlp))

PUNCT NUM ADP DET NOUN NOUN VERB ADJ ADJ PUNCT VERB ADP DET NOUN VERB ADP NUM NOUN NOUN PUNCT CCONJ NUM VERB ADJ PUNCT PUNCT


### Vectorize text and POS sequences

For each sentence, we create TF-IDF vectors corresponding to the token n-grams (for n=1..3) and POS 3-grams. These two vectors are concatenated and represent the combined feature vector for the sentence.

We use POS tokens here since we are looking at Word Sense Disambiguation, and the extra hints provided by POS tags are generally useful.

In [6]:
def tfidf_vectorize(texts, ngram_range):
    vec = TfidfVectorizer(ngram_range=ngram_range, 
        use_idf=True, min_df=5, max_df=0.2,
        norm="l2")
    td_matrix = vec.fit_transform(texts)
    return td_matrix.todense()

In [7]:
poss, texts = [], []
num_read = 0
fsents = open(SENTS_FILEPATH, "r")
for line in fsents:
    if line.startswith("#"):
        continue
    if num_read % 500 == 0:
        print("{:d} sentences read".format(num_read))
    pii, sent_id, sent_text, label = line.strip().split('\t')
    label = int(label)
    poss.append(extract_pos_tags(sent_text, nlp))
    texts.append(clean_text(sent_text, punct_pattern))
    num_read += 1

print("{:d} sentences read, COMPLETE".format(num_read))
fsents.close()

0 sentences read
500 sentences read
668 sentences read, COMPLETE


In [8]:
text_matrix = tfidf_vectorize(texts, ngram_range=(1, 3))
pos_matrix = tfidf_vectorize(poss, ngram_range=(3, 3))
feat_matrix = np.hstack((text_matrix, pos_matrix))
print(text_matrix.shape, pos_matrix.shape, feat_matrix.shape)

(668, 902) (668, 499) (668, 1401)


### Compute Similarity Matrix

The similarity matrix uses cosine similarities. Final form of the similarity matrix is the graph adjacency matrix that we will run Label Propagation against.

Similarity matrix is normalized so all diagonal elements (cosine similarity for sentence with itself) is always 1.

We also remove the ability for any self-traversals by zero-izing the diagonal elements.

Then, in order to sparsify the graph, for all sentences, we discard all but the top 5 edges to neighboring sentences.

Finally, the resulting graph adjacency matrix is saved in order to be used by the next step in the pipeline.

In [9]:
S = np.matmul(feat_matrix, feat_matrix.T)
S /= np.diag(S)[0]
print(S.shape)

(668, 668)


In [10]:
# discard self edges, i.e., diagonal = 0
for i in range(S.shape[0]):
    S[i, i] = 0

In [11]:
# discard all but top N edges from the adjacency matrix
num_to_discard = S.shape[1] - NUM_EDGES_TO_KEEP
zero_indices = np.argpartition(S, -NUM_EDGES_TO_KEEP, axis=1)[:, 0:num_to_discard]
for i in range(zero_indices.shape[0]):
    for j in zero_indices[i]:
        S[i, j] = 0

In [12]:
np.save(SIM_MATRIX_FILEPATH, S)