# Adding a Noun Phrase Dependency Feature
This notebook walks through the process of adding a binary Noun Phrase (NP) dependency feature to the data.
Each token in the data is considered in the context of its original sentence in the corpus. 
SpaCy's dependency parser is used to extract the noun phrases in a sentence. 
All tokens found to be within an NP are marked with a one; otherwise, they are marked with a 0.

Note: Due to no meaningful gains in final performance, this feature is not considered during ablation experiments in the system paper

In [None]:
import urllib
import pandas as pd
import numpy as np
import spacy

Add better visibiity when checking head of dataframes

In [0]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

In [0]:
def get_comments(filename, url=True):
    if url:
        comments = []
        with urllib.request.urlopen(filename) as f:
            for line in f:
                if line.startswith(b'#'):
                    comments.append(line.decode("utf-8"))
                else:
                    break
        return comments
    with open(filename, 'r', encoding='utf8') as f:
        commentiter = takewhile(lambda s: s.startswith('#'), f)
        comments = list(commentiter)
    return comments

In [0]:
TRAIN_DEV_URL = 'https://raw.githubusercontent.com/cicl-iscl/CyberWallE/master/data/si-train.tsv?token=AFDEFD3RCBKTJKERYHW56QS6M24HK'

In [5]:
comments = get_comments(TRAIN_DEV_URL)
train_df = pd.read_csv(TRAIN_DEV_URL, sep='\t', skiprows=len(comments), quoting=3)
train_input = train_df.groupby('sent_id')['token'].apply(list).to_frame()
print(train_df.head())
print(train_input.head())
print(list(train_df.columns))
print(train_df.shape[0])

   document_id  sent_id  token_start  token_end       token label  positive  \
0  111111111    1        0            4          Next        O     0.000000   
1  111111111    1        5            11         plague      O     0.071429   
2  111111111    1        12           20         outbreak    O     0.000000   
3  111111111    1        21           23         in          O     0.000000   
4  111111111    1        24           34         Madagascar  O     0.000000   

   negative  arglex_any  ADJ  ADP  ADV  CCONJ  DET  INTJ  NOUN  NUM  PART  \
0  0.031250  0           1    0    0    0      0    0     0     0    0      
1  0.214286  0           0    0    0    0      0    0     1     0    0      
2  0.125000  0           0    0    0    0      0    0     1     0    0      
3  0.000000  0           0    1    0    0      0    0     0     0    0      
4  0.000000  0           0    0    0    0      0    0     0     0    0      

   PRON  PROPN  PUNCT  SYM  VERB  X  
0  0     0      0      0

Concatenate tokens back into sentences so we can parse their structure

In [6]:
sents_df = pd.DataFrame({'sent_id':train_input['token'].index, 'tokenized_sent':train_input['token'].values})
sents_df["full_sents"]= sents_df["tokenized_sent"].str.join(" ")
print(sents_df.head())

   sent_id  \
0  1         
1  2         
2  3         
3  4         
4  5         

                                                                                                                                                                                                                                   tokenized_sent  \
0  [Next, plague, outbreak, in, Madagascar, could, be, ', stronger, ', :, WHO]                                                                                                                                                                      
1  [Geneva, -, The, World, Health, Organisation, chief, on, Wednesday, said, a, deadly, plague, epidemic, appeared, to, have, been, brought, under, control, in, Madagascar, ,, but, warned, the, next, outbreak, would, likely, be, stronger, .]   
2  [", The, next, transmission, could, be, more, pronounced, or, stronger, ,, ", WHO, Director, -, General, Tedros, Adhanom, Ghebreyesus, told, reporters, in, Geneva, ,, insisting,

Check to see if we have the same number of tokenized sentences and the now concatenated sentences

In [7]:
sents_list = sents_df['tokenized_sent'].tolist()
full_sents_list = sents_df['full_sents'].tolist()
print(len(sents_list), len(full_sents_list))
nlp = spacy.load("en_core_web_sm")

21501 21501


Iterate though the full sentences and get all the tokens in each sentence that exist in Noun Phrase chunks. Noun Phrase chunking is already provided by Spacy

In [0]:
all_marked_tokens = []
for count, sent in enumerate(full_sents_list):
  doc = nlp(str(sent))
  chunks_list = list(doc.noun_chunks)
  tokens_in_sent = []
  for chunk in chunks_list:
    for token in chunk:
      tokens_in_sent.append(token.text)
  all_marked_tokens.append(tokens_in_sent)

 Match the tokens in NP chunks to the pre-tokenized sentences, marking 1 for a match

In [9]:
all_isin_np_list = []
for count, sent in enumerate(sents_list):
  isin_np_list = np.zeros(len(sent))
  for i, word in enumerate(sent):
    if word in all_marked_tokens[count]:
      isin_np_list[i] = 1
      all_marked_tokens[count].remove(word)
  all_isin_np_list.append(isin_np_list)
print(all_isin_np_list[0:5])
print(len(all_isin_np_list))

[array([1., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1.]), array([1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 0., 0., 0.,
       0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0.]), array([1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0.]), array([1., 1., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0.,
       0., 1., 1., 0., 0., 1., 0., 1., 0.]), array([0., 0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0.])]
21501


In [10]:
np_list = [x.tolist() for x in all_isin_np_list]
print(np_list[5:10])
print(sents_list[5:10])
print(full_sents_list[5:10])

[[1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0], [1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0], [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0], [0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0], [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0]]
[['Madagascar', 'has', 'suffered', 'bubonic', 'plague', 'outbreaks', 'almost', 'every', 'year', 'since', '1980', ',', 'often', 'caused', 'by', 'rats', 'fleeing', 'forest', 'fires', '.'], ['The', 'disease', 'tends', 'to', 'make', 'a', 'comeback', 'each', 'hot', 'rainy', 'season', ',', 'from', 'September', 'to', 'April', '.'], ['On', 'average', ',', 'between', '300', 'and', '600', 'infections', 'are', 'recorded', 'every', 'year', 'among', 'a', 'population',

Flattening to merge easily as single dataframe column corresponding one-to-one with each token

In [0]:
flat_list = [item for sublist in np_list for item in sublist]

The number of 0's and 1's equals the number of tokens in our original train+dev dataframe! Every token has been checked to be in an NP chunk

In [12]:
len(flat_list)

401288

In [0]:
isin_np_df = pd.DataFrame({'isin_np':flat_list})

In [0]:
merged_df = pd.merge(isin_np_df, train_df, left_index=True, right_index=True)

In [0]:
merged_df_1 = merged_df[['document_id', 'sent_id', 'token_start', 'token_end', 'token', 'label', 'positive', 'negative', 'arglex_any','ADJ', 'ADP', 'ADV', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SYM', 'VERB', 'X', 'isin_np']]

In [16]:
print(list(merged_df_1.columns))
print(merged_df_1.head())

['document_id', 'sent_id', 'token_start', 'token_end', 'token', 'label', 'positive', 'negative', 'arglex_any', 'ADJ', 'ADP', 'ADV', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SYM', 'VERB', 'X', 'isin_np']
   document_id  sent_id  token_start  token_end       token label  positive  \
0  111111111    1        0            4          Next        O     0.000000   
1  111111111    1        5            11         plague      O     0.071429   
2  111111111    1        12           20         outbreak    O     0.000000   
3  111111111    1        21           23         in          O     0.000000   
4  111111111    1        24           34         Madagascar  O     0.000000   

   negative  arglex_any  ADJ  ADP  ADV  CCONJ  DET  INTJ  NOUN  NUM  PART  \
0  0.031250  0           1    0    0    0      0    0     0     0    0      
1  0.214286  0           0    0    0    0      0    0     1     0    0      
2  0.125000  0           0    0    0    0      0    0     

In [0]:
#merged_df_1 = merged_df_1.drop(columns=['Unnamed: 0'])

In [19]:
cols = ['isin_np', 'X']
merged_df_1[cols] = merged_df_1[cols].applymap(np.int64)
print(merged_df_1.head())

   document_id  sent_id  token_start  token_end       token label  positive  \
0  111111111    1        0            4          Next        O     0.000000   
1  111111111    1        5            11         plague      O     0.071429   
2  111111111    1        12           20         outbreak    O     0.000000   
3  111111111    1        21           23         in          O     0.000000   
4  111111111    1        24           34         Madagascar  O     0.000000   

   negative  arglex_any  ADJ  ADP  ADV  CCONJ  DET  INTJ  NOUN  NUM  PART  \
0  0.031250  0           1    0    0    0      0    0     0     0    0      
1  0.214286  0           0    0    0    0      0    0     1     0    0      
2  0.125000  0           0    0    0    0      0    0     1     0    0      
3  0.000000  0           0    1    0    0      0    0     0     0    0      
4  0.000000  0           0    0    0    0      0    0     0     0    0      

   PRON  PROPN  PUNCT  SYM  VERB  X  isin_np  
0  0     0     

In [0]:
file = merged_df_1.to_csv("si-train+dependency.tsv", sep='\t',index=False)
from google.colab import files
files.download("si-train+dependency.tsv") 