<a href="https://colab.research.google.com/github/seeger22/jzou2/blob/master/token_classify.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Token classification

In [15]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [17]:
%env HOME=/content/drive/MyDrive/
!mkdir -p ~/Research/huggingface
%cd ~/Research/huggingface

env: HOME=/content/drive/MyDrive/
/content/drive/MyDrive/Research/huggingface


## Install huggingface stuff
Huggingface's transformers library, pretrained models, and datasets have very good support.

* `transformers` library
* `datasets` library. Also look at [Colab notebook](https://github.com/huggingface/datasets/blob/master/notebooks/Overview.ipynb) for details.

Besides, we need to install other library, including sequential evaluation metrics
* `seqeval` library

In [None]:
# You are supposed to run just after you first run this notebook! We want to fix this library to ensure reproducibility.
!git clone https://github.com/huggingface/transformers.git
!git clone https://github.com/huggingface/datasets.git

In [None]:
# You are supposed to run everytime you run this Colab notebook
!pip install seqeval

In [None]:
# You are supposed to run everytime you run this Colab notebook
#%cd ~/Research/huggingface/transformers/
#!python setup.py install

!pip install transformers

In [None]:
# You are supposed to run everytime you run this Colab notebook
#%cd ~/Research/huggingface/datasets/
#!python setup.py install

!pip install datasets

In [8]:
# You are supposed to run everytime you run this Colab notebook

# Make sure that we have a recent version of pyarrow in the session before we continue - otherwise reboot Colab to activate it
import pyarrow
if int(pyarrow.__version__.split('.')[1]) < 16 and int(pyarrow.__version__.split('.')[0]) == 0:
    import os
    os.kill(os.getpid(), 9)

In [9]:
# sanity check
from transformers import BertForTokenClassification
from datasets import ClassLabel, load_dataset

## Run torch example
* See transformers/examples, there are many other task examples to copy from.\
* Refer to trasnformers/examples/README.md

In [None]:
%cd ~/Research/huggingface/transformers/examples/token-classification
!bash run.sh

# Task 1: Run on DSTC9 data
* Goal: Extract named entities such as hotel names, restaurant names that are defined in a provided knowledge base file

In [None]:
!git clone https://github.com/alexa/alexa-with-dstc9-track1-dataset.git

In [18]:
%cd ~/Research/ner/

/content/drive/MyDrive/Research/ner


In [19]:
%%writefile BabyTrie.py
import re

class BabyTrie:#baby version of Trie
    class TrieNode:#Node within a Trie
        def __init__(self,word=None):
            self.children={}#dictionary of TrieNodes
            self.markers=[]#list of markers/categories
            self.end=False#the Node is leaf, or end of word
            self.word=word#in case needed
    
    def __init__(self):
        self.root=self.new_node()

    def new_node(self,word=None):
        #Creates TrieNode object with a given word
        return self.TrieNode(word)

    def insert(self,lst,cat):
        '''
        Given a list of words that make up a Named Entity and its category,
        inserts the Entity into the Trie.

        Note: the structure resembles a tree, i.e. words as nodes, and at the end
        of the inserted word, the node is made a leaf node (end=True), and a Marker
        is added to the node's list of markers.
        -------------------------------------------------------------------
        Example:
        bt=BabyTrie()
        possible_NE=['Jade','Garden']#let's say that this is a restaurant
        bt.insert(possible_NE,'restaurant')
        '''
        if cat not in lst:
            lst.append(cat)
        '''
        for the sake of more possible matches:
        If inserted "Jade Garden" and it's a restaurant,
        "Jade Garden" will be recognized as an entity;
        and "Jade Garden Restaurant" will also be one.
        '''
        ptr=self.root
        for i in range(0,len(lst)-1):#for every elem but not the last (since it is the category)
            if lst[i] not in ptr.children.keys():#if already a key, skip; else add new
                new=self.new_node(lst[i])
                ptr.children[lst[i]]=new
            ptr=ptr.children[lst[i]]
        ptr.end=True#is end of word
        if cat not in ptr.markers:#Only adds new categories to the list of markers 
            ptr.markers.append(cat)
        if lst[-1] not in ptr.children.keys():#avoid possible conflict with same name
            new=self.new_node(lst[-1])
            ptr.children[lst[-1]]=new#adds the category as one of the leaf node
        ptr=ptr.children[lst[-1]]
        ptr.end=True
        if cat not in ptr.markers:#only adds category if not in list of markers
            ptr.markers.append(cat)
    
    def isinTrie(self,sen):
        '''
        Given a sentence/string,
        returns a tuple ((1),(2)) where:
                 (1) is a list of the stripped, lowercased, and split (even punctuations)
        sentence, mainly for the convenience of tokenization.
                 (2) is a dictionary with keys as categories and their values as tuples where
        the first index is the starting position and second index is the ending position
        of the Named Entities in the sentence.

        Note: these Named Entities are defined on the BabyTrie.
        Note also:
        1. While doing the clean-up for the sentence, if given string is "McDonald's" or anything
        with RE structure ^(\w+)(\'w+)$ will be split as (\w+) and (\'w+), i.e. "McDonald" and "'s"
        2. While cleaning up the dictionary for output, the longest instance is always taken.
        A node can also have multiple markers, in the case where both instances happen and have
        the same exact distances, both will be counted towards the final result. For further
        information please see the comment for each condition and the final output.
        -------------------------------------------------------------------------------------
        Example:
        #bt is already defined in the above example, which has an inserted entity
        test_sentence='I went to Jade Garden last night.'
        res=bt.isinTrie(test_sentence)
        print(res[0],'\n',res[1])

        Output:
        ['i','went','to','jade','garden','last','night','.']
        {'restaurant':(3,4)}
        '''
        dic={}#the uncleaned version of returned dictionary
        flag=False#=True when a phrase in the sentence does not match the trie anymore
        lst=[]#the list that will ultimately be returned
        new_sen=sen.lower().strip()#cleans up sentence
        plst=re.findall(r"[\w']+|[.,!?;]", new_sen)#premature list used for clean-up
        p=re.compile(r"^(\w+)(\'\w+)$")
        for i in range (len(plst)):
            if p.match(plst[i]):
                new_word1,new_word2=p.match(plst[i]).group(1),p.match(plst[i]).group(2)
                lst.append(new_word1)
                lst.append(new_word2)
            else:
                lst.append(plst[i])
        
        #the returned list is completely formed at this point, we use it to generate dictionary
        rstart=0#starting position of NE
        rend=0#ending position of NE
        last=len(lst)-1#keep track of last to ensure index does not go over limit

        for i in range (len(lst)):
            ptr=self.root
            if lst[i] in ptr.children:
                rstart=i
                ptr=ptr.children[lst[i]]
                for j in range (i+1,len(lst)):
                    if lst[j] not in ptr.children:
                        flag=True
                        rend=j-1
                        break
                    ptr=ptr.children[lst[j]]
                    rend=j
                
                if ptr.end:
                    for i in range(len(ptr.markers)):
                        if (ptr.markers[i] in dic):#if already a key, add to value
                            dic[ptr.markers[i]].append((rstart,rend))
                        else:#if not, make new entry
                            dic[ptr.markers[i]]=[(rstart,rend)]
                    rstart=0
                    rend=0

        #cleaning up dictionary
        rdic={}#clean version of the dictionary that is ultimately returned
        for cat in dic.keys():
            rdic[cat]=[]
            rlst=sorted(dic[cat],key=lambda item:item[1]-item[0],reverse=True)
            rdic[cat].append(rlst[0])
            for i in range(1,len(rlst)):
                rflag=True
                counter=1
                for rcat in rdic.keys():#make sure no overlap: ex. 'some hotel diner' > 'some hotel'
                    for elem in rdic[rcat]:
                        if (rlst[i][0]<elem[0] and rlst[i][1]<elem[0]):#ex. if we have (3,5) we can take (1,2)
                            continue
                        elif (rlst[i][0]>elem[1] and rlst[i][1]>elem[1]):#ex. if we have (3,5) we can take (6,7)
                            continue
                        else:# any equals, [0] or [1] will not work: ex. we have (3,5). cannot have (3,4) or (4,5).
                            rflag=False
                if rflag:
                    rdic[cat].append(rlst[i])
        return (lst,rdic)


Overwriting BabyTrie.py


In [32]:
%%writefile preprocess.py
import json
import re
import argparse
from BabyTrie import BabyTrie
from argparse import Namespace

def get_bt(dic):
    '''
    Given a dictionary with categories of entities in the form:
    {'some category':{'0':{'name':'some name'},'1':{'name':'some other name'},...},'some other category':{'0':...},...}
    Return a BabyTrie that contains all entities and the leaf nodes contianing a list of their corresponding categories
    
    Note: A large portion of this code under the if statement is cleaning up the list of words within the name phrase
    -------------------------------------------------------------------------------------------------------------------
    Example:
    dic={'restaurant':{'0':{'name':'Jade Garden'}}}
    bt=get_bt(dic)#the name 'jade' and 'garden' are created as nodes and inserted into the BabyTrie
    
    '''
    bt=BabyTrie()
    for cat in dic.keys():
        for elem in dic[cat]:
            record = dic[cat][elem]
            if ('name' in record and record['name'] is not None):
                new_name=record['name'].lower().strip()#cleans up name
                lst=[]#final list that is used to insert into bt
                plst=re.findall(r"[\w']+|[.,!?;]", new_name)#premature list, can have ex. "McDonald's" instead of "McDonald" and "'s"
                p=re.compile(r"^(\w+)(\'\w+)$")
                for i in range (len(plst)):
                    if p.match(plst[i]):
                        new_word1,new_word2=p.match(plst[i]).group(1),p.match(plst[i]).group(2)
                        lst.append(new_word1)
                        lst.append(new_word2)
                    else:
                        lst.append(plst[i])
            bt.insert(lst,cat)
    return bt

def extraction_of_log1(dic):#will generate dupes. Eliminate after everything
    lst=[]
    for i in range(0,len(dic)):
        for elem in dic[i]:
            lst.append(elem["text"])
    return lst

def extraction_of_log2(dic):
    lst=[]
    for cat in dic.keys():
        for elem in dic[cat]:
            for dialogue_num in dic[cat][elem]['docs']:
                lst.append(dic[cat][elem]['docs'][dialogue_num]['title'])
                lst.append(dic[cat][elem]['docs'][dialogue_num]['body'])
    return lst

def printres(ofstream,res):
    '''
    Given a result tuple that is generated from the isinTrie method in the class of BabyTrie,
    Prints the result in two columns containing labels.
    Note: the labeling of 'B_cat','I_cat', and 'O' is used.
    B_cat = begining of an entity of a category
    I_cat = intermediate (in the middle) of an entity of a category
    O = not an entity
    ----------------------------------------------------------------------------------------
    Example: (same output from isinTrie example)
    res=(['i','went','to','jade','garden','last','night','.'], {'restaurant':(3,4)})
                        lst                                            dic
    printres(res)

    Output:
    i              O
    went           O
    to             O
    jade           B_restaurant
    garden         I_restaurant
    last           O
    night          O
    .              O
    '''
    lst=res[0]
    dic=res[1]
    sen_label=[[word,'O'] for word in lst]
    for cat in dic.keys():
        for item in dic[cat]:
            start,end=item[0],item[1]
            first=True
            for i in range(start,end+1):
                if (sen_label[i][-1]!='O'):
                    if first:
                        record=' | B_'+cat
                        first=False
                    else:
                        record=' | I_'+cat
                    sen_label[i][-1]+=record
                else:
                    if first:
                        sen_label[i][-1]='B_'+cat
                        first=False
                    else:
                        sen_label[i][-1]='I_'+cat
    for elem in sen_label:
        ofstream.write('{}\t{}\n'.format(elem[0],elem[1]))
    ofstream.write('\n')
def main():
    parser=argparse.ArgumentParser(description='NER preprocess dataset')
    parser.add_argument("--log_file",type=str,default='logs.json',help="input log file in json")
    parser.add_argument("--knowledge_file",type=str,default='knowledge.json',help="input knowledge file in json")
    parser.add_argument("--output_file",type=str,default='output.txt',help="output file for NER training")
    parser.add_argument("--extract_lognum",type=int,default=2,help="1 for log, 2 for log included in knowledge base")
    args = parser.parse_args()
    
    #opens files first to load
    logs1=open(args.log_file,'r')
    data1=json.load(logs1)
    k=open(args.knowledge_file,'r')
    dic=json.load(k)
    

    bt=get_bt(dic)#setting up bt based on knowledge base
    if args.extract_lognum==1:
        dset=extraction_of_log1(data1)#extract logs1
    elif args.extract_lognum==2:
        dset=extraction_of_log2(dic)#extract additional data from knowledge base
    
    outFile=open(args.output_file,'w')
    for sen in dset:#write results of log to file
        res=bt.isinTrie(sen)
        printres(outFile,res)
    
    logs1.close()
    k.close()
    outFile.close()
    
main()


Overwriting preprocess.py


In [44]:
!git init
!git config --global user.email"seeger22@126.com"
!git config --global user.name"seeger22"
!git add -A
!git commit -m "first commit"

'''
!mkdir ./temp
!git clone https://github.com/seeger22/ner.git ./temp
!rsync -aP --exclude=data/ "drive/MyDrive/Research/ner"/* ./temp

%cd ./temp
!git add .
!git commit -m '"yuh"'
!git config --global user.email "seeger22@126.com"
!git config --global user.name "seeger22"
!git push origin "master"
%cd /content
!rm -rf ./temp
'''

Reinitialized existing Git repository in /content/.git/
^C
On branch master

Initial commit

Untracked files:
	[31m.config/[m
	[31mBabyTrie.py[m
	[31mdrive/[m
	[31mpreprocess.py[m
	[31msample_data/[m

nothing added to commit but untracked files present


'\n!mkdir ./temp\n!git clone https://github.com/seeger22/ner.git ./temp\n!rsync -aP --exclude=data/ "drive/MyDrive/Research/ner"/* ./temp\n\n%cd ./temp\n!git add .\n!git commit -m \'"yuh"\'\n!git config --global user.email "seeger22@126.com"\n!git config --global user.name "seeger22"\n!git push origin "master"\n%cd /content\n!rm -rf ./temp\n'

# Task 2: Incorporate dictionary features into the model

* Use Trie to locate the positions of each entity in a sentence.
* Feed in positions as extra features into the model.
* During test time, we can add new entities from knowledge base into the dictionary to improve model robustness on recognizing the new named entities not covered during training.