In this notebook, I will convert the data into a format suitable for using with the linear chain CRF library.  The format in question is IOB format and is used to tag individual tokens within sentences.  Tokens at the beginning of an aspect phrase are tagged as 'B'; tokens within an aspect phrase but not at the beginning are tagged as 'I', and tokens not belonging to an aspect phrase are tagged as 'O'.

In [1]:
import pickle
with open('restaurant.pickle', 'rb') as handle:
    data = pickle.load(handle)

In [2]:
data

Unnamed: 0,id,text,aspect_terms,aspect_categories
0,3121,But the staff was so horrible to us.,"[{'term': 'staff', 'polarity': 'negative', 'fr...","[{'category': 'service', 'polarity': 'negative'}]"
1,2777,"To be completely fair, the only redeeming fact...","[{'term': 'food', 'polarity': 'positive', 'fro...","[{'category': 'food', 'polarity': 'positive'},..."
2,1634,"The food is uniformly exceptional, with a very...","[{'term': 'food', 'polarity': 'positive', 'fro...","[{'category': 'food', 'polarity': 'positive'}]"
3,2534,Where Gabriela personaly greets you and recomm...,[],"[{'category': 'service', 'polarity': 'positive'}]"
4,583,"For those that go once and don't enjoy it, all...",[],"[{'category': 'anecdotes/miscellaneous', 'pola..."
...,...,...,...,...
3039,1063,But that is highly forgivable.,[],"[{'category': 'anecdotes/miscellaneous', 'pola..."
3040,777,"From the appetizers we ate, the dim sum and ot...","[{'term': 'appetizers', 'polarity': 'positive'...","[{'category': 'food', 'polarity': 'positive'}]"
3041,875,"When we arrived at 6:00 PM, the restaurant was...",[],"[{'category': 'anecdotes/miscellaneous', 'pola..."
3042,671,Each table has a pot of boiling water sunken i...,"[{'term': 'table', 'polarity': 'neutral', 'fro...","[{'category': 'food', 'polarity': 'neutral'}]"


In [4]:
data["aspect_terms"][0]

[{'term': 'staff', 'polarity': 'negative', 'from': 8, 'to': 13}]

The first thing I want to do is go from the from fields as represented here; as a character span, to updated fields, represented as a token span.

I think that the only sane way to do this, is for each token we also need its character span in the sentence so then we can match them up.

I suppose the way to do it would literally to be to find the indices of the span in the sentence that matches the current token. If there is a single match then that's fine. If there are multiple matches, use some sort of counter to detect current state (i.e. perhaps use the spans of the previous and next detected words as upper and lower bounds) in order to resolve it. If there are no matches, we probably have a data error.

In [5]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [6]:
def check_perfect_match(doc, aspects):
    """ Detect if we can find the aspects as a direct match """ 
    detected = [x.text for x in doc for aspect in aspects if x.text == aspect['term']]
    match = set([x['term'] for x in aspects]) == set(detected)
    return match

OK, let's test out this function on one of the rows in our data - can we match the aspects to the tokens in the sentence?

In [7]:
i = 5
sentence = data["text"][i]
doc = nlp(sentence)
aspects = data["aspect_terms"][i]

print(doc)
print(aspects)
check_perfect_match(doc, aspects)

Not only was the food outstanding, but the little 'perks' were great.
[{'term': 'food', 'polarity': 'positive', 'from': 17, 'to': 21}, {'term': 'perks', 'polarity': 'positive', 'from': 51, 'to': 56}]


True

That seems to have worked! Let's go through our data and see if this works for every row or not.

In [10]:
# find the ones we can't match
for i in range(len(data)):
    sentence = data["text"][i]
    doc = nlp(sentence)
    aspects = data["aspect_terms"][i]
    match = check_perfect_match(doc, aspects)
    if not match:
        print(i)

7
16
26
28
31
32
41
48
50
51
52
54
57
59
60
69
72
77
82
93
105
106
108
114
121
126
127
129
130
133
134
135
136
141
142
146
147
149
150
152
153
160
162
165
169
170
176
177
185
187
188
195
198
199
204
208
220
221
226
230
232
235
239
242
246
252
253
254
258
259
262
269
275
279
282
283
284
285
293
294
298
302
305
314
324
330
332
336
338
343
347
349
359
361
380
388
391
392
393
404
407
410
414
415
416
417
424
433
435
442
443
448
450
452
455
458
462
467
477
489
500
502
507
508
513
514
515
517
519
526
528
534
538
539
540
541
547
549
555
561
575
577
589
593
601
604
605
616
618
627
629
631
633
636
642
643
653
661
662
664
667
670
675
677
682
685
686
687
694
697
699
700
703
705
708
710
716
717
721
730
733
735
736
738
744
746
748
753
756
762
771
779
780
781
785
788
789
790
791
802
806
807
815
817
820
821
828
832
841
843
844
846
847
848
854
857
858
861
866
867
870
871
877
879
882
883
886
898
899
902
904
910
916
917
919
920
926
934
939
940
950
951
952
954
956
960
966
968
972
975
978
980
983
985
986
9

There are lots of mismatches!  Let's look deeper at one of them to find the source(s) of error.

In [10]:
i = 16
sentence = data["text"][i]
doc = nlp(sentence)
aspects = data["aspect_terms"][i]
print(sentence)
print(aspects)

The pizza is the best if you like thin crusted pizza.
[{'term': 'pizza', 'polarity': 'positive', 'from': 4, 'to': 9}, {'term': 'thin crusted pizza', 'polarity': 'neutral', 'from': 34, 'to': 52}]


OK, so this method doesn't pick up multiword phrases.  I suppose the way to do it would be to split the aspect term using spacy so that there should be a perfect match in terms of tokenisation between the aspect term and the sentence, and then detect the first word and move iteratively.

In [11]:
doc

The pizza is the best if you like thin crusted pizza.

In [12]:
j = 1
phrase = nlp(aspects[j]['term'])

In [13]:
def find_phrase(doc, phrase):
    phrase_position = 0
    sentence_position = 0

    tagged_tokens = []
    # while we are not at the end of the sentence and have not detected the full phrase
    while(sentence_position < len(doc) and phrase_position < len(phrase)):
        current_target = phrase[phrase_position]
        current_token = doc[sentence_position]
        if current_token.text == current_target.text:
            if phrase_position == 0:
                tagged_tokens.append((current_token.text, 'B'))
            else:
                tagged_tokens.append((current_token, 'I'))

            phrase_position += 1
            sentence_position += 1

        else:
            tagged_tokens.append((current_token, 'O'))
            sentence_position +=1

    return tagged_tokens

find_phrase(doc, phrase)

[(The, 'O'),
 (pizza, 'O'),
 (is, 'O'),
 (the, 'O'),
 (best, 'O'),
 (if, 'O'),
 (you, 'O'),
 (like, 'O'),
 ('thin', 'B'),
 (crusted, 'I'),
 (pizza, 'I')]

The problem we now have is that if we have 2 phrases, we want to search for the first one (as they are ordered, and then upon its termination, search immediately for the next one. This will prevent us having multiple matches

Before we make life too complicated, can we do this with spacy? I think I recall spacy tokens having an attribute representing their position in a sentence.

In [14]:
[(token.text, token.idx) for token in doc]

[('The', 0),
 ('pizza', 4),
 ('is', 10),
 ('the', 13),
 ('best', 17),
 ('if', 22),
 ('you', 25),
 ('like', 29),
 ('thin', 34),
 ('crusted', 39),
 ('pizza', 47),
 ('.', 52)]

Yup! We have the token.idx property and the numbers match! Let's write a function that works with this instead!

I ended up going with a list of dicts in the end as that's the format that the CRF code expects, and it makes the logic of the code flow better than it did with a list of tuples

In [16]:
tagged_tokens = [{"token": token, "iob_tag": 'O'} for token in doc]

# Use this function if anything goes wrong with the character one
def tag_phrase_backup(tokens, phrase, char_from, char_to):
    
    # initialise as untagged

    phrase_position = 0
    sentence_position = 0
    
    # while we are not at the end of the sentence and have not detected the full phrase
    while(sentence_position < len(tokens) and phrase_position < len(phrase)):
        current_target = phrase[phrase_position]
        current_token = tokens[sentence_position]
        
        if current_token["token"].text == current_target.text:
            if phrase_position == 0:
                current_token["iob_tag"] = "B"
            else:
                current_token["iob_tag"] = "I"

            phrase_position += 1
            sentence_position += 1

        else:
            sentence_position +=1
            
    return tokens

def tag_phrase(tokens, phrase, char_from, char_to):
    
    for current_token in tokens:
        if current_token["token"].idx == char_from:
            current_token["iob_tag"] = "B"
        elif char_to > current_token["token"].idx > char_from:
            current_token["iob_tag"] = "I"
        elif char_to < current_token["token"].idx:
            break
            

    return tokens

In [17]:
for aspect in aspects:
    char_from = aspect["from"]
    char_to = aspect["to"]
    tagged_tokens = tag_phrase(tagged_tokens, phrase, char_from, char_to)
tagged_tokens

[{'token': The, 'iob_tag': 'O'},
 {'token': pizza, 'iob_tag': 'B'},
 {'token': is, 'iob_tag': 'O'},
 {'token': the, 'iob_tag': 'O'},
 {'token': best, 'iob_tag': 'O'},
 {'token': if, 'iob_tag': 'O'},
 {'token': you, 'iob_tag': 'O'},
 {'token': like, 'iob_tag': 'O'},
 {'token': thin, 'iob_tag': 'B'},
 {'token': crusted, 'iob_tag': 'I'},
 {'token': pizza, 'iob_tag': 'I'},
 {'token': ., 'iob_tag': 'O'}]

Let's check that we have the same number of words tagged in each. As we have used the from and to fields for the tagging, we don't need to check the sets.

In [18]:
def check_tags(tagged_tokens, aspects):
    word_count_tags = len([x['iob_tag'] for x in tagged_tokens if x['iob_tag'] in ['B', 'I']])
    term_lengths = sum([len(nlp(x['term'])) for x in aspects])
    return word_count_tags == term_lengths

In [19]:
check_tags(tagged_tokens, aspects)

True

And now let's test out our tagging method.

In [20]:
token_list = []
for i in range(len(data)):
    
    # extract row components
    sentence = data["text"][i]
    doc = nlp(sentence)
    aspects = data["aspect_terms"][i]
    tagged_tokens = [{"token": token, "iob_tag": 'O'} for token in doc]

    # cycle through all aspects
    for aspect in aspects:
        char_from = aspect["from"]
        char_to = aspect["to"]
        tagged_tokens = tag_phrase(tagged_tokens, phrase, char_from, char_to)
    
    # check aspects and tags match
    if not check_tags(tagged_tokens, aspects):
        print("Mismatch on row " + str(i))
    else:
        token_list.append(tagged_tokens)

In [21]:
token_list

[[{'token': But, 'iob_tag': 'O'},
  {'token': the, 'iob_tag': 'O'},
  {'token': staff, 'iob_tag': 'B'},
  {'token': was, 'iob_tag': 'O'},
  {'token': so, 'iob_tag': 'O'},
  {'token': horrible, 'iob_tag': 'O'},
  {'token': to, 'iob_tag': 'O'},
  {'token': us, 'iob_tag': 'O'},
  {'token': ., 'iob_tag': 'O'}],
 [{'token': To, 'iob_tag': 'O'},
  {'token': be, 'iob_tag': 'O'},
  {'token': completely, 'iob_tag': 'O'},
  {'token': fair, 'iob_tag': 'O'},
  {'token': ,, 'iob_tag': 'O'},
  {'token': the, 'iob_tag': 'O'},
  {'token': only, 'iob_tag': 'O'},
  {'token': redeeming, 'iob_tag': 'O'},
  {'token': factor, 'iob_tag': 'O'},
  {'token': was, 'iob_tag': 'O'},
  {'token': the, 'iob_tag': 'O'},
  {'token': food, 'iob_tag': 'B'},
  {'token': ,, 'iob_tag': 'O'},
  {'token': which, 'iob_tag': 'O'},
  {'token': was, 'iob_tag': 'O'},
  {'token': above, 'iob_tag': 'O'},
  {'token': average, 'iob_tag': 'O'},
  {'token': ,, 'iob_tag': 'O'},
  {'token': but, 'iob_tag': 'O'},
  {'token': could, 'iob_ta

Although I do not have access to the tokeniser used creating the original dataset, the check function I implemented above, as well as taking a quick look at the data myself has led me to conclude that that the character-based tagging was a success!

Now I just need to do some feature engineering.  The following features are commonly used in this domain and include each token's part-of-speech tag and syntactic dependencies, as well as those of the surrounding tokens.

In [22]:
for sentence_list in token_list:
    for i in range(len(sentence_list)):
        token = sentence_list[i]
        token['text'] = token['token'].text
        token['pos'] = token['token'].pos_
        token['tag'] = token['token'].tag_
        token['dep'] = token['token'].dep_
        token['is_punct'] = token['token'].is_punct
        
        
        if i > 0:
            prev_token_1 = sentence_list[i-1]
            token['-1: pos'] = prev_token_1['token'].pos_
            token['-1: tag'] = prev_token_1['token'].tag_
            token['-1: dep'] = prev_token_1['token'].dep_
        
        if i > 1:
            prev_token_2 = sentence_list[i-2]
            token['-2: pos'] = prev_token_2['token'].pos_
            token['-2: tag'] = prev_token_2['token'].tag_
            token['-2: dep'] = prev_token_2['token'].dep_
        
        if i > 2:
            prev_token_3 = sentence_list[i-3]
            token['-3: pos'] = prev_token_3['token'].pos_
            token['-3: tag'] = prev_token_3['token'].tag_
            token['-3: dep'] = prev_token_3['token'].dep_

In [23]:
token_list[0]

[{'token': But,
  'iob_tag': 'O',
  'text': 'But',
  'pos': 'CCONJ',
  'tag': 'CC',
  'dep': 'cc',
  'is_punct': False},
 {'token': the,
  'iob_tag': 'O',
  'text': 'the',
  'pos': 'DET',
  'tag': 'DT',
  'dep': 'det',
  'is_punct': False,
  '-1: pos': 'CCONJ',
  '-1: tag': 'CC',
  '-1: dep': 'cc'},
 {'token': staff,
  'iob_tag': 'B',
  'text': 'staff',
  'pos': 'NOUN',
  'tag': 'NN',
  'dep': 'nsubj',
  'is_punct': False,
  '-1: pos': 'DET',
  '-1: tag': 'DT',
  '-1: dep': 'det',
  '-2: pos': 'CCONJ',
  '-2: tag': 'CC',
  '-2: dep': 'cc'},
 {'token': was,
  'iob_tag': 'O',
  'text': 'was',
  'pos': 'VERB',
  'tag': 'VBD',
  'dep': 'ROOT',
  'is_punct': False,
  '-1: pos': 'NOUN',
  '-1: tag': 'NN',
  '-1: dep': 'nsubj',
  '-2: pos': 'DET',
  '-2: tag': 'DT',
  '-2: dep': 'det',
  '-3: pos': 'CCONJ',
  '-3: tag': 'CC',
  '-3: dep': 'cc'},
 {'token': so,
  'iob_tag': 'O',
  'text': 'so',
  'pos': 'ADV',
  'tag': 'RB',
  'dep': 'advmod',
  'is_punct': False,
  '-1: pos': 'VERB',
  '-1: t

As we can't use pickle to save spacy objects, the last thing to do is delete the token item in each dict with just the text it contains.

In [24]:
for sentence in token_list:
    for token in sentence:
        token.pop('token')

In [25]:
token_list[0]

[{'iob_tag': 'O',
  'text': 'But',
  'pos': 'CCONJ',
  'tag': 'CC',
  'dep': 'cc',
  'is_punct': False},
 {'iob_tag': 'O',
  'text': 'the',
  'pos': 'DET',
  'tag': 'DT',
  'dep': 'det',
  'is_punct': False,
  '-1: pos': 'CCONJ',
  '-1: tag': 'CC',
  '-1: dep': 'cc'},
 {'iob_tag': 'B',
  'text': 'staff',
  'pos': 'NOUN',
  'tag': 'NN',
  'dep': 'nsubj',
  'is_punct': False,
  '-1: pos': 'DET',
  '-1: tag': 'DT',
  '-1: dep': 'det',
  '-2: pos': 'CCONJ',
  '-2: tag': 'CC',
  '-2: dep': 'cc'},
 {'iob_tag': 'O',
  'text': 'was',
  'pos': 'VERB',
  'tag': 'VBD',
  'dep': 'ROOT',
  'is_punct': False,
  '-1: pos': 'NOUN',
  '-1: tag': 'NN',
  '-1: dep': 'nsubj',
  '-2: pos': 'DET',
  '-2: tag': 'DT',
  '-2: dep': 'det',
  '-3: pos': 'CCONJ',
  '-3: tag': 'CC',
  '-3: dep': 'cc'},
 {'iob_tag': 'O',
  'text': 'so',
  'pos': 'ADV',
  'tag': 'RB',
  'dep': 'advmod',
  'is_punct': False,
  '-1: pos': 'VERB',
  '-1: tag': 'VBD',
  '-1: dep': 'ROOT',
  '-2: pos': 'NOUN',
  '-2: tag': 'NN',
  '-2: d

Looks good! Now we can save it.

In [26]:
with open('restaurant_iob.pickle', 'wb') as handle:
    pickle.dump(token_list, handle)