## Supervised Learning for Entity and Aspect Mining

This notebook introduces Conditional Random Fields (CRF) for entity and aspect mining. Recall that we have mentioned that entity and aspect mining involves 3 main tasks:
1. Extraction of entity 
2. Extraction of aspects associated with the entity
3. Sentiment classification
In this notebook, we use CRF for the second task. 

### Conditional Random Fields
CRF is a sequence machine learning technique and is very popular in natural language porcessing (NLP). It is used for eg in Named entity Recogition (NER), Part of speech tagging (POS) and word sense disambiguation. 

The CRF is a subset of HMF (hidden markov fields) in that it may have dependencies beyond the adjacent words.

Earlier, we had introduced several heuristic techniques for extraction of aspects. These include using dependency parsing, looking at syntactic relations (like 'of', 'from' etc). These rules can be integrated into the ML model - left as an exercise. For illustration we just use POS as features in this notebook. 

### Understanding CRF in NLP 

The CRF is a sequential ML technique. By sequence, it means it is used to predict what's next in a sequence? For eg, in the entity and aspect mining perspective, ABSA -  aspect based sentiment analysis. 

Suppose $(\bf{X},\bf{Y})$ is a conditional random field such that Y are the observables and X is a latent variable. In NLP, Y can be the actual words themselves - 'like', 'of', 'and' etc. while X are the POS tags - which need to be derived through an algorithm. In a CRF:

$
 p(Y_v \vert X, Y_w, w\ne v) =  p(Y_v \vert X, Y_w, w \sim v)
$

Here $\sim$ refers to the surrounding words. CRF is a specific case of MRF where the latter refers to the immediate instead of surrounding words. Note this expression is a conditional probability which is computed by Bayes rule from the learning corpus. 

In [1]:
from itertools import chain
import nltk
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pandas as pd
import pycrfsuite  # crf

print(sklearn.__version__)

0.21.3


Our training data set needs to be labelled and obtained from the CoNLL 2003 dataset. https://www.aclweb.org/anthology/W03-0419 It is in a iob format. It essentially contains 3 columns. The first column is the actual words, the second is the POS and the 3rd column where it is an entity B-A, aspect I-A or others O. We write a simple code to convert it into a form to use the pycrfsuite library for CRF. This is the most accessible library to run CRFs. 

The function word2features extracts out features in the sentence - in this case just POS of the individual tokens. The function is adapted from https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

In [2]:
%%time

def createCRFSet(fname):
    train_sents = []
    tt_sents = []
    t_sents = []
    fp = open(fname,  encoding="utf-8")
   
    for line in fp.readlines():
        line = tuple(line.split())
        t_sents.append(line)
    
    for t in t_sents:
        if len(t)!=0: 
            tt_sents.append(t)
        else:
            train_sents.append(tt_sents)
            tt_sents=[]
    
    return train_sents

train_sents = createCRFSet("data/Restaurants_Train.iob")
test_sents = createCRFSet("data/Restaurants_Test.iob")
#test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

Wall time: 47.9 ms


In [3]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [  # for all words
        'bias',
        'postag=' + postag,
      #  'word.lower()': word.lower(),
        #'word[-3:]': word[-3:],
       # 'word[-2:]': word[-2:],
        #'word.isupper()': word.isupper(),
        #'word.istitle()': word.istitle(),
        #'word.isdigit()': word.isdigit(),
        #'postag[:2]': postag[:2],
    ]
    if i > 0: # if BOS(beginning of statement)
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:postag=' + postag1
            # '-1:word.lower()': word1.lower(),
            #'-1:word.istitle()': word1.istitle(),
            #'-1:word.isupper()': word1.isupper(),
            #'-1:postag': postag1,
            #'-1:postag[:2]': postag1[:2],
        ])
    else:
        features.append('BOS')  # beginning of statement
        
    if i < len(sent)-1:  # if EOS(end of statement)
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.extend([
            '+1:postag=' + postag1
            #'+1:word.lower()': word1.lower(),
            #'+1:word.istitle()': word1.istitle(),
            #'+1:word.isupper()': word1.isupper(),
            #'+1:postag': postag1,
            #'+1:postag[:2]': postag1[:2],
        ])
    else:
        features.append('EOS')
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

Note the features for one of the sentence - 'To be completely fair, the only redeeming factor was the food which was above average, but couldn't make up for all the other deficiencies of Teodora'. The POS tags (before and after are used as features).

In [4]:
df_1 = pd.DataFrame(train_sents[5],columns=["Word","POS","Entity or Aspect Tag"])
# change to dataframe for easy printing.
df_1


Unnamed: 0,Word,POS,Entity or Aspect Tag
0,Not,RB,O
1,only,RB,O
2,was,VBD,O
3,the,DT,O
4,food,NN,B-A
5,outstanding,JJ,O
6,",",",",O
7,but,CC,O
8,the,DT,O
9,little,JJ,O


In [5]:
df_2 = pd.DataFrame(sent2features(train_sents[1]), columns=["Bias constant","POS","POS Before","POS after"])  #bias content is the content(just ike the 'c'in regression)偏置
df_2

Unnamed: 0,Bias constant,POS,POS Before,POS after
0,bias,postag=TO,BOS,+1:postag=VB
1,bias,postag=VB,-1:postag=TO,+1:postag=RB
2,bias,postag=RB,-1:postag=VB,+1:postag=JJ
3,bias,postag=JJ,-1:postag=RB,"+1:postag=,"
4,bias,"postag=,",-1:postag=JJ,+1:postag=DT
5,bias,postag=DT,"-1:postag=,",+1:postag=JJ
6,bias,postag=JJ,-1:postag=DT,+1:postag=NN
7,bias,postag=NN,-1:postag=JJ,+1:postag=NN
8,bias,postag=NN,-1:postag=NN,+1:postag=VBD
9,bias,postag=VBD,-1:postag=NN,+1:postag=DT


In [5]:
train_sents[1]

[('To', 'TO', 'O'),
 ('be', 'VB', 'O'),
 ('completely', 'RB', 'O'),
 ('fair', 'JJ', 'O'),
 (',', ',', 'O'),
 ('the', 'DT', 'O'),
 ('only', 'JJ', 'O'),
 ('redeeming', 'NN', 'O'),
 ('factor', 'NN', 'O'),
 ('was', 'VBD', 'O'),
 ('the', 'DT', 'O'),
 ('food', 'NN', 'B-A'),
 (',', ',', 'O'),
 ('which', 'WDT', 'O'),
 ('was', 'VBD', 'O'),
 ('above', 'IN', 'O'),
 ('average', 'NN', 'O'),
 (',', ',', 'O'),
 ('but', 'CC', 'O'),
 ("couldn't", 'NNS', 'O'),
 ('make', 'VBP', 'O'),
 ('up', 'RP', 'O'),
 ('for', 'IN', 'O'),
 ('all', 'PDT', 'O'),
 ('the', 'DT', 'O'),
 ('other', 'JJ', 'O'),
 ('deficiencies', 'NNS', 'O'),
 ('of', 'IN', 'O'),
 ('Teodora', 'NNP', 'O'),
 ('.', '.', 'O')]

In [8]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

Wall time: 79.8 ms


In [9]:
train_sents[3]

[('Where', 'WRB', 'O'),
 ('Gabriela', 'NNP', 'O'),
 ('personaly', 'VBZ', 'O'),
 ('greets', 'NNS', 'O'),
 ('you', 'PRP', 'O'),
 ('and', 'CC', 'O'),
 ('recommends', 'VB', 'O'),
 ('you', 'PRP', 'O'),
 ('what', 'WP', 'O'),
 ('to', 'TO', 'O'),
 ('eat', 'VB', 'O'),
 ('.', '.', 'O')]

In [10]:

%%time
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

Wall time: 70.8 ms


In [12]:

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty(减少Overfitting)
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [13]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

In [14]:
%%time
# Here we save the trained CRF model. 
trainer.train('CRF_ABSA.crfsuite')

Wall time: 359 ms


In [16]:
trainer.logparser.last_iteration

{'num': 50,
 'scores': {},
 'loss': 8270.838023,
 'feature_norm': 16.557075,
 'error_norm': 161.739801,
 'active_features': 241,
 'linesearch_trials': 1,
 'linesearch_step': 1.0,
 'time': 0.007}

In [17]:
print (len(trainer.logparser.iterations), trainer.logparser.iterations[-1])

50 {'num': 50, 'scores': {}, 'loss': 8270.838023, 'feature_norm': 16.557075, 'error_norm': 161.739801, 'active_features': 241, 'linesearch_trials': 1, 'linesearch_step': 1.0, 'time': 0.007}


In [15]:
tagger = pycrfsuite.Tagger()
tagger.open('CRF_ABSA.crfsuite')

<contextlib.closing at 0x20cd4f7a2e8>

In [19]:

example_sent = test_sents[5]
print(' '.join(sent2tokens(example_sent)), end='\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

I trust the people at Go Sushi , it never disappoints .

Predicted: O O O O O O O O O O O O
Correct:   O O O B-A O O O O O O O O


In [20]:
def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.
    
    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

In [21]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]

Wall time: 17.9 ms


In [18]:
print(bio_classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         B-A       0.62      0.36      0.46      1135
         I-A       0.55      0.23      0.32       538

   micro avg       0.60      0.32      0.42      1673
   macro avg       0.59      0.30      0.39      1673
weighted avg       0.60      0.32      0.41      1673
 samples avg       0.04      0.04      0.04      1673



  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


The CRF theory is such that there are transiting hidden states - some of which are more probable than others. The below shows that is B-A -> I-A is very likely. An example of this is iPhone (B-A) size (I-A). 

In [23]:

from collections import Counter
info = tagger.info()

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(8))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-8:])

Top likely transitions:
I-A    -> I-A     2.657962
B-A    -> I-A     2.559635
O      -> O       1.920100
O      -> B-A     1.000955
B-A    -> O       0.152442
NN     -> B-A     -0.244032
I-A    -> O       -0.658764
O      -> NN      -0.980371

Top unlikely transitions:
B-A    -> O       0.152442
NN     -> B-A     -0.244032
I-A    -> O       -0.658764
O      -> NN      -0.980371
NN     -> O       -2.090161
I-A    -> B-A     -4.247481
B-A    -> B-A     -5.006722
O      -> I-A     -6.641027


Which feature is the most common (or least common) to tag entities or aspects. In the below, in particular to positively tag aspects it is the feature -1:postag=PRP - that is if the word before is a preposition word. This includes words like 'of', 'in', 'at' etc. Makes sense of the heuristic rules earlier mentioned!

In [24]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(info.state_features).most_common(20))

#rint("\nTop negative:")
#print_state_features(Counter(info.state_features).most_common()[-20:])

Top positive:
3.756987 O      postag=,
3.561562 O      postag=PRP
3.432838 O      postag=.
2.797307 O      postag=WDT
2.301108 O      postag=PRP$
2.278891 O      EOS
2.209918 O      BOS
1.866367 I-A    -1:postag=PRP
1.826179 I-A    postag=SYM
1.767864 O      +1:postag=CD
1.727703 O      postag=JJS
1.655197 O      postag=WP
1.535133 O      bias
1.472674 B-A    postag=NN
1.412390 B-A    postag=NNS
1.379069 NN     postag=.
1.361568 NN     EOS
1.342975 I-A    postag=NN
1.224582 B-A    BOS
1.190439 O      postag=VBZ
