Load training and development datasets. Print out some examples to show that the reading process is working properly. Also, print out some statistic information to analyse the features. 

In [1]:
# data loading
import pandas as pd
import numpy as np
# read training data
wnuttrain = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17train_clean_tagged.txt'
train = pd.read_table(wnuttrain, header=None, names=['token', 'label', 'bio_only', 'upos'])

# train data information
print("Training data information\n")
print("\nFirst 10 rows: \n")
print(train.head(n=10))
print("\nStatistics: \n")
print(train.describe())
print("\nLabel statistics: \n")
print(train['bio_only'].value_counts())

# read development data
wnutdev = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17dev_clean_tagged.txt'
dev = pd.read_table(wnutdev, header=None, names=['token', 'label', 'bio_only', 'upos'])

# development data information
print("\nDevelopment data information\n")
print("\nFirst 10 rows: \n")
print(dev.head(n=10))
print("\nStatistics: \n")
print(dev.describe())
print("\nLabel statistics: \n")
print(dev['bio_only'].value_counts())


Training data information


First 10 rows: 

       token label bio_only  upos
0  @paulwalk     O        O  NOUN
1         It     O        O  PRON
2         's     O        O   AUX
3        the     O        O   DET
4       view     O        O  NOUN
5       from     O        O   ADP
6      where     O        O   ADV
7          I     O        O  PRON
8         'm     O        O     X
9     living     O        O  NOUN

Statistics: 

        token  label bio_only   upos
count   62236  62241    62241  62241
unique  14799     13        3     17
top         .      O        O   NOUN
freq     1920  59100    59100  12178

Label statistics: 

O    59100
B     1964
I     1177
Name: bio_only, dtype: int64

Development data information


First 10 rows: 

        token label bio_only   upos
0  Stabilized     O        O  PROPN
1    approach     O        O   NOUN
2          or     O        O  CCONJ
3         not     O        O   PART
4           ?     O        O  PUNCT
5        That     O        O   PR

The results show that the loading process is successful. From the statistics, we know that the numbers of different labels are very imballanced, so we have to use some algorithms to mitigate this effect. Also, we observe that some tokens appear very frequently, such as '.' and 'the'. These large number of punctuations and stopwords may contain noise when predicting name entities. Hence, it worth trying to remove these tokens and check the performance. 

The next step is data preprocessing. We will delete all the N/A values, because they are useless when training the models. In addition, we change the BIO labels and POS tags to integers for the future use in training process. 

In [3]:
# data preprocessing
# Drop empty rows between texts
train = train.dropna()
dev = dev.dropna()

# Quantification of qualitative labels and features (BIO labels and POS tags)
# Change BIO labels to integers
def bio_index(bio):
  if bio=='B':
    ind = 0
  elif bio=='I':
    ind = 1
  elif bio=='O':
    ind = 2
  return ind

# Convert POS tags to integers
# Get the UPOS tagset
pos_vocab = train.upos.unique().tolist()
# Convert POS-tags to integers
def pos_index(pos):
  ind = pos_vocab.index(pos)
  return ind

# Quantify BIO labels and POS tags
def numeralization(txt):
  txt_copy = txt.reset_index(drop=True) #  make a copy of original data frame

  # BIO labels
  bioints = [bio_index(b) for b in txt_copy['bio_only']]
  txt_copy['bio_only_label'] = bioints

  # POS tags
  posinds = [pos_index(u) for u in txt_copy['upos']]
  txt_copy['pos_indices'] = posinds

  return txt_copy

# Preprocess train and development data
train_preprocess = numeralization(train)
dev_preprocess = numeralization(dev)
print(train_preprocess.head(n=20)) # check the results

        token       label bio_only   upos  bio_only_label  pos_indices
0   @paulwalk           O        O   NOUN               2            0
1          It           O        O   PRON               2            1
2          's           O        O    AUX               2            2
3         the           O        O    DET               2            3
4        view           O        O   NOUN               2            0
5        from           O        O    ADP               2            4
6       where           O        O    ADV               2            5
7           I           O        O   PRON               2            1
8          'm           O        O      X               2            6
9      living           O        O   NOUN               2            0
10        for           O        O    ADP               2            4
11        two           O        O    NUM               2            7
12      weeks           O        O   NOUN               2            0
13    

We have aready got the POS as a feature, and we also want to explore more features from both word level and sentence level. 

In [4]:
# feature extraction
# feature 1: part-of-speech index

# feature 2: the POS index of the previous token, 
# because we want to capture some context information before the token
def pre_index(pos):
  return np.insert(pos[0:-1], 0 , [-1]) # The first token has no previous token, so it is set to -1

# feature 3: the POS index of the next token, 
# because we want to capture some context information after the token
def aft_index(pos):
  return np.append(pos[1 : ], -1) # The last token has no next token, so it is set to -1

# feature 4: is this token a proper noun?
# A proper noun is more likely to be a name entity than other POS 
def is_propn(pos):
  resp = False
  if pos=='PROPN':
    resp = True
  return resp

# feature 5: is this token a noun?
# A noun is also more likely to be a name entity than other POS
def is_noun(pos):
  resp = False
  if pos=='NOUN':
    resp = True
  return resp

# feature 6: Does the token contain capital letters?
# If a token contains captital letter, it is likely to be a name entity
def has_capital(tok):
  return not (tok.islower())

# feature 7: is the token a common title?
# A title is likely to be a name entity
def is_title(tok):
  return tok.istitle()

# feature 8: Does the token consist of digits?
# Some name entities consist of digits. 
def is_digit(tok):
  return tok.isdigit()


# feature 9: the length of the token
# Some name entities have long word length
def word_length(tok):
  return len(tok)

# extract features using the above functions
def extract_features(txt):
  txt_copy = txt.reset_index(drop=True)

  posinds = [pos_index(u) for u in txt_copy['upos']] # POS tag index

  x_pre = pre_index(posinds)
  txt_copy['x_pre'] = x_pre

  x_aft = aft_index(posinds)
  txt_copy['x_aft'] = x_aft

  isprop = [is_propn(u) for u in txt_copy['upos']]
  txt_copy['is_propn'] = isprop

  isnoun = [is_noun(u) for u in txt_copy['upos']]
  txt_copy['is_noun'] = isnoun

  capital = [has_capital(t) for t in txt_copy['token']]
  txt_copy['has_capital'] = capital

  title = [is_title(t) for t in txt_copy['token']]
  txt_copy['is_title'] = title

  digit = [is_digit(t) for t in txt_copy['token']]
  txt_copy['is_digit'] = digit

  wlength = [word_length(t) for t in txt_copy['token']]
  txt_copy['word_length'] = wlength

  return txt_copy

train_with_feature = extract_features(train_preprocess) # extract features from preprocessed training set
dev_with_feature = extract_features(dev_preprocess) # # extract features from preprocessed development set
print(train_with_feature.head(n=20)) # check the results

        token       label bio_only   upos  bio_only_label  pos_indices  x_pre  \
0   @paulwalk           O        O   NOUN               2            0     -1   
1          It           O        O   PRON               2            1      0   
2          's           O        O    AUX               2            2      1   
3         the           O        O    DET               2            3      2   
4        view           O        O   NOUN               2            0      3   
5        from           O        O    ADP               2            4      0   
6       where           O        O    ADV               2            5      4   
7           I           O        O   PRON               2            1      5   
8          'm           O        O      X               2            6      1   
9      living           O        O   NOUN               2            0      6   
10        for           O        O    ADP               2            4      0   
11        two           O   

The next step is to train the model. Because the training classes are very imbalanced, we aim to use the easy ensemble method to tackle the problem. The majority class set ('O') is divided into some subsets, and we make the number of samples in each subset is similar to the number in minority classes ('B' + 'I', about 3,000 samples). Then, each majority sample subset is combined with the minority samples to obtain a down-sampled set, and we use a base classifier for training. Finally, bagging all the base classifiers to obtain the final model.

For the base classifier, we use a voting model to combine the advantages of different classifiers. It will get a better performance comparing with using a single model. 

In [5]:
# model training
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import GradientBoostingClassifier 
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

'''
This function takes the whole training set and an index as inputs.
It returns a down-sampled dataset, containing all data labelled with 'B' or 'I', and 3,000 data labelled with 'O'. 
The input index determines which 3,000 'O' are chosen. For example, if index is 0, the first 3,000 data labelled with 'O' will be selected. 
'''
def train_split(txt, index):
  txt_copy = txt.reset_index(drop=True)

  # split into B&I versus O subsets
  is_inside = txt_copy['bio_only']!='O' # the indexes of data labelled with 'B' & 'I'
  is_outside = txt_copy['bio_only']=='O' # the indexes of data labelled with 'O'
  bi = txt_copy[is_inside] # All 'B' & 'I'
  outside = txt_copy[is_outside] # All 'O'

  # Choose 3,000 'O'. If the index is out of bound, select all the data after the input index
  if (index + 3000) < len(outside['bio_only']):
    outside = outside[index : index + 3000]  # approx the sum of B and I labels in train
  else:
    outside = outside[index :  ]

  # recombine
  down_sample = pd.concat([bi, outside])

  # check the results
  print('\nSubset %d token statistic:\n' % int(index/3000))
  print(down_sample['token'].describe())
  return down_sample

'''
This function takes the training set and number of base classifiers as inputs.
It will return a list of base models. Each base model will train on different down-sampled dataset. 
'''
def train_model(train_set, num_clf):
  models = []
  # Train a lists of models, the number of models is equal to 'num_clf'
  for i in range (0 , num_clf): 
    # Obtain a down-sampled set 
    train_subset = train_split(train_set, i*3000)

    # Extract the features used for training
    x_train = train_subset.drop(['token', 'label', 'bio_only', 'upos' , 'bio_only_label' ], axis=1)
    # Extract the labels
    y_train = train_subset['bio_only_label']

    # Define some basic models to consist a vote model
    ''' 
    Different classifiers have different advantages and shortages on this tasks. The reason to use a vote model is to combine the advantages
    from different classifiers. Hence, we select some commonly used and well-performed classifiers to consist the vote model. 
    Continuing problems would be investigating which basic models should be use, the number of models and the hyper-parameters in each model.

    '''
    model_gbc = GradientBoostingClassifier()
    model_lgbmc = LGBMClassifier()
    model_xgbc = XGBClassifier()
    model_rf = RandomForestClassifier()
    model_lr = LogisticRegression(multi_class='multinomial')
    model_BagC = BaggingClassifier()

    # Define the vote model
    estimators = [('randomforest', model_rf),('logistic', model_lr),('bagging', model_BagC),('gbc',model_gbc), ('lgbmc',model_lgbmc), ('xgbc', model_xgbc) ]
    model = VotingClassifier(estimators=estimators, voting='soft', weights=[1, 5, 1, 1, 1, 1], n_jobs=-1) 
    # The weights are set based on simply trying some different numbers. A continuing problem is to investigate all the hyper-parameters. 

    # train the model
    model.fit(x_train, y_train)

    # append to the model list
    models.append(model)

  return models

# train models
models = train_model(train_with_feature, 20) # there are approx 60,000 data labelled 'O', so the input is set to 60000/3000=20
print('training finished')


Subset 0 token statistic:

count     6141
unique    3363
top          .
freq       113
Name: token, dtype: object

Subset 1 token statistic:

count     6141
unique    3386
top          .
freq       112
Name: token, dtype: object

Subset 2 token statistic:

count     6141
unique    3329
top          .
freq       119
Name: token, dtype: object

Subset 3 token statistic:

count     6141
unique    3400
top          .
freq        98
Name: token, dtype: object

Subset 4 token statistic:

count     6141
unique    3350
top          .
freq       116
Name: token, dtype: object

Subset 5 token statistic:

count     6141
unique    3380
top          .
freq       117
Name: token, dtype: object

Subset 6 token statistic:

count     6141
unique    3361
top          .
freq       135
Name: token, dtype: object

Subset 7 token statistic:

count     6141
unique    3356
top          .
freq       112
Name: token, dtype: object

Subset 8 token statistic:

count     6141
unique    3344
top          .
freq   

The statistic information from each subset is different. We can concluded that different base models trained on different down-sampled sets, which means the code works as intended.

When predicting results, each base models will give probabilities on 3 labels. For each input, we calculate the sum of probabilities from all base models on each label, and select the label with highest likelihood as the final result. 

In [6]:
'''
This function takes the model lists and development dataset as input.
It returns the predict labels as results
'''
def predict_ensemble(model_list, x_development):
  prob = []
  # Calculate the sum of probabilities
  for m in model_list:
    if m == model_list[0]:
      prob = m.predict_proba(x_development)
    else:
      prob = prob + m.predict_proba(x_development)

  results = []
  # Select the label with highest probability as result
  for line in prob:
    res = np.argmax(line)
    results.append(res)
 
  return results


We can use this model to validate on the development set. 

In [7]:
# Extract the features used for predicting
X_dev = dev_with_feature.drop(['token', 'label', 'bio_only', 'upos' , 'bio_only_label'], axis=1)
X_dev.head()

# Generate predictions
preds = predict_ensemble(models, X_dev)

# Check if our classifier has only predicted outside=2 for all tokens in the dev file
(unique, counts) = np.unique(preds, return_counts=True)
print('Predicted label, Count of labels')
print(np.asarray((unique, counts)).T)


Predicted label, Count of labels
[[    0  1088]
 [    1   280]
 [    2 14014]]


The result looks plausible, and we can further evaluate on precision, recall and F1 scores. 

In [8]:
# Evaluation
# The code is copy from 'Task 2' notebook
def wnut_evaluate(txt):
  '''entity evaluation: we evaluate by whole named entities'''
  npred = 0; ngold = 0; tp = 0
  nrows = len(txt)
  for i in txt.index:
    if txt['prediction'][i]=='B' and txt['bio_only'][i]=='B':
      npred += 1
      ngold += 1
      for predfindbo in range((i+1),nrows):
        if txt['prediction'][predfindbo]=='O' or txt['prediction'][predfindbo]=='B':
          break  # find index of first O (end of entity) or B (new entity)
      for goldfindbo in range((i+1),nrows):
        if txt['bio_only'][goldfindbo]=='O' or txt['bio_only'][goldfindbo]=='B':
          break  # find index of first O (end of entity) or B (new entity)
      if predfindbo==goldfindbo:  # only count a true positive if the whole entity phrase matches
        tp += 1
    elif txt['prediction'][i]=='B':
      npred += 1
    elif txt['bio_only'][i]=='B':
      ngold += 1
  
  fp = npred - tp  # n false predictions
  fn = ngold - tp  # n missing gold entities
  prec = tp / (tp+fp)
  rec = tp / (tp+fn)
  f1 = (2*(prec*rec)) / (prec+rec)
  print('Sum of TP and FP = %i' % (tp+fp))
  print('Sum of TP and FN = %i' % (tp+fn))
  print('True positives = %i, False positives = %i, False negatives = %i' % (tp, fp, fn))
  print('Precision = %.3f, Recall = %.3f, F1 = %.3f' % (prec, rec, f1))

# reverse BIO labels from integers to 'B' 'I' and 'O'
def reverse_bio(ind):
  bio = ' '
  if ind==0:
    bio = 'B'
  elif ind==1:
    bio = 'I'
  elif ind==2:
    bio = 'O'
  return bio

# Convert BIO labels to original form
bio_preds = [reverse_bio(p) for p in preds]
dev_with_feature['prediction'] = bio_preds

# Evaluate on development set
print('New evaluation:')
wnut_evaluate(dev_with_feature)

New evaluation:
Sum of TP and FP = 1088
Sum of TP and FN = 826
True positives = 422, False positives = 666, False negatives = 404
Precision = 0.388, Recall = 0.511, F1 = 0.441


The result is not too bad. It shows that the easy ensemble method can improve performance on imbalanced dataset. 

We can try to improve the performance by removing the punctuations and stopwords, because they may introduce noise during training. 

In [9]:
# Remove punctuation
not_punct = train_with_feature['upos']!='PUNCT'
train_no_punc = train_with_feature[not_punct]
print(train_no_punc['bio_only'].value_counts())
print(train_no_punc['token'].describe())

O    51569
B     1964
I     1134
Name: bio_only, dtype: int64
count     54667
unique    14657
top         the
freq       1105
Name: token, dtype: object


About 8,000 samples labelled with 'O' are removed. 

In [10]:
# Download stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

not_stop = [n not in stop_words for n in train_no_punc['token']]
train_no_punc_stop = train_no_punc[not_stop]
print(train_no_punc_stop['bio_only'].value_counts())
print(train_no_punc_stop['token'].describe())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


O    36010
B     1948
I     1096
Name: bio_only, dtype: int64
count     39054
unique    14512
top           I
freq        870
Name: token, dtype: object


About 15,000 samples labelled with 'O' are removed. Although a few 'B' and 'I' are removed, they will not cause significant affect. 

In [11]:
# Train and evaluate the results
# train the model
new_models = train_model(train_no_punc_stop, 12) # The second parameter is set to 36000/3000 = 12
# predict results
preds = predict_ensemble(new_models, X_dev)
# convert the results from integer to 'B' 'I' 'O'
bio_preds = [reverse_bio(p) for p in preds]
dev_with_feature['prediction'] = bio_preds
# evaluate
print('New evaluation:')
wnut_evaluate(dev_with_feature)



Subset 0 token statistic:

count     6044
unique    3830
top          I
freq       102
Name: token, dtype: object

Subset 1 token statistic:

count     6044
unique    3793
top          I
freq        99
Name: token, dtype: object

Subset 2 token statistic:

count     6044
unique    3798
top          I
freq        86
Name: token, dtype: object

Subset 3 token statistic:

count     6044
unique    3823
top          I
freq        93
Name: token, dtype: object

Subset 4 token statistic:

count     6044
unique    3785
top          I
freq        98
Name: token, dtype: object

Subset 5 token statistic:

count     6044
unique    3823
top          I
freq        87
Name: token, dtype: object

Subset 6 token statistic:

count     6044
unique    3823
top          I
freq        83
Name: token, dtype: object

Subset 7 token statistic:

count     6044
unique    3836
top          I
freq        80
Name: token, dtype: object

Subset 8 token statistic:

count     6044
unique    3947
top          I
freq   

The F1 score increaces by about 0.01. From the results, we find that the model predict more 'B' and less 'I' than expected. Some tokens labelled with 'I' are predicted as 'B' by the model. A continuing problem is to improve on this.

We can finally apply it on the test set and check the performance. 

In [13]:
# Generate test results
# load data
wnuttest = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17test_clean_tagged.txt'
testset = pd.read_table(wnuttest, header=None, names=['token', 'upos']).dropna()
print(testset.describe())

# preprocess
test_preprocess = testset.reset_index(drop=True)
posinds = [pos_index(u) for u in test_preprocess['upos']]
test_preprocess['pos_indices'] = posinds

# Extract the features used for predicting
test_with_feature = extract_features(test_preprocess)
X_test = test_with_feature.drop(['token', 'upos'], axis=1)

# Generating predictions
preds = predict_ensemble(new_models, X_test)

# Convert results
bio_preds = [reverse_bio(p) for p in preds]
testset['prediction'] = bio_preds

# Save to txt file
print(testset.describe())
print(testset['prediction'].value_counts())
testset.to_csv('test2.txt', sep='\t', index=False)


        token   upos
count   23323  23323
unique   6329     17
top         .   NOUN
freq      962   4442
        token   upos prediction
count   23323  23323      23323
unique   6329     17          3
top         .   NOUN          O
freq      962   4442      21370
O    21370
B     1580
I      373
Name: prediction, dtype: int64
