# Named Entity Identification (NEI) using SVM
**Problem statement:** Label each word in the input sentence as NE/non-NE  
**Assumptions:**
- The tags `B-PER (1), I-PER (2), B-ORG (3), I-ORG (4) B-LOC (5), I-LOC (6) B-MISC (7), I-MISC (8)` are taken to be as a NE seperately.  
For example, the sentence `The Delhi High Court ...` will have ground truth tags as `The_0 Delhi_1 High_1 Court_1 ...` instead of `The_0 (Delhi High Court)_1 ...`

## Install Dependencies

In [24]:
! pip install datasets



In [81]:
import nltk
nltk.download('punkt')
nltk.download("stopwords")
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Start

In [26]:
%reset -f

## Imports

In [27]:
import numpy as np
from sklearn.svm import SVC
from string import punctuation
from tqdm.notebook import tqdm
from datasets import load_dataset
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from nltk.tag import pos_tag

In [28]:
import pickle

In [29]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Constants

In [30]:
SEED = 0
D = 6 # number of features used
SW = stopwords.words("english")
PUNCT = list(punctuation)

## Functions

### Data

In [31]:
def createData(data):

    words = [] # stores the str
    features = [] # feature array, one vector per word in the corpus
    labels = [] # labels (0/1)

    for d in tqdm(data):

        tokens = d["tokens"]
        tags = d["ner_tags"]
        pos = d["pos_tags"]

        l = len(tokens)
        for i in range(l):

            x = vectorize(w = tokens[i], scaled_position = (i/l), p = pos[i])

            if tags[i] > 0:
                y = 1
            else:
                y = 0

            features.append(x)
            labels.append(y)

        words += tokens

    words = np.asarray(words, dtype = "object")
    features = np.asarray(features, dtype = np.float32)
    labels = np.asarray(labels, dtype = np.float32)

    return words, features, labels

### Model

#### Feature Engineering (word $w$ (`str`) $\to$ feature vector $x \in \mathbb{R}^d$)
- Capitalization [`0/1`]
- Is all caps (eg., acronyms like 'USA') [`0/1`]
- Length of the token [`int`]
- Is stopword (using NLTK's english stopword list, 179 stopwords) [`0/1`]
- Is punctuation [`0/1`]
- (Scaled) position in sentence [`float`]

In [32]:
def vectorize(w, scaled_position, p):
    # w : str : a token
    # p : pos_tag sequence of the word

    v = np.zeros(D).astype(np.float32)

    # If first character in uppercase
    if w[0].isupper():
        title = 1
    else:
        title = 0

    # All characters in uppercase
    if w.isupper():
        allcaps = 1
    else:
        allcaps = 0

    # Is stopword
    if w.lower() in SW:
        sw = 1
    else:
        sw = 0

    # Is punctuation
    if w in PUNCT:
        punct = 1
    else:
        punct = 0

    # is a proper noun(NNP/NNPS)
    if p == 22 or p == 23:
        pnoun = 1
    else:
        pnoun = 0

    # Build vector
    v[0] = title
    v[1] = allcaps
    v[2] = sw
    v[3] = punct
    v[4] = scaled_position
    v[5] = pnoun

    return v

In [89]:
def infer(model, scaler, s):
    # s: sentence

    tokens = word_tokenize(s)
    features = []
    postag = pos_tag(tokens)
    l = len(tokens)
    #print(l)
    for i in range(l):
        #print([tokens[i]])
        #print(postag[i][0])
        pos = postag[i][1]
        if pos == "NNP":
          pos == 22
        elif pos == "NNPS":
          pos == 23
        f = vectorize(w = tokens[i], scaled_position = (i/l), p = pos)
        features.append(f)

    features = np.asarray(features, dtype = np.float32)

    scaled = scaler.transform(features)

    pred = model.predict(scaled)

    return pred, tokens, features

In [97]:
print(pos_tag(['India']))

[('India', 'NNP')]


## Data (CoNLL 2003) [[huggingface]](https://huggingface.co/datasets/conll2003) [[original]](https://www.clips.uantwerpen.be/conll2003/ner/)
Has labels for persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups (4 classes).

In [34]:
data = load_dataset("conll2003") # of type datasets.dataset_dict.DatasetDict
data_train = data["train"] # 14,041 rows (type: datasets.arrow_dataset.Dataset)
data_val   = data["validation"] # 3250 rows
data_test  = data["test"] # 3453 rows

# columns: 'id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'

In [98]:
words_train, X_train, y_train = createData(data_train)
words_val, X_val, y_val       = createData(data_val)
words_test, X_test, y_test    = createData(data_test)

  0%|          | 0/14041 [00:00<?, ?it/s]

  0%|          | 0/3250 [00:00<?, ?it/s]

  0%|          | 0/3453 [00:00<?, ?it/s]

In [99]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(203621, 6)
(51362, 6)
(46435, 6)


In [100]:
np.save('/content/gdrive/My Drive/X_train.npy', X_train)
np.save('/content/gdrive/My Drive/Y_train.npy', y_train)

In [40]:
# Print some examples of named entities in y_val
nes = words_val[y_val == 1]
for ne in np.random.choice(nes, size = 15):
    print(ne)

Agriculture
Mets
Lavrentyeva
Harrington
Republic
Jon
Australia
Bank
Dutch
Huddersfield
Briton
Nolan
Sean
Hamed
Jansher


In [41]:
# Standardize the features such that all features contribute equally to the distance metric computation of the SVM
scaler = StandardScaler()

# Fit only on the training data (i.e. compute mean and std)
scaler = scaler.fit(X_train)

# Use the train data fit values to scale val and test
X_train = scaler.transform(X_train)
X_val   = scaler.transform(X_val)
X_test  = scaler.transform(X_test)

In [42]:
model = SVC(C = 1.0, kernel = "linear", class_weight = "balanced", random_state = SEED, verbose = True)
model.fit(X_train, y_train)

[LibSVM]

In [43]:
pickle.dump(model, open('/content/gdrive/My Drive/models/model.pkl','wb'))

In [44]:
y_pred_val = model.predict(X_val)

In [46]:
print(classification_report(y_true = y_val, y_pred = y_pred_val))

              precision    recall  f1-score   support

         0.0       0.99      0.96      0.98     42759
         1.0       0.82      0.97      0.89      8603

    accuracy                           0.96     51362
   macro avg       0.91      0.96      0.93     51362
weighted avg       0.96      0.96      0.96     51362



In [47]:
print(confusion_matrix(y_true = y_val , y_pred = y_pred_val))

[[40931  1828]
 [  263  8340]]


In [90]:
# A few examples

examples = [
    "Delhi is the capital of India.",
    "US Vice President Kamala Harris, PM Modi talk up Indo-US ties at 1st in-person meeting.",
    "Covid-19 India Live News: National Task Force drops Ivermectin, HCQ drugs from Covid-19 treatment protocol; India logs 31,382 new cases.",
    "US Rules Out Adding India Or Japan To Security Alliance With Australia And UK" # all words are capitalized
]

for e in examples:
    pred, tokens, features = infer(model, scaler, e)
    annotated = []
    for w, p in zip(tokens, pred):
        annotated.append(f"{w}_{int(p)}")
    print(" ".join(annotated))
    print()

Delhi_1 is_0 the_0 capital_0 of_0 India_1 ._0

US_1 Vice_1 President_1 Kamala_1 Harris_1 ,_0 PM_1 Modi_1 talk_0 up_0 Indo-US_1 ties_0 at_0 1st_0 in-person_0 meeting_0 ._0

Covid-19_1 India_1 Live_1 News_1 :_0 National_1 Task_1 Force_1 drops_0 Ivermectin_1 ,_0 HCQ_1 drugs_0 from_0 Covid-19_1 treatment_0 protocol_0 ;_0 India_1 logs_0 31,382_0 new_0 cases_0 ._0

US_1 Rules_1 Out_0 Adding_1 India_1 Or_0 Japan_1 To_0 Security_1 Alliance_1 With_0 Australia_1 And_0 UK_1



## References
1. [Kapociute-Dzikiene, J., Nøklestad, A., Johannessen, J. B., & Krupavicius, A. (2013). Exploring features for named entity recognition in lithuanian text corpus.](https://aclanthology.org/W13-5611.pdf)
2. [Král, P. (2011). Features for named entity recognition in czech.](https://www.researchgate.net/publication/256605620_Features_for_named_entity_recognition_in_Czech_language)
3. [Malarkodi, C. S., & Devi, S. L. (2020, May). A Deeper Study on Features for Named Entity Recognition. In Proceedings of the WILDRE5–5th Workshop on Indian Language Data: Resources and Evaluation (pp. 66-72).](https://aclanthology.org/2020.wildre-1.12.pdf)