# Named entity recognition with SVM 


## Prerequisites

Downloading the folowing libraries: numpy, scikit-learn, scipy, stanza
(versions used: stanza==1.0.1, scikit-learn==0.22.1, numpy==1.18.1, scipy==1.4.1)


## File type

A text file containing a line of text for each word of the format:
[word] [part of speech] [chunk] [NER tag]

The NER tag can be one of the following: 
"O","B-ORG", B-MISC", B-LOC", "B-PER","I-ORG","I-MISC","I-LOC","I-PER"

where "O" denotes a word that is not a named entity, "B" means the word is a beggining of a named entity, 
"I" means the word is within the named entity, and LOC, PER, ORG and MISC mean location, person, organisation and other, respectively. 

For the purpose of this, the model will not differentiate between "B" and "I" 


## Method 

This model uses a Suport Vector Machine from scikitlearn, with a linear kernel, with a seperate classifier for each NER tag (LOC, ORG, PER, MISC) 


In [None]:
clf = svm.SVC(kernel='linear', C=1)

### Features

The preprocessing steps of splitting and lemmatising the text is done in [TFIDFutils] and [TFIDFprepare]
The features are the following (note: they are all transformed to numbers):

#### part of speech and chunk
obtained from the data directly, using the code in [parseCoNLL.py](https://github.com/skrljanja/E3/blob/master/NewTrain/parseCoNLL.py)

#### form
boolean telling us whether a word starts with a capital letter and is followed by lowercase. 
the function that gives us this is defined in the file [isform.py](https://github.com/skrljanja/E3/blob/master/NewTrain/isform.py)

#### tf-idf 
gives two distinct tf-idf matrices, using the TfidfTransformer from the sklearn library. 

The first one has only one nonzero entry, corresponding to the observed word.  
The second one has up to six nonzero entries, corresponding to the surroundings of the observed word. 

#### embedding
the embedding vector using gloVe weighing.
    
### Testing method

The test is done using 5-fold cross validation from sklearn on the file [train.txt](https://github.com/skrljanja/E3/blob/master/NewTrain/train.txt). It observes accuracy, f1, recall and precision

In [None]:
 scores = cross_validate(clf, X, y, scoring = ['accuracy','f1', 'recall', 'precision'], cv = 5)

### Running the test

The file which joins all of this together is [train.txt](https://github.com/skrljanja/E3/blob/master/NewTrain/train.txt), so running this file gives you the output of the scores. 

The file specifies the which NER tag it is observing in line 23, so by changing the value of the variable "anonType", the observed NER tag is changed.

# Results

This part of the notebook shows the scores obtained for each of the classifiers. 

### "LOC": 
Accuracy = 0.987813286

F1 = 0.838931792 

Recall = 0.782207018 

Precision = 0.90555553


### "ORG": 
Accuracy = 0.976755788

F1 = 0.727515648

Recall = 0.6345137160000001

Precision = 0.8532788200000001

### "MISC":
Accuracy = 0.991318242

F1 = 0.78165211

Recall = 0.693002248

Precision = 0.896755748
