# Deliverable 2

- Deliverable 2 will be a NER (Named entity recognition system).


## Overview of the data

url = https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus


Essential info about entities:

```
geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon
```


In [37]:
import pandas as pd
import numpy as np
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt


The data is located in the 'data' folder

In [38]:
!ls ../data/

[1m[34mbooks[m[m                 people_data.txt       quora_train_data.csv
[30m[43mconll[m[m                 people_wiki_small.csv spanish_line.txt
[30m[43mkaggle_ner[m[m            quora_test_data.csv   words_nltk.csv


We can take a look at the first lines of the dataset

In [39]:
! head ../data/kaggle_ner/ner_dataset.csv 

Sentence #,Word,POS,Tag
Sentence: 1,Thousands,NNS,O
,of,IN,O
,demonstrators,NNS,O
,have,VBP,O
,marched,VBN,O
,through,IN,O
,London,NNP,B-geo
,to,TO,O
,protest,VB,O


In [40]:
data = pd.read_csv("../data/kaggle_ner/ner_dataset.csv",
                   encoding="latin1")

In [42]:
data.head(70)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


 Fill with "Sentence: k" for each k

In [9]:
sentences = list(set(data["Sentence #"]))
sentences[0] = "nan"
sentences.sort()
len(sentences)

47960

In [10]:
sentences[0:3]

['Sentence: 1', 'Sentence: 10', 'Sentence: 100']

In [11]:
set(data["Tag"])

{'B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim',
 'O'}

In [12]:
for tag in set(data["Tag"]):
    print("\nTAG:",tag)
    print(data[data["Tag"] == tag]["Word"][0:10])


TAG: B-gpe
18     British
102    English
113    Britain
126    British
173       Iran
181       Iran
196    Iranian
238       U.S.
245       Iran
259     Tehran
Name: Word, dtype: object

TAG: I-per
271         Mahmoud
272     Ahmadinejad
332         Horbach
444       Abdullahi
445           Yusuf
446           Ahmad
966        Muhammad
974          Khayam
1106     Faridullah
1107           Khan
Name: Word, dtype: object

TAG: B-geo
6        London
12         Iraq
65         Hyde
94      Britain
106    Brighton
118        Iraq
133      London
146        Rome
148       Paris
151      Madrid
Name: Word, dtype: object

TAG: B-per
42           Bush
270     President
331        Thomas
443     President
965       Prophet
973          Omar
997        Khayam
1055       Khayam
1105        Malik
1240        Abdul
Name: Word, dtype: object

TAG: I-gpe
1225    States
1264     Korea
2713      Binh
2932     Ababa
3466      City
5241     Lanka
5313     Korea
5361     Korea
5370     Korea
5390     Ko

How many sentences do we have?

In [13]:
"Sentence: 47959" in sentences, "Sentence: 47960" in sentences

(True, False)

## Indexing Sentences

In [14]:
sentence_formatter = "Sentence: {}"
sentence_formatter.format(0) in sentences

False

In [15]:
sentence_formatter = "Sentence: {}"
sentence_formatter.format(1) in sentences

True

In [16]:
i = 1
sentence_id      = sentence_formatter.format(i)
sentence_id_next = sentence_formatter.format(i+1)
sentence_id, sentence_id_next

('Sentence: 1', 'Sentence: 2')

In [17]:
print(data.index[data["Sentence #"] == sentence_id])
print(data.index[data["Sentence #"] == sentence_id_next])

Int64Index([0], dtype='int64')
Int64Index([24], dtype='int64')


In [18]:
start = data.index[data["Sentence #"] == sentence_id][0]
end   =  data.index[data["Sentence #"] == sentence_id_next][0]
start, end

(0, 24)

In [19]:
data["Sentence #"][start:end] = sentence_id

In [20]:
data["Sentence #"][start:end]

0     Sentence: 1
1     Sentence: 1
2     Sentence: 1
3     Sentence: 1
4     Sentence: 1
5     Sentence: 1
6     Sentence: 1
7     Sentence: 1
8     Sentence: 1
9     Sentence: 1
10    Sentence: 1
11    Sentence: 1
12    Sentence: 1
13    Sentence: 1
14    Sentence: 1
15    Sentence: 1
16    Sentence: 1
17    Sentence: 1
18    Sentence: 1
19    Sentence: 1
20    Sentence: 1
21    Sentence: 1
22    Sentence: 1
23    Sentence: 1
Name: Sentence #, dtype: object

## Selecting a subset and writting an identifier

In [22]:
data = pd.read_csv("../data/kaggle_ner/ner_dataset.csv",
                   encoding="latin1")

last_n = 2000
end   = data.index[data["Sentence #"] == sentence_formatter.format(last_n)][0]

In [23]:
data = data[0:end]

In [24]:
n_sentences = len(list(set(data["Sentence #"])))
first_n = 1
last_n = last_n -1
print(n_sentences)

2000


In [25]:
%%time 
sentence_formatter = "Sentence: {}"

for s_id in  range(first_n, last_n):
    print("current {}/{}".format(s_id,last_n), end="\r")
    sentence_id = sentence_formatter.format(s_id)
    sentence_id_next = sentence_formatter.format(s_id + 1)
    start = data.index[data["Sentence #"] == sentence_id][0]
    end   = data.index[data["Sentence #"] == sentence_id_next][0]
    data["Sentence #"][start:end] = sentence_id
    
sentence_id = sentence_formatter.format(last_n)
start = data.index[data["Sentence #"] == sentence_id][0]
end   = data.shape[0]
data["Sentence #"][start:end] = sentence_id


CPU times: user 8.04 s, sys: 295 ms, total: 8.33 s
Wall time: 8.46 s


## Building X and Y

In [26]:
n_sentences

2000

In [27]:
X = []
Y = []

sentence_formatter = "Sentence: {}"

for i in range(1,n_sentences):
    s = sentence_formatter.format(i)
    X.append(list(data[data["Sentence #"]==s]["Word"].values))
    Y.append(list(data[data["Sentence #"]==s]["Tag"].values))

In [28]:
i = 0
xy = ["{}/{}".format(x,y) for x,y in zip(X[i],Y[i])]
" ".join(xy)

'Thousands/O of/O demonstrators/O have/O marched/O through/O London/B-geo to/O protest/O the/O war/O in/O Iraq/B-geo and/O demand/O the/O withdrawal/O of/O British/B-gpe troops/O from/O that/O country/O ./O'

In [29]:
def build_word_to_pos(X):

    word_to_pos = {}
    i = 0
    for s in X:
        for w in s:
            if w not in word_to_pos:
                word_to_pos[w] = i
                i +=1
                
    pos_to_word = {v: k for k, v in word_to_pos.items()}
    return word_to_pos, pos_to_word
            
def build_tag_to_pos(Y):
    tag_to_pos = {}
    i = 0
    for s in Y:
        for t in s:
            if t not in tag_to_pos:
                tag_to_pos[t] = i
                i +=1
    pos_to_tag = {v: k for k, v in tag_to_pos.items()}

    return tag_to_pos, pos_to_tag

In [30]:
word_to_pos, pos_to_word = build_word_to_pos(X)
tag_to_pos, pos_to_tag  = build_tag_to_pos(Y)

len(word_to_pos), len(tag_to_pos)

(7047, 17)

In [31]:
tag_to_pos

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-per': 3,
 'I-geo': 4,
 'B-org': 5,
 'I-org': 6,
 'B-tim': 7,
 'B-art': 8,
 'I-art': 9,
 'I-per': 10,
 'I-gpe': 11,
 'I-tim': 12,
 'B-nat': 13,
 'B-eve': 14,
 'I-eve': 15,
 'I-nat': 16}

In [32]:
X = [[word_to_pos[w] for w in s] for s in X]
Y = [[tag_to_pos[t] for t in s] for s in Y]