<a href="https://colab.research.google.com/github/ratmcu/wiki_ner/blob/master/wiki_ner_loader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
'''
An entry or sent looks like ...
SOCCER NN B-NP O
- : O O
JAPAN NNP B-NP B-LOC
GET VB B-VP O
LUCKY NNP B-NP O
WIN NNP I-NP O
, , O O
CHINA NNP B-NP B-PER
IN IN B-PP O
SURPRISE DT B-NP O
DEFEAT NN I-NP O
. . O O
Each mini-batch returns the followings:
words: list of input sents. ["The 26-year-old ...", ...]
x: encoded input sents. [N, T]. int64.
is_heads: list of head markers. [[1, 1, 0, ...], [...]]
tags: list of tags.['O O B-MISC ...', '...']
y: encoded tags. [N, T]. int64
seqlens: list of seqlens. [45, 49, 10, 50, ...]
'''
import numpy as np
import torch
from torch.utils import data
!pip install pytorch-pretrained-bert
from pytorch_pretrained_bert import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)
# VOCAB = ('<PAD>', 'O', 'I-LOC', 'B-PER', 'I-PER', 'I-ORG', 'I-MISC', 'B-MISC', 'B-LOC', 'B-ORG')

tags = ['BD', 'BP', 'PR', 'SP', 'CH', 'ED']
VOCAB_list = ['<PAD>', 'O',]
for tag in tags:
    VOCAB_list.append('I-'+tag)
    VOCAB_list.append('B-'+tag)
VOCAB = tuple(VOCAB_list)
tag2idx = {tag: idx for idx, tag in enumerate(VOCAB)}
idx2tag = {idx: tag for idx, tag in enumerate(VOCAB)}

class NerDataset(data.Dataset):
    def __init__(self, fpath):
        """
        fpath: [train|valid|test].txt
        """
        entries = open(fpath, 'r').read().strip().split("\n\n")
        sents, tags_li = [], [] # list of lists
        for entry in entries:
#             print(entry)
            try:
                words = [line.split()[0] for line in entry.splitlines()]
            except:
                print('splitting failed: ', [ord(char) for char in entry])
                continue
            tags = ([line.split()[-1] for line in entry.splitlines()])
            sents.append(["[CLS]"] + words + ["[SEP]"])
            tags_li.append(["<PAD>"] + tags + ["<PAD>"])
        self.sents, self.tags_li = sents, tags_li

    def __len__(self):
        return len(self.sents)

    def __getitem__(self, idx):
        words, tags = self.sents[idx], self.tags_li[idx] # words, tags: string list

        # We give credits only to the first piece.
        x, y = [], [] # list of ids
        is_heads = [] # list. 1: the token is the first piece of a word
        for w, t in zip(words, tags):
            tokens = tokenizer.tokenize(w) if w not in ("[CLS]", "[SEP]") else [w]
            xx = tokenizer.convert_tokens_to_ids(tokens)

            is_head = [1] + [0]*(len(tokens) - 1)

            t = [t] + ["<PAD>"] * (len(tokens) - 1)  # <PAD>: no decision
            yy = [tag2idx[each] for each in t]  # (T,)

            x.extend(xx)
            is_heads.extend(is_head)
            y.extend(yy)

        assert len(x)==len(y)==len(is_heads), f"len(x)={len(x)}, len(y)={len(y)}, len(is_heads)={len(is_heads)}"

        # seqlen
        seqlen = len(y)

        # to string
        words = " ".join(words)
        tags = " ".join(tags)
        return words, x, is_heads, tags, y, seqlen


def pad(batch):
    '''Pads to the longest sample'''
    f = lambda x: [sample[x] for sample in batch]
    words = f(0)
    is_heads = f(2)
    tags = f(3)
    seqlens = f(-1)
    maxlen = np.array(seqlens).max()

    f = lambda x, seqlen: [sample[x] + [0] * (seqlen - len(sample[x])) for sample in batch] # 0: <pad>
    x = f(1, maxlen)
    y = f(-2, maxlen)


    f = torch.LongTensor

    return words, f(x), is_heads, tags, f(y), seqlens

Collecting pytorch-pretrained-bert
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
[K     |██▋                             | 10kB 13.4MB/s eta 0:00:01[K     |█████▎                          | 20kB 6.2MB/s eta 0:00:01[K     |████████                        | 30kB 8.6MB/s eta 0:00:01[K     |██████████▋                     | 40kB 5.5MB/s eta 0:00:01[K     |█████████████▎                  | 51kB 6.6MB/s eta 0:00:01[K     |███████████████▉                | 61kB 7.8MB/s eta 0:00:01[K     |██████████████████▌             | 71kB 8.7MB/s eta 0:00:01[K     |█████████████████████▏          | 81kB 9.6MB/s eta 0:00:01[K     |███████████████████████▉        | 92kB 10.6MB/s eta 0:00:01[K     |██████████████████████████▌     | 102kB 9.0MB/s eta 0:00:01[K     |█████████████████████████████▏  | 112kB 9.0MB/s eta 0:00:01[K     |█████████████████████

100%|██████████| 213450/213450 [00:00<00:00, 835411.13B/s]


In [3]:
train='https://raw.githubusercontent.com/Franck-Dernoncourt/NeuroNER/master/neuroner/data/conll2003/en/train.txt'
valid='https://raw.githubusercontent.com/Franck-Dernoncourt/NeuroNER/master/neuroner/data/conll2003/en/valid.txt'
test='https://raw.githubusercontent.com/Franck-Dernoncourt/NeuroNER/master/neuroner/data/conll2003/en/test.txt'

!mkdir conll2003 | wget --show-progress $train && mv train.txt conll2003
!wget --show-progress $valid && mv valid.txt conll2003
!wget --show-progress $test && mv test.txt conll2003

mkdir: cannot create directory ‘conll2003’: File exists
--2019-09-18 15:28:00--  https://raw.githubusercontent.com/Franck-Dernoncourt/NeuroNER/master/neuroner/data/conll2003/en/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3283420 (3.1M) [text/plain]
Saving to: ‘train.txt’


2019-09-18 15:28:00 (38.0 MB/s) - ‘train.txt’ saved [3283420/3283420]

--2019-09-18 15:28:00--  https://raw.githubusercontent.com/Franck-Dernoncourt/NeuroNER/master/neuroner/data/conll2003/en/valid.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 827443 (

In [0]:
train_dataset = NerDataset("conll2003/train.txt")
eval_dataset = NerDataset("conll2003/valid.txt")
print(len(eval_dataset)/(len(train_dataset)+len(eval_dataset)))

In [0]:
from torch.utils import data as torch_data_utils
train_iter = torch_data_utils.DataLoader(dataset=train_dataset,
                             batch_size=1,
                             shuffle=True,
                             num_workers=4,
                             collate_fn=pad)
eval_iter = torch_data_utils.DataLoader(dataset=eval_dataset,
                            batch_size=1,
                            shuffle=False,
                            num_workers=4,
                            collate_fn=pad)

In [0]:
for i, batch in enumerate(train_iter):
    words, x, is_heads, tags, y, seqlens = batch
    print(words, x, is_heads, tags, y, seqlens)
    if i == 10:
        break

In [3]:
import os
import time
# split a list into evenly sized chunks
!pip install wget
import wget
import logging
import pickle
import ast
import pandas as pd
import numpy as np
import urllib
from bs4 import BeautifulSoup
import tarfile
if not os.path.exists('dataset.tar.gz'):
    wget.download('https://github.com/ratmcu/wiki_ner/blob/master/dataset.tar.gz?raw=true')
tar = tarfile.open('dataset.tar.gz', mode='r')
tar.extractall('./')
tar.close()

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9681 sha256=bbfe2ce3c9110e495810853bc0f14097d724f4d91d0a40ca7aad6430e4f3e84c
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [14]:
df = pd.read_csv('./dataset/politicians/Sri Lanka/Mahinda Rajapaksa/conll_tagged.csv')
# print(df.iloc[7290:7349])
print(df.iloc[80:100])

         words tags
80    Peramuna    O
81          in    O
82        2019    O
83           ,    O
84    spliting    O
85         the    O
86         Sri    O
87       Lanka    O
88     Freedom    O
89       Party    O
90           .    O
91          \n    O
92          \n   \n
93           A    O
94      lawyer    O
95          by    O
96  profession    O
97           ,    O
98   Rajapaksa    O
99         was    O


In [5]:
fname = os.walk('./dataset')
paths = sorted([os.path.join(f[0], f[2][0]) for f in fname if len(f[2]) != 0])
df = pd.read_csv('./dataset/politicians/Sri Lanka/Mahinda Rajapaksa/conll_tagged.csv')
def toConllTxt(path):
    df = pd.read_csv(path)
    dir_path, _  = os.path.split(path)
    txt_file = os.path.join(dir_path, '%s.txt' % path.split('/')[-2])
    with open(txt_file, 'w') as file:    
        for i, row in enumerate(df.iterrows()):
#             print (row[1]['words'], row[1]['tags'], )
#             print (row[1]['words'], row[1]['tags'], len(row[1]['words']), ord(row[1]['words'][0]), ord(row[1]['words'][1]))
            if (row[1]['words'] == '\n' and row[1]['tags'] == '\n'):
                file.write('\n')
    #             print(i,'\n')
            elif row[1]['words'] == '\n ' and row[1]['tags'] == 'O':
                print(i,'\n')
            elif row[1]['words'] == ' \n' and row[1]['tags'] == 'O':
    #             file.write(row[1]['words']+' ')
    #             file.write(row[1]['tags']+'\n')
#                 print(i,'\n')
                pass
    #         elif i == 91:
    #             print (row[1]['words'], row[1]['tags'], len(row[1]['words']), ord(row[1]['words'][0]), ord(row[1]['words'][1]))
            else:
                try:
                    file.write(row[1]['words']+' ')
                except:
#                     print(row[1]['words'], type(row[1]['words']))
                    file.write( str(row[1]['words']) + ' ')
                file.write(row[1]['tags']+'\n')
    return txt_file
test_page = NerDataset(toConllTxt('./dataset/politicians/Sri Lanka/Mahinda Rajapaksa/conll_tagged.csv'))
print(len(test_page))

91 

276


# **testing all pages**

In [6]:
# tree = os.walk('./dataset')
# paths = sorted([os.path.join(f[0], f[2][0]) for f in tree if len(f[2]) != 0 and os.path.splitext(f[2][0])[-1] == '.csv'])
# g = (tgtexp  for var1 in exp1 if exp2 for var2 in exp3 if exp4)

paths = sorted([os.path.join(f[0], name) for f in os.walk('./dataset') if len(f[2])!=0 for name in f[2] if os.path.splitext(name)[-1] == '.csv'])

# df = pd.read_csv('./dataset/politicians/Sri Lanka/Mahinda Rajapaksa/conll_tagged.csv')
# print(len(paths))
# paths
for i, path in enumerate(paths):
    test_page = NerDataset(toConllTxt(path))
    print(len(test_page), ' ', i)
# test_page = NerDataset(toConllTxt(paths[356]))
# print(len(test_page))
# directory

38   0
72   1
45   2
47   3
8   4
102   5
287   6
57   7
287   8
54 

8   9
292   10
260   11
147   12
39   13
47   14
34   15
28   16
91   17
51   18
21   19
17   20
97   21
152   22
15   23
5   24
13   25
28   26
61   27
16   28
40   29
9   30
86   31
33   32
9   33
98   34
19   35
4   36
243   37
nan <class 'float'>
91   38
8   39
263   40
375   41
165   42
11   43
169   44
33   45
132   46
150   47
9   48
27   49
11394 

522   50
57   51
320   52
336   53
43   54
13   55
304   56
53   57
16   58
8   59
52   60
111   61
71   62
2742 

126   63
38   64
155   65
101   66
10   67
27   68
nan <class 'float'>
54   69
32   70
144   71
344 

37   72
22   73
29   74
20   75
23   76
35   77
35   78
17   79
119   80
1511 

8594 

338   81
17   82
37   83
28   84
24   85
7   86
212   87
73   88
163   89
12   90
27   91
10   92
37   93
3894 

216   94
70   95
3440 

228   96
11   97
337   98
154   99
83   100
35   101
174   102
41   103
15   104
37   105
17   106
17   107
17   108
417   109
138

failed pages: 573~ with a long sentence

In [36]:
entries = open("conll2003/valid.txt", 'r').read().strip().split("\n\n")
entries = open("Mahinda Rajapaksa.txt", 'r').read().strip().split("\n\n")
# entries
sents, tags_li = [], [] # list of lists
for i, entry in enumerate(entries):
    print(entry)
#     if i == 0:
#         continue
    words = [line.split()[0] for line in entry.splitlines()]
    tags = ([line.split()[-1] for line in entry.splitlines()])
    sents.append(["[CLS]"] + words + ["[SEP]"])
    tags_li.append(["<PAD>"] + tags + ["<PAD>"])

sents
# entries
# tags_li


O
Percy O
Mahendra O
Rajapaksa O
( O
Sinhala O
: O
මහින්ද O
රාජපක්ෂ O
, O
Tamil O
: O
மஹிந்த O
ராஜபக்ஷ O
; O
born O
18 O
November O
) O
is O
a O
Sri O
Lankan O
politician O
serving O
as O
Leader O
of O
the O
Opposition O
since O
2018 O
, O
and O
has O
served O
as O
Member O
of O
Parliament O
( O
MP O
) O
for O
Kurunegala O
since O
2015 O
. O
He O
served O
as O
the O
President O
of O
Sri O
Lanka O
and O
Leader O
of O
the O
Sri O
Lanka O
Freedom O
Party O
from O
2005 O
to O
2015.He O
became O
the O
leader O
of O
the O
Sri O
Lanka O
Podujana O
Peramuna O
in O
2019 O
, O
spliting O
the O
Sri O
Lanka O
Freedom O
Party O
. O
A O
lawyer O
by O
profession O
, O
Rajapaksa O
was O
first O
elected O
to O
the O
Parliament O
of O
Sri O
Lanka O
in O
1970 O
, O
and O
he O
served O
as O
Prime O
Minister O
from O
6 O
April O
2004 O
until O
his O
victory O
in O
the O
2005 O
presidential O
election O
. O
He O
was O
sworn O
in O
for O
his O
first O
six O
- O
year O
term O
as O
president O
on O
19 O
Novemb

[['[CLS]', 'O', 'Percy', 'Mahendra', 'Rajapaksa', '[SEP]'],
 ['[CLS]', '(', 'Sinhala', ':', 'මහින්ද', 'රාජපක්ෂ', ',', '[SEP]'],
 ['[CLS]', 'Tamil', ':', '[SEP]'],
 ['[CLS]',
  'மஹிந்த',
  'ராஜபக்ஷ',
  ';',
  'born',
  '18',
  'November',
  ')',
  'is',
  'a',
  'Sri',
  'Lankan',
  'politician',
  'serving',
  'as',
  'Leader',
  'of',
  'the',
  'Opposition',
  'since',
  '2018',
  ',',
  'and',
  'has',
  'served',
  'as',
  'Member',
  'of',
  'Parliament',
  '(',
  'MP',
  ')',
  'for',
  'Kurunegala',
  'since',
  '2015',
  '.',
  '[SEP]'],
 ['[CLS]',
  'He',
  'served',
  'as',
  'the',
  'President',
  'of',
  'Sri',
  'Lanka',
  'and',
  'Leader',
  'of',
  'the',
  'Sri',
  'Lanka',
  'Freedom',
  'Party',
  'from',
  '2005',
  'to',
  '2015.He',
  'became',
  'the',
  'leader',
  'of',
  'the',
  'Sri',
  'Lanka',
  'Podujana',
  'Peramuna',
  'in',
  '2019',
  ',',
  'spliting',
  'the',
  'Sri',
  'Lanka',
  'Freedom',
  'Party',
  '.',
  '[SEP]'],
 ['[CLS]',
  'A',
  'lawy

In [0]:
tags = ['BD', 'BP', 'PR', 'SP', 'CH', 'ED']
VOCAB_list = ['<PAD>', 'O',]
for tag in tags:
    VOCAB_list.append('I-'+tag)
    VOCAB_list.append('B-'+tag)
VOCAB = tuple(VOCAB_list)

In [59]:
type(1212)

int