# Ways to load NER dataset

## For huggingface tokenizer
> If you're using huggingface tokenizer, most of the preprocessing can be automated into the following way

First we are loading a tokenizer

In [1]:
from transformers import AutoTokenizer
tk = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

Load the downloaded data with pre-designed pipeline

In [2]:
from langhuan.loaders import load_ner_data_pytorch_huggingface

This step will return a dataset

In [3]:
data_ds = load_ner_data_pytorch_huggingface(
    "ner_result_sample.json", # your label result
    tk, # your tokenizer
)

Get a data_loader, this function will save you the effort to specify ```collate_fn```

In [4]:
data_loader = data_ds.get_data_loader(batch_size=3, num_workers=2)

Split 1 dataset into train/ valid

In [5]:
train_ds, val_ds = data_ds.split_train_valid(valid_ratio=.2)
len(train_ds), len(val_ds)

(8, 0)

In [6]:
data_ds.labels

[{'index': 0,
  'now': '21-02-01_15:39:20',
  'pandas': 0,
  'remote_addr': '127.0.0.1',
  'tags': [{'label': 'school',
    'offset': 122,
    'text': 'University of Maryland'},
   {'label': 'company', 'offset': 346, 'text': 'Bricklin'}],
  'user_id': '4de71c07fa'},
 {'index': 1,
  'now': '21-02-01_15:38:29',
  'pandas': 1,
  'remote_addr': '127.0.0.1',
  'tags': [{'label': 'school',
    'offset': 213,
    'text': 'University of Washington'},
   {'label': 'company', 'offset': 340, 'text': 'SI'}],
  'user_id': '4de71c07fa'},
 {'index': 2,
  'now': '21-02-01_15:39:03',
  'pandas': 2,
  'remote_addr': '127.0.0.1',
  'tags': [{'label': 'school', 'offset': 89, 'text': 'Purdue University'},
   {'label': 'company', 'offset': 107, 'text': 'Engineering Computer Network'},
   {'label': 'company',
    'offset': 1795,
    'text': 'Purdue Electrical Engineering'}],
  'user_id': '4de71c07fa'},
 {'index': 3,
  'now': '21-02-01_15:39:11',
  'pandas': 3,
  'remote_addr': '127.0.0.1',
  'tags': [{'label

## Test a sample of x, y

In [7]:
batch = data_ds.one_batch(5)

In [8]:
x = batch["input_ids"]
y = batch["targets"]
x, y

(tensor([[ 101, 2013, 1024,  ..., 8040, 5332,  102],
         [ 101, 2013, 1024,  ...,    0,    0,    0],
         [ 101, 2013, 1024,  ...,    0,    0,    0],
         [ 101, 2013, 1024,  ...,    0,    0,    0],
         [ 101, 2013, 1024,  ..., 4797, 2016,  102]]),
 tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]))

Here we left the slicing configuration to the hands of users

In [9]:
x.shape, y.shape

(torch.Size([5, 512]), torch.Size([5, 512]))

## Convert x, y back to NER tags
This also works for predicted y

Make sure both x and y tensors are:
* torch.LongTenser
* in cpu, not cuda  

In [10]:
data_ds.decode(batch, y)

[{'row_id': 0,
  'token_id': 34,
  'text': 'new mexico state university',
  'label': 'school',
  'offset': 75},
 {'row_id': 1,
  'token_id': 55,
  'text': 'university of maryland',
  'label': 'school',
  'offset': 122},
 {'row_id': 1,
  'token_id': 108,
  'text': 'bricklin',
  'label': 'company',
  'offset': 346},
 {'row_id': 2,
  'token_id': 51,
  'text': 'university of chicago',
  'label': 'school',
  'offset': 151},
 {'row_id': 3,
  'token_id': 27,
  'text': 'harris computer systems division',
  'label': 'company',
  'offset': 73},
 {'row_id': 3,
  'token_id': 205,
  'text': 'harris corporation',
  'label': 'company',
  'offset': 645}]

## Tensorflow:
> Development pending, [check here](https://github.com/raynardj/langhuan) to help