# Fine Tuning Transformer for Named Entity Recognition

### Introduction

In this tutorial we will be fine tuning a transformer model for the **Named Entity Recognition** problem. 
This is one of the most common business problems where a given piece of text/sentence/document different entites need to be identified such as: Name, Location, Number, Entity etc.

#### Flow of the notebook

The notebook will be divided into seperate sections to provide a organized walk through for the process used. This process can be modified for individual use cases. The sections are:

1. [Installing packages for preparing the system](#section00)
2. [Importing Python Libraries and preparing the environment](#section01)
3. [Importing and Pre-Processing the domain data](#section02)
4. [Preparing the Dataset and Dataloader](#section03)
5. [Creating the Neural Network for Fine Tuning](#section04)
6. [Fine Tuning the Model](#section05)
7. [Validating the Model Performance](#section06)

#### Technical Details

This script leverages on multiple tools designed by other teams. Details of the tools used below. Please ensure that these elements are present in your setup to successfully implement this script.

 - Data:
	- We are working from a dataset available on [Kaggle](https://www.kaggle.com/)
    - This NER annotated dataset is available at the following [link](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
    - We will be working with the file `ner.csv` from the dataset. 
    - In the given file we will be looking at the following columns for the purpose of this fine tuning:
        - `sentence_idx` : This is the identifier that the word in the row is part of the same sentence
        - `word` : Word in the sentence
        - `tag` : This is the identifier that is used to identify the entity in the dataset. 
    - The various entites tagged in this dataset are as per below:
        - geo = Geographical Entity
        - org = Organization
        - per = Person
        - gpe = Geopolitical Entity
        - tim = Time indicator
        - art = Artifact
        - eve = Event
        - nat = Natural Phenomenon


 - Language Model Used:
	 - We are using BERT for this project. Hugging face team has created a customized model for token classification, called **BertForTokenClassification**. We will be using it in our custommodel class for training. 
	 - [Blog-Post](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
     - [Documentation for python](https://huggingface.co/transformers/model_doc/bert.html#bertfortokenclassification)


 - Hardware Requirements:
	 - Python 3.6 and above
	 - Pytorch, Transformers and All the stock Python ML Libraries
	 - TPU enabled setup. This can also be executed over GPU but the code base will need some changes. 


 - Script Objective:
	 - The objective of this script is to fine tune **BertForTokenClassification**` to be able to identify the entites as per the given test dataset. The entities labled in the given dataset are as follows:

In [1]:
import sys

if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount("/content/drive")

    #!pip install kaggle --upgrade
    !pip install --upgrade --force-reinstall --no-deps kaggle
    import os
    import json
    %cd /content/drive/MyDrive/colab_notebooks/kaggle/
    f = open("kaggle.json", "r")
    json_data = json.load(f)
    os.environ["KAGGLE_USERNAME"] = json_data["username"]
    os.environ["KAGGLE_KEY"] = json_data["key"]

Mounted at /content/drive
Collecting kaggle
[?25l  Downloading https://files.pythonhosted.org/packages/3a/e7/3bac01547d2ed3d308ac92a0878fbdb0ed0f3d41fb1906c319ccbba1bfbc/kaggle-1.5.12.tar.gz (58kB)
[K     |████████████████████████████████| 61kB 3.3MB/s 
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.12-cp37-none-any.whl size=73053 sha256=e493807e5eff28712d1a64a0b12c2e07b85b23bb436f57facdc3ae93fec04f5e
  Stored in directory: /root/.cache/pip/wheels/a1/6a/26/d30b7499ff85a4a4593377a87ecf55f7d08af42f0de9b60303
Successfully built kaggle
Installing collected packages: kaggle
  Found existing installation: kaggle 1.5.12
    Uninstalling kaggle-1.5.12:
      Successfully uninstalled kaggle-1.5.12
Successfully installed kaggle-1.5.12
/content/drive/MyDrive/colab_notebooks/kaggle


In [2]:
if 'google.colab' in sys.modules:
    %cd /content/drive/MyDrive/colab_notebooks/kaggle/Kaggle-Coleridge-Initiative/notebooks
    import yaml
    from pprint import pprint
    with open('../config/config.yml') as file:
        CFG = yaml.load(file)
    pprint(CFG)

/content/drive/MyDrive/colab_notebooks/kaggle/Kaggle-Coleridge-Initiative/notebooks
{'debug': True,
 'epochs': 5,
 'is_single': False,
 'learning_rate': 2e-05,
 'max_len': 200,
 'tags_vals': 'o o-dataset pad',
 'test_batch_size': 16,
 'text_len': 0,
 'train': True,
 'train_batch_size': 32,
 'use_cosine': False,
 'use_pos': False,
 'valid_batch_size': 16}


In [3]:
if 'google.colab' in sys.modules:
    dname = "kagglenb006-get-text"
    !mkdir ../input/{dname}
    !kaggle kernels output riow1983/{dname} -p ../input/{dname}

mkdir: cannot create directory ‘../input/kagglenb006-get-text’: File exists
tcmalloc: large alloc 1090682880 bytes == 0x55de99024000 @  0x7fc2c11121e7 0x55de4846fe68 0x55de4843a637 0x55de4843c630 0x55de4843dafd 0x55de4852efed 0x55de484b1988 0x55de4837ed14 0x55de4852f101 0x55de4855d099 0x55de484ad51d 0x55de484ac4ae 0x55de4843fc9f 0x55de4843fea1 0x55de484aebb5 0x55de484ac4ae 0x55de4843fc9f 0x55de4843fea1 0x55de484aebb5 0x55de484ac4ae 0x55de4837ee2c 0x55de484aebb5 0x55de484ac4ae 0x55de4843f3ea 0x55de484b17f0 0x55de484ac4ae 0x55de4843f3ea 0x55de484ad60e 0x55de484ac4ae 0x55de4843fc9f 0x55de4843fea1
Output file downloaded to ../input/kagglenb006-get-text/folds_pubcat.pkl
Kernel log downloaded to ../input/kagglenb006-get-text/kagglenb006-get-text.log 


In [None]:
TRAIN_BATCH_SIZE = CFG["train_batch_size"]
VALID_BATCH_SIZE = CFG["valid_batch_size"]
EPOCHS = CFG["epochs"]
LEARNING_RATE = CFG["learning_rate"]
TRAIN = CFG["train"]
MAX_LEN = CFG["max_len"]
USE_POS = CFG["use_pos"]
DEBUG = CFG["debug"]
TEXT_LEN = CFG["text_len"]
TAGS_VALS = CFG["tags_vals"]

!python ../src/bridge.py {TRAIN} {MAX_LEN} {USE_POS} {DEBUG} {TEXT_LEN} {TAGS_VALS}
#print(MAX_LEN)
#!echo {TRAIN} {MAX_LEN} {USE_POS} {DEBUG} {TEXT_LEN} {TAGS_VALS}

Usage example:

!python bridge.py {train} {max_len} {use_pos} {debug} {text_len} {tags_vals}

Args:
['True', '200', 'False', 'True', '0', 'o', 'o-dataset', 'pad']
train: Bool: True
max_len: Int: 200
tags_vals: List[str]: ['o', 'o-dataset', 'pad']
use_pos: Bool: False
debug: Bool: True
text_len: Int: 0
Reading train data (CV folds)...
Starting to clean text...
Starting to convert df to dataset...
    Converting tokens...: 100% 500/500 [00:01<00:00, 252.40it/s]
    Starting to concatenate...
df.shape after concatenation: (500, 13)
Output file has been saved at ./dataset.pkl


In [None]:
if DEBUG:
    import pandas as pd
    import numpy as np
    tmp = pd.read_pickle("./dataset.pkl")
    print("max_len:", np.sum(np.array(tmp["sentence#"][0])=="sentence#0"))
    del tmp

max_len: 200


<a id='section00'></a>
### Installing packages for preparing the system

We are installing 2 packages for the purposes of TPU execution and f1 metric score calculation respectively
*You can skip this step if you already have these libraries installed in your environment*

In [None]:
!curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev
!pip -q install seqeval

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  5116  100  5116    0     0  64759      0 --:--:-- --:--:-- --:--:-- 64759
Updating... This may take around 2 minutes.
Updating TPU runtime to pytorch-dev20200515 ...
Uninstalling torch-1.6.0a0+bf2bbd9:
  Successfully uninstalled torch-1.6.0a0+bf2bbd9
Uninstalling torchvision-0.7.0a0+a6073f0:
  Successfully uninstalled torchvision-0.7.0a0+a6073f0
Copying gs://tpu-pytorch/wheels/torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl...
|
Operation completed over 1 objects/91.0 MiB.                                     
Copying gs://tpu-pytorch/wheels/torch_xla-nightly+20200515-cp37-cp37m-linux_x86_64.whl...
| [1 files][119.5 MiB/119.5 MiB]                                                
Operation completed over 1 objects/119.5 MiB.                       

<a id='section01'></a>
### Importing Python Libraries and preparing the environment

At this step we will be importing the libraries and modules needed to run our script. Libraries are:
* Pandas
* Pytorch
* Pytorch Utils for Dataset and Dataloader
* Transformers
* BERT Model and Tokenizer

Followed by that we will preapre the device for TPU execeution. This configuration is needed if you want to leverage on onboard TPU. 

In [None]:
# Importing pytorch and the library for TPU execution

import torch
import torch_xla
import torch_xla.core.xla_model as xm

In [None]:
if 'google.colab' in sys.modules:
    !pip install transformers



In [None]:
# Importing stock ml libraries

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import gc
import transformers
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertForTokenClassification, BertTokenizer, BertConfig, BertModel

# Preparing for TPU usage
dev = xm.xla_device()

<a id='section02'></a>
### Importing and Pre-Processing the domain data

We will be working with the data and preparing for fine tuning purposes. 
*Assuming that the `ner.csv` is already downloaded in your `data` folder*

* Import the file in a dataframe and give it the headers as per the documentation.
* Cleaning the file to remove the unwanted columns.
* We will create a class `SentenceGetter` that will pull the words from the columns and create them into sentences
* Followed by that we will create some additional lists and dict to keep the data that will be used for future processing

In [None]:
folds = pd.read_pickle("./dataset.pkl")
folds

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label,pub_category,fold,text,word,pos,sentence,sentence#,tag
0,b345ff09-64bf-4473-935f-fef4f9da6f23,Cognitive Training and Transcranial Direct Cur...,Alzheimer's Disease Neuroimaging Initiative (A...,Alzheimer's Disease Neuroimaging Initiative (A...,alzheimer s disease neuroimaging initiative adni,adni/alzheimer s disease neuroimaging initiati...,1,background transcranial direct current stimula...,"[background, transcranial, direct, current, st...",,sentence5827,"[sentence#0, sentence#0, sentence#0, sentence#...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,8266a8ad-2f9a-44a5-898a-b9f13de7d533,Obesity-Associated Cognitive Decline: Excess W...,Baltimore Longitudinal Study of Aging (BLSA),Baltimore Longitudinal Study of Aging,baltimore longitudinal study of aging,baltimore longitudinal study of aging blsa /ba...,2,cognitive and memory performance suggests that...,"[cognitive, and, memory, performance, suggests...",,sentence6752,"[sentence#0, sentence#0, sentence#0, sentence#...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,1fe822d8-0ca7-4d2a-a40b-3f19609d2a24,"The consecutive disparity index, D : a measure...",North American Breeding Bird Survey (BBS),North American Breeding Bird Survey,north american breeding bird survey,usgs north american breeding bird survey/north...,4,knowing how and why systems fluctuate with tim...,"[knowing, how, and, why, systems, fluctuate, w...",,sentence15998,"[sentence#0, sentence#0, sentence#0, sentence#...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,db2e8ca3-ffcd-4681-85e0-21fbc559ebca,Cognition and Student Learning through the Arts,Early Childhood Longitudinal Study,Early Childhood Longitudinal Study,early childhood longitudinal study,early childhood longitudinal study,5,in recent years an increasing number of studie...,"[in, recent, years, an, increasing, number, of...",,sentence16616,"[sentence#0, sentence#0, sentence#0, sentence#...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,4f000758-db14-4dc2-933a-06453477765f,of LaborTesting the Internal Validity of Compu...,Trends in International Mathematics and Scienc...,Trends in International Mathematics and Scienc...,trends in international mathematics and scienc...,trends in international mathematics and scienc...,4,abstract testing the internal validity of comp...,"[abstract, testing, the, internal, validity, o...",,sentence13317,"[sentence#0, sentence#0, sentence#0, sentence#...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,550ea55c-6e6d-4dbb-8736-fb497c4234d9,Transmission dynamics and evolutionary history...,SARS-CoV-2 genome sequence,2019-nCoV genome sequences,2019 ncov genome sequences,2019 ncov genome sequences/2019 ncov genome se...,2,previous studies have confirmed that this viru...,"[previous, studies, have, confirmed, that, thi...",,sentence9415,"[sentence#0, sentence#0, sentence#0, sentence#...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
496,382d3385-14e1-4049-9886-615a4c9aa27e,Estimating fiber orientation distribution from...,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI,adni,adni/alzheimer s disease neuroimaging initiati...,1,we present a novel method for estimation of th...,"[we, present, a, novel, method, for, estimatio...",,sentence1744,"[sentence#0, sentence#0, sentence#0, sentence#...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
497,abfd3dca-8209-4c83-9d02-159a0c8bc366,The Enhancement of Students' Critical Thinking...,Trends in International Mathematics and Scienc...,Trends in International Mathematics and Scienc...,trends in international mathematics and scienc...,trends in international mathematics and scienc...,4,abstract this study aims to describe the enhan...,"[abstract, this, study, aims, to, describe, th...",,sentence13308,"[sentence#0, sentence#0, sentence#0, sentence#...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
498,50d4daba-85ca-4195-a2af-23e92c797b60,Subject-adaptive Integration of Multiple SICE ...,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI,adni,adni/alzheimer s disease neuroimaging initiati...,1,as a principled method for partial correlation...,"[as, a, principled, method, for, partial, corr...",,sentence1884,"[sentence#0, sentence#0, sentence#0, sentence#...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [None]:
cv = 1
#### RIOW
#dataset_train = pd.read_pickle(f"nb003-annotation-data/nb003_cv{cv}_train.pkl")
#dataset_test = pd.read_pickle(f"nb003-annotation-data/nb003_cv{cv}_test.pkl")

dataset_train = folds[folds["fold"]!=cv]
dataset_valid = folds[folds["fold"]==cv]

#### RIOWRIOW

In [None]:
#### RIOW
# dataset_train["isTrain"] = 1
# dataset_test["isTrain"] = 0

# dataset = pd.concat([dataset_train, dataset_test], axis=0, ignore_index=True)
# del dataset_train, dataset_test
# gc.collect()
#### RIOWRIOW

In [None]:
#### RIOW
#dataset["sentence_idx"] = dataset["sentence"] + dataset["sentence#"]
#### RIOWRIOW

In [None]:
#### RIOW
# sentence_vals = list(set(dataset["sentence_idx"].values))
# sentence2idx = {v: i for i, v in enumerate(sentence_vals)}
# del sentence_vals
# gc.collect()
#### RIOWRIOW

In [None]:
#### RIOW
# dataset["sentence_idx"] = dataset["sentence_idx"].apply(lambda x: sentence2idx[x])
# del sentence2idx
# gc.collect()
#### RIOWRIOW

In [None]:
#dataset.head()

In [None]:
#### RIOW
# dataset = dataset[["sentence_idx", "word", "pos", "tag", "isTrain"]].copy()
# dataset
#### RIOWRIOW

In [None]:
#### RIOW
#num_labels = dataset["tag"].nunique() + 1 #['o', 'o-dataset', 'pad']
num_labels = len(TAGS_VALS.split())
#### RIOWRIOW
print(num_labels)

3


In [None]:
#### RIOW
# # Creating a class to pull the words from the columns and create them into sentences

# class SentenceGetter(object):
    
#     def __init__(self, dataset):
#         self.n_sent = 1
#         self.dataset = dataset
#         self.empty = False
#         if USE_POS:
#             agg_func = lambda s: [(w,p,f,t) for w,p,f,t in zip(s["word"].values.tolist(),
#                                                         s["pos"].values.tolist(),
#                                                         s["isTrain"].values.tolist(),
#                                                         s["tag"].values.tolist())]
#         else:
#             agg_func = lambda s: [(w,f,t) for w,f,t in zip(s["word"].values.tolist(),
#                                                         s["isTrain"].values.tolist(),
#                                                         s["tag"].values.tolist())]
#         self.grouped = self.dataset.groupby("sentence_idx").apply(agg_func)
#         self.sentences = [s for s in self.grouped]
    
#     def get_next(self):
#         try:
#             s = self.grouped["Sentence: {}".format(self.n_sent)]
#             self.n_sent += 1
#             return s
#         except:
#             return None

# getter = SentenceGetter(dataset)

#### RIOWRIOW

In [None]:
#### RIOW
# # Creating new lists and dicts that will be used at a later stage for reference and processing

# tags_vals = ['o', 'o-dataset', 'pad']
# tag2idx = {t: i for i, t in enumerate(tags_vals)}
# print("tag2idx:", tag2idx)

# sentences = [' '.join([s[0] for s in sent]) for sent in getter.sentences]

# if use_pos:
#     poses = [' '.join([s[1] for s in sent]) for sent in getter.sentences]
#     istrain = [[s[2] for s in sent] for sent in getter.sentences]
#     labels = [[s[3] for s in sent] for sent in getter.sentences]
# else:
#     poses = None
#     istrain = [[s[1] for s in sent] for sent in getter.sentences]
#     labels = [[s[2] for s in sent] for sent in getter.sentences]

# labels = [[tag2idx.get(l) for l in lab] for lab in labels]
# print("set of lables:", set([l for label in labels for l in label]))
#### RIOWRIOW

<a id='section03'></a>
### Preparing the Dataset and Dataloader

We will start with defining few key variables that will be used later during the training/fine tuning stage.
Followed by creation of Dataset class - This defines how the text is pre-processed before sending it to the neural network. We will also define the Dataloader that will feed  the data in batches to the neural network for suitable training and processing. 
Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

#### *CustomDataset* Dataset Class
- This class is defined to accept the `tokenizer`, `sentences` and `labels` as input and generate tokenized output and tags that is used by the BERT model for training. 
- We are using the BERT tokenizer to tokenize the data in the `sentences` list for encoding. 
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer)
- `tags` is the encoded entity from the annonated dataset. 
- The *CustomDataset* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 

#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [None]:
def sentence_getter(dataset):
    sentences = []
    for _,row in tqdm(dataset.iterrows()):
        #id = row["Id"]
        
        hashes = np.array(row["sentence#"])
        num_sentences = len(np.unique(hashes))
        
        words = np.array(row["word"])
        if USE_POS:
            poses = np.array(row["pos"])
        else:
            poses = None
        
        if TRAIN:
            tags = np.array(row["tag"])
        else:
            tags = None
        
        for i in range(num_sentences):
            hash = np.where(hashes==f"sentence#{i}")[0]
            if TRAIN:
                if USE_POS:
                    sentences.append((words[hash], poses[hash], tags[hash]))
                else:
                    sentences.append((words[hash], poses, tags[hash]))
            else:
                if USE_POS:
                    sentences.append((words[hash], poses[hash], tags))
                else:
                    sentences.append((words[hash], poses, tags))
    return sentences

In [None]:
sentences_train = sentence_getter(dataset_train)
sentences_valid = sentence_getter(dataset_valid)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [None]:
# def sentence_getter(dataset):
#     agg_func = lambda s: [(w,p,t) for w,p,t in zip(s["word"].values.tolist(),
#                                                    s["pos"].values.tolist(),
#                                                    s["tag"].values.tolist())]

#     grouped = dataset.groupby("sentence_idx").apply(agg_func)
#     sentences = [s for s in grouped]
#     return sentences

In [None]:
# Creating the dataset and dataloader for the neural network

#### RIOW
#tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
#### RIOWRIOW

#### RIOW
class CustomDataset(Dataset):
    def __init__(self, sentences, tokenizer, max_len):
        self.len = len(sentences)
        self.sentences = sentences
        #self.poses = poses
        #self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        tmp = self.sentences[index]
        sentence = " ".join(tmp[0])
        
        if USE_POS:
            #pos = str(self.poses[index])
            pos = " ".join(tmp[1])
        else:
            #pos = self.poses # which is None
            pos = tmp[1]
        
        inputs = self.tokenizer.encode_plus(
            sentence,
            pos,
            add_special_tokens=True,
            max_length=self.max_len,
            truncation=True,
            pad_to_max_length=True, # future warning (deprecated)
            #padding=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        if TRAIN:
            #label = self.labels[index]
            label = list(tmp[2])
            label.extend([2]*self.max_len) # tag2idx = {'o':0, 'o-dataset':1, 'pad':2}
            label = label[:self.max_len]
        
            return {'ids': torch.tensor(ids, dtype=torch.long),
                    'mask': torch.tensor(mask, dtype=torch.long),
                    'tags': torch.tensor(label, dtype=torch.long)} 
        else:
            return {'ids': torch.tensor(ids, dtype=torch.long),
                    'mask': torch.tensor(mask, dtype=torch.long)} 
    
    def __len__(self):
        return self.len


# class CustomDataset(Dataset):
#     def __init__(self, tokenizer, sentences, poses, labels, max_len):
#         self.len = len(sentences)
#         self.sentences = sentences
#         self.poses = poses
#         self.labels = labels
#         self.tokenizer = tokenizer
#         self.max_len = max_len
        
#     def __getitem__(self, index):
#         sentence = str(self.sentences[index])
#         if use_pos:
#             pos = str(self.poses[index])
#         else:
#             pos = self.poses # which is None
#         inputs = self.tokenizer.encode_plus(
#             sentence,
#             #### RIOW
#             #None,
#             pos,
#             #### RIOWRIOW
#             add_special_tokens=True,
#             max_length=self.max_len,
#             truncation=True,
#             pad_to_max_length=True,
#             return_token_type_ids=True
#         )
#         ids = inputs['input_ids']
#         mask = inputs['attention_mask']
#         label = self.labels[index]
#         label.extend([2]*self.max_len) # tag2idx = {'o':0, 'o-dataset':1, 'pad':2}
#         label=label[:self.max_len]

#         return {
#             'ids': torch.tensor(ids, dtype=torch.long),
#             'mask': torch.tensor(mask, dtype=torch.long),
#             'tags': torch.tensor(label, dtype=torch.long)
#         } 
    
#     def __len__(self):
#         return self.len

#### RIOWRIOW

In [None]:
#### RIOW
# istrain = [ist[0] for ist in istrain]
# istrain = np.array(istrain)

# train_sentences = [sentences[i] for i in np.where(istrain==1)[0]]
# if use_pos:
#     train_poses = [poses[i] for i in np.where(istrain==1)[0]]
# else:
#     train_poses = None
# train_labels = [labels[i] for i in np.where(istrain==1)[0]]


# test_sentences = [sentences[i] for i in np.where(istrain==0)[0]]
# if use_pos:
#     test_poses = [poses[i] for i in np.where(istrain==0)[0]]
# else:
#     test_poses = None
# test_labels = [labels[i] for i in np.where(istrain==0)[0]]

# print("FULL Dataset: {}".format(len(sentences)))
# print("TRAIN Dataset: {}".format(len(train_sentences)))
# print("TEST Dataset: {}".format(len(test_sentences)))

# training_set = CustomDataset(tokenizer, train_sentences, train_poses, train_labels, MAX_LEN)
# testing_set = CustomDataset(tokenizer, test_sentences, train_poses, test_labels, MAX_LEN)


dataset_train = CustomDataset(sentences_train, tokenizer, MAX_LEN)
dataset_valid = CustomDataset(sentences_valid, tokenizer, MAX_LEN)

#### RIOWRIOW


In [None]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

valid_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

dataloader_train = DataLoader(dataset_train, **train_params)
dataloader_valid = DataLoader(dataset_valid, **valid_params)

<a id='section04'></a>
### Creating the Neural Network for Fine Tuning

#### Neural Network
 - We will be creating a neural network with the `BERTClass`. 
 - This network will have the `BertForTokenClassification` model. 
 - The data will be fed to the `BertForTokenClassification` as defined in the dataset. 
 - Final layer outputs is what will be used to calcuate the loss and to determine the accuracy of models prediction. 
 - We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference. 
 
#### Loss Function and Optimizer
 - `Optimizer` is defined in the next cell.
 - We do not define any `Loss function` since the specified model already outputs `Loss` for a given input. 
 - `Optimizer` is used to update the weights of the neural network to improve its performance.
 
#### Further Reading
- You can refer to my [Pytorch Tutorials](https://github.com/abhimishra91/pytorch-tutorials) to get an intuition of Loss Function and Optimizer.
- [Pytorch Documentation for Loss Function](https://pytorch.org/docs/stable/nn.html#loss-functions)
- [Pytorch Documentation for Optimizer](https://pytorch.org/docs/stable/optim.html)
- Refer to the links provided on the top of the notebook to read more about `BertForTokenClassification`. 

In [None]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 

class BERTClass(torch.nn.Module):
    def __init__(self):
        super(BERTClass, self).__init__()
        #self.l1 = transformers.BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=num_labels)
        #self.l1 = transformers.BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=18)
        self.l1 = transformers.BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)
        # self.l2 = torch.nn.Dropout(0.3)
        # self.l3 = torch.nn.Linear(768, 200)
    
    def forward(self, ids, mask, labels):
        output_1= self.l1(ids, mask, labels = labels)
        # output_2 = self.l2(output_1[0])
        # output = self.l3(output_2)
        return output_1

    #def save_pretrained(self, path):
    #    self.l1.save_pretrained(path)

In [None]:
model = BERTClass()
model.to(dev)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

BERTClass(
  (l1): BertForTokenClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
       

In [None]:
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

<a id='section05'></a>
### Fine Tuning the Model

After all the effort of loading and preparing the data and datasets, creating the model and defining its loss and optimizer. This is probably the easier steps in the process. 

Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. 

Following events happen in this function to fine tune the neural network:
- The dataloader passes data to the model based on the batch size. 
- Subsequent output from the model and the actual category are compared to calculate the loss. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 500 steps the loss value is printed in the console.

As you can see just in 1 epoch by the final step the model was working with a miniscule loss of 0.08503091335296631 i.e. the output is extremely close to the actual output.

In [None]:
def train(epoch):
    model.train()
    for _,data in enumerate(dataloader_train, 0):
        ids = data['ids'].to(dev, dtype = torch.long)
        mask = data['mask'].to(dev, dtype = torch.long)
        targets = data['tags'].to(dev, dtype = torch.long)

        loss = model(ids, mask, labels = targets)[0]

        # optimizer.zero_grad()
        if _%500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        xm.optimizer_step(optimizer)
        xm.mark_step() 

In [None]:
for epoch in range(EPOCHS):
    train(epoch)



Epoch: 0, Loss:  1.067091464996338
Epoch: 1, Loss:  0.00017841998487710953
Epoch: 2, Loss:  0.003136305371299386
Epoch: 3, Loss:  0.0008938525570556521
Epoch: 4, Loss:  0.000968664709944278


In [None]:
# for epoch in range(EPOCHS):
# #for epoch in range(1):
#     train(epoch)



Epoch: 0, Loss:  0.9502448439598083
Epoch: 0, Loss:  0.018455473706126213
Epoch: 1, Loss:  0.009544247761368752
Epoch: 1, Loss:  0.018766552209854126
Epoch: 2, Loss:  0.018835175782442093
Epoch: 2, Loss:  0.012853455729782581
Epoch: 3, Loss:  0.010095881298184395
Epoch: 3, Loss:  0.014236393384635448
Epoch: 4, Loss:  0.008458074182271957
Epoch: 4, Loss:  0.008434616029262543


<a id='section06'></a>
### Validating the Model

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data. 

This unseen data is the 30% of `ner.csv` which was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calcuate the accuracy of the model. 

The metric used for measuring the performance of model for these problem statements is called F1 score. We will create a helper function for helping us with f1 score calcuation and also import a library for the same. `seqeval`

In [None]:
from seqeval.metrics import f1_score

def flat_accuracy(preds, labels):
    flat_preds = np.argmax(preds, axis=2).flatten()
    flat_labels = labels.flatten()
    return np.sum(flat_preds == flat_labels)/len(flat_labels)

In [None]:
def valid(model, testing_loader):
    model.eval()
    eval_loss = 0; eval_accuracy = 0
    n_correct = 0; n_wrong = 0; total = 0
    predictions , true_labels = [], []
    nb_eval_steps, nb_eval_examples = 0, 0
    with torch.no_grad():
        for _, data in enumerate(dataloader_valid, 0):
            ids = data['ids'].to(dev, dtype = torch.long)
            mask = data['mask'].to(dev, dtype = torch.long)
            targets = data['tags'].to(dev, dtype = torch.long)

            output = model(ids, mask, labels=targets)
            loss, logits = output[:2]
            logits = logits.detach().cpu().numpy()
            
            label_ids = targets.to('cpu').numpy()
            predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
            true_labels.append(label_ids)
            
            accuracy = flat_accuracy(logits, label_ids)
            eval_loss += loss.mean().item()
            eval_accuracy += accuracy
            nb_eval_examples += ids.size(0)
            nb_eval_steps += 1
        eval_loss = eval_loss/nb_eval_steps
        print("Validation loss: {}".format(eval_loss))
        print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
        
        pred_tags = [TAGS_VALS.split()[p_i] for p in predictions for p_i in p]
        valid_tags = [TAGS_VALS.split()[l_ii] for l in true_labels for l_i in l for l_ii in l_i]
        print("F1-Score: {}".format(f1_score([pred_tags], [valid_tags])))

    return predictions, true_labels

In [None]:
# To get the results on the validation set. This data is not seen by the model

predictions, true_labels = valid(model, dataloader_valid)
#predictions, true_labels = valid(model, testing_loader)

# F1-Score: 0.011653313911143482 (EPOCHS=5)
# F1-Score: 0.016901408450704227 (EPOCHS=1)
# F1-Score: 0.007478033277248085 (EPOCHS=5)
# F1-Score: 0.0026246719160104987 (EPOCHS=5, nopos)
# F1-Score: 0.0028142589118198874 (EPOCHS=5, nopos, debug)



Validation loss: 0.006529409675192102
Validation Accuracy: 0.9888927469135803




F1-Score: 0.0028142589118198874


In [None]:
# pred_tags: (num_obs * max_len == 7527600, )
pred_tags = [TAGS_VALS.split()[p_i] for p in predictions for p_i in p]
pred_tags[:10]

['o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o']

In [None]:
set(pred_tags)
# 一応2ラベルになっている

{'o', 'pad'}

In [None]:
print(len([p for p in pred_tags if p=="o"]))
len([p for p in pred_tags if p=="o"]) / len(pred_tags)

770303


0.9934266185194738

In [None]:
print(len([p for p in pred_tags if p=="o-dataset"]))
len([p for p in pred_tags if p=="o-dataset"]) / len(pred_tags)

0


0.0

In [None]:
print(len([p for p in pred_tags if p=="pad"]))
len([p for p in pred_tags if p=="pad"]) / len(pred_tags)

5097


0.00657338148052618

# Save the model

In [None]:
# Print model's state_dict
print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

# Print optimizer's state_dict
#print("Optimizer's state_dict:")
#for var_name in optimizer.state_dict():
#    print(var_name, "\t", optimizer.state_dict()[var_name])

Model's state_dict:
l1.bert.embeddings.position_ids 	 torch.Size([1, 512])
l1.bert.embeddings.word_embeddings.weight 	 torch.Size([30522, 768])
l1.bert.embeddings.position_embeddings.weight 	 torch.Size([512, 768])
l1.bert.embeddings.token_type_embeddings.weight 	 torch.Size([2, 768])
l1.bert.embeddings.LayerNorm.weight 	 torch.Size([768])
l1.bert.embeddings.LayerNorm.bias 	 torch.Size([768])
l1.bert.encoder.layer.0.attention.self.query.weight 	 torch.Size([768, 768])
l1.bert.encoder.layer.0.attention.self.query.bias 	 torch.Size([768])
l1.bert.encoder.layer.0.attention.self.key.weight 	 torch.Size([768, 768])
l1.bert.encoder.layer.0.attention.self.key.bias 	 torch.Size([768])
l1.bert.encoder.layer.0.attention.self.value.weight 	 torch.Size([768, 768])
l1.bert.encoder.layer.0.attention.self.value.bias 	 torch.Size([768])
l1.bert.encoder.layer.0.attention.output.dense.weight 	 torch.Size([768, 768])
l1.bert.encoder.layer.0.attention.output.dense.bias 	 torch.Size([768])
l1.bert.encoder.

In [None]:
"""
output_model = './models/model_xlnet_mid.pth'

# save
def save(model, optimizer):
    # save
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict()
    }, output_model)

save(model, optimizer)
"""

In [None]:
# ToDo: cased -> uncased

folder = "localnb001-transformers-ner"
!mkdir {folder}
#PATH = f"bert-base-cased-ner-cv{cv}.pth"
if USE_POS:
    PATH = f"bert-base-uncased-ner-pad-cv{cv}-epochs{EPOCHS}.pth"
else:
    PATH = f"bert-base-uncased-ner-pad-nopos-cv{cv}-epochs{EPOCHS}.pth"

def save(model, optimizer, folder, path, as_tpu=False):
    # save
    if as_tpu:
        #torch.save({
        xm.save({
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict()
        }, "./"+folder+"/"+path)
    else:
        #torch.save({
        xm.save({
            'model_state_dict': model.to("cpu").state_dict(),
            'optimizer_state_dict': optimizer.state_dict()
        }, "./"+folder+"/"+path)

save(model, optimizer, folder, PATH, as_tpu=False)

mkdir: cannot create directory ‘localnb001-transformers-ner’: File exists


In [None]:
!date

Thu May 27 11:57:08 UTC 2021


In [None]:
!ls -l {folder}/

total 7919952
-rw------- 1 root root  647113465 Apr 24 02:26 bert-base-cased-ner-cv1.bin
-rw------- 1 root root  647113465 Apr 24 06:33 bert-base-cased-ner-cv1.pt
-rw------- 1 root root 1859958951 May 10 05:16 bert-base-cased-ner-cv1.pth
-rw------- 1 root root 1292759453 May 13 03:07 bert-base-cased-ner-pad-cv1-epochs5.pth
-rw------- 1 root root 1292759453 May 13 01:31 bert-base-cased-ner-pad-cv1.pth
-rw------- 1 root root 1306823069 May 27 11:56 bert-base-cased-ner-pad-nopos-cv1-epochs5.pth
-rw------- 1 root root  404400730 Apr 24 07:26 bert-base-cased.tar.gz
-rw------- 1 root root     213450 Nov 30  2018 bert-base-cased-vocab.txt
-rw------- 1 root root        614 May 11 04:31 _config.json
-rw------- 1 root root        313 May 11 04:31 config.json
-rw------- 1 root root        121 Apr 23 06:52 dataset-metadata.json
-rw------- 1 root root  435779157 May 11 02:23 pytorch_model.bin
-rw------- 1 root root       5116 Apr 22 23:29 pytorch-xla-env-setup.py
-rw------- 1 root root   95368646 A

In [None]:
#folder = "localnb001-transformers-ner"
#!mkdir {folder}
#PATH = f"bert-base-cased-ner-cv{cv}.pth"

#torch.save(model.state_dict(), PATH)
#torch.save(model.state_dict(), PATH)
#!mv {PATH} {folder}/{PATH}

In [None]:
# folder = "localnb001-transformers-ner"
# !mkdir {folder}

# # huggingface fine-tuned model (pre-trained) save method is different from standard PyTroch save method
# model.save_pretrained(folder)
# # reference: https://huggingface.co/transformers/model_sharing.html

mkdir: cannot create directory ‘localnb001-transformers-ner’: File exists


# Upload to Kaggle

In [7]:
folder = "localnb001-transformers-ner"
!date
!ls -l ./{folder}

Fri May 28 05:22:37 UTC 2021
total 8323096
-rw------- 1 root root  647113465 Apr 24 02:26 bert-base-cased-ner-cv1.bin
-rw------- 1 root root  647113465 Apr 24 06:33 bert-base-cased-ner-cv1.pt
-rw------- 1 root root 1859958951 May 10 05:16 bert-base-cased-ner-cv1.pth
-rw------- 1 root root 1292759453 May 13 03:07 bert-base-cased-ner-pad-cv1-epochs5.pth
-rw------- 1 root root 1292759453 May 13 01:31 bert-base-cased-ner-pad-cv1.pth
-rw------- 1 root root 1306823069 May 27 11:57 bert-base-cased-ner-pad-nopos-cv1-epochs5.pth
-rw------- 1 root root  404400730 Apr 24 07:26 bert-base-cased.tar.gz
-rw------- 1 root root     213450 Nov 30  2018 bert-base-cased-vocab.txt
-rw------- 1 root root  407873900 May 27 14:10 bert-base-uncased.tar.gz
-rw------- 1 root root     231508 May 27 12:35 bert-base-uncased-vocab.txt
-rw------- 1 root root        313 Oct 18  2018 bert_config.json
-rw------- 1 root root      18673 May 28 04:55 bridge.py
-rw------- 1 root root        614 May 11 04:31 _config.json
-rw

In [5]:
# cp src/bridge.py, config/config.yml to {folder}
#!mkdir ./{folder}/src
#!mkdir ./{folder}/config
!cp ../src/bridge.py ./{folder}/bridge.py
!cp ../config/config.yml ./{folder}/config.yml

In [None]:
#!kaggle datasets version -p {folder} -m "renamed fine-tuned model (.pt) added"
#!kaggle datasets version -p {folder} -m "huggingface's fine-tuned model added"
!kaggle datasets version -p {folder} -m "[Update] bridge.py"

Starting upload for file bert_config.json
100% 313/313 [00:03<00:00, 89.5B/s]
Upload successful: bert_config.json (313B)
Starting upload for file bert-base-cased-vocab.txt
100% 208k/208k [00:03<00:00, 66.1kB/s]
Upload successful: bert-base-cased-vocab.txt (208KB)
Starting upload for file pytorch-xla-env-setup.py
100% 5.00k/5.00k [00:02<00:00, 1.87kB/s]
Upload successful: pytorch-xla-env-setup.py (5KB)
Starting upload for file torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl
100% 91.0M/91.0M [02:43<00:00, 584kB/s]
Upload successful: torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl (91MB)
Starting upload for file torch_xla-nightly+20200515-cp37-cp37m-linux_x86_64.whl
  0% 0.00/119M [00:00<?, ?B/s]

In [None]:
#!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt
#!mv bert-base-cased-vocab.txt {folder}/bert-base-cased-vocab.txt
#!kaggle datasets version -p {folder} -m "pre-trained BertTokenizer added"

--2021-04-23 14:36:18--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.153.14
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.153.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 213450 (208K) [text/plain]
Saving to: ‘bert-base-cased-vocab.txt’


2021-04-23 14:36:18 (1.67 MB/s) - ‘bert-base-cased-vocab.txt’ saved [213450/213450]



In [None]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
!mv bert-base-uncased-vocab.txt {folder}/bert-base-uncased-vocab.txt
!kaggle datasets version -p {folder} -m "[Add] pre-trained BertTokenizer (uncased)"

--2021-05-27 12:35:28--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.106.126
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.106.126|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘bert-base-uncased-vocab.txt’


2021-05-27 12:35:28 (2.01 MB/s) - ‘bert-base-uncased-vocab.txt’ saved [231508/231508]

Starting upload for file bert-base-cased-vocab.txt
100% 208k/208k [00:01<00:00, 133kB/s]
Upload successful: bert-base-cased-vocab.txt (208KB)
Starting upload for file pytorch-xla-env-setup.py
100% 5.00k/5.00k [00:01<00:00, 3.76kB/s]
Upload successful: pytorch-xla-env-setup.py (5KB)
Starting upload for file torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl
100% 91.0M/91.0M [00:03<00:00, 25.3MB/s]
Upload successful: torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl (91MB)
Starting upload for file torch_xla-nightly+20200515-cp37-c

In [None]:
#!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz
#!mv bert-base-cased.tar.gz {folder}/bert-base-cased.tar.gz
#!tar xf {folder}/bert-base-cased.tar.gz -C {folder}
#!cp {folder}/bert_config.json {folder}/config.json
#!kaggle datasets version -p {folder} -m "[Update] pre-trained BertForTokenClassification"

Starting upload for file bert-base-cased-vocab.txt
100% 208k/208k [00:01<00:00, 200kB/s]
Upload successful: bert-base-cased-vocab.txt (208KB)
Starting upload for file pytorch-xla-env-setup.py
100% 5.00k/5.00k [00:00<00:00, 5.29kB/s]
Upload successful: pytorch-xla-env-setup.py (5KB)
Starting upload for file torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl
100% 91.0M/91.0M [00:02<00:00, 41.1MB/s]
Upload successful: torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl (91MB)
Starting upload for file torch_xla-nightly+20200515-cp37-cp37m-linux_x86_64.whl
100% 119M/119M [00:03<00:00, 39.5MB/s]
Upload successful: torch_xla-nightly+20200515-cp37-cp37m-linux_x86_64.whl (119MB)
Starting upload for file torchvision-nightly+20200515-cp37-cp37m-linux_x86_64.whl
100% 2.33M/2.33M [00:01<00:00, 1.66MB/s]
Upload successful: torchvision-nightly+20200515-cp37-cp37m-linux_x86_64.whl (2MB)
Starting upload for file bert-base-cased-ner-cv1.bin
100% 617M/617M [00:07<00:00, 82.1MB/s]
Upload successful: bert-b

In [None]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
!mv bert-base-uncased.tar.gz {folder}/bert-base-uncased.tar.gz
!tar xf {folder}/bert-base-uncased.tar.gz -C {folder}
!cp {folder}/bert_config.json {folder}/config.json
!kaggle datasets version -p {folder} -m "[Update] pre-trained BertForTokenClassification (changed to uncased)"

--2021-05-27 14:10:02--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.42.150
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.42.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407873900 (389M) [application/x-tar]
Saving to: ‘bert-base-uncased.tar.gz’


2021-05-27 14:10:09 (65.7 MB/s) - ‘bert-base-uncased.tar.gz’ saved [407873900/407873900]

Starting upload for file bert-base-cased-vocab.txt
100% 208k/208k [00:01<00:00, 168kB/s]
Upload successful: bert-base-cased-vocab.txt (208KB)
Starting upload for file pytorch-xla-env-setup.py
100% 5.00k/5.00k [00:01<00:00, 4.75kB/s]
Upload successful: pytorch-xla-env-setup.py (5KB)
Starting upload for file torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl
100% 91.0M/91.0M [00:01<00:00, 51.7MB/s]
Upload successful: torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl (91MB)
Starting upload for file torch_xla-nightly+20200515-c

In [None]:
# folder = "localnb001-transformers-ner"
# !kaggle datasets version -p {folder} -m "[Update] config.json"

Starting upload for file bert-base-cased-vocab.txt
100% 208k/208k [00:01<00:00, 175kB/s]
Upload successful: bert-base-cased-vocab.txt (208KB)
Starting upload for file pytorch-xla-env-setup.py
100% 5.00k/5.00k [00:01<00:00, 4.38kB/s]
Upload successful: pytorch-xla-env-setup.py (5KB)
Starting upload for file torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl
100% 91.0M/91.0M [00:03<00:00, 27.9MB/s]
Upload successful: torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl (91MB)
Starting upload for file torch_xla-nightly+20200515-cp37-cp37m-linux_x86_64.whl
100% 119M/119M [00:02<00:00, 43.6MB/s]
Upload successful: torch_xla-nightly+20200515-cp37-cp37m-linux_x86_64.whl (119MB)
Starting upload for file torchvision-nightly+20200515-cp37-cp37m-linux_x86_64.whl
100% 2.33M/2.33M [00:01<00:00, 1.93MB/s]
Upload successful: torchvision-nightly+20200515-cp37-cp37m-linux_x86_64.whl (2MB)
Starting upload for file bert-base-cased-ner-cv1.bin
100% 617M/617M [00:08<00:00, 79.1MB/s]
Upload successful: bert-b

In [None]:
# !kaggle datasets init -p {folder}
# # referene: https://kaeru-nantoka.hatenablog.com/entry/2020/01/17/015551

# with open(f"{folder}/dataset-metadata.json", "r") as jsonFile:
#     data = json.load(jsonFile)

# data["id"] = f"riow1983/{folder}"
# data["title"] = folder

# with open(f"{folder}/dataset-metadata.json", "w") as jsonFile:
#     json.dump(data, jsonFile)

# !kaggle datasets create -p {folder}

Data package template written to: localnb001-transformers-ner/dataset-metadata.json
Starting upload for file model_initial.pth
100% 617M/617M [00:08<00:00, 80.0MB/s]
Upload successful: model_initial.pth (617MB)
Starting upload for file bert-base-cased-ner-cv1.pth
100% 617M/617M [00:10<00:00, 62.4MB/s]
Upload successful: bert-base-cased-ner-cv1.pth (617MB)
Your private Dataset is being created. Please check progress at https://www.kaggle.com/riow1983/localnb001-transformers-ner
