<a href="https://colab.research.google.com/github/sibat119/finetune-transformer-models/blob/main/finetune_bert_with_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#BERT Fine-Tuning with PyTorch for Named Entity Recognition task

Use BERT with the huggingface PyTorch library to quickly and efficiently fine-tune a model to get near state of the art performance in Named Entity Recognition. More broadly, I describe the practical application of transfer learning in NLP to create high performance models with minimal effort on a range of NLP tasks.

# 1. Setup

# 1.1. Using Colab GPU for Training

In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device.

In [11]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


# 1.2. Installing the Hugging Face Library

In [12]:
!pip install transformers



The code in this notebook is actually a simplified version of the [run_ner.py](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_ner.py) example script from huggingface. This `run_ner.py` also uses [utils_ner.py](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/utils_ner.py) file which we also followed for this finetuning process.

`run_ner.py` is a helpful utility which allows you to pick which NER task you want to run on, and which pre-trained model you want to use (you can see the list of possible models [here](https://github.com/huggingface/transformers/blob/e6cff60b4cbc1158fbd6e4a1c3afda8dc224f566/examples/run_glue.py#L69)). It also supports using either the CPU, a single GPU, or multiple GPUs. It even supports using 16-bit precision if you want further speed up.



Unfortunately, all of this configurability comes at the cost of *readability*. In this Notebook, we've simplified the code greatly and added plenty of comments to make it clear what's going on. 

# 2. Loading Conll-2003 Dataset

# 2.1. Download & Extract

We'll use the wget package to download the dataset to the Colab instance's file system.

In [13]:
!pip install wget



In [14]:
import wget
import os


# Download the file
if not os.path.exists('./conll'):
  print('Downloading dataset...')
  !wget -P conll/ "https://github.com/davidsbatista/NER-datasets/raw/master/CONLL2003/train.txt"
  !wget -P conll/ "https://github.com/davidsbatista/NER-datasets/raw/master/CONLL2003/test.txt"
  !wget -P conll/ "https://github.com/davidsbatista/NER-datasets/raw/master/CONLL2003/valid.txt"
