The dockerfile you can use to run this notebook

```
FROM docker.io/pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-04.html#rel_21-04
# Cuda 11.1, Py 3.8, PyTorch 1.9, Jupyter notebook and jupyter lab installed

# proxy related env settings
ENV https_proxy=<your_proxy_url>
ENV http_proxy=<your_proxy_url>
ENV no_proxy=<your_no_proxy_url>
ENV DEBIAN_FRONTEND=noninteractive

# Set environment variables
# the following 2 environment variables are needed to download spacy models
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV SHELL /bin/bash

# solution for fix: https://stackoverflow.com/questions/38002543/apt-get-update-returned-a-non-zero-code-100
RUN apt-get install -y apt-transport-https

# RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
#   build-essential \
#   cmake \
#   g++ \
#   git \
#   curl \
#   vim \
#   wget

# to enable scp inside and outside of docker container
# RUN apt-get -y update && apt-get install -y openssh-server

# before pip installing packages, upgrade, pip, setuptools and wheels
RUN pip install -U pip setuptools wheel

# for learning torchtext, torchvision and captum (for model interpretability)
RUN pip install torchvision torchtext matplotlib tensorboard captum


# Apart from PyTorch, other core DL libraries for NLP
# installing spacy, transformers and sentence_transformers
RUN pip install transformers tokenizers spacy[cuda111,transformers] sentencepiece sentence-transformers

# install spacy small and trf models
RUN python -m spacy download en_core_web_sm
RUN python -m spacy download en_core_web_trf

# python ecosystem for traditional machine learning/ data science
RUN pip install -U scikit-learn pandas numpy scipy seaborn 


RUN pip install jupyter

# to run jupyter in HPC
RUN jupyter nbextension enable --py widgetsnbextension && \
    rm -rf /var/lib/apt/lists/*
```

In [1]:
!python -m spacy validate

[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /opt/conda/lib/python3.7/site-packages/spacy[0m

NAME              SPACY                 VERSION                            
en_core_web_sm    >=3.3.0.dev0,<3.4.0   [38;5;2m3.3.0[0m   [38;5;2m✔[0m
en_core_web_trf   >=3.3.0.dev0,<3.4.0   [38;5;2m3.3.0[0m   [38;5;2m✔[0m



In [2]:
import os
import re
import pandas as pd

### Dir Locations

In [4]:
DATA_DIR = "/path/to/dir/spacy_model_training_ner/data/diease_ner/train_dev_test_split_conll_data/"
SPACY_DATA_DIR = "/path/to/dir/spacy_model_training_ner/data/diease_ner/train_dev_test_split_spacy_binary/"
CONFIG_DIR = "/path/to/dir/spacy_model_training_ner/data/model_config/spacy_roberta_base_model/" 
SPACY_ROBERTA_MODEL_DIR_GPU = '/path/to/dir/spacy_model_training_ner/data/model_weights/spacy_roberta_base_/'

In [5]:
!ls $DATA_DIR
!ls $SPACY_DATA_DIR

dev_data.conll	test_data.conll  train_data.conll
dev_data.spacy	test_data.spacy  train_data.spacy


### Spacy Convert
Already done while running spacy small model

### 2. Spacy Init Config

In [6]:
!python -m spacy init config $CONFIG_DIR/original_trf_config.cfg --lang en -G --pipeline "ner" --optimize accuracy --force

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: accuracy
- Hardware: GPU
- Transformer: roberta-base
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/path/to/dir/spacy_model_training_ner/data/model_config/spacy_roberta_base_model/original_trf_config.cfg
You can now add your data and train your pipeline:
python -m spacy train original_trf_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


### 3. Debug Spacy Data

In [9]:
!python -m spacy debug data $CONFIG_DIR/original_trf_config.cfg \
--paths.train $SPACY_DATA_DIR/train_data.spacy \
--paths.dev $SPACY_DATA_DIR/dev_data.spacy \
--paths.dev $SPACY_DATA_DIR/test_data.spacy \
--verbose

[1m
Downloading: 100%|██████████████████████████████| 481/481 [00:00<00:00, 665kB/s]
Downloading: 100%|███████████████████████████| 878k/878k [00:00<00:00, 2.40MB/s]
Downloading: 100%|███████████████████████████| 446k/446k [00:00<00:00, 1.47MB/s]
Downloading: 100%|█████████████████████████| 1.29M/1.29M [00:00<00:00, 3.09MB/s]
Downloading: 100%|███████████████████████████| 478M/478M [00:34<00:00, 14.6MB/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be ex

###  4. Train Roberta base NER model

In [14]:
%%time
!python3 -m spacy train $CONFIG_DIR/original_trf_config.cfg \
--output $SPACY_ROBERTA_MODEL_DIR_GPU \
--paths.train $SPACY_DATA_DIR/train_data.spacy \
--paths.dev $SPACY_DATA_DIR/dev_data.spacy \
--verbose \
-g 0

[2022-05-24 07:44:08,351] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
[38;5;2m✔ Created output directory:
/path/to/dir/SpacyNER/data/model_weights/spacy_roberta_base_[0m
[38;5;4mℹ Saving to output directory:
/path/to/dir/SpacyNER/data/model_weights/spacy_roberta_base_[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-05-24 07:44:11,169] [INFO] Set up nlp object from config
[2022-05-24 07:44:11,178] [DEBUG] Loading corpus from path: /path/to/dir/spacy_model_training_ner/data/diease_ner/train_dev_test_split_spacy_binary/dev_data.spacy
[2022-05-24 07:44:11,180] [DEBUG] Loading corpus from path: /path/to/dir/spacy_model_training_ner/data/diease_ner/train_dev_test_split_spacy_binary/train_data.spacy
[2022-05-24 07:44:11,180] [INFO] Pipeline: ['transformer', 'ner']
[2022-05-24 07:44:11,184] [INFO] Created vocabulary
[2022-05-24 07:44:11,185] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaMode