# Custom NER based parser using Spacy3
1. Downloading appropriate the skeleton of base [config_file](https://spacy.io/usage/training) as per the system requirement.


2. Clone the source data [repo](https://github.com/laxmimerit/CV-Parsing-using-Spacy-3.git) to train the custom `ner` model.

    ```bash
    git clone https://github.com/laxmimerit/CV-Parsing-using-Spacy-3.git
    ```

3. Generate the `config.cfg` file using the `base_config.cfg` i.e. used for coniguring the Spacy Model parameters.

    ```bash
    python -m spacy init fill-config base_config.cfg config.cfg
    ```



4. Parse and convert source training data into `*.spacy` format.


5. Train the blank `ner` based model on custom data using below command.

    ```bash
    python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy --gpu-id=0
    ```
    **Remark:** Here dev.spacy denotes the test data i.e. used for model evaluation


6. Evaluate model performance on Unseen data.

## 1. Installing dependencies & importing packages

In [1]:
!pip install spacy_transformers
!pip install -U spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy_transformers
  Downloading spacy_transformers-1.2.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (192 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m192.1/192.1 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<4.27.0,>=3.4.0
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m105.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m85.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip freeze > requirement.txt

In [None]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import json

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import locale
print(locale.getpreferredencoding())

UTF-8


In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

After adding above 2 lines bash commands should work as expected

In [None]:
# Checking GPU configuration
!nvidia-smi

Wed Apr 12 08:50:24 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P0    27W /  70W |    397MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 2. Donloading and preparing the `base_config.cfg` file

In [None]:
!python -m spacy init fill-config '/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/data/base_config.cfg' '/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/data/config.cfg'

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume
Parser/data/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


## 3. Loading the source data and preparing it for model training.

In [None]:
cv_Data = json.load(open(r'/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/data/train_data.json','r'))

In [None]:
len(cv_Data)

200

In [None]:
cv_Data[0]

['Govardhana K Senior Software Engineer  Bengaluru, Karnataka, Karnataka - Email me on Indeed: indeed.com/r/Govardhana-K/ b2de315d95905b68  Total IT experience 5 Years 6 Months Cloud Lending Solutions INC 4 Month • Salesforce Developer Oracle 5 Years 2 Month • Core Java Developer Languages Core Java, Go Lang Oracle PL-SQL programming, Sales Force Developer with APEX.  Designations & Promotions  Willing to relocate: Anywhere  WORK EXPERIENCE  Senior Software Engineer  Cloud Lending Solutions -  Bangalore, Karnataka -  January 2018 to Present  Present  Senior Consultant  Oracle -  Bangalore, Karnataka -  November 2016 to December 2017  Staff Consultant  Oracle -  Bangalore, Karnataka -  January 2014 to October 2016  Associate Consultant  Oracle -  Bangalore, Karnataka -  November 2012 to December 2013  EDUCATION  B.E in Computer Science Engineering  Adithya Institute of Technology -  Tamil Nadu  September 2008 to June 2012  https://www.indeed.com/r/Govardhana-K/b2de315d95905b68?isid=rex-

In [None]:
def get_spacy_doc(file, data):
    nlp = spacy.blank('en')
    db = DocBin()
    
    for text, annot in tqdm(data):
        doc = nlp.make_doc(text)
        annot = annot['entities']
        
        ents = []
        entity_indices = []
        
        for start, end, label in annot:
            skip_entity = False
            for idx in range(start, end):
                if idx in entity_indices:
                    skip_entity = True
                    break
            if skip_entity==True:
                continue

            entity_indices = entity_indices + list(range(start, end))

            try:
                span = doc.char_span(start, end, label=label, alignment_mode='strict')
            except:
                continue

            if span is None:
                err_data = str([start, end]) + "    " + str(text) + "\n"
                file.write(err_data)
            else:
                ents.append(span)

        try:
            doc.ents = ents
            db.add(doc)
        except:
            pass
            
    return db

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(cv_Data, test_size=0.3)

In [None]:
file = open(r"/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/error.txt", "w", encoding="utf-8")

db_train = get_spacy_doc(file, train)
db_train.to_disk(r'/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/train_data.spacy')

db_test = get_spacy_doc(file, test)
db_test.to_disk(r'/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/test_data.spacy')

file.close()

100%|██████████| 140/140 [00:02<00:00, 64.84it/s]
100%|██████████| 60/60 [00:00<00:00, 62.41it/s]


## 4. Model training

In [None]:
!python -m spacy train '/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/data/config.cfg' --output '/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/output' --paths.train '/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/train_data.spacy' --paths.dev '/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/test_data.spacy' --gpu-id=0

[38;5;2m✔ Created output directory: /content/drive/MyDrive/Colab
Notebooks/Work/Gembo/NER based Resume Parser/output[0m
[38;5;4mℹ Saving to output directory: /content/drive/MyDrive/Colab
Notebooks/Work/Gembo/NER based Resume Parser/output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-04-12 09:12:56,039] [INFO] Set up nlp object from config
[2023-04-12 09:12:56,050] [INFO] Pipeline: ['transformer', 'ner']
[2023-04-12 09:12:56,053] [INFO] Created vocabulary
[2023-04-12 09:12:56,054] [INFO] Finished initializing nlp object
Downloading (…)lve/main/config.json: 100% 481/481 [00:00<00:00, 82.4kB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 4.03MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 1.08MB/s]
Downloading (…)/main/tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 19.1MB/s]
Downloading pytorch_model.bin: 100% 501M/501M [00:01<00:00, 296MB/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['

## 5. Loading the base model and test the model on unseen data

In [None]:
nlp = spacy.load(r'/content/drive/MyDrive/Colab Notebooks/Work/Gembo/NER based Resume Parser/output/model-best')

In [None]:
doc = nlp('My name is Raj. I work as a ML Engineer at GEMBO. I have 2 years of experience')
for ent in doc.ents:
    print(ent.text, "------------>", ent.label_)

Raj. ------------> Name
ML Engineer ------------> Designation
2 years ------------> Years of Experience
