# Learning Custom NER model in SpaCy to auto-detect named entities

In [72]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [73]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [74]:
article_text="""India that previously comprised only a handful of players in the e-commerce space, is now home to many biggies and giants battling out with each other to reach the top. This is thanks to the overwhelming internet and smartphone penetration coupled with the ever-increasing digital adoption across the country. These new-age innovations not only gave emerging startups a unique platform to deliver seamless shopping experiences but also provided brick and mortar stores with a level-playing field to begin their online journeys without leaving their offline legacies.
In the wake of so many players coming together on one platform, the Indian e-commerce market is envisioned to reach USD 84 billion in 2021 from USD 24 billion in 2017. Further, with the rate at which internet penetration is increasing, we can expect more and more international retailers coming to India in addition to a large pool of new startups. This, in turn, will provide a major Philip to the organized retail market and boost its share from 12% in 2017 to 22-25% by 2021. 
Here’s a view to the e-commerce giants that are dominating India’s online shopping space:
Amazon – One of the uncontested global leaders, Amazon started its journey as a simple online bookstore that gradually expanded its reach to provide a large suite of diversified products including media, furniture, food, and electronics, among others. And now with the launch of Amazon Prime and Amazon Music Limited, it has taken customer experience to a godly level, which will remain undefeatable for a very long time. 

Flipkart – Founded in 2007, Flipkart is recognized as the national leader in the Indian e-commerce market. Just like Amazon, it started operating by selling books and then entered other categories such as electronics, fashion, and lifestyle, mobile phones, etc. And now that it has been acquired by Walmart, one of the largest leading platforms of e-commerce in the US, it has also raised its bar of customer offerings in all aspects and giving huge competition to Amazon. 

Snapdeal – Started as a daily deals platform in 2010, Snapdeal became a full-fledged online marketplace in 2011 comprising more than 3 lac sellers across India. The platform offers over 30 million products across 800+ diverse categories from over 125,000 regional, national, and international brands and retailers. The Indian e-commerce firm follows a robust strategy to stay at the forefront of innovation and deliver seamless customer offerings to its wide customer base. It has shown great potential for recovery in recent years despite losing Freecharge and Unicommerce. 

ShopClues – Another renowned name in the Indian e-commerce industry, ShopClues was founded in July 2011. It’s a Gurugram based company having a current valuation of INR 1.1 billion and is backed by prominent names including Nexus Venture Partners, Tiger Global, and Helion Ventures as its major investors. Presently, the platform comprises more than 5 lac sellers selling products in nine different categories such as computers, cameras, mobiles, etc. 

Paytm Mall – To compete with the existing e-commerce giants, Paytm, an online payment system has also launched its online marketplace – Paytm Mall, which offers a wide array of products ranging from men and women fashion to groceries and cosmetics, electronics and home products, and many more. The unique thing about this platform is that it serves as a medium for third parties to sell their products directly through the widely-known app – Paytm. 

Reliance Retail – Given Reliance Jio’s disruptive venture in the Indian telecom space along with a solid market presence of Reliance, it is no wonder that Reliance will soon be foraying into retail space. As of now, it has plans to build an e-commerce space that will be established on online-to-offline market program and aim to bring local merchants on board to help them boost their sales and compete with the existing industry leaders. 
Big Basket – India’s biggest online supermarket, Big Basket provides a wide variety of imported and gourmet products through two types of delivery services – express delivery and slotted delivery. It also offers pre-cut fruits along with a long list of beverages including fresh juices, cold drinks, hot teas, etc. Moreover, it not only provides farm-fresh products but also ensures that the farmer gets better prices. 

Grofers – One of the leading e-commerce players in the grocery segment, Grofers started its operations in 2013 and has reached overwhelming heights in the last 5 years. Its wide range of products includes atta, milk, oil, daily need products, vegetables, dairy products, juices, beverages, among others. With its growing reach across India, it has become one of the favorite supermarkets for Indian consumers who want to shop grocery items from the comforts of their homes. 

Digital Mall of Asia – Going live in 2020, Digital Mall of Asia is a very unique concept coined by the founders of Yokeasia Malls. It is designed to provide an immersive digital space equipped with multiple visual and sensory elements to sellers and shoppers. It will also give retailers exclusive rights to sell a particular product category or brand in their respective cities. What makes it unique is its zero-commission model enabling retailers to pay only a fixed amount of monthly rental instead of paying commissions. With its one-of-a-kind features, DMA is expected to bring
never-seen transformation to the current e-commerce ecosystem while addressing all the existing e-commerce worries such as counterfeiting. """

doc=nlp(article_text)
for ent in doc.ents:
  print(ent.text,ent.label_)

India GPE
one CARDINAL
Indian NORP
USD 84 billion MONEY
2021 DATE
USD 24 billion MONEY
2017 DATE
India GPE
Philip PERSON
12% PERCENT
2017 DATE
22-25% PERCENT
2021 DATE
India GPE
Amazon ORG
One CARDINAL
Amazon ORG
Amazon ORG
Amazon Music Limited ORG
2007 DATE
Flipkart ORG
Indian NORP
Amazon ORG
Walmart ORG
one CARDINAL
US GPE
Amazon ORG
2010 DATE
2011 DATE
more than 3 CARDINAL
India GPE
over 30 million CARDINAL
800 CARDINAL
over 125,000 CARDINAL
Indian NORP
recent years DATE
Freecharge PERSON
Unicommerce GPE
ShopClues PRODUCT
Indian NORP
ShopClues PRODUCT
July 2011 DATE
Gurugram ORG
1.1 billion CARDINAL
Nexus Venture Partners ORG
Tiger Global PERSON
Helion Ventures ORG
more than 5 CARDINAL
nine CARDINAL
Paytm Mall PERSON
Paytm PERSON
Paytm Mall PERSON
third ORDINAL
Reliance Retail PERSON
Reliance Jio PERSON
Indian NORP
Reliance ORG
Reliance ORG
India GPE
two CARDINAL
One CARDINAL
Grofers ORG
2013 DATE
the last 5 years DATE
daily DATE
India GPE
Indian NORP
Digital Mall of Asia – Going li

Observe the above output. Notice that FLIPKART has been identified as PERSON, it should have been ORG . Walmart has also been categorized wrongly as LOC , in this context it should have been ORG . Same goes for Freecharge , ShopClues ,etc..


### Training NER to correctly classify 

In [75]:
ner = nlp.get_pipe('ner')

spaCy accepts training data as list of tuples.

Each tuple should contain the text and a dictionary. The dictionary should hold the start and end indices of the named enity in the text, and the category or label of the named entity.

For example, ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]})

In [76]:
# training data
TRAIN_DATA = [
              ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
              ("I reached Chennai yesterday.", {"entities": [(10, 16, "GPE")]}),
              ("I recently ordered a book from Amazon", {"entities": [(24,32, "ORG")]}),
              ("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
              ("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
              ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT")]}),
              ("I bought a new Washer", {"entities": [(16,22, "PRODUCT")]}),
              ("I bought a old table", {"entities": [(16,21, "PRODUCT")]}),
              ("I bought a fancy dress", {"entities": [(18,23, "PRODUCT")]}),
              ("I rented a camera", {"entities": [(12,18, "PRODUCT")]}),
              ("I rented a tent for our trip", {"entities": [(12,16, "PRODUCT")]}),
              ("I rented a screwdriver from our neighbour", {"entities": [(12,22, "PRODUCT")]}),
              ("I repaired my computer", {"entities": [(15,23, "PRODUCT")]}),
              ("I got my clock fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("I got my truck fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
              ("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Swiggy", {"entities": [(24,29, "ORG")]}),
              ("I am driving a Toyota.", {"entities":[(16,21,"PRODUCT")]})
              ]

In [77]:
# Adding labels to the `ner`

for _, annotations in TRAIN_DATA:
  for ent in annotations.get("entities"):
    ner.add_label(ent[2])

In [78]:
# While training one pipeline, other pipelines shouldn't be affected

# Disable pipeline components you dont need to change
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

(a) To train an ner model, the model has to be looped over the example for sufficient number of iterations. If you train it for like just 5 or 6 iterations, it may not be effective.

(b) Before every iteration it’s a good practice to shuffle the examples randomly throughrandom.shuffle() function to ensure the model does not make generalizations based on the order of the examples.

(c) The training data is usually passed in batches.

You can call the minibatch() function of spaCy over the training data that will return you data in batches . The minibatch function takes size parameter to denote the batch size. You can make use of the utility function compounding to generate an infinite series of compounding values.

compunding() function takes three inputs which are start ( the first integer value) ,stop (the maximum value that can be generated) and finally compound. This value stored in compund is the compounding factor for the series.If you are not clear, check out this link for understanding.

For each iteration , the model or ner is updated through the nlp.update() command. Parameters of nlp.update() are :

    docs: This expects a batch of texts as input. You can pass each batch to the zip method, which will return you batches of text and annotations. `

    golds: You can pass the annotations we got through zip method here

    drop: This represents the dropout rate.

    losses: A dictionary to hold the losses against each pipeline component. Create an empty dictionary and pass it here.

At each word, the update() it makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t , it adjusts the weights so that the correct action will score higher next time.

Finally, all of the training is done within the context of the nlp model with disabled pipeline, to prevent the other components from being involved.

In [79]:
import random
from spacy.training.example import Example
from spacy.util import minibatch, compounding
from pathlib import Path

# TRAINING THE MODEL
with nlp.disable_pipes(*unaffected_pipes):

    # Training for 30 iterations
    for iteration in range(30):

        # Shuffling examples before every iteration
        random.shuffle(TRAIN_DATA)
        losses = {}

        # Batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))

        for batch in batches:
            examples = []
            for text, annotation in batch:
                example = Example.from_dict(nlp.make_doc(text), annotation)
                examples.append(example)

            nlp.update(
                examples,  # batch of Example objects
                drop=0.5,  # dropout - make it harder to memorize data
                losses=losses,
            )

        print("Losses", losses)

Losses {'ner': 13.9771563638423}




Losses {'ner': 12.285115603818763}
Losses {'ner': 11.546935547412808}
Losses {'ner': 8.5671408131798}
Losses {'ner': 8.69265727982853}
Losses {'ner': 4.612449450881454}
Losses {'ner': 4.257090980177775}
Losses {'ner': 3.0995573838552875}
Losses {'ner': 4.683424434282797}
Losses {'ner': 2.6093997939594584}
Losses {'ner': 3.9578975267199423}
Losses {'ner': 2.259847518117554}
Losses {'ner': 4.826305386370259}
Losses {'ner': 0.04189992520967517}
Losses {'ner': 0.5532236750269922}
Losses {'ner': 1.3733060068577314}
Losses {'ner': 2.595102990073144}
Losses {'ner': 0.07269885696994079}
Losses {'ner': 7.420432282286659e-05}
Losses {'ner': 1.7360327066079668}
Losses {'ner': 4.031402664763472}
Losses {'ner': 0.18942591569940428}
Losses {'ner': 0.810541170752198}
Losses {'ner': 0.1173423028761718}
Losses {'ner': 0.001454015186435243}
Losses {'ner': 0.018201167020807568}
Losses {'ner': 0.05359590678632756}
Losses {'ner': 6.542456480857925e-05}
Losses {'ner': 0.0036747235965190746}
Losses {'ner': 0

In [80]:
doc = nlp("I am driving a AUDI")
print()
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])


Entities [('AUDI', 'PRODUCT')]


You can observe that even though I didn’t directly train the model to recognize “AUDI” as a vehicle name, it has predicted based on the similarity of context.

This is the awesome part of the NER model.

The model does not just memorize the training examples. It should learn from them and be able to generalize it to new examples.

In [81]:
# # Save the  model to directory
# output_dir = Path('nlp')
# nlp.to_disk(output_dir)
# print("Saved model to", output_dir)

# # Load the saved model and predict
# print("Loading from", output_dir)
# nlp_updated = spacy.load(output_dir)
# doc = nlp_updated("Fridge can be ordered in FlipKart" )
# print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

#### Training ner from blank spacy model

As it is an empty model , it does not have any pipeline component by default. You have to add the ner to the pipeline through `add_pipe()` method.

You’ll not have to disable other pipelines as in previous case

You must provide a larger number of training examples comparitively in rhis case.

Before you start training the new model set `nlp.begin_training()`.

In [82]:
nlp=spacy.blank("en")

ner = nlp.add_pipe('ner')

nlp.begin_training()

#After this, you can follow the same exact procedure as in the case for pre-existing model.
#This is how you can train the named entity recognizer to identify and categorize correctly as per the context.

<thinc.optimizers.Optimizer at 0x7f420282da80>

#### Adding new entity type

In [83]:
nlp=spacy.load("en_core_web_sm") 

# Getting the ner component
ner=nlp.get_pipe('ner')

In [84]:
# New label to add
LABEL = "FOOD"

# Training examples in the required format
TRAIN_DATA =[ ("Pizza is a common fast food.", {"entities": [(0, 5, "FOOD")]}),
              ("Pasta is an italian recipe", {"entities": [(0, 5, "FOOD")]}),
              ("China's noodles are very famous", {"entities": [(8,14, "FOOD")]}),
              ("Shrimps are famous in China too", {"entities": [(0,7, "FOOD")]}),
              ("Lasagna is another classic of Italy", {"entities": [(0,7, "FOOD")]}),
              ("Sushi is extemely famous and expensive Japanese dish", {"entities": [(0,5, "FOOD")]}),
              ("Unagi is a famous seafood of Japan", {"entities": [(0,5, "FOOD")]}),
              ("Tempura , Soba are other famous dishes of Japan", {"entities": [(0,7, "FOOD")]}),
              ("Udon is a healthy type of noodles", {"entities": [(0,4, "ORG")]}),
              ("Chocolate soufflé is extremely famous french cuisine", {"entities": [(0,17, "FOOD")]}),
              ("Flamiche is french pastry", {"entities": [(0,8, "FOOD")]}),
              ("Burgers are the most commonly consumed fastfood", {"entities": [(0,7, "FOOD")]}),
              ("Burgers are the most commonly consumed fastfood", {"entities": [(0,7, "FOOD")]}),
              ("Frenchfries are considered too oily", {"entities": [(0,11, "FOOD")]}),
              ("I ate Pasta. Pasta is tasty.", {"entities":[(6,11,"FOOD"), (12,17,"FOOD")]})
           ]

In [85]:
# Add the new label to ner
ner.add_label(LABEL)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [86]:
# TRAINING THE MODEL
with nlp.disable_pipes(*other_pipes):

    # Training for 30 iterations
    for iteration in range(30):

        # Shuffling examples before every iteration
        random.shuffle(TRAIN_DATA)
        losses = {}

        # Batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))

        for batch in batches:
            examples = []
            for text, annotation in batch:
                example = Example.from_dict(nlp.make_doc(text), annotation)
                examples.append(example)

            nlp.update(
                examples,  # batch of Example objects
                drop=0.5,  # dropout - make it harder to memorize data
                losses=losses,
            )

        print("Losses", losses)



Losses {'ner': 35.5110822299458}
Losses {'ner': 29.59050418805498}
Losses {'ner': 27.245973437629004}
Losses {'ner': 27.891098902582616}
Losses {'ner': 21.996921687520626}
Losses {'ner': 22.09606195749052}
Losses {'ner': 20.380360020568332}
Losses {'ner': 17.304002318174753}
Losses {'ner': 15.748360321915243}
Losses {'ner': 20.936289012781344}
Losses {'ner': 14.4717055890942}
Losses {'ner': 11.189857555669732}
Losses {'ner': 8.455215350317303}
Losses {'ner': 8.813771430723136}
Losses {'ner': 5.622159225023097}
Losses {'ner': 3.2460395492910834}
Losses {'ner': 4.244475671227024}
Losses {'ner': 3.833080125352737}
Losses {'ner': 3.608760729033867}
Losses {'ner': 3.0387880080739937}
Losses {'ner': 4.728233427958036}
Losses {'ner': 5.084212401576973}
Losses {'ner': 4.050221489557719}
Losses {'ner': 1.791718887978489}
Losses {'ner': 2.366056361443631}
Losses {'ner': 1.6670740343313302}
Losses {'ner': 1.4873293624819082}
Losses {'ner': 2.3885286445604668}
Losses {'ner': 2.0115464616039405}
Lo

In [87]:
test_text = "I ate Sushi yesterday. Maggi is a common fast food "
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
  print(ent)

Entities in 'I ate Sushi yesterday. Maggi is a common fast food '
Sushi
Maggi


In [88]:
# output_dir=Path('new_entity')

# # Saving the model to the output directory
# if not output_dir.exists():
#   output_dir.mkdir()
# nlp.meta['name'] = 'my_ner'  # rename model
# nlp.to_disk(output_dir)
# print("Saved model to", output_dir)

# # Loading the model from the directory
# print("Loading from", output_dir)
# nlp2 = spacy.load(output_dir)
# assert nlp2.get_pipe("ner").move_names == move_names
# doc2 = nlp2(' Dosa is an extremely famous south Indian dish')
# for ent in doc2.ents:
#   print(ent.label_, ent.text)

# Resume Parser using Spacy

In [89]:
!pip install pdfminer.six

Defaulting to user installation because normal site-packages is not writeable


In [90]:
from pdfminer.high_level import extract_text

def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

In [91]:
def parse_resume(text):
    doc = nlp(text)

    # Extract information based on spaCy's NER (Named Entity Recognition) capabilities
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    return entities

In [126]:
nlp = spacy.load("en_core_web_sm")
ner = nlp.get_pipe('ner')
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [127]:
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
unaffected_pipes

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']

In [139]:
# Add the new label to ner
Label1 = "TENURE"
Label2 = "SKILLS"
Label3 = "COURSE"
Label4 = "INVOLVEMENT"
Label5 = "EXPERIENCE"
Label6 = "SOCIALS"
Label7 = "EDUCATION"
# Label8 = "ORG" # for education and work institutions

# label=[]
# for i in range(1, 6):
#     label.append(globals()[f"Label{i}"])

for i in range(1,8):
    ner.add_label(globals()[f"Label{i}"])

# # Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

In [140]:
ner.labels

('CARDINAL',
 'COURSE',
 'DATE',
 'EDUCATION',
 'EVENT',
 'EXPERIENCE',
 'FAC',
 'GPE',
 'INVOLVEMENT',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'SKILLS',
 'SOCIALS',
 'TENURE',
 'TIME',
 'WORK_OF_ART')

In [130]:
resume = """Rupesh Ghimire
rupeshghimire007@gmail.com | +977-9868155925

EDUCATION
Pashchimanchal Campus, IOE, TU
Bachelor in Computer Engineering
2019-Present
Fusemachines
Micro-Degree in Artificial Intelligence
2023-Present
Fellowship Scholar
LINKS
Github:// rupeshghimire7
LinkedIn:// rupesh-ghimire7
Medium:// rupeshghimire7
LeetCode:// rupeshghimire7
INVOLVEMENTS
NAAMI | ANAIS-Student Ambassador
Apr 2023 - Jun 2023
i-CES | Django Mentor
Jan 2023 - Feb 2023 | Pokhara
Made attendees familiar with Python and Django's MVC
architecture, ORM, Templates and Rest Framework.
Coding Competition (GCES) - 2023
Code with Coffee (i-CES) - 2022
PROJECTS
LIVER CIRRHOSIS PREDICTION
Fusemachines ML Final Project | (Feb 2023- Apr 2023)
Worked on the multiclass classification problem with
various classifiers to predict the stage of patients' liver.
Deployed on Flask.
BCT Study Room
Software Engineering Project | (May 2022- Jul 2022)
A django web-app to enable discussion in groups.
BookSuggestor
COURSEWORKSDBMS Project
Leveraged RAW SQL for database management using
MySQL in Django bypassing Django's ORM.
UNDERGRADUATETelegram Chat Bot
Data Structures & Algorithms
Operating Systems
Database Management System
Software Engineering
Artificial Intelligence
Computer NetworksPersonal Project
Used Telegram API for configuration and python for
implementation.
Machine Learning AlgorithmsACHIEVEMENTS
FUSEMACHINES
PROGRAMMING SKILLS
EXPERIENCED
Python | Django
Tailwind CSS | HTML5 | CSS3
INTERMEDIATE
Pytorch
MySQL
Pandas | Numpy
Matplotlib | Seaborn | Scikit-Learn
RESTful API
C/C++
FAMILIAR
Tensorflow/ Keras
JavaScript | ReactJS
ML Projects
Personal Project
Collection of Regression and Classification projects.
HultPrize 2021
To build viable food enterprises to create jobs, stimulate
economies, reimagine supply chains, and improve outcomes for
10,000,000 people by 2030.
OnCampus: 1st Runner Up
Regional Summit: Participant via Wildcard
Golden Jubilee Scholarship Scheme
Awarded from Embassy of India, Kathmandu for
Undergrad Studies.
LANGUAGES
Nepali - Native Proficiency
English - Professional Working Proficiency"""

In [131]:
resume = resume.replace('\n', ' ').replace('|', '')

### TRAINER FUNCTION

Takes list train data of form:
        
        [("Pizza is a common fast food.", {"entities": [(0, 5, "FOOD")]}),
        ("Pasta is an italian recipe", {"entities": [(0, 5, "FOOD")]})]
        
Resume Doc is spacy doc object of resume:

In [145]:
def trainer(TRAIN_DATA,resume_doc):
    optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    
    # TRAINING THE MODEL
    with nlp.disable_pipes(*unaffected_pipes):

        # Training for n iterations
        for iteration in range(10):

            # Shuffling examples before every iteration
            random.shuffle(TRAIN_DATA)
            losses = {}

            # Batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))

            for batch in batches:
                examples = []
                for text, annotation in batch:
                    example = Example.from_dict(nlp.make_doc(text), annotation)
                    examples.append(example)

                nlp.update(
                    examples,  
                    sgd=optimizer,
                    drop=0.5,  
                    losses=losses,
                )

            print("Losses", losses)
            
        skills = []
        tenure = []
        experience = []
        course = []
        involvement = []
        org = []
        education = []
        
        for ent in doc.ents:
            if ent.label_ == "SKILLS":
                skills.append(ent.text)
                print("SKILL:", ent.text)
            elif ent.label_ == "COURSE":
                course.append(ent.text)
                print("COURSE:", ent.text)
            elif ent.label_ == "TENURE":
                tenure.append(ent.text)
                print("TENURE:", ent.text)
            elif ent.label_ == "ORG":
                org.append(ent.text)
                print("ORG:", ent.text)
            elif ent.label_ == "INVOLVEMENT":
                involvement.append(ent.text)
                print("INVOLVEMENT:", ent.text)
            elif ent.label_ == "EXPERIENCE":
                experience.append(ent.text)
                print("EXPERIENCE:", ent.text)
            elif ent.label_ == "EDUCATION":
                education.append(ent.text)
                print("EDUCATION:", ent.text)
            else: pass
        
        return skills, course, tenure, org, involvement, experience, education
    


### Data Generator function

Label -> "SKILLS"

List of data -> ['React', 'Django', 'Machine Learning', 'AI']  

In [154]:
def data_generator(label,data_list,level=None):
    training_data = []

    for i in range(len(data_list)):
        data = data_list[i]
        start_index = 0
        end_index = start_index + len(data)
        training_data.append((data, {"entities": [(start_index, end_index, label)]}))

    return training_data

### Creating data for tecnhical skills i.e. Entity Label : SKILLS

In [146]:
technical_skills = [
    # Programming Languages
    "Python", "Java", "C++", "JavaScript", "HTML", "CSS", "Ruby", "Go", "Swift", "Kotlin",
    "TypeScript", "Rust", "Scala", "PHP", "C#", "Objective-C",

    # Web Development
    "Django", "Flask", "Node.js", "Express.js", "React", "Angular", "Vue.js", "Next.js",
    "Spring Boot", "Ruby on Rails", "ASP.NET", "Meteor", "HTML5", "CSS3", "Bootstrap",

    # Databases
    "SQL", "MySQL", "PostgreSQL", "MongoDB", "Redis", "SQLite", "Firebase", "Cassandra",

    # Machine Learning / Data Science
    "TensorFlow", "PyTorch", "Scikit-learn", "Keras", "Pandas", "NumPy", "Matplotlib",
    "Seaborn", "NLTK", "Spacy", "Scrapy", "Beautiful Soup",

    # Cloud Computing
    "AWS", "Azure", "Google Cloud Platform (GCP)", "Docker", "Kubernetes", "Heroku",

    # DevOps
    "Jenkins", "Travis CI", "GitLab CI", "Ansible", "Terraform", "Docker Compose",

    # Version Control
    "Git", "GitHub", "Bitbucket", "GitLab",

    # Mobile Development
    "React Native", "Flutter", "Xamarin", "SwiftUI", "Android SDK",

    # Frameworks
    "React", "Angular", "Vue.js", "Django", "Flask", "Ruby on Rails", "Spring Boot", "Express.js",

    # Libraries
    "Pandas", "NumPy", "Matplotlib", "Seaborn", "NLTK", "Spacy", "Scrapy", "Beautiful Soup",

    # Frontend Technologies
    "React", "Angular", "Vue.js", "Next.js", "TypeScript", "Webpack", "Babel", "SASS",

    # Backend Technologies
    "Node.js", "Django", "Flask", "Spring Boot", "Ruby on Rails", "Express.js", "PHP",

    # Mobile Frameworks
    "React Native", "Flutter", "Xamarin", "SwiftUI", "Android SDK",

    # Networking
    "TCP/IP", "HTTP/HTTPS", "DNS", "Load Balancing", "Firewalls", "Proxy Servers",

    # Security
    "Cybersecurity", "Penetration Testing", "Cryptography", "OWASP", "SSL/TLS",

    # Operating Systems
    "Linux", "Windows", "macOS", "Unix",

    # Other Technologies
    "Blockchain", "Serverless", "Microservices", "RESTful API", "GraphQL",

    # Project Management / Agile
    "Scrum", "Kanban", "Agile", "JIRA", "Trello",

    # Miscellaneous
    "Data Manipulation", "Natural Language Processing (NLP)", "Computer Vision",
]
level = ['beginner', 'intermediate', 'expert']

In [153]:
TRAIN_DATA = data_generator("SKILLS",technical_skills, level)

for data in TRAIN_DATA[:5]:
    print(data)
for data in TRAIN_DATA[-5:]:
    print(data)
print("TRAIN_DATA length", len(TRAIN_DATA))

('Python', {'entities': [(0, 6, 'SKILLS')]})
('Java', {'entities': [(0, 4, 'SKILLS')]})
('C++', {'entities': [(0, 3, 'SKILLS')]})
('JavaScript', {'entities': [(0, 10, 'SKILLS')]})
('HTML', {'entities': [(0, 4, 'SKILLS')]})
('JIRA', {'entities': [(0, 4, 'SKILLS')]})
('Trello', {'entities': [(0, 6, 'SKILLS')]})
('Data Manipulation', {'entities': [(0, 17, 'SKILLS')]})
('Natural Language Processing (NLP)', {'entities': [(0, 33, 'SKILLS')]})
('Computer Vision', {'entities': [(0, 15, 'SKILLS')]})
TRAIN_DATA length 136


In [156]:
results = trainer(TRAIN_DATA, nlp(resume))

Losses {'ner': 0.008960743059662227}
Losses {'ner': 7.617366391243015e-05}
Losses {'ner': 0.020943281238995106}
Losses {'ner': 9.62696453844033e-06}
Losses {'ner': 0.0035727945813399664}
Losses {'ner': 0.0005496944615650698}
Losses {'ner': 6.869120225059044e-07}
Losses {'ner': 2.319660427829893e-07}
Losses {'ner': 9.670310305236177e-07}
Losses {'ner': 2.087947047659063}
SKILL: Rupesh Ghimire rupeshghimire007@gmail.com  
SKILL: +977-9868155925  EDUCATION Pashchimanchal Campus, IOE
SKILL: TU Bachelor
SKILL: Computer Engineering
SKILL: 2019-Present Fusemachines Micro-Degree
SKILL: Artificial Intelligence 2023-Present Fellowship Scholar LINKS Github:// rupeshghimire7
SKILL: rupesh-ghimire7 Medium:// rupeshghimire7
SKILL: LeetCode:// rupeshghimire7 INVOLVEMENTS NAAMI  ANAIS-Student Ambassador Apr 2023 - Jun 2023 i-CES  Django Mentor
SKILL: 2023 - Feb 2023  
SKILL: Pokhara Made
SKILL: Python and Django's MVC architecture,
SKILL: ORM, Templates
SKILL: Rest Framework
SKILL: Coding Competition 

**It does look good but there is a lot of room for improvement**

### Train Tenure (for eg: Jan 2023 - Jan 2024)