<a href="https://colab.research.google.com/github/vivekk2k28/Machine_Learning/blob/main/Natural%20Language%20Processing/Named%20Entity%20Recognition/Custom%20NER%20with%20spacy/Custom_NER_with_Spacy3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -U spacy -q

In [2]:
!python -m spacy info

[1m

spaCy version    3.7.4                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.58+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_sm (3.7.1)        



In [3]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") #load a new spacy model
db = DocBin() #create a DocBin object

In [4]:
import json
f = open('training_data.json')
TRAIN_DATA = json.load(f)

In [5]:
TRAIN_DATA

{'classes': ['CRYPTO', 'DATE', 'MONEY', 'WEBSITE'],
 'annotations': [['The first peer-reviewed paper on cryptocasinos was only published in October 2020. The gambling games discussed were laughably simple, such as bets on virtual coin flips or dice rolls. Such activities that might appeal to bored friends on a long journey must have seemed benign compared to the world of in-play sports betting and online slots available using conventional currencies.\r',
   {'entities': [[33, 46, 'CRYPTO'], [69, 82, 'DATE'], [372, 382, 'CRYPTO']]}],
  ['\r', {'entities': []}],
  ['But fast-forward a year and cryptocasinos had evolved substantially. In April 2021, Premier League football team Southampton signed a £7.5 million a year sponsorship deal with sportsbet.io, which specialises in allowing gamblers to make sports bets during matches with cryptocurrencies.\r',
   {'entities': [[176, 188, 'WEBSITE'], [269, 285, 'CRYPTO']]}],
  ['\r', {'entities': []}],
  ['Shortly afterwards, the rapper Drake anno

In [7]:
for text, annot in tqdm(TRAIN_DATA['annotations']):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
      span = doc.char_span(start, end, label=label, alignment_mode="contract")
      if span is None:
        print("Skipping entity")
      else:
        ents.append(span)
    doc.ents = ents
    db.add(doc)

db.to_disk("./training_data.spacy") #save the docbin object

100%|██████████| 5/5 [00:00<00:00, 1226.62it/s]


In [8]:
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [9]:
! python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     25.22    0.00    0.00    0.00    0.00
 46     200        516.73    919.32  100.00  100.00  100.00    1.00
108     400          0.00      0.00  100.00  100.00  100.00    1.00
181     600          0.00      0.00  100.00  100.00  100.00    1.00
278     800          0.00      0.00  100.00  100.00  100.00    1.00
378    1000          0.00      0.00  100.00  100.00  100.00    1.00
537    1200          0.00      0.00  100.00  100.00  100.00    1.00
737    1400          0.00      0.00  100.00  100.00  100.00    1.00
937    1600          0.00      0.00  100.00  100.00  100.00    1.00
1137    1800          0.00      0.00  100.00  100.0

In [10]:
nlp_ner = spacy.load("/content/model-best")

In [15]:
doc = nlp_ner('''The cryptocurrency paradigm was heralded by the launch of Bitcoin (BTC) in June 2008, inspiring a new technological and social movement. The goal of cryptocurrencies is to provide a medium for global, peer-to-peer transaction settlement that preserves privacy and financial security.

A cryptocurrency monetary policy is enforced through a unique blend of software, cryptography and financial incentives rather than the whim of trusted third parties such as central banks, corporations or governments. Cryptocurrencies are powered by cryptographically secure, verifiable transaction databases called blockchains, which provide their security and transparency.

A cryptocurrency network consists of a global community of stakeholders $20,000 four figure, including the validators that secure the network while adding transactions to the blockchain, the traders who speculate on these radically market-driven assets, and the builders working to onboard people to this new financial paradigm.

At Cointelegraph, we are chronicling the ongoing story of cryptocurrency and the rise of a borderless, permissionless financial system. How will industry stakeholders work to make crypto a mainstay in people’s lives? How will crypto investments change the paradigm of the current financial system? And will incumbent and legacy systems accept or fight this change?

Stay tuned: Cryptocurrencies are going to play a big role heading into the future.
''')

In [16]:
spacy.displacy.render(doc, style="ent", jupyter=True) #display in jupyter