In [25]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

In [30]:
import json
f = open('training_data.json')
TRAIN_DATA = json.load(f)


In [31]:
for text, annot in tqdm(TRAIN_DATA['annotations']):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents
    db.add(doc)

db.to_disk("./training_data.spacy") # save the docbin object

100%|██████████| 8/8 [00:00<00:00, 532.56it/s]


In [32]:
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [33]:
! python -m spacy train config.cfg --output ./ --paths.train training_data.spacy --paths.dev training_data.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-07-25 20:09:57,833] [INFO] Set up nlp object from config
[2023-07-25 20:09:57,856] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-07-25 20:09:57,861] [INFO] Created vocabulary
[2023-07-25 20:09:57,861] [INFO] Finished initializing nlp object
[2023-07-25 20:09:58,047] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     48.94    3.57    1.96   20.00    0.04
 35     200        236.72   1220.36   96.00   96.00   96.00    0.96
 81     400         24.87     91.83   96.00   96.00   96.00    0.96
138     600         49.33    110.34   92.00   92.00   92.00    0.92
205     800         53.99    128.31   96.00   96.00   96.00    

In [34]:
nlp_ner = spacy.load("model-best")

In [50]:
doc = nlp_ner('''
TSMC reported revenue slipped 10% from a year ago to NT$480.84 billion, while net income fell 23.3% from a year ago to NT$181.8 billion. The company had previously forecast second-quarter revenue between $15.2 billion and $16 billion.

TSMC said business was impacted by macroeconomic headwinds “which dampened the end market demand, and led to customers’ ongoing inventory adjustment.”

This is the company’s first quarterly net income decline since the second quarter of 2019.

TSMC forecast third-quarter revenue between $16.7 billion and $17.5 billion.

“Moving into third quarter 2023, we expect our business to be supported by the strong ramp of our 3-nanomenter technologies, partially offset by customers’ continued inventory adjustment,” Wendell Huang, CFO of TSMC said.

TSMC makes chips for Apple’s iPhones. Apple’s next processor for its iPhone is rumored to be based on the 3-nanometer process technology. Apple typically releases its latest iPhone in September so it is likely ordering chips from TSMC in the third quarter.
''')

In [51]:
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter

In [None]:
# Custom NER can be improved with more training data.

In [62]:
cd ..

/content/drive/MyDrive


In [None]:
!zip -r Custom_NER.zip Cu
