Build a Named Entity Recognition (NER) system for extracting entities from real-world text
such as news articles or social media data. And measure its accuracy, precision, recall, and F1-
score.

In [2]:
!pip install spacy scikit-learn
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m72.0 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
import spacy
from sklearn.metrics import classification_report, accuracy_score

In [13]:
nlp = spacy.load("en_core_web_sm")

In [14]:
texts = [
    "Apple is looking to buy a startup in San Francisco for $1 billion.",
    "PM Modi met Elon Musk in New York.",
    "Google announced a new AI model in London.",
    "I love watching cricket matches in Mumbai!",
    "Microsoft plans to invest in OpenAI next year."
]

In [15]:
true_entities = [
    [("Apple", "ORG"), ("San Francisco", "GPE")],
    [("PM Modi", "PERSON"), ("Elon Musk", "PERSON"), ("New York", "GPE")],
    [("Google", "ORG"), ("London", "GPE")],
    [("Mumbai", "GPE")],
    [("Microsoft", "ORG"), ("OpenAI", "ORG")]
]

In [16]:
y_true = []
y_pred = []

for text, gold_entities in zip(texts, true_entities):
    doc = nlp(text)

    # Convert gold entities to dict
    gold_dict = {ent[0]: ent[1] for ent in gold_entities}

    for ent in doc.ents:
        if ent.text in gold_dict:
            y_true.append(gold_dict[ent.text])
            y_pred.append(ent.label_)

    # Handle missed entities
    for gold_ent, gold_label in gold_dict.items():
        if gold_ent not in [ent.text for ent in doc.ents]:
            y_true.append(gold_label)
            y_pred.append("O")  # Not detected

In [17]:
print("Accuracy:", accuracy_score(y_true, y_pred))
print("\nDetailed Classification Report:\n")
print(classification_report(y_true, y_pred))

Accuracy: 0.8

Detailed Classification Report:

              precision    recall  f1-score   support

         GPE       0.80      1.00      0.89         4
         ORG       0.75      0.75      0.75         4
      PERSON       1.00      0.50      0.67         2

    accuracy                           0.80        10
   macro avg       0.85      0.75      0.77        10
weighted avg       0.82      0.80      0.79        10

