## Using spaCy for NLP Tasks

This notebook demonstrates how to install and use spaCy to perform various Natural Language Processing (NLP) tasks.

In today’s lesson we will:

- Install spaCy and download its statistical models.
- Read and process a text file.
- Perform Named Entity Recognition (NER) to extract entities from text.
- Visualize entity counts.
- Explore and customize the spaCy pipeline (including using the EntityRuler).

### 1. Installation
To get started, you must install spaCy and the English language model.

**Instructions:**

1. Use `pip install spacy` to install the core library.
2. Download the English model (`en_core_web_sm`) which includes the statistical model for English.


In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


### 2. Loading the spaCy Model

Once downloaded, those models can be opened via **spacy.load('model_name')** in python. Therefore, you can verify if the models were downloaded successfully by running the following code:

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

### 3. Reading a Text File
Here, we read in a text file that contains a chapter from *The Fellowship Of The Ring*. Make sure the file is in your working directory (or provide the full path).

**Key Steps:**
- Open the file using Python’s built-in `open()` function.
- Read the content into a variable.
- Adjust `nlp.max_length` to avoid errors when processing long texts.


In [3]:
# Define the file path (adjust this path if your file is stored elsewhere)
lotr_script = '/content/The Fellowship Of The Ring_Ch1 (1).txt'

# Read the file content
with open(lotr_script, 'r', encoding='utf-8') as f:
    text = f.read()

# Adjust the maximum allowed length for the NLP model to process the full text
nlp.max_length = len(text)

# Process the text with the spaCy model
doc = nlp(text)

In [7]:
# Increase the max_length to handle the large text, avoids an error
nlp.max_length = len(text) # Sets the maximum length to the length of the text

doc = nlp(text)

### 4. Named Entity Recognition (NER)

Named Entity Recognition identifies and classifies entities (like names of people, places, or organizations) in text.
**What we'll do:**
- Extract entities from the processed document.
- Create a Pandas DataFrame that shows the entity text and its corresponding label.

In [9]:
import pandas as pd

# Create a list to collect entity data
entities_data = []

# Extract each entity and its label from the document
for ent in doc.ents:
    entities_data.append({
        'text': ent.text,
        'label': ent.label_
    })

# Convert the list into a DataFrame for easier viewing
ent_df = pd.DataFrame(entities_data)
ent_df  # Display the entities

Unnamed: 0,text,label
0,Chapter 1,LAW
1,Party,ORG
2,Bilbo Baggins,PERSON
3,Bag End,ORG
4,first,ORDINAL
...,...,...
520,Shire,PERSON
521,Gandalf,NORP
522,Frodo,ORG
523,Frodo,ORG


### 5. Analyzing Entity Data
Let's examine:
- **Text and Label Frequency:**  
Display the most common entity texts and their labels.
- **Entity Details:**  
Use spaCy's built-in explanation function to understand what a specific label (e.g., "FAC") means.

In [10]:
# Display the top 15 most common texts and labels
print("Top 15 Entity Texts:")
print(ent_df['text'].value_counts()[:15])

print("\nTop 15 Entity Labels:")
print(ent_df['label'].value_counts()[:15])

# Explain a specific entity label
print("\nExplanation for label 'FAC':")
print(spacy.explain("FAC"))

Top 15 Entity Texts:
text
Bilbo       86
Frodo       47
Gandalf     36
Bag End     18
Hobbiton    12
Shire       12
Gaffer      10
Bywater      9
Baggins      9
two          7
Lobelia      7
one          7
three        6
Tooks        6
first        6
Name: count, dtype: int64

Top 15 Entity Labels:
label
PERSON         159
ORG            131
GPE             56
CARDINAL        54
DATE            42
NORP            29
TIME            14
ORDINAL         12
LOC              8
FAC              7
WORK_OF_ART      6
PRODUCT          4
EVENT            2
LAW              1
Name: count, dtype: int64

Explanation for label 'FAC':
Buildings, airports, highways, bridges, etc.


In [None]:
# Use spacy.explain() to learn more about abbreviations and definitions
spacy.explain("FAC")

'Buildings, airports, highways, bridges, etc.'

In [13]:
# Display combinations of text and label counts
print("\nTop 20 Text and Label Combinations:")
ent_df[['text', 'label']].value_counts()[:20]


Top 20 Text and Label Combinations:


Unnamed: 0_level_0,Unnamed: 1_level_0,count
text,label,Unnamed: 2_level_1
Bilbo,PERSON,39
Bilbo,GPE,35
Frodo,ORG,33
Gandalf,NORP,25
Frodo,PERSON,14
Bag End,ORG,14
Hobbiton,ORG,11
Bilbo,ORG,10
Gaffer,PERSON,10
Bywater,PERSON,8


In [14]:
# List unique labels in the dataset
print("\nUnique Entity Labels:")
ent_df["label"].unique()


Unique Entity Labels:


array(['LAW', 'ORG', 'PERSON', 'ORDINAL', 'GPE', 'LOC', 'DATE',
       'CARDINAL', 'FAC', 'PRODUCT', 'TIME', 'NORP', 'WORK_OF_ART',
       'EVENT'], dtype=object)

###  6. Exploring the NLP Pipeline
[NLP Pipeline Documentation](https://spacy.io/usage/processing-pipelines#processing)

You can inspect the components of the NLP pipeline `nlp.pipeline`.

In [20]:
# Spacy's language model pipeline
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7b38c7e75f70>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7b38c7e75970>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7b398a5f8d60>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7b38c7c36510>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7b38c7c4e550>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7b398a1284a0>)]

### 7. Visualizing Entities with DisplaCy
DisplaCy is spaCy’s visualization tool for rendering entities and dependencies in Jupyter notebooks.


In [21]:
from spacy import displacy

# Render entities in the processed document using DisplaCy
displacy.render(doc, style = "ent", jupyter = True)

### 8. Identifying Issues in the Named Entity Recognizer

Use the [Lord of the Rings Wiki](https://lotr.fandom.com/wiki/Main_Page) if you need help

Example Issues: ?

### 9. Creating Custom Entity Recognizers with the EntityRuler

Custom patterns can be added to the pipeline using spaCy's `EntityRuler`. This allows us to capture entities that might be missed by the statistical model.

**Steps:**
- Define custom entity patterns (as a list of dictionaries).
- Check if the "ner" component exists and add the EntityRuler accordingly.

[EntityRuler Documentation](https://spacy.io/api/entityruler#add_patterns)

In [27]:
# Define custom entity patterns for names, locations, and other entities
entity_patterns = [
    {"label": "PERSON", "pattern": "Gollum"},
    {"label": "PERSON", "pattern": "Gorbadoc"},
    {"label": "PERSON", "pattern": "Daddy Twofoot"},
    {"label": "PERSON", "pattern": "Old Noakes"},
    {"label": "FAMILY", "pattern": "Bucklanders"},
    {"label": "HOBBIT", "pattern": "Bilbo Baggins"},
    {"label": "LOC", "pattern": "Brandywine River"},
    {"label": "LOC", "pattern": "The Hill"},
    {"label": "LOC", "pattern": "The Water"},
    {"label": "FAMILY", "pattern": "Brandybucks"},
    {"label": "FAMILY", "pattern": "Tooks"},
    {"label": "WIZARD", "pattern": "Gandalf"},
    {"label": "HOBBIT", "pattern": "Frodo"},
    {"label": "GPE", "pattern": "Shire"},
    {"label": "FAMILY", "pattern":"Bagginses"},
    {"label": "PERSON", "pattern": "Mr. Baggins"},
    {"label": "FAMILY", "pattern": "Baggins"},
    {"label": "MAGIC_OBJECT", "pattern": [{"LOWER": "ring"}]}, #checks for lower and upper case
    {"label": "MAGIC_OBJECT", "pattern": [{"LOWER": "my"}, {"LOWER": "precious"}]},
    {"label": "GPE", "pattern" : "Bag End"}
]

# Check if the "ner" pipe exists. If it does, add the EntityRuler before it.
if "ner" in nlp.pipe_names:
    # If entity_ruler already exists, simply add patterns to it.
    try:
        ruler = nlp.get_pipe("entity_ruler")
    except Exception:
        ruler = nlp.add_pipe("entity_ruler", before="ner")
    ruler.add_patterns(entity_patterns)
else:
    # If the NER component does not exist, add both the EntityRuler and the NER component.
    ruler = nlp.add_pipe("entity_ruler")
    ruler.add_patterns(entity_patterns)
    ner = nlp.add_pipe("ner")

# Check updated pipeline labels
print("\nUpdated Pipeline Labels:")
nlp.pipeline

#Add EntityRuler to the pipeline
#ruler = nlp.add_pipe("entity_ruler", before="ner")
#ruler.add_patterns(entity_patterns)

#Access the existing entity_ruler
#ruler = nlp.get_pipe("entity_ruler")

#Add your custom patterns
#ruler.add_patterns(entity_patterns)


Updated Pipeline Labels:


[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7b38c7e75f70>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7b38c7e75970>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7b398a5f8d60>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7b38c7c36510>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7b38c7c4e550>),
 ('entity_ruler', <spacy.pipeline.entityruler.EntityRuler at 0x7b38bd1d6050>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7b398a1284a0>)]

### 10. Testing the Custom Entity Ruler
Run a sample sentence through the updated pipeline to check if your custom patterns are recognized.


In [28]:
from spacy import displacy

doc_2 = nlp("Gandalf went to Bilbo's house for his birthday because he was my precious which is the Ring.")
displacy.render(doc_2, style="ent", jupyter=True)

### 11. Re-Processing the Full Text
Now that we have updated the pipeline with our custom patterns, re-run the full text to see how the recognizer performs.


In [29]:
# Re-process the text with the updated pipeline
nlp.max_length = len(text)
doc = nlp(text)

displacy.render(doc, style="ent", jupyter=True)

### 12. Re-Analyzing Entity Data
Let's again create a DataFrame from the updated document to see if our custom recognitions improved entity extraction.

In [30]:
# Collect entities from the updated document
entities_data = []
for ent in doc.ents:
    entities_data.append({
        'text': ent.text,
        'label': ent.label_
    })

# Convert to DataFrame and display
ent_df = pd.DataFrame(entities_data)
ent_df

# %%
# Show value counts for text and label combinations after the update
print("Updated Text and Label Combinations (Top 20):")
print(ent_df[['text', 'label']].value_counts()[:20])

Updated Text and Label Combinations (Top 20):
text           label       
Frodo          HOBBIT          58
Gandalf        WIZARD          41
Bilbo          PERSON          39
               GPE             35
Bag End        GPE             23
Bagginses      FAMILY          16
ring           MAGIC_OBJECT    16
Shire          GPE             12
Hobbiton       ORG             11
Bilbo          ORG             10
Gaffer         PERSON          10
Baggins        FAMILY           9
Bywater        PERSON           8
Bilbo Baggins  HOBBIT           7
one            CARDINAL         7
Lobelia        GPE              7
two            CARDINAL         7
first          ORDINAL          6
Tooks          FAMILY           6
three          CARDINAL         6
Name: count, dtype: int64


### 13. Visualizing Entity Data
Finally, we create a bar plot to visualize the top 10 most common text and label combinations.
**Steps:**
- Use Pandas to compute counts.
- Plot the counts using Seaborn and Matplotlib.

**Tree Map**
- The `color` parameter is set to `label` so that each entity label gets a distinct color.

- The `path` parameter defines a hierarchy where entities are grouped by their `label` first, then by `text`.

- This interactive treemap allows students to easily see how different entity labels contribute to the overall counts.

In [36]:
import plotly.express as px

# Prepare the data for visualization: top 10 entity combinations
top_10_ents = ent_df[['text', 'label']].value_counts().head(10).reset_index(name='counts')

# Create a treemap using a hierarchical structure (first by label, then by text)
fig = px.treemap(top_10_ents,
                path=['label', 'text'],
                values='counts',
                title='Top 10 Text and Label Combinations (Treemap)',
                color='label')

fig.show()