# Natural Language Processing -- Basics

### spaCy Basics

**spaCy** (https://spacy.io/) is an open-source Python library that ***parses*** and ***understands*** large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.). It uses a pipeline-based architecture.

**It Provides a pretrained models for tasks:**
* Tokenizer ‚Äì Splits text into tokens.
* Tagger ‚Äì Assigns part-of-speech (POS) tags.
* Parser ‚Äì Analyzes grammatical structure (dependency parsing).
* Named Entity Recognizer (NER) ‚Äì Identifies entities like names, dates, and locations.
* Lemmatizer ‚Äì Converts words to their base form (e.g., running ‚Üí run).
* Text Classifier (if included) ‚Äì Categorizes text into predefined labels.

#### Installation and Setup

Installation is a two-step process. First, install spaCy using either conda or pip. Next, download the specific model you want, based on language.<br> For more info visit https://spacy.io/usage/

#### 1. From the command line or terminal:
> `conda install -c conda-forge spacy`
> <br>*or*<br>
> `pip install -U spacy`

> #### Alternatively you can create a virtual environment:
> `conda create -n spacyenv python=3 spacy=2`

#### 2. Next, also from the command line (you must run this as admin or use sudo):

> `python -m spacy download en`

> #### If successful, you should see a message like:

> **`Linking successful`**<br>
> `    C:\Anaconda3\envs\spacyenv\lib\site-packages\en_core_web_sm -->`<br>
> `    C:\Anaconda3\envs\spacyenv\lib\site-packages\spacy\data\en`<br>
> ` `<br>
> `    You can now load the model via spacy.load('en')`

### Scenario: Automating Resume Screening for Job Applications
A company receives thousands of resumes and wants to automate the initial screening process using **spaCy**. The ***NLP pipeline*** can help extract useful information from candidate resumes and categorize them based on job relevance.

#### How Each Component is Used:
* **Tokenizer** ‚Äì Splits the resume text into individual words/tokens.

    Example: ‚ÄúExperienced Python developer with 5 years of experience‚Äù ‚Üí ["Experienced", "Python", "developer", "with", "5", "years", "of", "experience"]

* **Tagger (POS tagging) ‚Äì** Identifies parts of speech to understand key terms.
    Example: "Python" (NOUN), "developing" (VERB), "experienced" (ADJ).

* **Parser (Dependency Parsing)** ‚Äì Analyzes the grammatical structure to understand relationships between words.
    Example: Identifies that ‚Äú5 years‚Äù modifies ‚Äúexperience,‚Äù meaning the candidate has 5 years of experience.

* **Named Entity Recognizer (NER)** ‚Äì Extracts key entities like names, locations, skills, and dates.
    Example:

    John Doe (PERSON)

    Python, Machine Learning (SKILLS)

    Microsoft (ORG)

    New York (GPE ‚Äì Geo-Political Entity)

* **Lemmatizer ‚Äì** Converts words into their root form for better matching.
    Example: "Developing" ‚Üí "Develop", "Worked" ‚Üí "Work".

* **Text Classifier ‚Äì** Categorizes resumes into predefined job roles based on keywords and extracted information.
    Example: If the resume contains "Python, TensorFlow, Machine Learning," it is classified under "Data Scientist"; if it includes "Java, Spring Boot," it is classified under "Backend Developer".

#### Outcome:
The company can automatically filter and rank resumes based on skills, experience, and job relevance, saving HR teams hours of manual screening.

Would you like a code example for implementing this scenario in spaCy? üöÄ

## End-to-end Resume Parsing & Analysis Project using spaCy 

- ‚úÖ Step 1: Data Collection & Preprocessing
- ‚úÖ Step 2: Applying NLP Pipeline (Tokenizer, POS Tagging, NER, Lemmatization, Parsing, Classification, etc.)
- ‚úÖ Step 3: Extracting Key Information (Name, Skills, Experience, Education, etc.)
- ‚úÖ Step 4: Resume Categorization (Job Role Classification using ML/NLP)
- ‚úÖ Step 5: Storing & Analyzing Results (JSON/Database, visualization, ranking resumes, etc.)
- ‚úÖ Step 6: Deploying a Simple Web App (Streamlit or Flask) for Uploading Resumes


***I'll create a structured code framework for Resume Parsing using spaCy, which includes:***

* Data Ingestion: Uploading resumes (text/PDF parsing).
* Text Preprocessing: Cleaning and tokenizing text.
* NLP Pipeline Processing: Applying spaCy to extract important details.
* Information Extraction: Identifying names, skills, education, work experience.
* Resume Classification: Categorizing into predefined job roles using NLP & ML.
* Visualization & Analysis: Storing data, ranking resumes based on relevancy.
* Deployment: A simple UI for uploading and analyzing resumes.

In [3]:
import spacy
import pdfplumber # For PDF text extraction 
import re # For regular expressions
import json
from collections import Counter
from spacy.matcher import Matcher
from spacy.pipeline.textcat import Config, single_label_cnn_config

# Load pre-trained SpaCy model
nlp = spacy.load("en_core_web_sm")

In [27]:


# Function to extract text from PDF
def extract_text_from_pdf(pdf_path):
    text = "" # Initialize empty string to store text
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages: # Iterate through each page
            text += page.extract_text() + "\n" # Extract text from each page and append to text variable 
    return text.strip() # Return the cleaned text after stripping leading/trailing whitespace

# Function to preprocess text (cleaning)
def clean_text(text):
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters 
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize spaces
    return text.lower()

# Function to extract named entities (NER)
def extract_entities(text):
    doc = nlp(text)
    entities = {ent.label_: ent.text for ent in doc.ents}
    return entities

# Function to extract skills using keyword matching
def extract_skills(text):
    skills = ["python", "machine learning", "data science", "sql", "java", "deep learning", "nlp"]
    found_skills = [skill for skill in skills if skill.lower() in text.lower()]
    return found_skills

# Function to classify resumes into job roles
def classify_resume(text):
    categories = {
        "Data Scientist": ["machine learning", "python", "data science", "deep learning"],
        "Software Engineer": ["java", "spring boot", "microservices", "docker"],
        "Business Analyst": ["excel", "business analysis", "sql", "data visualization"]
    }
    
    scores = {role: sum(1 for skill in skills if skill in text.lower()) for role, skills in categories.items()}
    return max(scores, key=scores.get) if max(scores.values()) > 0 else "Unknown"

# Function to process resume text and extract relevant details
def process_resume(pdf_path):
    text = extract_text_from_pdf(pdf_path)
    cleaned_text = clean_text(text)
    entities = extract_entities(cleaned_text)
    skills = extract_skills(cleaned_text)
    job_role = classify_resume(cleaned_text)
    
    result = {
        "Extracted Text": text,       
        "Entities": entities,
        "Skills": skills,
        "Predicted Job Role": job_role
    }
    return result

# Example usage
if __name__ == "__main__":    
    pdf_path = "C:/Users/tazeb/OneDrive/AtomicHabit/LLM Engineering/LLM_Engineering/TA_resume.pdf" 
    result = process_resume(pdf_path)
    #print(json.dumps(result, indent=4))


In [28]:
result

{'Extracted Text': "TAZEB ABERA\n8801 Edna Place Rowlett, TX 75089\nPhone: (214)430-2241\nEmail: tazabera@gmail.com\nWebsite: https://mydatascienceenthusiast.com/\nLinked Page: https://www.linkedin.com/in/tazeb-abera\nGitHub: https://github.com/tazeb6531\nProfessional Summary\nInnovative Senior Data Scientist with extensive experience delivering impactful AI-driven solutions across healthcare,\nmanufacturing, business operations, and engineering materials sectors. At Celanese, successfully applied advanced machine\nlearning techniques to refine search engine accuracy and streamline competitor analysis, achieved a 60% accuracy boost,\n40% faster search times, and 15% reduction in development process time. Proficient in integrating advanced tools like\nElasticsearch, Azure GenAI, and Snowflake Cortex to foster innovation and enhance decision-making processes. Skilled in\nNLP, deep learning, predictive modeling, and generative AI, with a focus on improving operational efficiency and drivi

## Discussion 

### 1. Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.   Image source: https://spacy.io/usage/spacy-101#pipelines

In [1]:
print(nlp.pipeline)
print("\n")
print(nlp.pipe_names)

NameError: name 'nlp' is not defined

In spaCy, the NLP pipeline consists of several **trainable** components that process text sequentially. These include:
* **tok2vec**, which generates vector representations for tokens, 
* **tagger**, which assigns part-of-speech (POS) tags, 
* **parser**, which analyzes sentence structure through dependency parsing, 
* **NER**, which detects named entities like names and organizations, 
* **attribute_ruler**, which refines POS and morphology attributes, and 
* **Lemmatizer**, which converts words to their base forms. 

While these components are explicitly listed in nlp.pipe_names, some core functionalities like **tokenization** always run first but are not considered trainable components. The pipeline ensures efficient text processing by automatically passing data through these components in sequence. üöÄ

### 1. Part-of-Speech (POS) Tagging in spaCy

In **spaCy**, **Part-of-Speech (POS) tagging** assigns grammatical labels (e.g., noun, verb, adjective) to each token in a text. The **`tagger`** component in spaCy‚Äôs pipeline is responsible for POS tagging, leveraging pre-trained statistical models for accuracy. 

##### üîπ How to Access POS Tags:
- `token.pos_` ‚Üí General POS category (e.g., `NOUN`, `VERB`, `ADJ`).
- `token.tag_` ‚Üí Detailed POS tag (specific to the language model).
- `spacy.explain(token.pos_)` ‚Üí Get explanations for POS tags.

**Example:**
```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("John is learning Python.")

for token in doc:
    print(token.text, "‚Üí", token.pos_, "(", spacy.explain(token.pos_), ")")
```


* The challenge of correctly identifying parts of speech is summed up nicely in the [spaCy docs](https://spacy.io/usage/linguistic-features):

<div class="alert alert-info" style="margin: 20px">Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a **Doc** object, that comes with a variety of annotations.</div>



In [63]:
# Create a simple Doc object
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")
print(doc.text)
print()
print(doc[4].text, doc[4].pos_, doc[4].tag_, spacy.explain(doc[4].tag_))

The quick brown fox jumped over the lazy dog's back.

jumped VERB VBD verb, past tense


In [64]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

The        DET      DT     determiner
quick      ADJ      JJ     adjective (English), other noun-modifier (Chinese)
brown      ADJ      JJ     adjective (English), other noun-modifier (Chinese)
fox        NOUN     NN     noun, singular or mass
jumped     VERB     VBD    verb, past tense
over       ADP      IN     conjunction, subordinating or preposition
the        DET      DT     determiner
lazy       ADJ      JJ     adjective (English), other noun-modifier (Chinese)
dog        NOUN     NN     noun, singular or mass
's         PART     POS    possessive ending
back       NOUN     NN     noun, singular or mass
.          PUNCT    .      punctuation mark, sentence closer


#### Coarse-grained Part-of-speech Tags
Every token is assigned a POS Tag from the following list:

<table><tr><th>POS</th><th>DESCRIPTION</th><th>EXAMPLES</th></tr>
    
<tr><td>ADJ</td><td>adjective</td><td>*big, old, green, incomprehensible, first*</td></tr>
<tr><td>ADP</td><td>adposition</td><td>*in, to, during*</td></tr>
<tr><td>ADV</td><td>adverb</td><td>*very, tomorrow, down, where, there*</td></tr>
<tr><td>AUX</td><td>auxiliary</td><td>*is, has (done), will (do), should (do)*</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>*and, or, but*</td></tr>
<tr><td>CCONJ</td><td>coordinating conjunction</td><td>*and, or, but*</td></tr>
<tr><td>DET</td><td>determiner</td><td>*a, an, the*</td></tr>
<tr><td>INTJ</td><td>interjection</td><td>*psst, ouch, bravo, hello*</td></tr>
<tr><td>NOUN</td><td>noun</td><td>*girl, cat, tree, air, beauty*</td></tr>
<tr><td>NUM</td><td>numeral</td><td>*1, 2017, one, seventy-seven, IV, MMXIV*</td></tr>
<tr><td>PART</td><td>particle</td><td>*'s, not,*</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>*I, you, he, she, myself, themselves, somebody*</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>*Mary, John, London, NATO, HBO*</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>*., (, ), ?*</td></tr>
<tr><td>SCONJ</td><td>subordinating conjunction</td><td>*if, while, that*</td></tr>
<tr><td>SYM</td><td>symbol</td><td>*$, %, ¬ß, ¬©, +, ‚àí, √ó, √∑, =, :), üòù*</td></tr>
<tr><td>VERB</td><td>verb</td><td>*run, runs, running, eat, ate, eating*</td></tr>
<tr><td>X</td><td>other</td><td>*sfpksdpsxmsa*</td></tr>
<tr><td>SPACE</td><td>space</td></tr>

#### Working with POS Tags
In the English language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important. **spaCy** uses machine learning algorithms to best predict the use of a token in a sentence. Is *"I read books on NLP"* present or past tense? Is *wind* a verb or a noun?

In [42]:
doc = nlp(u'I read books on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')
print("Look at this to see the difference o consfusing tags")

doc = nlp(u'I read a book on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')


read       VERB     VBP    verb, non-3rd person singular present
Look at this to see the difference o consfusing tags
read       VERB     VBD    verb, past tense


#### Counting POS Tags
The `Doc.count_by()` method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.

In [49]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")
# Count the frequencies of different coarse-grained POS tags:
POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts

{90: 2, 84: 3, 92: 3, 100: 1, 85: 1, 94: 1, 97: 1}

This isn't very helpful until you decode the attribute ID:

In [54]:
doc.vocab[84].text

'ADJ'

#### Create a frequency list of POS tags from the entire document
Since `POS_counts` returns a dictionary, we can obtain a list of keys with `POS_counts.items()`.<br>By sorting the list we have access to the tag and its count, in order.

In [53]:
for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')

84. ADJ  : 3
85. ADP  : 1
90. DET  : 2
92. NOUN : 3
94. PART : 1
97. PUNCT: 1
100. VERB : 1


In [55]:
# Count the different fine-grained tags:
TAG_counts = doc.count_by(spacy.attrs.TAG)

for k,v in sorted(TAG_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')

74. POS : 1
1292078113972184607. IN  : 1
10554686591937588953. JJ  : 3
12646065887601541794. .   : 1
15267657372422890137. DT  : 2
15308085513773655218. NN  : 3
17109001835818727656. VBD : 1


<div class="alert alert-success">**Why did the ID numbers get so big?** In spaCy, certain text values are hardcoded into `Doc.vocab` and take up the first several hundred ID numbers. Strings like 'NOUN' and 'VERB' are used frequently by internal operations. Others, like fine-grained tags, are assigned hash values as needed.</div>
<div class="alert alert-success">**Why don't SPACE tags appear?** In spaCy, only strings of spaces (two or more) are assigned tokens. Single spaces are not.</div>

In [56]:
# Count the different dependencies:
DEP_counts = doc.count_by(spacy.attrs.DEP)

for k,v in sorted(DEP_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')

402. amod: 3
415. det : 2
429. nsubj: 1
439. pobj: 1
440. poss: 1
443. prep: 1
445. punct: 1
8110129090154140942. case: 1
8206900633647566924. ROOT: 1


Here we've shown `spacy.attrs.POS`, `spacy.attrs.TAG` and `spacy.attrs.DEP`.<br>Refer back to the **Vocabulary and Matching** lecture from the previous section for a table of **Other token attributes**.

#### Fine-grained POS Tag Examples
These are some grammatical examples (shown in **bold**) of specific fine-grained tags. We've removed punctuation and rarely used tags:
<table>
<tr><th>POS</th><th>TAG</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>ADJ</td><td>AFX</td><td>affix</td><td>The Flintstones were a **pre**-historic family.</td></tr>
<tr><td>ADJ</td><td>JJ</td><td>adjective</td><td>This is a **good** sentence.</td></tr>
<tr><td>ADJ</td><td>JJR</td><td>adjective, comparative</td><td>This is a **better** sentence.</td></tr>
<tr><td>ADJ</td><td>JJS</td><td>adjective, superlative</td><td>This is the **best** sentence.</td></tr>
<tr><td>ADJ</td><td>PDT</td><td>predeterminer</td><td>Waking up is **half** the battle.</td></tr>
<tr><td>ADJ</td><td>PRP\$</td><td>pronoun, possessive</td><td>**His** arm hurts.</td></tr>
<tr><td>ADJ</td><td>WDT</td><td>wh-determiner</td><td>It's blue, **which** is odd.</td></tr>
<tr><td>ADJ</td><td>WP\$</td><td>wh-pronoun, possessive</td><td>We don't know **whose** it is.</td></tr>
<tr><td>ADP</td><td>IN</td><td>conjunction, subordinating or preposition</td><td>It arrived **in** a box.</td></tr>
<tr><td>ADV</td><td>EX</td><td>existential there</td><td>**There** is cake.</td></tr>
<tr><td>ADV</td><td>RB</td><td>adverb</td><td>He ran **quickly**.</td></tr>
<tr><td>ADV</td><td>RBR</td><td>adverb, comparative</td><td>He ran **quicker**.</td></tr>
<tr><td>ADV</td><td>RBS</td><td>adverb, superlative</td><td>He ran **fastest**.</td></tr>
<tr><td>ADV</td><td>WRB</td><td>wh-adverb</td><td>**When** was that?</td></tr>
<tr><td>CONJ</td><td>CC</td><td>conjunction, coordinating</td><td>The balloon popped **and** everyone jumped.</td></tr>
<tr><td>DET</td><td>DT</td><td>determiner</td><td>**This** is **a** sentence.</td></tr>
<tr><td>INTJ</td><td>UH</td><td>interjection</td><td>**Um**, I don't know.</td></tr>
<tr><td>NOUN</td><td>NN</td><td>noun, singular or mass</td><td>This is a **sentence**.</td></tr>
<tr><td>NOUN</td><td>NNS</td><td>noun, plural</td><td>These are **words**.</td></tr>
<tr><td>NOUN</td><td>WP</td><td>wh-pronoun, personal</td><td>**Who** was that?</td></tr>
<tr><td>NUM</td><td>CD</td><td>cardinal number</td><td>I want **three** things.</td></tr>
<tr><td>PART</td><td>POS</td><td>possessive ending</td><td>Fred**'s** name is short.</td></tr>
<tr><td>PART</td><td>RP</td><td>adverb, particle</td><td>Put it **back**!</td></tr>
<tr><td>PART</td><td>TO</td><td>infinitival to</td><td>I want **to** go.</td></tr>
<tr><td>PRON</td><td>PRP</td><td>pronoun, personal</td><td>**I** want **you** to go.</td></tr>
<tr><td>PROPN</td><td>NNP</td><td>noun, proper singular</td><td>**Kilroy** was here.</td></tr>
<tr><td>PROPN</td><td>NNPS</td><td>noun, proper plural</td><td>The **Flintstones** were a pre-historic family.</td></tr>
<tr><td>VERB</td><td>MD</td><td>verb, modal auxiliary</td><td>This **could** work.</td></tr>
<tr><td>VERB</td><td>VB</td><td>verb, base form</td><td>I want to **go**.</td></tr>
<tr><td>VERB</td><td>VBD</td><td>verb, past tense</td><td>This **was** a sentence.</td></tr>
<tr><td>VERB</td><td>VBG</td><td>verb, gerund or present participle</td><td>I am **going**.</td></tr>
<tr><td>VERB</td><td>VBN</td><td>verb, past participle</td><td>The treasure was **lost**.</td></tr>
<tr><td>VERB</td><td>VBP</td><td>verb, non-3rd person singular present</td><td>I **want** to go.</td></tr>
<tr><td>VERB</td><td>VBZ</td><td>verb, 3rd person singular present</td><td>He **wants** to go.</td></tr>
</table>

#### Visualizing Parts of Speech
spaCy offers an outstanding visualizer called **displaCy**:

In [58]:
# Render the dependency parse immediately inside Jupyter:
# Import the displaCy library
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

In [59]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{7}} {token.dep_:{7}} {spacy.explain(token.dep_)}')

The        DET     det     determiner
quick      ADJ     amod    adjectival modifier
brown      ADJ     amod    adjectival modifier
fox        NOUN    nsubj   nominal subject
jumped     VERB    ROOT    root
over       ADP     prep    prepositional modifier
the        DET     det     determiner
lazy       ADJ     amod    adjectival modifier
dog        NOUN    poss    possession modifier
's         PART    case    case marking
back       NOUN    pobj    object of preposition
.          PUNCT   punct   punctuation


#### Creating Visualizations Outside of Jupyter
If you're using another Python IDE or writing a script, you can choose to have spaCy serve up HTML separately.

Instead of `displacy.render()`, use `displacy.serve()`:

In [60]:
displacy.serve(doc, style='dep', options={'distance': 110})




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [12/Feb/2025 11:43:58] "GET / HTTP/1.1" 200 9180
127.0.0.1 - - [12/Feb/2025 11:43:58] "GET /favicon.ico HTTP/1.1" 200 9180


Shutting down server on port 5000.


<font color=blue>**After running the cell above, click the link below to view the dependency parse**:</font>

http://127.0.0.1:5000
<br><br>
<font color=red>**To shut down the server and return to jupyter**, interrupt the kernel either through the **Kernel** menu above, by hitting the black square on the toolbar, or by typing the keyboard shortcut `Esc`, `I`, `I`</font>

#### Handling Large Text
`displacy.serve()` accepts a single Doc or list of Doc objects. Since large texts are difficult to view in one line, you may want to pass a list of spans instead. Each span will appear on its own line:

In [61]:
doc2 = nlp(u"This is a sentence. This is another, possibly longer sentence.")

# Create spans from Doc.sents:
spans = list(doc2.sents)

displacy.serve(spans, style='dep', options={'distance': 110})


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [12/Feb/2025 11:45:50] "GET / HTTP/1.1" 200 8133
127.0.0.1 - - [12/Feb/2025 11:45:50] "GET /favicon.ico HTTP/1.1" 200 8133


Shutting down server on port 5000.


In [62]:
options = {'distance': 110, 'compact': 'True', 'color': 'yellow', 'bg': '#09a3d5', 'font': 'Times'}

displacy.serve(doc, style='dep', options=options)


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [12/Feb/2025 11:46:23] "GET / HTTP/1.1" 200 9391
127.0.0.1 - - [12/Feb/2025 11:46:23] "GET /favicon.ico HTTP/1.1" 200 9391


Shutting down server on port 5000.


### 2. Named Entity Recognition (NER) in spaCy

For more on **Named Entity Recognition** visit https://spacy.io/usage/linguistic-features#101


In **spaCy**, **Named Entity Recognition (NER)** is a process that identifies and classifies entities in text, such as names, organizations, dates, and locations. The **`ner`** component in spaCy‚Äôs pipeline is responsible for detecting these entities based on pre-trained models.

#### üîπ How to Access Named Entities:
- `token.ent_type_` ‚Üí Entity type of a token.
- `ent.text` ‚Üí Extracted entity text.
- `ent.label_` ‚Üí Entity label (e.g., `PERSON`, `ORG`, `DATE`).
- `spacy.explain(ent.label_)` ‚Üí Get explanations for entity labels.

**Example:**
```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("John works at Google and lives in New York.")

for ent in doc.ents:
    print(ent.text, "‚Üí", ent.label_, "(", spacy.explain(ent.label_), ")")
```

### üåê **Semantic Analysis in NLP**

**Semantic Analysis** is the process of understanding the **meaning and relationships** between words, phrases, and sentences in context.  
It goes beyond **syntactic structure** to capture **the intent, roles, and interactions** of words within text.

---

### üîç **Key Techniques in Semantic Analysis**

1. **Word Sense Disambiguation (WSD)** ‚Äì Identify the correct meaning of words with multiple senses.  
2. **Named Entity Recognition (NER)** ‚Äì Recognize proper names (people, organizations, etc.).  
3. **Semantic Role Labeling (SRL)** ‚Äì Determine who did what to whom, when, and how.  
4. **Relation Extraction** ‚Äì Identify relationships between entities.  
5. **Coreference Resolution** ‚Äì Link pronouns to the correct entities.

---

#### üõ†Ô∏è **Python Code Example: Semantic Analysis with spaCy**

#### **Problem:** Analyze the semantic structure of the following sentence:  
*"John gave Mary a beautiful gift on her birthday."*

---

In [None]:
for ent in doc.ents:
            print(ent.text + ' -- is label--> ' + ent.label_ + ' --- EXPLAIN--> ' + str(spacy.explain(ent.label_)))

In [111]:
# Write a function to display basic entity info:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

doc = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')
show_ents(doc)

Washington, DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.


Here we see tokens combine to form the entities Washington, DC, next May and the Washington Monument

#### Entity annotations
`Doc.ents` are token spans with their own set of annotations.
<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>

#### NER Tags
Tags are accessible through the `.label_` property of an entity.
<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>

#### Adding a Named Entity to a Span
Normally we would have spaCy build a library of named entities by training it on several samples of text.<br>In this case, we only want to add one value:

In [75]:
doc = nlp(u'Tesla to build a U.K. factory for $6 million')
show_ents(doc)

U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [78]:
from spacy.tokens import Span
# Get the hash value of the ORG entity label
ORG = doc.vocab.strings[u'ORG'] 

# Create a Span for the new entity
new_ent = Span(doc, 0, 1, label=ORG)

# Add the entity to the existing Doc object
doc.ents = list(doc.ents) + [new_ent]

<font color=green>In the code above, the arguments passed to `Span()` are:</font>
-  `doc` - the name of the Doc object
-  `0` - the *start* index position of the span
-  `1` - the *stop* index position (exclusive)
-  `label=ORG` - the label assigned to our entity

In [79]:
show_ents(doc)

Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


#### Adding Named Entities to All Matching Spans
What if we want to tag *all* occurrences of "Tesla"? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc:

In [80]:
doc = nlp(u'Our company plans to introduce a new vacuum cleaner. '
          u'If successful, the vacuum cleaner will be our first product.')

show_ents(doc)

first - ORDINAL - "first", "second", etc.


In [83]:
# Import PhraseMatcher and create a matcher object:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [84]:
# Create the desired phrase patterns:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]

In [85]:
# Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)

# Apply the matcher to our Doc object:
matches = matcher(doc)

# See what matches occur:
matches

[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]

In [86]:
# Here we create Spans from each match, and create named entities from them:
from spacy.tokens import Span

PROD = doc.vocab.strings[u'PRODUCT']
new_ents = [Span(doc, match[1], match[2], label=PROD) for match in matches]
doc.ents = list(doc.ents) + new_ents

In [87]:
show_ents(doc)

vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
first - ORDINAL - "first", "second", etc.


#### Counting Entities
While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:

In [88]:
doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')

show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit


In [89]:
len([ent for ent in doc.ents if ent.label_=='MONEY'])

2

In [91]:
doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')

show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit


### <font color=green>However, there is a simple fix that can be added to the nlp pipeline:</font>

In [94]:
from spacy.language import Language

# Custom function to remove whitespace-only entities
@Language.component("remove_whitespace_entities")
def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

# Register and insert the function AFTER 'ner'
nlp.add_pipe("remove_whitespace_entities", after='ner')

# Test the pipeline
doc = nlp("John Doe is   ")
print([(ent.text, ent.label_) for ent in doc.ents])  # Should return non-whitespace entities only


[('John Doe', 'PERSON')]


In [95]:
# Rerun nlp on the text above, and show ents:
doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')

show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit


In [96]:
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc.noun_chunks:
    print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)

Autonomous cars - cars - nsubj - shift
insurance liability - liability - dobj - shift
manufacturers - manufacturers - pobj - toward


#### `Doc.noun_chunks` is a  generator function
Previously we mentioned that `Doc` objects do not retain a list of sentences, but they're available through the `Doc.sents` generator.<br>It's the same with `Doc.noun_chunks` - lists can be created if needed:

In [99]:
len(list(doc.noun_chunks))

3

For more on **noun_chunks** visit https://spacy.io/usage/linguistic-features#noun-chunks

In [100]:
from spacy import displacy
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
         u'By contrast, Sony sold only 7 thousand Walkman music players.')

displacy.render(doc, style='ent', jupyter=True)

#### Viewing Sentences Line by Line
Unlike the **displaCy** dependency parse, the NER viewer has to take in a Doc object with an `ents` attribute. For this reason, we can't just pass a list of spans to `.render()`, we have to create a new Doc from each `span.text`:

In [101]:
for sent in doc.sents:
    displacy.render(nlp(sent.text), style='ent', jupyter=True)

**NOTE**: If a span does not contain any entities, displaCy will issue a harmless warning:

In [103]:
doc2 = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
         u'By contrast, my kids sold a lot of lemonade.')

for sent in doc2.sents:
    displacy.render(nlp(sent.text), style='ent', jupyter=True)



In [104]:
for sent in doc2.sents:
    docx = nlp(sent.text)
    if docx.ents:
        displacy.render(docx, style='ent', jupyter=True)
    else:
        print(docx.text)

By contrast, my kids sold a lot of lemonade.


#### Customizing Colors and Effects
You can also pass background color and gradient options:

In [105]:
colors = {'ORG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 'PRODUCT': 'radial-gradient(yellow, green)'}

options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}

displacy.render(doc, style='ent', jupyter=True, options=options)

For more on applying CSS background colors and gradients, visit https://www.w3schools.com/css/css3_gradients.asp

#### Creating Visualizations Outside of Jupyter
If you're using another Python IDE or writing a script, you can choose to have spaCy serve up HTML separately.

Instead of `displacy.render()`, use `displacy.serve()`:


### 3. Tokenizer
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:


#### Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:


|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape ‚Äì capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|


-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( ‚Äú ¬ø`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ‚Äù`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

Notice that tokens are pieces of the original text. That is, we don't see any conversion to word stems or lemmas (base forms of words) and we haven't seen anything about organizations/places/money etc. Tokens are the basic building blocks of a Doc object - everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

#### Prefixes, Suffixes and Infixes
spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [23]:
mystring = '"We\'re moving to L.A.!"'
print(mystring)
doc = nlp(mystring)
for token in doc:    
    print(token.text, end=' | ')

"We're moving to L.A.!"
" | We | 're | moving | to | L.A. | ! | " | 

In [24]:
print(f"Number of Token: {len(doc)}")

Number of Token: 8


**Spans:** Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop]

In upcoming lectures we'll see how to create Span objects using Span(). This will allow us to assign additional information to the Span.

In [None]:
life_quote = doc[4:10]
print(f"spans/Slice {life_quote}" )



#### 4 üõ†Ô∏è **Lemmatization vs Stemming in NLP**

Both **lemmatization** and **stemming** are text normalization techniques used to reduce words to their base or root forms, which helps in **improving text analysis** by treating different forms of the same word as identical.

---

#### üîç **1Ô∏è‚É£ Lemmatization**

**Lemmatization** reduces a word to its **base or dictionary form (lemma)** based on its **meaning and context**. It requires a vocabulary and **morphological analysis** of the word.  
- **Tool in spaCy:** `token.lemma_`

**Example:**  
- `running` ‚Üí `run`  
- `better` ‚Üí `good`  
- `studies` ‚Üí `study`  

**Key Characteristics:**  
- **Context-aware** (depends on POS tags)  
- **Slower but more accurate**  
- Requires a language model (e.g., `en_core_web_sm` in spaCy)  

**spaCy Example:**  
```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("He is running and studies hard.")
print([(token.text, token.lemma_) for token in doc])
```

#### üå± **Stemming in NLP**

**Stemming** is a text preprocessing technique used in **Natural Language Processing (NLP)** to reduce words to their **root form** by **removing prefixes or suffixes**. It applies **rule-based algorithms** without considering the word's **context or meaning**.

---

#### üîç **How Stemming Works**

Stemming works by applying **linguistic heuristics** to cut off **affixes** like **-ing**, **-ed**, **-s**, and **-ly** to find the **stem** of a word.  
- It does **not guarantee a valid word** but helps in **text normalization** for tasks like **information retrieval**.

##### **Examples:**
- `running` ‚Üí `run`  
- `studies` ‚Üí `studi` *(not a valid word)*  
- `better` ‚Üí `better` *(no change as it doesn't follow simple rules)*  

---

##### üß† **Popular Stemming Algorithms**

1. **Porter Stemmer** (widely used and efficient for English)  
2. **Snowball Stemmer** (an improved version of Porter)  
3. **Lancaster Stemmer** (more aggressive and less accurate)  

---

##### ‚öôÔ∏è **Python Code Examples**

##### 1Ô∏è‚É£ **Porter Stemmer**
```python
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "played", "flies", "studies", "better"]
print([stemmer.stem(word) for word in words])
```


In [4]:
# Sample text for lemmatization
text = "The cats are running faster than the mice, while John was studying hard for his exams."

# Process the text through spaCy's pipeline
doc = nlp(text)

# Print each word with its lemma
for token in doc:
    print(f"{token.text} ‚Üí {token.lemma_}")


The ‚Üí the
cats ‚Üí cat
are ‚Üí be
running ‚Üí run
faster ‚Üí fast
than ‚Üí than
the ‚Üí the
mice ‚Üí mouse
, ‚Üí ,
while ‚Üí while
John ‚Üí John
was ‚Üí be
studying ‚Üí study
hard ‚Üí hard
for ‚Üí for
his ‚Üí his
exams ‚Üí exam
. ‚Üí .


In [5]:
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for token in doc1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

I 	 PRON 	 4690420944186131903 	 I
am 	 AUX 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 SCONJ 	 16950148841647037698 	 because
I 	 PRON 	 4690420944186131903 	 I
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 SCONJ 	 10066841407251338481 	 since
I 	 PRON 	 4690420944186131903 	 I
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today


<font color=green>In the above sentence, `running`, `run` and `ran` all point to the same lemma `run` (...11841) to avoid duplication.</font>

### 5. Parser

### üîç **Dependency Parser in NLP (spaCy)**

**Parser** in NLP refers to the **Dependency Parser**, which analyzes the **grammatical structure** of a sentence.  
It determines the **relationships between words** by assigning each word a **syntactic role** (e.g., subject, object, modifier) and connecting them in a **dependency tree**.

---

## üõ†Ô∏è **What Does the Parser Do?**

1. **Identifies syntactic dependencies** (e.g., subject, object, modifier).  
2. **Builds a dependency tree** to represent sentence structure.  
3. **Helps with relation extraction** (e.g., who did what to whom).  
4. **Supports downstream tasks** like **Named Entity Recognition (NER)**, **coreference resolution**, and **text generation**.

---

## üå≤ **Dependency Parser in Action (spaCy)**

```python
import spacy

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Sample sentence
text = "John gave Mary a beautiful gift."

# Process the text
doc = nlp(text)

# Print the dependency information
print(f"{'Token':<10} {'Dependency':<15} {'Head':<10} {'Children'}")
print("-" * 50)
for token in doc:
    print(f"{token.text:<10} {token.dep_:<15} {token.head.text:<10} {[child.text for child in token.children]}")


In [6]:
# Sample sentence
text = "John gave Mary a beautiful gift."

# Process the text
doc = nlp(text)

# Print the dependency information
print(f"{'Token':<10} {'Dependency':<15} {'Head':<10} {'Children'}")
print("-" * 50)
for token in doc:
    print(f"{token.text:<10} {token.dep_:<15} {token.head.text:<10} {[child.text for child in token.children]}")

Token      Dependency      Head       Children
--------------------------------------------------
John       nsubj           gave       []
gave       ROOT            gave       ['John', 'Mary', 'gift', '.']
Mary       dative          gave       []
a          det             gift       []
beautiful  amod            gift       []
gift       dobj            gave       ['a', 'beautiful']
.          punct           gave       []


**Applications of Dependency Parsing**
- Relation Extraction: Extract relationships from text (e.g., subject, object).
- Text Summarization: Identify key components in sentences.
- Sentiment Analysis: Analyze how words relate to each other.
- Question Answering Systems: Understand sentence structure for better response generation.

#### 6. Text Classification

##### üß† **Text Classification in spaCy**

**Text Classification** is the process of **categorizing text into predefined labels**.  
In **spaCy**, the **`textcat` component** is responsible for **text classification** tasks like **sentiment analysis**, **topic detection**, **spam filtering**, and more.

---

##### üîç **How Does Text Classification Work in spaCy?**

1. **Tokenization:** The text is broken into tokens.  
2. **Feature Extraction:** Relevant linguistic features are extracted.  
3. **Model Training:** A statistical or transformer-based model learns from labeled data.  
4. **Prediction:** The model assigns **one or more labels** to unseen text.

---

##### ‚öôÔ∏è **Text Classification Modes**

spaCy supports two types of text classification:  

1. **Single-label classification**: Each text is assigned exactly **one label**.  
   *(e.g., **`positive`** or **`negative`` sentiment)*  

2. **Multi-label classification**: Each text can be assigned **multiple labels**.  
   *(e.g., a news article can be both **`politics`** and **`economy`**)*  

---

##### üõ†Ô∏è **Python Code Example: Text Classification with spaCy**

###### **üîπ Task:** Classify text as **Positive**, **Negative**, or **Neutral**.

---
