spaCy is a powerful and fast library for advanced Natural Language Processing (NLP) in Python. It provides a wide range of functionalities for processing and analyzing text. Below is an overview of some of the key operations you can perform with spaCy, along with examples for each operation.

### 1. Installation

First, you need to install spaCy and download a language model:

```bash
pip install spacy
python -m spacy download en_core_web_sm
```

### 2. Loading the Language Model

```python
import spacy
nlp = spacy.load("en_core_web_sm")
```

### 3. Tokenization

Tokenization is the process of breaking text into individual words or tokens.

```python
doc = nlp("SpaCy is an excellent NLP library.")
for token in doc:
    print(token.text)
```

### 4. Part-of-Speech (POS) Tagging

Identifying the part of speech for each token.

```python
for token in doc:
    print(f"{token.text} - {token.pos_}")
```

### 5. Named Entity Recognition (NER)

Identifying named entities in the text.

```python
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")
```

### 6. Dependency Parsing

Analyzing the syntactic structure of the sentence.

```python
for token in doc:
    print(f"{token.text} - {token.dep_} - {token.head.text}")
```

### 7. Lemmatization

Finding the base form of each word.

```python
for token in doc:
    print(f"{token.text} - {token.lemma_}")
```

### 8. Stop Words

Identifying and removing stop words.

```python
for token in doc:
    if not token.is_stop:
        print(token.text)
```

### 9. Text Similarity

Calculating similarity between two texts.

```python
doc1 = nlp("I love cats")
doc2 = nlp("I love dogs")
similarity = doc1.similarity(doc2)
print(f"Similarity: {similarity}")
```

### 10. Visualizing Dependencies

Using displaCy for visualizing syntactic dependencies.

```python
from spacy import displacy
displacy.render(doc, style="dep")
```

### 11. Custom Named Entity Recognition

Adding custom named entities to the NER pipeline.

```python
from spacy.tokens import Span

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
org = doc[0:1]
money = doc[8:11]
org_label = Span(doc, 0, 1, label="ORG")
money_label = Span(doc, 8, 11, label="MONEY")
doc.ents = list(doc.ents) + [org_label, money_label]

for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")
```

### 12. Rule-Based Matching

Using the Matcher class to find patterns in the text.

```python
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "nlp"}, {"IS_PUNCT": True}, {"LOWER": "library"}]
matcher.add("NLP_LIBRARY_PATTERN", [pattern])
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)
```

### 13. Sentence Segmentation

Segmenting text into sentences.

```python
for sent in doc.sents:
    print(sent.text)
```

### 14. Word Vectors and Similarity

Accessing word vectors and calculating similarity between words.

```python
token1 = nlp("cat")[0]
token2 = nlp("dog")[0]
similarity = token1.similarity(token2)
print(f"Word Similarity: {similarity}")
```

### 15. Custom Components in Pipeline

Adding custom components to the spaCy processing pipeline.

```python
def custom_component(doc):
    print("Custom component executed")
    return doc

nlp.add_pipe(custom_component, last=True)
doc = nlp("SpaCy is an excellent NLP library.")
```

### 16. Text Classification

Training and using text classification models (requires more extensive setup, including training data).

```python
# Placeholder for training a text classifier
# More details would be needed for a full example
```

### 17. Extending Token and Doc Objects

Adding custom attributes to tokens or documents.

```python
from spacy.tokens import Token

Token.set_extension("is_custom", default=False)

token = nlp("Hello")[0]
token._.is_custom = True
print(token._.is_custom)
```

These are some of the core functionalities of spaCy. Each of these operations can be further customized and combined to build complex NLP applications. For more detailed use cases and advanced configurations, refer to the official spaCy documentation at [spacy.io](https://spacy.io/).

In [14]:
import spacy
from spacy.tokens import Span, Token
from spacy.matcher import Matcher
from spacy import displacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "SpaCy is an excellent NLP library. Apple is looking at buying U.K. startup for $1 billion."

# Process the text
doc = nlp(text)
doc

SpaCy is an excellent NLP library. Apple is looking at buying U.K. startup for $1 billion.

In [15]:

# 1. Tokenization
print("Tokenization:")
for token in doc:
    print(token.text)
print("\n")


Tokenization:
SpaCy
is
an
excellent
NLP
library
.
Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion
.




In [16]:

# 2. Part-of-Speech (POS) Tagging
print("POS Tagging:")
for token in doc:
    print(f"{token.text} - {token.pos_}")
print("\n")


POS Tagging:
SpaCy - PROPN
is - AUX
an - DET
excellent - ADJ
NLP - PROPN
library - NOUN
. - PUNCT
Apple - PROPN
is - AUX
looking - VERB
at - ADP
buying - VERB
U.K. - PROPN
startup - NOUN
for - ADP
$ - SYM
1 - NUM
billion - NUM
. - PUNCT




In [17]:

# 3. Named Entity Recognition (NER)
print("Named Entity Recognition:")
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")
print("\n")


Named Entity Recognition:
NLP - ORG
Apple - ORG
U.K. - GPE
$1 billion - MONEY




In [18]:

# 4. Dependency Parsing
print("Dependency Parsing:")
for token in doc:
    print(f"{token.text} - {token.dep_} - {token.head.text}")
print("\n")


Dependency Parsing:
SpaCy - nsubj - is
is - ROOT - is
an - det - library
excellent - amod - library
NLP - compound - library
library - attr - is
. - punct - is
Apple - nsubj - looking
is - aux - looking
looking - ROOT - looking
at - prep - looking
buying - pcomp - at
U.K. - dobj - buying
startup - dep - looking
for - prep - startup
$ - quantmod - billion
1 - compound - billion
billion - pobj - for
. - punct - looking




In [21]:

# 5. Lemmatization
print("Lemmatization:")
for token in doc:
    print(f"{token.text} - {token.lemma_}")
print("\n")


Lemmatization:
SpaCy - SpaCy
is - be
an - an
excellent - excellent
NLP - NLP
library - library
. - .
Apple - Apple
is - be
looking - look
at - at
buying - buy
U.K. - U.K.
startup - startup
for - for
$ - $
1 - 1
billion - billion
. - .




In [22]:

# 6. Stop Words
print("Stop Words:")
for token in doc:
    if not token.is_stop:
        print(token.text)


Stop Words:
SpaCy
excellent
NLP
library
.
Apple
looking
buying
U.K.
startup
$
1
billion
.


In [25]:
print("\n")

# 7. Text Similarity
doc1 = nlp("Don't make fun")
doc2 = nlp("It's funny")
similarity = doc1.similarity(doc2)
print(f"Text Similarity: {similarity}\n")

# 8. Visualizing Dependencies
print("Visualizing Dependencies:")
displacy.render(doc, style="dep", jupyter=False)

# 9. Custom Named Entity Recognition
print("Custom Named Entity Recognition:")
org_label = Span(doc, 7, 8, label="ORG")
money_label = Span(doc, 12, 15, label="MONEY")
doc.ents = list(doc.ents) + [org_label, money_label]
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")
print("\n")

# 10. Rule-Based Matching
print("Rule-Based Matching:")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "nlp"}, {"IS_PUNCT": True}, {"LOWER": "library"}]
matcher.add("NLP_LIBRARY_PATTERN", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)
print("\n")

# 11. Sentence Segmentation
print("Sentence Segmentation:")
for sent in doc.sents:
    print(sent.text)
print("\n")

# 12. Word Vectors and Similarity
print("Word Vectors and Similarity:")
token1 = nlp("cat")[0]
token2 = nlp("dog")[0]
similarity = token1.similarity(token2)
print(f"Word Similarity: {similarity}\n")

# 13. Custom Components in Pipeline
print("Custom Components in Pipeline:")
def custom_component(doc):
    print("Custom component executed")
    return doc

nlp.add_pipe(custom_component, last=True)
doc = nlp("SpaCy is an excellent NLP library.")
print("\n")

# 14. Extending Token and Doc Objects
print("Extending Token and Doc Objects:")
Token.set_extension("is_custom", default=False)
token = nlp("Hello")[0]
token._.is_custom = True
print(token._.is_custom)




Text Similarity: 0.16777146679017096

Visualizing Dependencies:
Custom Named Entity Recognition:


  


ValueError: [E1010] Unable to set entity information for token 7 which is included in more than one span in entities, blocked, missing or outside.

In [26]:

# 8. Visualizing Dependencies
print("Visualizing Dependencies:")
displacy.render(doc, style="dep", jupyter=False)

# 9. Custom Named Entity Recognition
print("Custom Named Entity Recognition:")
org_label = Span(doc, 7, 8, label="ORG")
money_label = Span(doc, 12, 15, label="MONEY")
doc.ents = list(doc.ents) + [org_label, money_label]
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")
print("\n")

# 10. Rule-Based Matching
print("Rule-Based Matching:")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "nlp"}, {"IS_PUNCT": True}, {"LOWER": "library"}]
matcher.add("NLP_LIBRARY_PATTERN", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)
print("\n")

# 11. Sentence Segmentation
print("Sentence Segmentation:")
for sent in doc.sents:
    print(sent.text)
print("\n")

# 12. Word Vectors and Similarity
print("Word Vectors and Similarity:")
token1 = nlp("cat")[0]
token2 = nlp("dog")[0]
similarity = token1.similarity(token2)
print(f"Word Similarity: {similarity}\n")

# 13. Custom Components in Pipeline
print("Custom Components in Pipeline:")
def custom_component(doc):
    print("Custom component executed")
    return doc

nlp.add_pipe(custom_component, last=True)
doc = nlp("SpaCy is an excellent NLP library.")
print("\n")

# 14. Extending Token and Doc Objects
print("Extending Token and Doc Objects:")
Token.set_extension("is_custom", default=False)
token = nlp("Hello")[0]
token._.is_custom = True
print(token._.is_custom)


Visualizing Dependencies:
Custom Named Entity Recognition:


ValueError: [E1010] Unable to set entity information for token 7 which is included in more than one span in entities, blocked, missing or outside.