<a href="https://colab.research.google.com/github/shfarhaan/NLP/blob/main/spaCy/spaCy_Tutorial_for_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **spaCy Tutorial for Natural Language Processing**

## **Introduction**

Python has emerged as a popular language for Natural Language Processing (NLP) due to its simplicity and powerful libraries. One such library is spaCy, which provides easy-to-use and efficient tools for various NLP tasks. This tutorial aims to introduce beginners to spaCy and cover essential NLP tasks using this library.


### Introduction to spaCy

#### What is spaCy?
spaCy is an open-source library used for advanced NLP in Python. It is designed with the goal of being fast, streamlined, and simple to use. spaCy offers features for tokenization, named entity recognition (NER), part-of-speech tagging, dependency parsing, and more.


#### Installation
To install spaCy, use pip:

```bash
pip install spacy
```

In [None]:
!pip install spacy

In [2]:
import spacy

# Load the spaCy NLP object
nlp = spacy.load("en_core_web_sm")

# Preprocess the text
text = "This is a sample text."
doc = nlp(text)

# Print the tokens
for token in doc:
    print(token.text)

This
is
a
sample
text
.



Replace `en_core_web_sm` with the language model you want to download. This example uses the English language model.


### Text Preprocessing with spaCy

#### Tokenization
Tokenization breaks text into individual words or tokens. Here's how to tokenize text using spaCy:

```python
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Tokenization breaks text into tokens."
doc = nlp(text)

for token in doc:
    print(token.text)
```


In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Tokenization breaks text into tokens."
doc = nlp(text)

for token in doc:
    print(token.text)

Tokenization
breaks
text
into
tokens
.


#### Lemmatization
Lemmatization reduces words to their base or root form. Here's an example:

```python
for token in doc:
    print(token.text, token.lemma_)
```

#### Part-of-Speech Tagging
Identifying the grammatical parts of a sentence using spaCy:

```python
for token in doc:
    print(token.text, token.pos_)
```

In [16]:
# Import the spaCy library
import spacy

# Load the English language model "en_core_web_sm"
nlp = spacy.load("en_core_web_sm")

# Define the input text
text = "Tokenization breaks text into tokens."

# Process the text using the spaCy model
doc = nlp(text)

# # Iterate over each token in the processed document and its index
# for i, token in enumerate(doc):
#     # Print the token's text, with 15 characters of space, followed by its index with 2 characters of space
#     print(f"{token.text:15} - is the {i:2}th token from the text")


# for i, token in enumerate(doc):
#     print(f"{token.text:15} is the {i:2}th token and after lemmatizing it is - {token.lemma_:2}" )

for i, token in enumerate(doc):
    print(f"{token.text:15} is the {i:2}th token and after lemmatizing it is - {token.pos_:2}" )

Tokenization    is the  0th token and after lemmatizing it is - NOUN
breaks          is the  1th token and after lemmatizing it is - VERB
text            is the  2th token and after lemmatizing it is - NOUN
into            is the  3th token and after lemmatizing it is - ADP
tokens          is the  4th token and after lemmatizing it is - NOUN
.               is the  5th token and after lemmatizing it is - PUNCT


### Named Entity Recognition (NER)

Named Entity Recognition identifies entities in text, such as names, organizations, locations, etc. Example:

```python
text = "Apple is situated in California."
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)
```

### 8. Dependency Parsing

Dependency Parsing reveals the grammatical structure of a sentence. Example:

```python
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_)
```


In [18]:
# Import the spaCy library
import spacy

# Load the English language model "en_core_web_sm"
nlp = spacy.load("en_core_web_sm")

# Define the input text
text = "Tokenization breaks text into tokens."

# Process the text using the spaCy model
doc = nlp(text)

# Iterate over each token in the processed document
for token in doc:
    # Print the token's text
    token_text = token.text

    # Print the token's dependency label
    dependency_label = token.dep_

    # Print the text of the token's head
    head_token_text = token.head.text

    # Print the part of speech of the token's head
    head_token_pos = token.head.pos_

    # Combine and print all the information
    print(f"Token Text: {token_text}, Dependency Label: {dependency_label}, Head Token Text: {head_token_text}, Head Token POS: {head_token_pos}")


Token Text: Tokenization, Dependency Label: nsubj, Head Token Text: breaks, Head Token POS: VERB
Token Text: breaks, Dependency Label: ROOT, Head Token Text: breaks, Head Token POS: VERB
Token Text: text, Dependency Label: dobj, Head Token Text: breaks, Head Token POS: VERB
Token Text: into, Dependency Label: prep, Head Token Text: breaks, Head Token POS: VERB
Token Text: tokens, Dependency Label: pobj, Head Token Text: into, Head Token POS: ADP
Token Text: ., Dependency Label: punct, Head Token Text: breaks, Head Token POS: VERB



### 9. Text Classification with spaCy

Text classification categorizes text into predefined classes or categories. Here's a simple example:

```python
# Training data preparation
train_texts = ["Text 1", "Text 2", "Text 3"]
train_labels = ["Label 1", "Label 2", "Label 3"]

# Train a text classification model
textcat = nlp.create_pipe("textcat")
nlp.add_pipe(textcat, last=True)

textcat.add_label("Label 1")
textcat.add_label("Label 2")
textcat.add_label("Label 3")

train_data = list(zip(train_texts, [{"cats": {label: 1.0 if label == true_label else 0.0 for label in train_labels}} for true_label in train_labels]))

for text, annotations in train_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    nlp.update([example], losses={textcat: losses.CategoricalCrossentropy()})

# Classify new text
new_text = "New text to classify"
doc = nlp(new_text)
print(doc.cats)

```

In [19]:
# Training data preparation
train_texts = ["Text 1", "Text 2", "Text 3"]
train_labels = ["Label 1", "Label 2", "Label 3"]

# Train a text classification model
textcat = nlp.create_pipe("textcat")
nlp.add_pipe(textcat, last=True)

textcat.add_label("Label 1")
textcat.add_label("Label 2")
textcat.add_label("Label 3")

train_data = list(zip(train_texts, [{"cats": {label: 1.0 if label == true_label else 0.0 for label in train_labels}} for true_label in train_labels]))

for text, annotations in train_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    nlp.update([example], losses={textcat: losses.CategoricalCrossentropy()})

# Classify new text
new_text = "New text to classify"
doc = nlp(new_text)
print(doc.cats)

ValueError: ignored

### 10. Practical Examples and Projects

#### Project 1: Sentiment Analysis
Perform sentiment analysis on a dataset using spaCy for text classification.

#### Project 2: Information Extraction
Extract specific information, like dates or quantities, from a set of documents using spaCy.









### 11. Conclusion

In this tutorial, we covered the basics of Python and spaCy for NLP tasks. We explored text preprocessing, named entity recognition, dependency parsing, text classification, and presented practical examples and projects. To further advance your understanding, continue exploring spaCy's documentation, practice on different datasets, and engage in real-world NLP projects. With consistent practice, you'll become proficient in NLP using Python and spaCy.

Remember, NLP is a vast field, and this tutorial only scratches the surface. Continual learning and hands-on experience will enhance your skills and understanding.

I hope this tutorial serves as a solid foundation for your journey into NLP with spaCy and Python.