## SpaCy
SpaCy is an open-source library for advanced Natural Language Processing (NLP) in Python. Designed for use in production environments. Very performant and efficient library for industial applications, providing a fast and powerful solution for various NLP tasks.

**Features:**

*   Support for 75+ languages
*   84 trained pipelines for 25 languages
*   Multi-task learning with pretrained transformers like BERT
*   Pretrained word vectors
*   State-of-the-art speed
*   Production-ready training system
*   Linguistically-motivated tokenization
*   Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
*   Easily extensible with custom components and attributes
*   Support for custom models in PyTorch, TensorFlow and other frameworks
*   Built in visualizers for syntax and NER
*   Easy model packaging, deployment and workflow management
*   Robust, rigorously evaluated accuracy                                     

See all details here https://spacy.io/




# Task
 Using the SpaCy library, you will do efficient data cleaning, tokenization, and preprocessing of text

<a target="_blank" href="https://colab.research.google.com/github/toelt-llc/HSLU-NLP-Bootcamp/blob/main/Day_1/SpaCy_NER/Bootcamp_NLP_Data_preprocessing_SpaCy.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



# Setup

## Step 1: Install SpaCy and Download a Language Model

First, ensure you have SpaCy installed along with a language model. You can do this using pip. There are a few available language models. We will use the English one. https://spacy.io/models/en



In [1]:
!pip install -q spacy -q
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
# inslall beautifulsoup4 library for HTML tag removal
!pip install -q spacy beautifulsoup4

###Step2: Import Required Libraries and Load SpaCy Model

Start by importing the necessary libraries and loading the SpaCy language model.

python


In [3]:
#Import the regular expressions (regex) library.
# Regex allows us to search for and manipulate text based on patterns.
import re

# Regular Expressions (Regex) in Python

## Overview
Regular Expressions (Regex) is a powerful library for text pattern matching and manipulation in Python. It provides tools for searching, matching, and modifying strings based on patterns.

## Documentation & Resources
- [Official Python Regex Documentation](https://docs.python.org/3/library/re.html)
- [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)
- [Regex Testing Tool](https://regex101.com/) - Interactive regex pattern tester

## Key Features
The `re` module in Python provides:
- Pattern matching in strings
- Text validation
- Complex string manipulations
- Search and replace functionality



# Common Pattern Symbols in Regular Expressions

| Symbol | Description |
|--------|-------------|
| `.` | Any character except newline |
| `^` | Start of string |
| `$` | End of string |
| `*` | Zero or more occurrences |
| `+` | One or more occurrences |
| `?` | Zero or one occurrence |
| `{n}` | Exactly n occurrences |
| `{n,}` | n or more occurrences |
| `{n,m}` | Between n and m occurrences |
| `[]` | Set of characters |
| `[^]` | Complement of set |
| `\d` | Digit [0-9] |
| `\D` | Non-digit |
| `\w` | Word character [a-zA-Z0-9_] |
| `\W` | Non-word character |
| `\s` | Whitespace |
| `\S` | Non-whitespace |
| `\b` | Word boundary |
| `\B` | Non-word boundary |

#### Main Regex Methods:


In [4]:
# 1. re.match(pattern, string)
# Checks if pattern matches at the BEGINNING of the string
text = "Hello World"
result = re.match(r'Hello', text)  # Matches
print(result)
result = re.match(r'World', text)  # Doesn't match (not at start)
print(result)


<re.Match object; span=(0, 5), match='Hello'>
None


In [5]:
# 2. re.search(pattern, string)
# Searches for pattern ANYWHERE in the string
text = "Hello World"
result = re.search(r'World', text)  # Matches
print(result)

<re.Match object; span=(6, 11), match='World'>


In [6]:
# 3. re.findall(pattern, string)
# Returns ALL non-overlapping matches as a list
text = "The rain in Spain"
result = re.findall(r'ain', text)  # Returns ['ain', 'ain']
print(result)

['ain', 'ain']


In [7]:
# 4. re.finditer(pattern, string)
# Returns an iterator of match objects
text = "The rain in Spain"
for match in re.finditer(r'ain', text):
    print(f"Found at position: {match.start()}-{match.end()}")

Found at position: 5-8
Found at position: 14-17


In [8]:
# 5. re.sub(pattern, repl, string)
# Substitutes matches with replacement text
text = "I love cats"
result = re.sub(r'cats', 'dogs', text)  # Returns "I love dogs"
print(result)


I love dogs


In [9]:
# 6. re.split(pattern, string)
# Splits string by pattern
text = "apple,banana;orange:grape"
result = re.split(r'[,;:]', text)  # Returns ['apple', 'banana', 'orange', 'grape']
print(result)

['apple', 'banana', 'orange', 'grape']


In [10]:
# Example Patterns:

# 1. Email pattern
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

# 2. Phone number pattern (US format)
phone_pattern = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'

# 3. URL pattern
url_pattern = r'https?://(?:[\w-]+\.)+[\w-]+(?:/[\w-./?%&=]*)?'

# 4. Date pattern (MM/DD/YYYY)
date_pattern = r'\b\d{2}/\d{2}/\d{4}\b'

# Flags (can be combined with |):
# re.IGNORECASE or re.I - Case-insensitive matching
# re.MULTILINE or re.M - Multi-line matching
# re.DOTALL or re.S - Dot matches any char, including newline
# re.VERBOSE or re.X - Allow comments and whitespace


In [11]:
# Example with flags:
text = "Python\nJAVA\nC++"
result = re.findall(r'python', text, re.IGNORECASE)  # Matches 'Python'
print(result)


['Python']


In [12]:
# Grouping with ()
text = "John Doe, Jane Doe"
pattern = r'(\w+)\s+(\w+)'
matches = re.findall(pattern, text)  # Returns [('John', 'Doe'), ('Jane', 'Doe')]
print(matches)

[('John', 'Doe'), ('Jane', 'Doe')]


In [13]:
# Named Groups (?P<name>...)
pattern = r'(?P<first>\w+)\s+(?P<last>\w+)'
match = re.search(pattern, "John Doe")
if match:
    print(match.group('first'))  # Returns 'John'
    print(match.group('last'))   # Returns 'Doe'

John
Doe


###Step 3: Sample Text Data

Let's define some sample text data to work with.

In [14]:
import spacy
from bs4 import BeautifulSoup

# Load SpaCy model
nlp = spacy.load('en_core_web_sm')

In [15]:
sample_text = """
    SpaCy is an open-source software library for advanced natural language processing,
    written in the programming languages Python and Cython. SpaCy is designed specifically for production use
    and helps you build applications that process and understand large volumes of text. It can be used to build
    information extraction or natural language understanding systems, or to pre-process text for deep learning.
"""

###Step 4: Data Cleaning Function

Create a function to clean the text data. This function will remove unwanted characters, convert text to lowercase, and strip leading/trailing whitespace.

In [16]:

def clean_text(text):
    # Remove special characters and digits
    text = re.sub(r'[^A-Za-z\s]', '', text)
    # Convert text to lowercase
    text = text.lower()
    # Remove leading and trailing whitespace
    text = text.strip()
    return text

# Clean the sample text
cleaned_text = clean_text(sample_text)
print(cleaned_text)


spacy is an opensource software library for advanced natural language processing
    written in the programming languages python and cython spacy is designed specifically for production use
    and helps you build applications that process and understand large volumes of text it can be used to build
    information extraction or natural language understanding systems or to preprocess text for deep learning


###Step 5: Tokenization and Preprocessing

Define a function for tokenization and preprocessing. This function will tokenize the text, remove stop words, and perform lemmatization.

In [17]:
def preprocess_text(text):
    # Process the text using SpaCy
    doc = nlp(text)

    # Tokenize, remove stop words and punctuation, and lemmatize
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

    return tokens

# Preprocess the cleaned text
tokens = preprocess_text(cleaned_text)
print(tokens)


['spacy', 'opensource', 'software', 'library', 'advanced', 'natural', 'language', 'processing', '\n    ', 'write', 'programming', 'language', 'python', 'cython', 'spacy', 'design', 'specifically', 'production', 'use', '\n    ', 'help', 'build', 'application', 'process', 'understand', 'large', 'volume', 'text', 'build', '\n    ', 'information', 'extraction', 'natural', 'language', 'understanding', 'system', 'preprocess', 'text', 'deep', 'learning']


###Step 6: Putting It All Together

Combine the cleaning and preprocessing steps in a single workflow.

In [18]:
def process_text(text):
    # Step 1: Clean the text
    cleaned_text = clean_text(text)

    # Step 2: Tokenize and preprocess the text
    cltokens = preprocess_text(cleaned_text)

    return cltokens

# Process the sample text
tokens = process_text(sample_text)
print(tokens)


['spacy', 'opensource', 'software', 'library', 'advanced', 'natural', 'language', 'processing', '\n    ', 'write', 'programming', 'language', 'python', 'cython', 'spacy', 'design', 'specifically', 'production', 'use', '\n    ', 'help', 'build', 'application', 'process', 'understand', 'large', 'volume', 'text', 'build', '\n    ', 'information', 'extraction', 'natural', 'language', 'understanding', 'system', 'preprocess', 'text', 'deep', 'learning']


##Complex Data Cleaning Function

Create a function to handle the more complex text cleaning steps.

In [19]:
resume_text = """
    <p>John Doe</p>
    <p>Experienced software engineer with a strong background in Python and machine learning. Worked at Google and Microsoft.</p>
    <p>Education: B.Sc. in Computer Science from MIT.</p>
    Contact: john.doe@example.com.
"""

In [20]:
#function to preprocess the text
def clean_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()

    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # Remove special characters and digits
    text = re.sub(r'[^A-Za-z\s]', '', text)

    # Convert text to lowercase
    text = text.lower()

    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text



In [21]:
# Clean the sample text
cleaned_text = clean_text(resume_text)
print(cleaned_text)

john doe experienced software engineer with a strong background in python and machine learning worked at google and microsoft education bsc in computer science from mit contact


### Tokenisation

In [22]:
# tokenise the sample text
processed_tokens = process_text(resume_text)
print(processed_tokens)

['john', 'doe', 'experience', 'software', 'engineer', 'strong', 'background', 'python', 'machine', 'learning', 'work', 'google', 'microsoft', 'education', 'bsc', 'computer', 'science', 'mit', 'contact']


### Named Entity Recognition (NER)
NER is a sub-task of information extraction in NLP that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as:
Person names,
Organizations,  Locations, Dates, Quantities,  Monetary values,  Percentages


In [23]:
# Named Entity Recognition (NER)
def named_entity_recognition(text):
    doc = nlp(text)
    entities = [(entity.text, entity.label_) for entity in doc.ents]
    return entities

In [24]:
# Named Entity Recognition (NER)
entities = named_entity_recognition(cleaned_text)
print("Named Entities:", entities)


Named Entities: [('john', 'PERSON'), ('google', 'ORG'), ('microsoft', 'ORG'), ('bsc', 'ORG')]


###Part-of-Speech Tagging (POS Tagging)
Part-of-Speech Tagging  is a fundamental task in NLP that involves assigning a part of speech (such as noun, verb, adjective, etc.) to each word in a given text. This process is crucial for understanding the grammatical structure and syntactic relationships within sentences.

In [25]:
# Part-of-Speech Tagging
def pos_tagging(text):
    doc = nlp(text)
    pos_tags = [(token.text, token.pos_) for token in doc]
    return pos_tags

In [26]:
# Part-of-Speech Tagging
pos_tags = pos_tagging(cleaned_text)
print("POS Tags:", pos_tags)

POS Tags: [('john', 'PROPN'), ('doe', 'PROPN'), ('experienced', 'VERB'), ('software', 'NOUN'), ('engineer', 'NOUN'), ('with', 'ADP'), ('a', 'DET'), ('strong', 'ADJ'), ('background', 'NOUN'), ('in', 'ADP'), ('python', 'NOUN'), ('and', 'CCONJ'), ('machine', 'NOUN'), ('learning', 'NOUN'), ('worked', 'VERB'), ('at', 'ADP'), ('google', 'PROPN'), ('and', 'CCONJ'), ('microsoft', 'PROPN'), ('education', 'PROPN'), ('bsc', 'PROPN'), ('in', 'ADP'), ('computer', 'NOUN'), ('science', 'NOUN'), ('from', 'ADP'), ('mit', 'NOUN'), ('contact', 'NOUN')]


### Dependency Parsing
Dependency parsing is a process in natural language processing (NLP) that involves analyzing the grammatical structure of a sentence by establishing relationships between "head" words and words that modify those heads (known as "dependents"). In this framework, each word in a sentence is connected to its dependents via directed links, forming a dependency tree.

In [27]:
# Dependency Parsing
def dependency_parsing(text):
    doc = nlp(text)
    dependencies = [(token.text, token.dep_, token.head.text) for token in doc]
    return dependencies

In [28]:
# Dependency Parsing
dependencies = dependency_parsing(cleaned_text)
print("Dependencies:", dependencies)

Dependencies: [('john', 'compound', 'doe'), ('doe', 'nsubj', 'experienced'), ('experienced', 'ROOT', 'experienced'), ('software', 'compound', 'engineer'), ('engineer', 'dobj', 'experienced'), ('with', 'prep', 'experienced'), ('a', 'det', 'background'), ('strong', 'amod', 'background'), ('background', 'pobj', 'with'), ('in', 'prep', 'background'), ('python', 'pobj', 'in'), ('and', 'cc', 'background'), ('machine', 'compound', 'learning'), ('learning', 'nsubj', 'worked'), ('worked', 'conj', 'experienced'), ('at', 'prep', 'worked'), ('google', 'pobj', 'at'), ('and', 'cc', 'google'), ('microsoft', 'compound', 'education'), ('education', 'compound', 'bsc'), ('bsc', 'conj', 'google'), ('in', 'prep', 'worked'), ('computer', 'compound', 'science'), ('science', 'pobj', 'in'), ('from', 'prep', 'worked'), ('mit', 'compound', 'contact'), ('contact', 'pobj', 'from')]


#### Visualizing the dependency parse

The dependency visualizer, **dep**, shows part-of-speech tags and syntactic dependencies.


In [29]:
from spacy import displacy

# In[2]:
doc = nlp(cleaned_text)
displacy.render(doc, style="dep")


###Visualizing the entity recognizer

The entity visualizer, **ent**, highlights named entities and their labels in a text.

In [30]:
doc = nlp(cleaned_text)
displacy.render(doc, style="ent")

###Task
Compare entities from clean and not cleaned text. Try to visualise entities of not cleaned text