<a href="https://colab.research.google.com/github/yashfirkedata/NLP-Practice/blob/main/Text%20Preprocessing/NLP_processing_using_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Processing usin NLP**

First, you need to load the language model instance in spaCy:



 **en_core_web_sm** is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities.

sm/md/lg refer to the sizes of the models (small, medium, large respectively).

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x7f27daf216c0>

In [2]:
introduction_doc = nlp(
    "Hello, I am Yash Firke revising Natural Language Processing!"
)

type(introduction_doc)

spacy.tokens.doc.Doc

In [3]:
for token in introduction_doc:
  print(token.text)

Hello
,
I
am
Yash
Firke
revising
Natural
Language
Processing
!


If you want to read from a file

In [5]:
import pathlib
file_name = "intro.txt"
introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding = "utf-8"))

In [6]:
for token in introduction_doc:
  print(token.text)

Hello
,
I
am
Yash
Firke
revising
Natural
Language
Processing
!


## **Sentence Detection**
Sentence detection is the process of locating where sentences start and end in a given text. This allows you to you divide a text into linguistically meaningful units.

In [15]:
sentences_text = (
    "Hello I am Yash Firke."
    "I am learning in the field of Ai/ML."
    "This is me practicing NLP."
    "To hire me, contact yashfirke.edu@gmail.com."
)

sentences_doc = nlp(sentences_text)

In [16]:
sentences = list(sentences_doc.sents)
len(sentences)

2

In [17]:
for sentence in sentences:
    print(f"{sentence}")
    print("\n")

Hello I am Yash Firke.


I am learning in the field of Ai/ML.This is me practicing NLP.To hire me, contact yashfirke.edu@gmail.com.




You can also slice the Span objects to produce sections of a sentence.

You can also customize sentence detection behavior by using custom delimiters. Here’s an example where an ellipsis (...) is used as a delimiter, in addition to the full stop, or period (.):

In [20]:
ellipses_text = "The sun was setting... casting long shadows across the field... birds chirped lazily as they prepared for nightfall... the breeze carried a hint of summer... memories of childhood floated by... laughter, games, and carefree days... the world seemed to slow down... each moment stretched... every second a lifetime... nostalgia wrapped around like a warm blanket... the past and present mingling... a dance of time..."

from spacy.language import Language
@Language.component("set_custom_boundaries")

def set_custom_boundaries(doc):
  """
  Adding support to use '...' as delimiter for sentence detection
  """
  for token in doc[:-1]:
    if token.text == "...":
      doc[token.i + 1].is_sent_start = True

  return doc

In [21]:
custom_nlp = spacy.load("en_core_web_sm")
custom_nlp.add_pipe("set_custom_boundaries", before= "parser")
custom_ellipses_doc = custom_nlp(ellipses_text)
custom_ellipses_sentences = list(custom_ellipses_doc.sents)
len(custom_ellipses_sentences)

12

In [24]:
for sentence in custom_ellipses_sentences:
  print(sentence)
  print("\n")

The sun was setting...


casting long shadows across the field...


birds chirped lazily as they prepared for nightfall...


the breeze carried a hint of summer...


memories of childhood floated by...


laughter, games, and carefree days...


the world seemed to slow down...


each moment stretched...


every second a lifetime...


nostalgia wrapped around like a warm blanket...


the past and present mingling...


a dance of time...




## **Tokens in spaCy**

*.idx attribute*

In [27]:
for token in sentences_doc[:10]:
  print(token, token.idx)

Hello 0
I 6
am 8
Yash 11
Firke 16
. 21
I 22
am 24
learning 27
in 36


In [31]:
print(
    f"{'Text with Whitespace':22}"
    f"{'Is Alphanumeric?':15}"
    f"{'Is Punctuation?':18}"
    f"{'Is Stop Word?'}"
)
for token in sentences_doc[:20]:
    print(
        f"{str(token.text_with_ws):22}"
        f"{str(token.is_alpha):15}"
        f"{str(token.is_punct):18}"
        f"{str(token.is_stop)}"
    )


Text with Whitespace  Is Alphanumeric?Is Punctuation?   Is Stop Word?
Hello                 True           False             False
I                     True           False             True
am                    True           False             True
Yash                  True           False             False
Firke                 True           False             False
.                     False          True              False
I                     True           False             True
am                    True           False             True
learning              True           False             False
in                    True           False             True
the                   True           False             True
field                 True           False             False
of                    True           False             True
Ai                    True           False             False
/                     False          True              False
ML.This               

You can even use a custom tokenizer for ex: Default parsing reads yash@gmail.com as one, you can set @ as custom infix to get 3 seperate tokens.

## **Stop Words**

Examples of stop words are the, are, but, and they.

With NLP, stop words are generally removed because they aren’t significant, and they heavily distort any word frequency analysis.

In [32]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [33]:
for stopword in list(spacy_stopwords)[:10]:
  print(stopword)


otherwise
everywhere
give
besides
she
does
once
their
anyone
move


In [34]:
sentences_text = (
    "Hello I am Yash Firke."
    "I am learning in the field of Ai/ML."
    "This is me practicing NLP."
    "To hire me, contact yashfirke.edu@gmail.com."
)

sentences_doc = nlp(sentences_text)

You can remove stop words from the input text by making use of the .is_stop attribute of each token

In [35]:
for token in sentences_doc:
  if token.is_stop:
    continue

  print(token.text)

Hello
Yash
Firke
.
learning
field
Ai
/
ML.This
practicing
NLP.To
hire
,
contact
yashfirke.edu@gmail.com
.


## **Lemmatization**
Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form, or root word, is called a lemma.

For example, *organizes*, *organized* and *organizing* are all forms of *organize*.

It can also help you normalize the text.

In [36]:
sentences_text

'Hello I am Yash Firke.I am learning in the field of Ai/ML.This is me practicing NLP.To hire me, contact yashfirke.edu@gmail.com.'

In [42]:
doc = nlp(sentences_text)
for token in doc:
  if str(token) != str(token.lemma_):
    print(f"{str(token)} : {str(token.lemma_)}")

Hello : hello
am : be
am : be
learning : learn
ML.This : ml.this
is : be
me : I
practicing : practice
me : I


You can see practicing being reduced to practice.

Lemmatization helps you avoid duplicate words that may overlap conceptually.

## **Word Frequency**

You can perform statistical analysis on it. This analysis can give you various insights, such as common words or unique words in the text

In [44]:
long_text = ("Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results."
            "Data science is important because it combines tools, methods, and technology to generate meaning from data. Modern organizations are inundated with data; there is a proliferation of devices that can automatically collect and store information. Online systems and payment portals capture more data in the fields of e-commerce, medicine, finance, and every other aspect of human life. We have text, audio, video, and image data available in vast quantities. "
)

In [50]:
doc = nlp(long_text)
words = []

for token in doc:
  if token.is_stop or token.is_punct:
    continue
  words.append(token.text)


In [51]:
words

['Data',
 'science',
 'study',
 'data',
 'extract',
 'meaningful',
 'insights',
 'business',
 'multidisciplinary',
 'approach',
 'combines',
 'principles',
 'practices',
 'fields',
 'mathematics',
 'statistics',
 'artificial',
 'intelligence',
 'computer',
 'engineering',
 'analyze',
 'large',
 'amounts',
 'data',
 'analysis',
 'helps',
 'data',
 'scientists',
 'ask',
 'answer',
 'questions',
 'like',
 'happened',
 'happened',
 'happen',
 'results',
 'Data',
 'science',
 'important',
 'combines',
 'tools',
 'methods',
 'technology',
 'generate',
 'meaning',
 'data',
 'Modern',
 'organizations',
 'inundated',
 'data',
 'proliferation',
 'devices',
 'automatically',
 'collect',
 'store',
 'information',
 'Online',
 'systems',
 'payment',
 'portals',
 'capture',
 'data',
 'fields',
 'e',
 'commerce',
 'medicine',
 'finance',
 'aspect',
 'human',
 'life',
 'text',
 'audio',
 'video',
 'image',
 'data',
 'available',
 'vast',
 'quantities']

In [53]:
from collections import Counter
print(Counter(words).most_common(5))

[('data', 7), ('Data', 2), ('science', 2), ('combines', 2), ('fields', 2)]


# **Part of Speech Tagging**

**Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence. There are typically eight parts of speech:**

```
Noun
Pronoun
Adjective
Verb
Adverb
Preposition
Conjunction
Interjection
```



In [54]:
for token in doc:
  print(f"Token: {token}")
  print(f"Token Tag: {token.tag_}")
  print(f"Token POS: {token.pos_}")
  print(f"Explaination: {spacy.explain(token.tag_)}")
  print("-"*35)

Token: Data
Token Tag: NNS
Token POS: NOUN
Explaination: noun, plural
-----------------------------------
Token: science
Token Tag: NN
Token POS: NOUN
Explaination: noun, singular or mass
-----------------------------------
Token: is
Token Tag: VBZ
Token POS: AUX
Explaination: verb, 3rd person singular present
-----------------------------------
Token: the
Token Tag: DT
Token POS: DET
Explaination: determiner
-----------------------------------
Token: study
Token Tag: NN
Token POS: NOUN
Explaination: noun, singular or mass
-----------------------------------
Token: of
Token Tag: IN
Token POS: ADP
Explaination: conjunction, subordinating or preposition
-----------------------------------
Token: data
Token Tag: NNS
Token POS: NOUN
Explaination: noun, plural
-----------------------------------
Token: to
Token Tag: TO
Token POS: PART
Explaination: infinitival "to"
-----------------------------------
Token: extract
Token Tag: VB
Token POS: VERB
Explaination: verb, base form
----------------

> .tag_ displays a fine-grained tag.

> .pos_ displays a coarse-grained tag, which is a reduced version of the fine-grained tags.

> spacy.explain() gives descriptive detials about the pos tag

In [55]:
# Extracting particular category of words:
nouns = []
adjectives = []
for token in doc:
  if(token.pos_ == "NOUN"):
    nouns.append(token)
  if(token.pos_ == "ADJ"):
    adjectives.append(token)


In [62]:
print(f"Nouns: {nouns[:5]}")
print(f"Adjectives: {adjectives[:5]}")

Nouns: [Data, science, study, data, insights]
Adjectives: [meaningful, multidisciplinary, artificial, large, important]


## **Visualization: Using displaCy**
You can use displaCy to find POS tags for tokens:

In [64]:
from spacy import displacy
doc

Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results.Data science is important because it combines tools, methods, and technology to generate meaning from data. Modern organizations are inundated with data; there is a proliferation of devices that can automatically collect and store information. Online systems and payment portals capture more data in the fields of e-commerce, medicine, finance, and every other aspect of human life. We have text, audio, video, and image data available in vast quantities. 

In [66]:
displacy.serve(doc, style = "dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In the image above, each token is assigned a POS tag written just below the token.

In [68]:
# to display inside jupyter notebook
displacy.render(doc, style="dep", jupyter=True)

# **Preprocessing Functions**

Creating a preprocessor that applies following operations:


```
Lowercases the text
Lemmatizes each token
Removes punctuation symbols
Removes stop words
```



In [69]:
import spacy
nlp = spacy.load("en_core_web_sm")
text = ("Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results."
            "Data science is important because it combines tools, methods, and technology to generate meaning from data. Modern organizations are inundated with data; there is a proliferation of devices that can automatically collect and store information. Online systems and payment portals capture more data in the fields of e-commerce, medicine, finance, and every other aspect of human life. We have text, audio, video, and image data available in vast quantities. "
)

In [73]:
text_doc = nlp(text)

def is_token_allowed(token):
    # Check if the token is valid (not empty or just spaces)
    if not token or not str(token).strip():
        return False
    # Check if the token is a stop word or punctuation
    if token.is_stop or token.is_punct:
        return False
    # If none of the above, the token is allowed
    return True

def preprocess_token(token):
  return token.lemma_.strip().lower()

filtered_tokens = []
for token in text_doc:
    if is_token_allowed(token):
        filtered_tokens.append(preprocess_token(token))



In [75]:
print(filtered_tokens)

['datum', 'science', 'study', 'datum', 'extract', 'meaningful', 'insight', 'business', 'multidisciplinary', 'approach', 'combine', 'principle', 'practice', 'field', 'mathematic', 'statistic', 'artificial', 'intelligence', 'computer', 'engineering', 'analyze', 'large', 'amount', 'datum', 'analysis', 'help', 'data', 'scientist', 'ask', 'answer', 'question', 'like', 'happen', 'happen', 'happen', 'result', 'datum', 'science', 'important', 'combine', 'tool', 'method', 'technology', 'generate', 'meaning', 'datum', 'modern', 'organization', 'inundate', 'datum', 'proliferation', 'device', 'automatically', 'collect', 'store', 'information', 'online', 'system', 'payment', 'portal', 'capture', 'datum', 'field', 'e', 'commerce', 'medicine', 'finance', 'aspect', 'human', 'life', 'text', 'audio', 'video', 'image', 'datum', 'available', 'vast', 'quantity']


## **Rule-Based Matching Using spaCy**
*Rule-based matching* is one of the steps in extracting information from unstructured text. It’s used to identify and extract tokens and phrases according to patterns (such as lowercase) and grammatical features (such as part of speech).

While you can use regular expressions to extract entities (such as phone numbers), rule-based matching in spaCy is more powerful than regex alone, because you can include semantic or grammatical filters.

In [79]:
mytext = (
    "Hello I am Yash Firke."
    "I am learning in the field of Ai/ML."
    "This is me practicing NLP."
    "To hire me, contact yashfirke.edu@gmail.com."
)

mydoc = nlp(mytext)

In [83]:
from spacy.matcher import Matcher
def extract_full_name(nlp_doc):
    # Define the pattern to match two proper nouns (first and last name)
    pattern = [{"POS": "PROPN"}, {"POS": "PROPN"}]

    # Initialize the matcher with the pattern
    matcher = Matcher(nlp_doc.vocab)
    matcher.add("FULL_NAME", [pattern])

    # Find matches in the document
    matches = matcher(nlp_doc)

    # Loop through the matches and yield the matched text
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        yield span.text

full_names = list(extract_full_name(mydoc))
full_names

['Yash Firke']

You can do same to extract phone numbers by doing changes in pattern and


```
matcher.add("PHONE_NUMBER", None, pattern)

```



# **Dependency Parsing Using spaCy**
Dependency parsing is the process of extracting the dependency graph of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the root of the sentence. All other words are linked to the headword.

The dependencies can be mapped in a directed graph representation where:

- Words are the nodes.
- Grammatical relationships are the edges.

Dependency parsing helps you know what role a word plays in the text and how different words relate to each other.

In [87]:
sample_text = "Yash is learning Natural Language Processing"
sample_doc = nlp(sample_text)

for token in sample_doc:
  print(f"Token: {token}")
  print(f"Token Tag: {token.tag_}" )
  print(f"Token Head Text: {token.head.text}" )
  print(f"Token Dependency: {token.dep_ }")
  print("-"*35)

Token: Yash
Token Tag: NNP
Token Head Text: learning
Token Dependency: nsubj
-----------------------------------
Token: is
Token Tag: VBZ
Token Head Text: learning
Token Dependency: aux
-----------------------------------
Token: learning
Token Tag: VBG
Token Head Text: learning
Token Dependency: ROOT
-----------------------------------
Token: Natural
Token Tag: NNP
Token Head Text: Language
Token Dependency: compound
-----------------------------------
Token: Language
Token Tag: NNP
Token Head Text: Processing
Token Dependency: compound
-----------------------------------
Token: Processing
Token Tag: NN
Token Head Text: learning
Token Dependency: dobj
-----------------------------------


In [88]:
displacy.render(sample_doc, style="dep")

## **Tree and Subtree Navigation**
The dependency graph has all the properties of a tree. This tree contains information about sentence structure and grammar and can be traversed in different ways to extract relationships.

spaCy provides attributes like .children, .lefts, .rights, and .subtree to make navigating the parse tree easier.

## **Shallow Parsing**

Shallow parsing, or chunking, is the process of extracting phrases from unstructured text. This involves chunking groups of adjacent tokens into phrases on the basis of their POS tags. There are some standard well-known chunks such as noun phrases, verb phrases, and prepositional phrases.

### Noun Phrase Detection
A noun phrase is a phrase that has a noun as its head. It could also include other kinds of words, such as adjectives, ordinals, and determiners. Noun phrases are useful for explaining the context of the sentence. They help you understand what the sentence is about.

In [93]:
#  .noun_chunks on the Doc object.
for chunk in sample_doc.noun_chunks:
  print(chunk)

Yash
Natural Language Processing


By looking at noun phrases, you can get information about your text

### Verb Phrase Detection
spaCy has no built-in functionality to extract verb phrases, so you’ll need a library called textacy. You can use pip to install textacy

In [None]:
!python -m pip install textacy

In [97]:
import textacy

about_talk_text = (
    "The talk will introduce reader about use"
    " cases of Natural Language Processing in"
    " Fintech, making use of"
    " interesting examples along the way."
)

patterns = [{"POS": "AUX"}, {"POS": "VERB"}]
about_talk_doc = textacy.make_spacy_doc(
    about_talk_text, lang="en_core_web_sm"
)
verb_phrases = textacy.extract.token_matches(
    about_talk_doc, patterns=patterns
)

# Print all verb phrases
for chunk in verb_phrases:
    print(chunk.text)



# Extract noun phrase to explain what nouns are involved
for chunk in about_talk_doc.noun_chunks:
    print (chunk)

will introduce
The talk
reader
use cases
Natural Language Processing
Fintech
use
interesting examples
the way


## **Named-Entity Recognition**
Named-entity recognition (NER) is the process of locating named entities in unstructured text and then classifying them into predefined categories, such as person names, organizations, locations, monetary values, percentages, and time expressions.

In [100]:
import spacy
nlp = spacy.load("en_core_web_sm")

piano_class_text = (
    "Great Piano Academy is situated"
    " in Mayfair or the City of London and has"
    " world-class piano instructors."
)
piano_class_doc = nlp(piano_class_text)

for ent in piano_class_doc.ents:
    print(
        f"""
{ent.text = }
{ent.start_char = }
{ent.end_char = }
{ent.label_ = }
spacy.explain('{ent.label_}') = {spacy.explain(ent.label_)}
----------------------------------------"""
)


ent.text = 'Great Piano Academy'
ent.start_char = 0
ent.end_char = 19
ent.label_ = 'ORG'
spacy.explain('ORG') = Companies, agencies, institutions, etc.
----------------------------------------

ent.text = 'Mayfair'
ent.start_char = 35
ent.end_char = 42
ent.label_ = 'GPE'
spacy.explain('GPE') = Countries, cities, states
----------------------------------------

ent.text = 'the City of London'
ent.start_char = 46
ent.end_char = 64
ent.label_ = 'GPE'
spacy.explain('GPE') = Countries, cities, states
----------------------------------------


In [101]:
displacy.render(piano_class_doc, style="ent")

One use case for NER is to redact people’s names from a text. For example, you might want to do this in order to hide personal information collected in a survey.