1. Tokenization with SpaCy
Tokenization is the process of splitting text into individual words, punctuation marks, or other meaningful elements called tokens. SpaCy provides an easy-to-use interface for tokenization.

Blank Language Model: spacy.blank("en") creates a blank language object for English. 

This includes a tokenizer but no other pipeline components (like named entity recognition or part-of-speech tagging).

The loop prints each token from the sentence.

In [25]:
import spacy

# Create a blank English language model
nlp = spacy.blank("en")

# Tokenize a sentence
doc = nlp("A sentence is typically associated with a clause.A clause can either be a clause simplex or a clause complex. A clause simplex represents a single process going on through time. A clause complex represents a logical relation between two or more processes and is thus composed of two or more clause simplexes. There may be the sentences which talks about currentcy such as $ Pound and Euro")

for token in doc:
    print(token)


A
sentence
is
typically
associated
with
a
clause
.
A
clause
can
either
be
a
clause
simplex
or
a
clause
complex
.
A
clause
simplex
represents
a
single
process
going
on
through
time
.
A
clause
complex
represents
a
logical
relation
between
two
or
more
processes
and
is
thus
composed
of
two
or
more
clause
simplexes
.
There
may
be
the
sentences
which
talks
about
currentcy
such
as
$
Pound
and
Euro


2. Accessing Tokens by Index
You can access individual tokens by their index in the Doc object.

Accessing Tokens and Their Attributes
You can access tokens by index, and each token has several attributes like is_alpha, is_currency, like_num, etc.

In [26]:
token = doc[1]  # Access the second token
print(token.text)  # Output: Strange

# List of all token attributes
print(dir(token))


sentence
['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_dep', 'has_extension', 'has_head', 'has_morph', 'has_vector', 'head', 'i', 'idx', 'iob_strings', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex', 'lex

3. Token Attributes
SpaCy tokens have various attributes like is_alpha, like_num, and is_currency to check if a token is a word, a number, or a currency symbol.

In [27]:
token0 = doc[0]
token0.__init__

<method-wrapper '__init__' of spacy.tokens.token.Token object at 0x16c191530>

You can explore various attributes of tokens to understand their properties.
Example below

In [28]:
doc = nlp("Tony gave two $ to Peter.")
for token in doc:
    print(token, "==>", "index: ", token.i, "is_alpha:", token.is_alpha, 
          "is_punct:", token.is_punct, "like_num:", token.like_num,
          "is_currency:", token.is_currency)


Tony ==> index:  0 is_alpha: True is_punct: False like_num: False is_currency: False
gave ==> index:  1 is_alpha: True is_punct: False like_num: False is_currency: False
two ==> index:  2 is_alpha: True is_punct: False like_num: True is_currency: False
$ ==> index:  3 is_alpha: False is_punct: False like_num: False is_currency: True
to ==> index:  4 is_alpha: True is_punct: False like_num: False is_currency: False
Peter ==> index:  5 is_alpha: True is_punct: False like_num: False is_currency: False
. ==> index:  6 is_alpha: False is_punct: True like_num: False is_currency: False


4. Extracting Information from Text
You can extract specific types of tokens (e.g., email addresses) by iterating over the tokens and checking their attributes.

In [29]:
for token in doc:
    if token.like_email:
        emails.append(token.text)


Using Span Objects
A Span is a slice of the Doc object and can be created by slicing.
Example:

In [30]:
span = doc[0:5]
print(span)


Tony gave two $ to


Extracting Email IDs
Using like_email attribute, you can extract email addresses from text.
Example

In [31]:
text = '''
Dayton high school, 8th grade students information
Name  birth day   email
Virat   5 June, 1882    virat@kohli.com
Maria  12 April, 2001  maria@sharapova.com
Serena  24 June, 1998   serena@williams.com 
Joe      1 May, 1997    joe@root.com
'''

doc = nlp(text)
emails = [token.text for token in doc if token.like_email]
print(emails)


['virat@kohli.com', 'maria@sharapova.com', 'serena@williams.com', 'joe@root.com']


5. Support for Multiple Languages
SpaCy supports many languages. You can tokenize text in other languages just by changing the language code (e.g., hi for Hindi).

In [32]:
nlp = spacy.blank("hi")
doc = nlp("राजेंद्र प्रसाद, भारत के पहले राष्ट्रपति, दो कार्यकाल के लिए कार्यालय रखने वाले एकमात्र व्यक्ति हैं।")
for token in doc:
    print(token, token.is_currency)


राजेंद्र False
प्रसाद False
, False
भारत False
के False
पहले False
राष्ट्रपति False
, False
दो False
कार्यकाल False
के False
लिए False
कार्यालय False
रखने False
वाले False
एकमात्र False
व्यक्ति False
हैं False
। False


6. Customizing Tokenizer
You can customize the tokenizer by adding special cases. For example, splitting "gimme" into "gim" and "me".

In [33]:
from spacy.symbols import ORTH

nlp = spacy.blank("en")
nlp.tokenizer.add_special_case("gimme", [{ORTH: "gim"}, {ORTH: "me"}])

doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
print(tokens)


['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']


In [34]:
from spacy.symbols import ORTH

nlp.tokenizer.add_special_case("gimme", [
    {ORTH: "gim"},
    {ORTH: "me"},
])


7. Sentence Segmentation
Sentence Tokenization involves splitting the text into sentences. The default blank model does not include sentence boundary detection. You need to add a Sentencizer component to enable this.

Sentence Tokenization (Segmentation)
To split text into sentences, you need to add a component like sentencizer to the pipeline.
Example:

In [35]:
nlp.add_pipe('sentencizer')
doc = nlp("Dr. Strange loves pav bhaji of Mumbai. Hulk loves chat of Delhi")
for sentence in doc.sents:
    print(sentence)


Dr. Strange loves pav bhaji of Mumbai.
Hulk loves chat of Delhi


8. Exercise Solutions
(1) Extracting URLs from Text:

In [None]:
text = '''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

doc = nlp(text)
urls = [token.text for token in doc if token.like_url]
print(urls)


['http://www.data.gov/', 'http://www.science', 'http://data.gov.uk/.', 'http://www3.norc.org/gss+website/', 'http://www.europeansocialsurvey.org/.']


(2) Extracting Money Transactions:

In [None]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
doc = nlp(transactions)

money_transactions = []
for token in doc:
    if token.is_currency:
        money_transactions.append(f"{doc[token.i - 1].text} {token.text}")

print(money_transactions)


['two $', '500 €']


My final Code

In [None]:
import spacy
from spacy.symbols import ORTH

# Initialize blank English model
nlp = spacy.blank("en")

# Example 1: Basic Tokenization
def basic_tokenization(sentence):
    doc = nlp(sentence)
    return [token.text for token in doc]

# Example 2: Accessing Tokens by Index
def access_token_by_index(sentence, index):
    doc = nlp(sentence)
    return doc[index].text

# Example 3: Token Attributes
def token_attributes(sentence):
    doc = nlp(sentence)
    attributes = [(token.text, token.is_alpha, token.like_num, token.is_currency) for token in doc]
    return attributes

# Example 4: Extracting Emails from Text
def extract_emails(text):
    doc = nlp(text)
    return [token.text for token in doc if token.like_email]

# Example 5: Customizing Tokenizer
def customize_tokenizer(sentence):
    nlp.tokenizer.add_special_case("gimme", [
        {ORTH: "gim"},
        {ORTH: "me"},
    ])
    doc = nlp(sentence)
    return [token.text for token in doc]

# Example 6: Sentence Segmentation
def sentence_segmentation(text):
    nlp.add_pipe('sentencizer')
    doc = nlp(text)
    return [sent.text for sent in doc.sents]

# Exercise 1: Extracting URLs
def extract_urls(text):
    doc = nlp(text)
    return [token.text for token in doc if token.like_url]

# Exercise 2: Extracting Money Transactions
def extract_money_transactions(transactions):
    doc = nlp(transactions)
    money_transactions = []
    for token in doc:
        if token.is_currency:
            money_transactions.append(f"{doc[token.i - 1].text} {token.text}")
    return money_transactions

# Usage examples
if __name__ == "__main__":
    sentence = "Dr. Strange loves pav bhaji of mumbai as it costs only 2$ per plate."
    print("Basic Tokenization:", basic_tokenization(sentence))
    
    print("Access Token by Index:", access_token_by_index(sentence, 1))
    
    print("Token Attributes:", token_attributes(sentence))
    
    email_text = '''
    Virat   5 June, 1882    virat@kohli.com
    Maria   12 April, 2001  maria@sharapova.com
    '''
    print("Extract Emails:", extract_emails(email_text))
    
    custom_sentence = "gimme double cheese extra large healthy pizza"
    print("Custom Tokenizer:", customize_tokenizer(custom_sentence))
    
    segmentation_text = "Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi"
    print("Sentence Segmentation:", sentence_segmentation(segmentation_text))
    
    url_text = '''
    Look for data at http://www.data.gov/ and http://data.gov.uk/.
    '''
    print("Extract URLs:", extract_urls(url_text))
    
    transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
    print("Extract Money Transactions:", extract_money_transactions(transactions))


Basic Tokenization: ['Dr.', 'Strange', 'loves', 'pav', 'bhaji', 'of', 'mumbai', 'as', 'it', 'costs', 'only', '2', '$', 'per', 'plate', '.']
Access Token by Index: Strange
Token Attributes: [('Dr.', False, False, False), ('Strange', True, False, False), ('loves', True, False, False), ('pav', True, False, False), ('bhaji', True, False, False), ('of', True, False, False), ('mumbai', True, False, False), ('as', True, False, False), ('it', True, False, False), ('costs', True, False, False), ('only', True, False, False), ('2', False, True, False), ('$', False, False, True), ('per', True, False, False), ('plate', True, False, False), ('.', False, False, False)]
Extract Emails: ['virat@kohli.com', 'maria@sharapova.com']
Custom Tokenizer: ['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']
Sentence Segmentation: ['Dr. Strange loves pav bhaji of mumbai.', 'Hulk loves chat of delhi']
Extract URLs: ['http://www.data.gov/', 'http://data.gov.uk/.']
Extract Money Transactions: ['t