We can use the following libaries:
1. NLTK or spaCy NLP

# What is the difference between NLTK and spaCy
Both powerful Python libraries for Natural Language Processing (NLP).

### 1. Purpose and Philosophy

| Feature   | NLTK                                      | spaCy                                               |
|-----------|-------------------------------------------|------------------------------------------------------|
| Goal      | Research and education                    | Industrial-strength NLP                              |
| Design    | Modular, flexible, and comprehensive      | Fast, efficient, and production-ready                |
| Use Case  | Teaching, prototyping, linguistic research| Real-world applications, pipelines, and deployment   |

### 2. Performance and Speed

| Feature       | NLTK                                | spaCy                                 |
|---------------|-------------------------------------|----------------------------------------|
| Speed         | Slower, especially with large texts | Much faster and optimized in Cython    |
| Memory Usage  | Higher                              | Lower and more efficient               |

### 3. Features and Capabilities

| Feature                        | NLTK                     | spaCy                                               |
|-------------------------------|--------------------------|------------------------------------------------------|
| Tokenization                  | Yes                      | Yes (more accurate and faster)                       |
| POS Tagging                   | Yes                      | Yes (more accurate)                                  |
| Named Entity Recognition (NER)| Yes                      | Yes (state-of-the-art)                               |
| Dependency Parsing            | Basic                    | Advanced                                             |
| Lemmatization                 | Yes                      | Yes                                                  |
| Word Vectors                  | Limited                  | Built-in support for word vectors and transformers   |
| Pre-trained Models            | Few, mostly for English  | Many, multilingual and transformer-based             |
| Deep Learning Integration     | Limited                  | Strong support via spaCy Transformers and Thinc      |


### Top Tockenization Libraries

| Library	| Language	| Highlights	| Best For |
|-----------|-----------|---------------|----------|
| spaCy (Top 1)	 | Python | Fast, accurate, supports multiple languages, built-in linguistic features	| Production NLP, Named Entity Recognition |
| Hugging Face Tokenizers (Top 2)	||Python/Rust	| Extremely fast, supports BPE, WordPiece, Unigram, etc.	| Transformer models (BERT, GPT, etc.)|
| NLTK (Top 3)	| Python	| Educational, simple, supports sentence and word tokenization	| Learning, academic research|


In [10]:
corpus = """Hello Welcome, to krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.
""" 

In [24]:
print(corpus)

Hello Welcome, to krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.



### Tockenization
#### Convert - Sentence --> Paragraphs

In [25]:
import nltk 
nltk.download('punkt_tab')

from nltk.tokenize import sent_tokenize


documents = sent_tokenize(corpus)

print(documents)



["Hello Welcome, to krish Naik's NLP Tutorials.", 'Please do watch the entire course!', 'to become expert in NLP.']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ugautam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [26]:
type(documents)

list

In [27]:
for sentence in documents:
    print(sentence)

Hello Welcome, to krish Naik's NLP Tutorials.
Please do watch the entire course!
to become expert in NLP.


### Tockenization
#### Convert  - Paragraphs  --> Words
####            Sentence --> Words

In [28]:
from nltk.tokenize import word_tokenize

words = word_tokenize(corpus)

In [29]:
len(words)

23

In [30]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'to', 'krish', 'Naik', "'s", 'NLP', 'Tutorials', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
['to', 'become', 'expert', 'in', 'NLP', '.']


In [31]:
# Advantage of wordpunct_tokenize - puctuations will also be considered as a seperate words
from nltk.tokenize import wordpunct_tokenize

wordpunct_tokenize (corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'krish',
 'Naik',
 "'",
 's',
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [32]:
## See the tockens

# Advantage of wordpunct_tokenize - puctuations will also be considered as a seperate words
from nltk.tokenize import wordpunct_tokenize

tokens  = wordpunct_tokenize (corpus)


# Display tokens with their index numbers
for idx, token in enumerate(tokens):
    print(f"Token {idx}: {token}")


Token 0: Hello
Token 1: Welcome
Token 2: ,
Token 3: to
Token 4: krish
Token 5: Naik
Token 6: '
Token 7: s
Token 8: NLP
Token 9: Tutorials
Token 10: .
Token 11: Please
Token 12: do
Token 13: watch
Token 14: the
Token 15: entire
Token 16: course
Token 17: !
Token 18: to
Token 19: become
Token 20: expert
Token 21: in
Token 22: NLP
Token 23: .


In [33]:
# in this FullStop (.) will not be treatened as a seperate word. It will be included with the Previous words.
# But for the last word,FullStop (.) will be the seperate word.
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

### Spacy Examples

In [7]:

import spacy

# Load a blank English model (no pretrained components)
nlp = spacy.blank("en")

# Your input text
text = "Tokenization using spaCy with a blank English's model."

# Process the text
doc = nlp(text)

# Print tokens
for i, token in enumerate(doc):
    print(f"Token {i}: {token.text}")


Token 0: Tokenization
Token 1: using
Token 2: spaCy
Token 3: with
Token 4: a
Token 5: blank
Token 6: English
Token 7: 's
Token 8: model
Token 9: .


In [None]:
## Check the NLTK and SPacy Difference

import nltk
import spacy
import time
from nltk.tokenize import word_tokenize

# Download NLTK tokenizer data
nltk.download('punkt')

# Sample text
text = "Dr. Smith went to Washington D.C. on Jan. 5th, 2023."

# NLTK Tokenization
start_nltk = time.time()
nltk_tokens = word_tokenize(text)
end_nltk = time.time()

# spaCy Tokenization
nlp = spacy.blank("en")  # Use blank model for tokenization only
start_spacy = time.time()
spacy_tokens = [token.text for token in nlp(text)]
end_spacy = time.time()

# Print results
print("NLTK Tokens:", nltk_tokens)
print("spaCy Tokens:", spacy_tokens)
print(f"NLTK Time: {end_nltk - start_nltk:.6f} seconds")
print(f"spaCy Time: {end_spacy - start_spacy:.6f} seconds")


NLTK Tokens: ['Dr.', 'Smith', 'went', 'to', 'Washington', 'D.C.', 'on', 'Jan.', '5th', ',', '2023', '.']
spaCy Tokens: ['Dr.', 'Smith', 'went', 'to', 'Washington', 'D.C.', 'on', 'Jan.', '5th', ',', '2023', '.']
NLTK Time: 0.000999 seconds
spaCy Time: 0.001000 seconds


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ugautam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
