We can use the following libaries:
1. NLTK or spaCy NLP

# What is the difference between NLTK and spaCy
Both powerful Python libraries for Natural Language Processing (NLP).

### 1. Purpose and Philosophy

| Feature   | NLTK                                      | spaCy                                               |
|-----------|-------------------------------------------|------------------------------------------------------|
| Goal      | Research and education                    | Industrial-strength NLP                              |
| Design    | Modular, flexible, and comprehensive      | Fast, efficient, and production-ready                |
| Use Case  | Teaching, prototyping, linguistic research| Real-world applications, pipelines, and deployment   |

### 2. Performance and Speed

| Feature       | NLTK                                | spaCy                                 |
|---------------|-------------------------------------|----------------------------------------|
| Speed         | Slower, especially with large texts | Much faster and optimized in Cython    |
| Memory Usage  | Higher                              | Lower and more efficient               |

### 3. Features and Capabilities

| Feature                        | NLTK                     | spaCy                                               |
|-------------------------------|--------------------------|------------------------------------------------------|
| Tokenization                  | Yes                      | Yes (more accurate and faster)                       |
| POS Tagging                   | Yes                      | Yes (more accurate)                                  |
| Named Entity Recognition (NER)| Yes                      | Yes (state-of-the-art)                               |
| Dependency Parsing            | Basic                    | Advanced                                             |
| Lemmatization                 | Yes                      | Yes                                                  |
| Word Vectors                  | Limited                  | Built-in support for word vectors and transformers   |
| Pre-trained Models            | Few, mostly for English  | Many, multilingual and transformer-based             |
| Deep Learning Integration     | Limited                  | Strong support via spaCy Transformers and Thinc      |


In [2]:
corpus = """Hello Welcome, to krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.
""" 

In [3]:
print(corpus)

Hello Welcome, to krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.



### Tockenization
#### Convert - Sentence --> Paragraphs

In [12]:
import nltk 
nltk.download('punkt_tab')

from nltk.tokenize import sent_tokenize


documents = sent_tokenize(corpus)

print(documents)



["Hello Welcome, to krish Naik's NLP Tutorials.", 'Please do watch the entire course!', 'to become expert in NLP.']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ugautam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [13]:
type(documents)

list

In [14]:
for sentence in documents:
    print(sentence)

Hello Welcome, to krish Naik's NLP Tutorials.
Please do watch the entire course!
to become expert in NLP.


### Tockenization
#### Convert  - Paragraphs  --> Words
####            Sentence --> Words

In [19]:
from nltk.tokenize import word_tokenize

words = word_tokenize(corpus)

In [20]:
len(words)

23

In [16]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'to', 'krish', 'Naik', "'s", 'NLP', 'Tutorials', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
['to', 'become', 'expert', 'in', 'NLP', '.']


In [None]:
# Advantage of wordpunct_tokenize - puctuations will also be considered as a seperate words
from nltk.tokenize import wordpunct_tokenize

wordpunct_tokenize (corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'krish',
 'Naik',
 "'",
 's',
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [None]:
# in this FullStop (.) will not be treatened as a seperate word. It will be included with the Previous words.
# But for the last word,FullStop (.) will be the seperate word.
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']