<a href="https://colab.research.google.com/github/swastikbanerjee/NLP_Lab/blob/main/nlpLab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer, MWETokenizer
from nltk.tokenize.treebank import TreebankWordTokenizer
from textblob import TextBlob
import spacy
from gensim.utils import tokenize
from keras.preprocessing.text import text_to_word_sequence

text = "Navigating through the labyrinth of Indian Railways' domain can be quite an adventure, filled with twists, turns, and unexpected delays. From bustling platforms 🚉 teeming with diverse faces to the rhythmic clatter of trains 🚂 departing into the horizon, the railway network sprawls across the vast expanse of the nation, connecting bustling metropolises with quaint villages. However, don't be fooled by the apparent chaos; behind the scenes, a well-oiled machinery operates tirelessly, ensuring millions of passengers reach their destinations safely. Amidst the cacophony of announcements and the hustle of passengers, one must navigate through the maze of ticket bookings, navigating cancellations, and negotiating with the omnipresent queues. It's a realm where time seems to both stand still and fly by in a blur of motion, where patience is tested, but the rewards of a scenic journey or a timely arrival are worth every moment of anticipation and uncertainty."

# a.Word Tokenization
nltk.download('punkt')
word_tokens = word_tokenize(text)
print("Word Tokenization:", word_tokens)

# b.Sentence Tokenization
sent_tokens = sent_tokenize(text)
print("Sentence Tokenization:", sent_tokens)

# c.Punctuation-based Tokenizer
punctuation_tokens = nltk.tokenize.WordPunctTokenizer().tokenize(text)
print("Punctuation-based Tokenizer:", punctuation_tokens)

# d.Treebank Word tokenizer
treebank_tokenizer = TreebankWordTokenizer().tokenize(text)
print("Treebank Word tokenizer:", treebank_tokenizer)

# e.Tweet Tokenizer
tweet_tokenizer = TweetTokenizer().tokenize(text)
print("Tweet Tokenizer:", tweet_tokenizer)

# f.Multi-Word Expression Tokenizer
mwe_tokenizer = MWETokenizer([('Word', 'Tokenization'), ('Tokenization', 'Python')])
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(text))
print("Multi-Word Expression Tokenizer:", mwe_tokens)

# g.TextBlob Word Tokenize
textblob_tokens = TextBlob(text).words
print("TextBlob Word Tokenize:", textblob_tokens)

# h.spaCy Tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
spacy_tokens = [token.text for token in doc]
print("spaCy Tokenizer:", spacy_tokens)

# i.Gensim word tokenizer
gensim_tokens = list(tokenize(text))
print("Gensim word tokenizer:", gensim_tokens)

# j.Tokenization with Keras
keras_tokens = text_to_word_sequence(text)
print("Tokenization with Keras:", keras_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Word Tokenization: ['Navigating', 'through', 'the', 'labyrinth', 'of', 'Indian', 'Railways', "'", 'domain', 'can', 'be', 'quite', 'an', 'adventure', ',', 'filled', 'with', 'twists', ',', 'turns', ',', 'and', 'unexpected', 'delays', '.', 'From', 'bustling', 'platforms', '🚉', 'teeming', 'with', 'diverse', 'faces', 'to', 'the', 'rhythmic', 'clatter', 'of', 'trains', '🚂', 'departing', 'into', 'the', 'horizon', ',', 'the', 'railway', 'network', 'sprawls', 'across', 'the', 'vast', 'expanse', 'of', 'the', 'nation', ',', 'connecting', 'bustling', 'metropolises', 'with', 'quaint', 'villages', '.', 'However', ',', 'do', "n't", 'be', 'fooled', 'by', 'the', 'apparent', 'chaos', ';', 'behind', 'the', 'scenes', ',', 'a', 'well-oiled', 'machinery', 'operates', 'tirelessly', ',', 'ensuring', 'millions', 'of', 'passengers', 'reach', 'their', 'destinations', 'safely', '.', 'Amidst', 'the', 'cacophony', 'of', 'announcements', 'and', 'the', 'hustle', 'of', 'passengers', ',', 'one', 'must', 'navigate', 'th

a. **Word Tokenization**:
   - Definition: Word tokenization is the process of splitting a text into individual words or tokens based on whitespace or punctuation.
   - Example:
     Input: "Hello, world! How are you?"
     Output: ['Hello', ',', 'world', '!', 'How', 'are', 'you', '?']

b. **Sentence Tokenization**:
   - Definition: Sentence tokenization is the process of splitting a text into individual sentences.
   - Example:
     Input: "Hello, world! How are you? I'm doing great."
     Output: ["Hello, world!", "How are you?", "I'm doing great."]

c. **Punctuation-based Tokenizer**:
   - Definition: Punctuation-based tokenizer splits a text into tokens based on punctuation marks such as commas, periods, etc.
   - Example:
     Input: "Hello, world! How are you?"
     Output: ['Hello', ',', 'world', '!', 'How', 'are', 'you', '?']

d. **Treebank Word Tokenizer**:
   - Definition: Treebank word tokenizer is a word tokenizer that follows the conventions used in the Penn Treebank corpus.
   - Example:
     Input: "Hello, world! How are you?"
     Output: ['Hello', ',', 'world', '!', 'How', 'are', 'you', '?']

e. **Tweet Tokenizer**:
   - Definition: Tweet tokenizer is specialized for tokenizing tweets, considering emojis, hashtags, and handles.
   - Example:
     Input: "Just landed ✈️ in New York! #excited"
     Output: ['Just', 'landed', '✈️', 'in', 'New', 'York', '!', '#excited']

f. **Multi-Word Expression Tokenizer**:
   - Definition: Multi-Word Expression tokenizer recognizes specific multi-word expressions and treats them as single tokens.
   - Example:
     Input: "New York City is a great place to visit."
     Output: ['New York City', 'is', 'a', 'great', 'place', 'to', 'visit']

g. **TextBlob Word Tokenize**:
   - Definition: TextBlob word tokenize is used to tokenize text into individual words using the TextBlob library.
   - Example:
     Input: "Hello, world! How are you?"
     Output: ['Hello', 'world', 'How', 'are', 'you']

h. **spaCy Tokenizer**:
   - Definition: spaCy tokenizer tokenizes the text using the spaCy library, providing detailed tokenization including parts of speech tagging and entity recognition.
   - Example:
     Input: "Hello, world! How are you?"
     Output: ['Hello', ',', 'world', '!', 'How', 'are', 'you', '?']

i. **Gensim word tokenizer**:
   - Definition: Gensim word tokenizer is a simple tokenizer provided by the Gensim library.
   - Example:
     Input: "Hello, world! How are you?"
     Output: ['Hello', 'world', 'How', 'are', 'you']

j. **Tokenization with Keras**:
   - Definition: Tokenization with Keras is similar to word tokenization but converts the text to lowercase and removes punctuation.
   - Example:
     Input: "Hello, world! How are you?"
     Output: ['hello', 'world', 'how', 'are', 'you']

1. **Word Tokenization**:
   - Insight: Word tokenization is a fundamental step in natural language processing (NLP) tasks, breaking down text into its constituent words.
   - Applications: Sentiment analysis, text classification, language translation, named entity recognition.

2. **Sentence Tokenization**:
   - Insight: Sentence tokenization divides text into individual sentences, enabling analysis at a higher level of granularity.
   - Applications: Text summarization, machine translation, text-to-speech systems, document indexing.

3. **Punctuation-based Tokenizer**:
   - Insight: Punctuation-based tokenizer segments text based on punctuation marks, preserving them as separate tokens.
   - Applications: Sentiment analysis, parsing URLs or email addresses, analyzing text with emoticons or special characters.

4. **Treebank Word Tokenizer**:
   - Insight: Treebank word tokenizer follows conventions used in the Penn Treebank corpus, providing standardized tokenization.
   - Applications: Part-of-speech tagging, dependency parsing, named entity recognition, syntactic analysis.

5. **Tweet Tokenizer**:
   - Insight: Tweet tokenizer handles tokenization in the context of social media text, which often contains hashtags, mentions, and emojis.
   - Applications: Social media sentiment analysis, trend analysis, user profiling, brand monitoring.

6. **Multi-Word Expression Tokenizer**:
   - Insight: Multi-word expression tokenizer recognizes and treats multi-word phrases as single tokens, preserving their semantic integrity.
   - Applications: Named entity recognition, extracting domain-specific terminology, phrase-based machine translation.

7. **TextBlob Word Tokenize**:
   - Insight: TextBlob word tokenize provides a simple and easy-to-use tokenization method integrated with other NLP functionalities.
   - Applications: Basic NLP tasks like sentiment analysis, part-of-speech tagging, text classification.

8. **spaCy Tokenizer**:
   - Insight: spaCy tokenizer offers detailed tokenization with linguistic features such as part-of-speech tagging and named entity recognition.
   - Applications: Advanced NLP tasks including entity recognition, information extraction, syntactic parsing.

9. **Gensim word tokenizer**:
   - Insight: Gensim word tokenizer is a lightweight tool for basic tokenization tasks, particularly suited for large-scale text processing.
   - Applications: Topic modeling, document clustering, word embedding generation, semantic similarity calculation.

10. **Tokenization with Keras**:
    - Insight: Tokenization with Keras is tailored for text preprocessing in deep learning models, converting text to lowercase and removing punctuation.
    - Applications: Text classification, sentiment analysis, sequence-to-sequence tasks, neural machine translation.