The **Natural Language Toolkit (nltk)** is a powerful Python library used for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text-processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

### Introduction to NLP and NLTK
Natural Language Processing (NLP) is a branch of artificial intelligence that deals with interaction between computers and humans through natural language. It’s crucial for tasks like:

1. Text Classification
2. Sentiment Analysis
3. Named Entity Recognition (NER)
4. Machine Translation

The `nltk` library simplifies NLP tasks by providing a large toolkit with pre-built functionalities to manipulate and analyze textual data.

### Installation of NLTK

In [16]:
%pip install nltk
%pip install numpy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Once installed, you can download the necessary datasets and models using:

In [17]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Basic Text Processing with NLTK

#### Tokenization
Breaking down text into smaller parts like sentences or words is the first step.

##### **Sentence Tokenization:** Tokenizes a paragraph into individual sentences.

In [18]:
from nltk.tokenize import sent_tokenize
text = "Hello! Welcome to the world of NLTK. This is a simple example."
sentences = sent_tokenize(text)
print(sentences)

['Hello!', 'Welcome to the world of NLTK.', 'This is a simple example.']


##### **Word Tokenization:** Tokenizes sentences into individual words.

In [19]:
from nltk.tokenize import word_tokenize
text = "Hello! Welcome to the world of NLTK. This is a simple example."
words = word_tokenize(text)
print(words)

['Hello', '!', 'Welcome', 'to', 'the', 'world', 'of', 'NLTK', '.', 'This', 'is', 'a', 'simple', 'example', '.']


#### Stopwords Removal
Many words in a language (like "the," "is," "in") don’t carry much meaning in terms of analysis. These are called stopwords.

In [20]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "Hello! Welcome to the world of NLTK. This is a simple example."
words = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words if not w in stop_words]
print(filtered_words)

['Hello', '!', 'Welcome', 'world', 'NLTK', '.', 'This', 'simple', 'example', '.']


#### Stemming and Lemmatization


**Stemming** is like a quick cut to simplify words, whereas **lemmatization** is like getting back to the dictionary form.

##### Stemming

**Stemming**: Reducing words to their root form by chopping off prefixes, suffixes, converting to all small case etc. (e.g., "Running" → "run").

In [21]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed = [ps.stem(w) for w in filtered_words]
print(stemmed)

['hello', '!', 'welcom', 'world', 'nltk', '.', 'thi', 'simpl', 'exampl', '.']


##### Lemmatization

**Lemmatization**: Similar to stemming but returns valid words. It uses vocabulary and morphological analysis (e.g., "running" → "run").

In [22]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmed = [lemmatizer.lemmatize(w) for w in filtered_words]
print(lemmed)

['Hello', '!', 'Welcome', 'world', 'NLTK', '.', 'This', 'simple', 'example', '.']


###### What is Morphological Analysis?

**Morphological analysis** in the context of **lemmatization** refers to the process of understanding the structure and formation of words to reduce them to their base or dictionary form, known as the **lemma**. It involves analyzing the various grammatical forms a word can take (like tense, gender, or number) and understanding how these forms are constructed through inflectional changes (prefixes, suffixes, etc.).

**Key Concepts:**
1. **Lemmatization**: The process of reducing a word to its base or dictionary form (lemma).
2. **Morphology**: The study of the structure of words and how they change to express grammatical features like tense, number, etc.

**How Morphological Analysis Works in Lemmatization:**
When lemmatization is applied, it looks at both the word's **morphology** (form and structure) and **context** to find its base form. This is different from **stemming**, which simply cuts off word endings without regard to grammatical rules or context.

**Example 1: Verb Conjugation**
Take the word "running":

1. **Morphological Analysis**: The word "running" is a present participle of the verb "run". The "-ing" suffix indicates a continuous action.
2. **Lemmatization**: The base form or lemma of "running" is "run".

In [23]:
import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Example
print(lemmatizer.lemmatize('running', pos='v'))  # Verb context

run


**Example 2: Nouns with Plural Forms**
For the word "geese":

1. **Morphological Analysis**: The word "geese" is the plural form of "goose". In English, it uses an irregular plural form.
2. **Lemmatization**: The lemma or base form is "goose".

In [24]:
print(lemmatizer.lemmatize('geese', pos='n'))  # Noun context

goose


**Example 3: Adjectives with Comparative and Superlative Forms**
For the word "better":

1. **Morphological Analysis**: "Better" is the comparative form of "good". Morphologically, it represents a higher degree of goodness.
2. **Lemmatization**: The lemma is "good".

In [25]:
print(lemmatizer.lemmatize('better', pos='a'))  # Adjective context

good


**Morphological Rules Considered in Lemmatization:**
1. **Verb tenses**: Runs → run, Running → run, Ran → run
2. **Plural nouns**: Geese → goose, Dogs → dog
3. **Comparative/superlative adjectives**: Better → good, Best → good
4. **Inflectional endings**: Laughing → laugh, Studies → study

**Importance in Natural Language Processing (NLP):**
Morphological analysis through lemmatization ensures that words with different grammatical forms (tenses, plurals, etc.) are understood as the same root concept. This is particularly useful for tasks like text classification, sentiment analysis, and search engines, where word variants should be treated uniformly.

### POS Tagging (Parts of Speech Tagging)
This step involves labeling words with their respective parts of speech (nouns, verbs, adjectives, etc.).

In [26]:
from nltk import pos_tag
tagged = pos_tag(words)
print(tagged)

[('Hello', 'NN'), ('!', '.'), ('Welcome', 'UH'), ('to', 'TO'), ('the', 'DT'), ('world', 'NN'), ('of', 'IN'), ('NLTK', 'NNP'), ('.', '.'), ('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('example', 'NN'), ('.', '.')]


### Named Entity Recognition (NER)
Identifying proper names in text (people, places, organizations).

In [27]:
from nltk import ne_chunk
entities = ne_chunk(tagged)
entities.draw()  # Visual representation

### Text Classification with NLTK
NLTK provides functionality for building basic text classifiers. A simple text classification pipeline includes:

1. Feature extraction
2. Training classifiers (Naive Bayes, Decision Trees, etc.)

#### Feature Extraction
Convert textual data into a format that machine learning models can understand. Bag of Words (BoW) is a common approach:

In [28]:
from nltk import FreqDist
fdist = FreqDist(words)
print(fdist.most_common(5))  # Most common words as features

[('.', 2), ('Hello', 1), ('!', 1), ('Welcome', 1), ('to', 1)]


#### Text Classification Example (Naive Bayes)
We’ll classify whether a piece of text is about sports or politics.

In [29]:
import random
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

def extract_features(word_list):
    return {word: True for word in word_list}

movie_reviews_data = [(extract_features(list(movie_reviews.words(fileid))), category)
                      for category in movie_reviews.categories()
                      for fileid in movie_reviews.fileids(category)]
random.shuffle(movie_reviews_data)
train_data = movie_reviews_data[:1600]
test_data = movie_reviews_data[1600:]

classifier = NaiveBayesClassifier.train(train_data)
print(f"Accuracy: {accuracy(classifier, test_data)}")

Accuracy: 0.7675


### WordNet - Lexical Database
WordNet is a lexical (anything related to words or vocabulary) database for the English language that groups words into sets of synonyms called synsets and provides meanings and relationships between them.

In [30]:
from nltk.corpus import wordnet
synonyms = wordnet.synsets("car")
for synonym in synonyms:
    print(synonym.definition())  # Print the definition


a motor vehicle with four wheels; usually propelled by an internal combustion engine
a wheeled vehicle adapted to the rails of railroad
the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
where passengers ride up and down
a conveyance for passengers or freight on a cable railway


### Applying NLTK for Machine Learning
NLTK’s text processing tools are foundational for building machine learning models. Here’s an example of how you might approach an NLP problem using NLTK with machine learning models:

1. **Preprocess the text**: Tokenization, stopwords removal, stemming/lemmatization.
2. **Feature Extraction**: Extract features using BoW, TF-IDF, or word embeddings.
3. **Model Building**: Use classifiers like Naive Bayes, SVM, or neural networks (RNN, LSTM) on the processed data.