## 01 Import NLTK
import nltk in-order to use its functions

In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## 02 convert text to lower case:

In [2]:
text = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit"
lower_text = text.lower()
print (lower_text)

this is a demo text for nlp using nltk. full form of nltk is natural language toolkit


## 03 word tokenize
Tokenize sentences to get the tokens of the text i.e breaking the sentences into words.

In [1]:
text = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit"
word_tokens = nltk.word_tokenize(text)
print (word_tokens)

NameError: name 'nltk' is not defined

## 04 sent tokenize
Tokenize sentences if the there are more than 1 sentence i.e breaking the sentences to list of sentence.

In [4]:
text = "This is a for Dr. No a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit"
sent_token = nltk.sent_tokenize(text)
print (sent_token)

['This is a for Dr. No a Demo Text for NLP using NLTK.', 'Full form of NLTK is Natural Language Toolkit']


## 05 stop words removal
Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they don’t carry any information.

In [6]:
import nltk
from nltk.corpus import stopwords
stopword = stopwords.words('english')
text = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit"
word_tokens = nltk.word_tokenize(text)
removing_stopwords = [word for word in word_tokens if word not in stopword]
print (removing_stopwords)

['This', 'Demo', 'Text', 'NLP', 'using', 'NLTK', '.', 'Full', 'form', 'NLTK', 'Natural', 'Language', 'Toolkit']


## 06 lemma
lemmatize the text so as to get its root form eg: functions,funtionality as function

Lemmatization is a linguistic process that reduces inflected or derived words to their root or base form, known as the "lemma." For example, "running," "runs," and "ran" are all derived from the base word, "run."

In natural language processing (NLP), lemmatization is often used to group different forms of a word so that they can be analyzed as a single entity. This is useful in various tasks like text mining, text analysis, and information retrieval, as it allows these systems to generalize better across different forms of a word, thereby increasing the accuracy and effectiveness of such systems.

The Natural Language Toolkit (NLTK) is a Python library that provides tools to work with human language data (text). NLTK includes a lemmatization utility that helps in reducing words to their base or root form.

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#is based on The Porter Stemming Algorithm
stopword = stopwords.words('english')
wordnet_lemmatizer = WordNetLemmatizer()
text = "the dogs are barking outside. Are the cats in the garden?"
word_tokens = nltk.word_tokenize(text)
lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in word_tokens]
print (lemmatized_word)

['the', 'dog', 'are', 'barking', 'outside', '.', 'Are', 'the', 'cat', 'in', 'the', 'garden', '?']


## 07 stemming
stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form

Stemming is another text normalization technique similar to lemmatization, used in natural language processing and information retrieval. The primary goal of stemming is to reduce inflected or derived words to their root form, but unlike lemmatization, the root form produced by stemming may not always be a valid or meaningful word in the language.

For example, stemming would reduce the words "running," "runner," and "ran" to the root "run," similar to lemmatization. However, stemming could also reduce the word "flies" to "fli" or "happiness" to "happi," which are not meaningful or valid words in English.

- **Algorithm Complexity**: Lemmatization is generally more complex and computationally expensive compared to stemming. This is because lemmatization often requires additional contextual information such as the part of speech of a word and relies on dictionaries or vocabularies.
- **Speed**: Stemming algorithms are typically faster because they operate on a set of heuristics or rules and don't require additional resources like a vocabulary or dictionary. This makes stemming suitable for tasks where speed is a critical factor.

In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
#is based on The Porter Stemming Algorithm
stopword = stopwords.words('english')
snowball_stemmer = SnowballStemmer('english')
text = "the dogs are barking outside. Are the cats in the garden?"
word_tokens = nltk.word_tokenize(text)
stemmed_word = [snowball_stemmer.stem(word) for word in word_tokens]
print (stemmed_word)

['the', 'dog', 'are', 'bark', 'outsid', '.', 'are', 'the', 'cat', 'in', 'the', 'garden', '?']


## 08 Get word frequency
counting the word occurrence using FreqDist library


In [9]:
import nltk
from nltk import FreqDist
text = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit"
word = nltk.word_tokenize(text.lower())
freq = FreqDist(word)
print (freq.most_common(5))

[('is', 2), ('nltk', 2), ('this', 1), ('a', 1), ('demo', 1)]


## 09 Part of Speech tags
POS tag helps us to know the tags of each word like whether a word is noun, adjective etc.

In [11]:
import nltk
text = "the dogs are barking outside."
word = nltk.word_tokenize(text)
pos_tag = nltk.pos_tag(word)
print (pos_tag)


[('the', 'DT'), ('dogs', 'NNS'), ('are', 'VBP'), ('barking', 'VBG'), ('outside', 'IN'), ('.', '.')]


The `nltk.pos_tag` function in the Natural Language Toolkit (NLTK) library for Python tags parts of speech according to the Penn Treebank POS tags. The Penn Treebank is one of the most commonly used sets of part-of-speech tags, and it's frequently used in academic and industrial NLP research.

Here are some commonly used Penn Treebank POS tags and their meanings:

- `NN`: Noun, singular or mass
- `NNS`: Noun, plural
- `VB`: Verb, base form
- `VBD`: Verb, past tense
- `VBG`: Verb, gerund or present participle
- `VBN`: Verb, past participle
- `VBP`: Verb, non-3rd person singular present
- `VBZ`: Verb, 3rd person singular present
- `JJ`: Adjective
- `JJR`: Adjective, comparative
- `JJS`: Adjective, superlative
- `RB`: Adverb
- `RBR`: Adverb, comparative
- `RBS`: Adverb, superlative
- `IN`: Preposition or subordinating conjunction
- `DT`: Determiner
- `CC`: Coordinating conjunction
- `PRP`: Personal pronoun
- `PRP$`: Possessive pronoun
- `MD`: Modal
- `CD`: Cardinal number
- `WP`: Wh-pronoun
- `WDT`: Wh-determiner

... and many more.

For a more comprehensive list of these part-of-speech tags and their explanations, you can refer to the [Penn Treebank POS Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) webpage.

