## Preprocessing

It is a very important step that helps to reduce the complexity of the raw text and helps the in future tasks. Proper care should be taken in the preprocessing as it might also lead to loss of important information. 

### Stemming

It is used to transform the word into its most basic form. Eg. jumping, jumps and jumped can be transformed to its stemmed form "jump". This reduces the total number of words that are required to be stored in the corpus. It might also lead to errors in the output, depends on the problem that we are trying to solve.

In [1]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/adityasingh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [3]:
stemmer = PorterStemmer()

In [4]:
print(stemmer.stem("cat")) # -> cat
print(stemmer.stem("cats")) # -> cat

print(stemmer.stem("walking")) # -> walk
print(stemmer.stem("walked")) # -> walk

print(stemmer.stem("achieve")) # -> achiev

print(stemmer.stem("am")) # -> am
print(stemmer.stem("is")) # -> is
print(stemmer.stem("are")) # -> are

cat
cat
walk
walk
achiev
am
is
are


In [5]:
text = "The cats are sleeping. What are the dogs doing?"

tokens = word_tokenize(text)
tokens_stemmed = [stemmer.stem(token) for token in tokens]
print(tokens_stemmed)

['the', 'cat', 'are', 'sleep', '.', 'what', 'are', 'the', 'dog', 'do', '?']


### Lemmatization

It is similar to stemming that brings the word to their base form. The only difference being that lemmatization uses the morphological analysis of the word. The word "better" has "good" as its base word which will be correctly identified by lemmatization but not by stemming. It is computationally expensive.

In [6]:
import nltk
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/adityasingh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/adityasingh/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/adityasingh/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [7]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [8]:
lemmatizer = WordNetLemmatizer()

In [10]:
print(lemmatizer.lemmatize("cat")) # -> cat
print(lemmatizer.lemmatize("cats")) # -> cat

print(lemmatizer.lemmatize("walking")) # -> walk
print(lemmatizer.lemmatize("walked")) # -> walk

print(lemmatizer.lemmatize("achieve")) # -> achiev

print(lemmatizer.lemmatize("am")) # -> am
print(lemmatizer.lemmatize("is")) # -> is
print(lemmatizer.lemmatize("are")) # -> are
print(lemmatizer.lemmatize("better"))
print(lemmatizer.lemmatize("good"))

cat
cat
walking
walked
achieve
am
is
are
better
good


### Stopwords

These are the words that are pretty common in the language and dont carry any special weight in the language related tasks like a, an, the, is, in ,etc. Although, they might play a significant role in some tasks like sentiment analysis, so whether to remove stop words should depend on the task as hand.

In [12]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/adityasingh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
from nltk.corpus import stopwords
english_stopwords = stopwords.words('english')

In [20]:
print(len(english_stopwords))
print(english_stopwords[:10])

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


### Tokenization

This is generally the first step in the NLP task. It breaks the textual data in token which may be words or sentences or something else (the basic unit in that problem). 

### Lowercase

In this the textual data is completely taken to the lower case. It helps to reduce the corpus, like the words "India", "INDIA" will be converted to "india".

### Punctuation removal

Removing punctuation from the data. It might provide additional context in different problem type. It must only be removed from the data if it is not required at all.

### Spell correction

Spell check and correction are essential for identifying and fixing typos and spelling errors in text data. This process helps reduce redundancy by ensuring that words like "speling" and "spelling" are recognized as the same word after correction.

### Noise removal

Noise removal is the process of eliminating unwanted characters, digits, and text fragments that can disrupt text analysis. This can include removing headers, footers, HTML, and XML content.

### Text normalization

Text normalization involves standardizing text by converting it to the same case (usually lowercase), removing punctuation, and converting numbers to their word forms.

### POS tagging

Part-of-speech tagging identifies the grammatical category (noun, verb, adjective, etc.) of each word in a sentence. It is valuable for understanding sentence structure and aids in tasks like named entity recognition and question answering.

In [21]:
import nltk

In [23]:
text = "Hello, my name is Aditya Singh"
word_tokenize = nltk.word_tokenize(text)
print(nltk.pos_tag(word_tokenize))

[('Hello', 'NNP'), (',', ','), ('my', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), ('Aditya', 'NNP'), ('Singh', 'NNP')]


In [24]:
text = "I love my country that is India"
word_tokenize = nltk.word_tokenize(text)
print(nltk.pos_tag(word_tokenize))

[('I', 'PRP'), ('love', 'VBP'), ('my', 'PRP$'), ('country', 'NN'), ('that', 'WDT'), ('is', 'VBZ'), ('India', 'NNP')]
