# Pre Processing techniques for NLP

## Stop Words

The words which are generally filtered out out before processing a natural language are called stop words .
These are the words which do not carry meaningful information about the content of the text.
Stop words are used to remove noise from the data and speed up the computation process.
Ex:the", "is", "at", "which", "and", "on", "in", "of", "to", etc.,  we can even include punctuations here

To perform the stop words operation, we will use the NLTK library. NLTK stands for Natural Language Toolkit. It is a leading platform for building Python programs to work with human language data, particularly in the field of natural language processing (NLP). NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, among other NLP tasks.

In [1]:
from nltk.corpus import stopwords
from string import punctuation
import nltk



In [2]:
#Example sentence to perform stop words operation
sentence="""NLTK is a leading platform for building Python 
programs to work with human language data."""

In [3]:
stop_words=stopwords.words("english")
 

nltk.download('stopwords') run this code if the above code is throwing an error

In [4]:
#Example stop words
print(stop_words[0:20]) 

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [5]:
punc=list(punctuation)

In [6]:
stop_words.extend(punc)

Now we have added all the stop words and unnecessary punctuation marks in the stop_words variable

Now we will be transforming sentence into lower case and splitting it because the stop words is in lower case and in form of list

In [7]:
sentence_splitted=sentence.lower().split()

In [8]:
sentence_splitted

['nltk',
 'is',
 'a',
 'leading',
 'platform',
 'for',
 'building',
 'python',
 'programs',
 'to',
 'work',
 'with',
 'human',
 'language',
 'data.']

In [9]:
sentence_without_stopwords=[word for word in sentence_splitted if word not in stop_words]

In [10]:
len(sentence_without_stopwords)

10

In [11]:
len(sentence_splitted)

15

Now join the sentence without stop words and compare it with original sentence

In [12]:
fullsentence_without_stopwords=' '.join(sentence_without_stopwords)

In [13]:
fullsentence_without_stopwords

'nltk leading platform building python programs work human language data.'

In [14]:
sentence

'NLTK is a leading platform for building Python \nprograms to work with human language data.'

We can see the difference between the original sentence and the updated sentence

## Tokens

In natural language processing (NLP), tokens are the individual units of text that make up a larger body of text. These units can vary depending on the level of granularity required for the task at hand. Tokenization is the process of breaking down text into these smaller units.

Here are some common types of tokens in NLP:

1. **Word Tokens**: These are tokens representing individual words in the text.

2. **Character Tokens**: These tokens represent individual characters in the text.

3. **Subword Tokens**: These tokens represent smaller linguistic units, such as prefixes, suffixes, or roots of words. Subword tokenization is often used to handle out-of-vocabulary words and improve the generalization of models.

4. **Special Tokens**: These are tokens that serve specific purposes in certain models or tasks. They are not typically part of the input text but are added during preprocessing or model training to convey additional information.

Now, let's take the example of BERT (Bidirectional Encoder Representations from Transformers), one of the most popular models used in NLP. BERT utilizes several special tokens:

1. **[CLS]** (Classification Token): In BERT, the input to the model consists of a sequence of tokens. BERT adds a special [CLS] token at the beginning of every input sequence. This token is used to represent the sequence as a whole for tasks such as sentence classification or sentence pair classification.

2. **[SEP]** (Separator Token): BERT uses the [SEP] token to separate pairs of sentences in tasks such as question answering or natural language inference. It indicates the end of one sentence and the beginning of another.

3. **[MASK]** (Mask Token): During pretraining, BERT employs the [MASK] token to mask certain tokens in the input sequence. The model is then trained to predict these masked tokens based on the context provided by the surrounding tokens.

4. **[PAD]** (Padding Token): In order to process sequences of varying lengths efficiently, BERT pads shorter sequences with [PAD] tokens to make them equal in length to the longest sequence in the batch.

These special tokens play crucial roles in how BERT processes and understands text, enabling it to perform tasks such as text classification, named entity recognition, and question answering effectively.

## Tokenization

Tokenization is the process of breaking down a text or a sequence of characters into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the granularity required for the task at hand. Tokenization is a fundamental step in natural language processing (NLP) and is essential for tasks such as text analysis, sentiment analysis, language modeling, and machine translation.

Here are the common types of tokenization:

Word Tokenization: This type of tokenization splits the text into words based on space or punctuation boundaries. For example, the sentence "Tokenization is important in NLP." would be tokenized into ["Tokenization", "is", "important", "in", "NLP"].

Sentence Tokenization: Sentence tokenization involves splitting the text into individual sentences. This is often done by identifying punctuation marks (such as periods, exclamation marks, or question marks) that denote the end of a sentence. For example, the paragraph "This is the first sentence. This is the second sentence!" would be tokenized into ["This is the first sentence.", "This is the second sentence!"].z

In [15]:
from nltk.tokenize import sent_tokenize,word_tokenize

In [16]:
sentence="""The sun sets behind the horizon,
Painting the sky with hues of orange and pink.
As darkness falls, stars twinkle in the night sky."""

In [17]:
sentence

'The sun sets behind the horizon,\nPainting the sky with hues of orange and pink.\nAs darkness falls, stars twinkle in the night sky.'

In [18]:
#  nltk.download('punkt')
sentence_tokenized=sent_tokenize(sentence)

In [19]:
sentence_tokenized

['The sun sets behind the horizon,\nPainting the sky with hues of orange and pink.',
 'As darkness falls, stars twinkle in the night sky.']

In [20]:
for sent in sentence_tokenized:
    print(sent)

The sun sets behind the horizon,
Painting the sky with hues of orange and pink.
As darkness falls, stars twinkle in the night sky.


In [21]:
word_tokenized=word_tokenize(sentence)

In [22]:
word_tokenized

['The',
 'sun',
 'sets',
 'behind',
 'the',
 'horizon',
 ',',
 'Painting',
 'the',
 'sky',
 'with',
 'hues',
 'of',
 'orange',
 'and',
 'pink',
 '.',
 'As',
 'darkness',
 'falls',
 ',',
 'stars',
 'twinkle',
 'in',
 'the',
 'night',
 'sky',
 '.']

## Stemming

In natural language processing (NLP), stemming is the process of reducing words to their root or base form, also known as the stem. The main goal of stemming is to reduce inflected words to their common base form, which can help improve text analysis and information retrieval tasks by treating different forms of a word as the same entity.

For example, stemming would convert words like "running", "runs", and "ran" to the common base form "run". Similarly, words like "play", "playing", and "played" would all be stemmed to "play".

Stemming algorithms typically work by removing suffixes from words to obtain the root form. These algorithms are rule-based and operate by applying a series of rules to trim off common suffixes. However, stemming algorithms do not always produce accurate or linguistically valid results, as they may sometimes produce stems that are not actual words or may result in stems that are not semantically related.

Despite its limitations, stemming is still widely used in NLP tasks such as text normalization, information retrieval, and document clustering. It can help reduce the dimensionality of text data and improve the performance of certain text processing tasks. Popular stemming algorithms include the Porter Stemmer and the Snowball Stemmer.

In [23]:
from nltk.stem import PorterStemmer,LancasterStemmer,RegexpStemmer

In [24]:
porter=PorterStemmer()
lancaster=LancasterStemmer()

In [25]:
print(porter.stem("changing"))
print(porter.stem("changed"))
print(porter.stem("change"))

chang
chang
chang


In [26]:
#Set of rules to change are different
print(lancaster.stem("changed"))
print(lancaster.stem("changed"))
print(lancaster.stem("change"))

chang
chang
chang


In [27]:
words=["changed","changing","change"]

In [28]:
port_stemmer=[porter.stem(word) for word in words]
print(port_stemmer)

['chang', 'chang', 'chang']


## Lemmetization


Lemmatization, like stemming, is a natural language processing (NLP) technique used to reduce words to their base or dictionary form, known as the lemma. However, unlike stemming, lemmatization considers the context and meaning of a word when determining its lemma.

In [29]:
# nltk.download('omw-1.4')

In [30]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [31]:
lemmetizer=WordNetLemmatizer()

In [32]:
lem_words=[lemmetizer.lemmatize(word,wordnet.VERB) for word in words]

In [33]:
lem_words

['change', 'change', 'change']