# **Natural Language Processing (NLP)**

Natural language processing (NLP) is the application of computational methods to not only extract information from text but also model different applications on top of it. All language based text have systematic structure or rules which is often referred as morphology, `for example` the past tense of “jump” is always “ jumped”. For humans this morphological understanding is obvious.

## **Tokenization**

The task of segmenting text into relevant words in called tokenization.

In simplest form, tokenization can be achieved by splitting text using whitespace. 

`NLTK` provides a function called `word_tokenize()` for splitting strings into tokens.

In [7]:
import nltk
# nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

text = 'we will look into the core components that are relevant to language in computational linguistics'
word_tokenize(text)

['we',
 'will',
 'look',
 'into',
 'the',
 'core',
 'components',
 'that',
 'are',
 'relevant',
 'to',
 'language',
 'in',
 'computational',
 'linguistics']

- But simple tokenization doesn’t work all the time.
- In case of complex words which involves punctuation marks in between words ( Example: what’s)

In [8]:
text = 'What\'s up?'
word_tokenize(text)

['What', "'s", 'up', '?']

- If we want to preserve that word with punctuations, simple hack is that we can split the text into words by white spaces and replace all punctuation with nothing.

In [9]:
import string
text = 'What\'s up?'
words = text.split()
table = str.maketrans('', '', string.punctuation)
[w.translate(table) for w in words]

['Whats', 'up']

## **Stemming & Lemmatization**

Task of reducing each word to its root . For example “Walk” is the root for words like “Walks”, “Walking”, “Walked”. Usually the root may hold significantly more meaning than the tense itself. So in NLP tasks it’s very important to extract the root for the words in the text.

`Stemming` helps in reducing the vocabulary present in the documents, which saves a lot of computation. Also in the tasks like classification, tenses of words are rendered irrelevant once stemming is applied.

Most popular method is the `Porter Stemming algorithm`. Its a Suffix stripping algorithms which do not rely on a lookup table that consists of inflected forms and root form relations. Some simple rules are built for extracting the root words.

### **Stemming vs Lemmatization**

Stemming and lemmatization are both techniques used to reduce words to a common base form, but they differ in how they do it. `Stemming is faster, but lemmatization is more accurate.` 

The practical distinction between stemming and lemmatization is that, where `stemming` merely removes common suffixes from the end of word tokens, `lemmatization` ensures the output word is an existing normalized form of the word

`lemmatization` does very similar to `stemming` as it removes inflection and suffixes to convert words into their root words. 

**`Meaning and context can be lost in the Stemming, lemmatization preserves the context.`**

In [None]:
# Stemming Method
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
tokens = ['Walked', 'Walks', 'Walking', 'ate', 'eats', 'eating']
[porter.stem(word) for word in tokens]

['walk', 'walk', 'walk', 'ate', 'eat', 'eat']

In [None]:
# Lemmatization Method
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
tokens2 = ['Walked', 'Walks', 'Walking', 'ate', 'eats', 'eating']
[lemma.lemmatize(word) for word in tokens2]

['Walked', 'Walks', 'Walking', 'ate', 'eats', 'eating']

### **Normalisation Case**

It is common to convert all words to one case

### **Stop Words**

Stop words are those words that do not contribute in the process of extracting/modelling on the text data because thery are the most common words such as: `the, a, and is`

In [17]:
# nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words[:11])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've"]


## **Note**
`Data Cleaning:` Before applying complex computational methods on the text data, we are expected to understand and clean the data. These techniques help us make the text ready for modelling with advanced DNN and NLP techniques.