# NLP

Source:https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/

# 1. Introduction to Natural Language Processing

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.

# Terms To Remember

* Tokenization – process of converting a text into tokens
* Tokens – words or entities present in the text
* Text object – a sentence or a phrase or a word or an article

* Corpus - Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals.


* Lexicon - Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons. For example: To a financial investor, the first meaning for the word "Bull" is someone who is confident about the market, as compared to the common English lexicon, where the first meaning for the word "Bull" is an animal. As such, there is a special lexicon for financial investors, doctors, children, mechanics, and so on.



* Token - Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

# Tokenize words

A sentence or data can be split into words using the method word_tokenize():

In [5]:
from nltk.tokenize import sent_tokenize, word_tokenize
 
data = "I want to go outside but its raining"
print(word_tokenize(data))

['I', 'want', 'to', 'go', 'outside', 'but', 'its', 'raining']


All of them are words except the comma. Special characters are treated as separate tokens.

# Tokenizing sentences

The same principle can be applied to sentences. Simply change the to sent_tokenize()
We have added two sentences to the variable data:

In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize
 
data = "I want to go outside but its raining.I want to go outside but its raining"
print(sent_tokenize(data))

['I want to go outside but its raining.I want to go outside but its raining']


# NLTK and arrays

We can store the words and sentences in arrays

In [7]:
from nltk.tokenize import sent_tokenize, word_tokenize
 
data = "I want to go outside but its raining."
 
phrases = sent_tokenize(data)
words = word_tokenize(data)
 
print(phrases)
print(words)

['I want to go outside but its raining.']
['I', 'want', 'to', 'go', 'outside', 'but', 'its', 'raining', '.']


# NLTK stop words

Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

# Natural Language Processing: remove stop words

In [8]:
from nltk.tokenize import sent_tokenize, word_tokenize
 
data = "I want to go outside but its raining.."
words = word_tokenize(data)
print(words)

['I', 'want', 'to', 'go', 'outside', 'but', 'its', 'raining..']


We modify it to:

In [9]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
 
data = "I want to go outside but its raining."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
 
for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)
 
print(wordsFiltered)

['I', 'want', 'go', 'outside', 'raining', '.']


We get a set of English stop words using the line:

In [12]:
stopWords = set(stopwords.words('english'))
stopWords

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 'd',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 'doing',
 'don',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 'has',
 'hasn',
 'have',
 'haven',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 'it',
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 'more',
 'most',
 'mustn',
 'my',
 'myself',
 'needn',
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 'she',
 'should',
 'shouldn',
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 'the',
 'their',
 'theirs',
 'them',
 

In [13]:
print(len(stopWords))
print(stopWords)

153
{'again', 'haven', 'shouldn', 'on', 'her', 'yours', 'wasn', 'by', 'or', 'weren', 'them', 'himself', 'doing', 'i', 'before', 'needn', 'so', 'after', 'same', 'some', 'yourself', 'myself', 'll', 'don', 'now', 'do', 'why', 'as', 'over', 'can', 'both', 'm', 'wouldn', 'this', 'very', 'above', 'most', 'mightn', 'because', 'your', 'ours', 'at', 'those', 'up', 'where', 'she', 'me', 'be', 'am', 'did', 'yourselves', 'had', 's', 'the', 'in', 'just', 'any', 'd', 'from', 'ourselves', 'their', 'my', 'will', 'his', 'until', 'o', 'more', 'should', 'these', 'an', 'doesn', 'nor', 'when', 'is', 'such', 'ain', 'hadn', 'how', 'mustn', 'were', 'couldn', 'against', 'not', 't', 'him', 'than', 'you', 'they', 'have', 'we', 'off', 'once', 'our', 'has', 'between', 'each', 'aren', 'during', 'through', 'which', 'theirs', 'for', 'to', 'other', 'won', 'isn', 'about', 'down', 'being', 'it', 'only', 're', 'under', 'whom', 'here', 'of', 'ma', 'themselves', 'y', 'then', 've', 'with', 'having', 'but', 'herself', 'into'

We create a new list called wordsFiltered which contains all words which are not stop words.
To create it we iterate over the list of words and only add it if its not in the stopWords list.

In [14]:
for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)

# NLTK – stemming

A word stem is part of a word. It is sort of a normalization idea, but linguistic.
For example, the stem of the word waiting is wait.

<img src="https://pythonspot-9329.kxcdn.com/wp-content/uploads/2016/08/word-stem.png.webp">

Given words, NLTK can find the stems.



Start by defining some words:

In [15]:
words = ["game","gaming","gamed","games"]

In [17]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
 
words = ["game","gaming","gamed","games"]
ps = PorterStemmer()
 
for word in words:
    print(ps.stem(word))

game
game
game
game


We can do word stemming for sentences too:

In [18]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
 
ps = PorterStemmer()
 
sentence = "gaming, the gamers play games"
words = word_tokenize(sentence)
 
for word in words:
    print(word + ":" + ps.stem(word))

gaming:game
,:,
the:the
gamers:gamer
play:play
games:game


There are more stemming algorithms, but Porter (PorterStemer) is the most popular.