Natural language processing, or NLP, is an automated way to understand, analyze human languages, and extract information from such data by applying machine learning algorithms. The data content can be text document, image, audio, or video. 
Sometimes, it is also referred to as a field of computer science or artificial intelligence to extract the linguistics information from the underlying data.
NLP enables machines or computers to derive meaning from human or natural language input.

Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. 
A lot of the data that you could be analyzing is unstructured data and contains human-readable text. Before you can analyze that data programmatically, you first need to preprocess it. 

In [1]:
#install NLTK with pip
#pip install nltk==3.5

Tokenizing
By tokenizing, you can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. It’s your first step in turning unstructured data into structured data, which is easier to analyze.

When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence. Here’s what both types of tokenization bring to the table:

Tokenizing by word: Words are like the atoms of natural language. They’re the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often. For example, if you were analyzing a group of job ads, then you might find that the word “Python” comes up often. That could suggest high demand for Python knowledge, but you’d need to look deeper to know more.

Tokenizing by sentence: When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python? Are there more terms from the domain of herpetology than the domain of software development, suggesting that you may be dealing with an entirely different kind of python than you were expecting?

Here’s how to import the relevant parts of NLTK so you can tokenize by word and by sentence:



In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [3]:
example_string = """Muad'Dib learned rapidly because his first training was in how to learn.
... And the first lesson of all was the basic trust that he could learn.
... It's shocking to find how many people do not believe they can learn,and how many more believe learning to be difficult."""

In [4]:
#use sent_tokenize() to split up example_string into sentences:
sent_tokenize(example_string)

["Muad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn,and how many more believe learning to be difficult."]

In [5]:
# tokenizing example_string by word:
word_tokenize(example_string)

["Muad'Dib",
 'learned',
 'rapidly',
 'because',
 'his',
 'first',
 'training',
 'was',
 'in',
 'how',
 'to',
 'learn',
 '.',
 'And',
 'the',
 'first',
 'lesson',
 'of',
 'all',
 'was',
 'the',
 'basic',
 'trust',
 'that',
 'he',
 'could',
 'learn',
 '.',
 'It',
 "'s",
 'shocking',
 'to',
 'find',
 'how',
 'many',
 'people',
 'do',
 'not',
 'believe',
 'they',
 'can',
 'learn',
 ',',
 'and',
 'how',
 'many',
 'more',
 'believe',
 'learning',
 'to',
 'be',
 'difficult',
 '.']

Filtering Stop Words
Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [6]:
#import the relevant parts of NLTK in order to filter out stop words:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\win10\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
worf_quote = "Sir, I protest. I am not a merry man!"
#Now tokenize worf_quote by word and store the resulting list in words_in_quote:
words_in_quote = word_tokenize(worf_quote)
words_in_quote
['Sir', ',', 'protest', '.', 'merry', 'man', '!']

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

In [8]:
#create a set of stop words to filter words_in_quote. For this example, you’ll need to focus on stop words in "english":
stop_words = set(stopwords.words("english"))

#Next, create an empty list to hold the words that make it past the filter:
filtered_list = []
#filtered_list, to hold all the words in words_in_quote that aren’t stop words.
#use stop_words to filter words_in_quote:
#iterate over words_in_quote with a for loop and add all the words that are not stop words to filtered_list. 
#use .casefold() on word so you could ignore whether the letters in word were uppercase or lowercase. 
#This is worth doing because stopwords.words('english') includes only lowercase versions of stop words.
for word in words_in_quote:
    if word.casefold() not in stop_words:
        filtered_list.append(word)

In [9]:
#print filtered_list
filtered_list

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.” Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. NLTK has more than one stemmer, but you’ll be using the Porter stemmer.

In [10]:
#import the relevant parts of NLTK in order to start stemming:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
#create a stemmer instant with PorterStemmer():
stemmer = PorterStemmer()

In [11]:
#create a string for stemming
string_for_stemming = """
... The crew of the USS Discovery discovered many discoveries.
... Discovering is what explorers do."""

In [12]:
#Before you can stem the words in that string, you need to separate all the words in it:
words = word_tokenize(string_for_stemming)

In [13]:
#Create a list of the stemmed versions of the words in words by using stemmer.stem() in a list comprehension:
stemmed_words = [stemmer.stem(word) for word in words]

In [14]:
#print stemmed_words
stemmed_words

['...',
 'the',
 'crew',
 'of',
 'the',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 '...',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

In [15]:
#import the relevant parts of NLTK in order to start stemming:
#create a stemmer instant with SnowballStemmer():
import nltk
from nltk.stem.snowball import SnowballStemmer
  
#the stemmer requires a language parameter
snow_stemmer = SnowballStemmer(language='english')

In [16]:
#create a string for stemming
string_for_stemming = """
... The crew of the USS Discovery discovered many discoveries.
... Discovering is what explorers do."""

In [17]:
#Before you can stem the words in that string, you need to separate all the words in it:
words = word_tokenize(string_for_stemming)

In [18]:
#Create a list of the stemmed versions of the words in words by using stemmer.stem() in a list comprehension:
snow_stemmed = []
for w in words:
    x = snow_stemmer.stem(w)
    snow_stemmed.append(x)
      
#print stemming results
for e1,e2 in zip(words,snow_stemmed):
    print(e1+' ----> '+e2)
    
#snow_stemmed = [snow_stemmer.stem(word) for word in words]

... ----> ...
The ----> the
crew ----> crew
of ----> of
the ----> the
USS ----> uss
Discovery ----> discoveri
discovered ----> discov
many ----> mani
discoveries ----> discoveri
. ----> .
... ----> ...
Discovering ----> discov
is ----> is
what ----> what
explorers ----> explor
do ----> do
. ----> .


Lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.

In [19]:
#import the relevant parts of NLTK in order to start lemmatizing:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

#Create a lemmatizer to use:
lemmatizer = WordNetLemmatizer()

#lemmatizing a plural noun:
lemmatizer.lemmatize("scarves")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\win10\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


'scarf'

In [20]:
#create a string with more than one word to lemmatize:
string_for_lemmatizing = "The friends of Rahul love scarves."

#Now tokenize that string by word:
words = word_tokenize(string_for_lemmatizing)

#list of words:
words

['The', 'friends', 'of', 'Rahul', 'love', 'scarves', '.']

In [21]:
#Create a list containing all the words in words after they’ve been lemmatized:
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

#print lemmatized words
lemmatized_words

['The', 'friend', 'of', 'Rahul', 'love', 'scarf', '.']

Tagging Parts of Speech
Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.

In [22]:
#create some text to tag. 
sagan_quote = """
... If you wish to make an apple pie from scratch,
... you must first invent the universe."""

#Use word_tokenize to separate the words in that string and store them in a list:
words_in_sagan_quote = word_tokenize(sagan_quote)

#Now call nltk.pos_tag() on your new list of words:
nltk.pos_tag(words_in_sagan_quote)

[('...', ':'),
 ('If', 'IN'),
 ('you', 'PRP'),
 ('wish', 'VBP'),
 ('to', 'TO'),
 ('make', 'VB'),
 ('an', 'DT'),
 ('apple', 'NN'),
 ('pie', 'NN'),
 ('from', 'IN'),
 ('scratch', 'NN'),
 (',', ','),
 ('...', ':'),
 ('you', 'PRP'),
 ('must', 'MD'),
 ('first', 'VB'),
 ('invent', 'VB'),
 ('the', 'DT'),
 ('universe', 'NN'),
 ('.', '.')]

All the words in the quote are now in a separate tuple, with a tag that represents their part of speech. 

In [23]:
#Here’s how to get a list of tags and their meanings:
from nltk.data import load
nltk.download('tagsets')
tagdict = load('help/tagsets/upenn_tagset.pickle')

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\win10\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [24]:
#POST tags and their meaning
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [25]:
tagdict.keys()

dict_keys(['LS', 'TO', 'VBN', "''", 'WP', 'UH', 'VBG', 'JJ', 'VBZ', '--', 'VBP', 'NN', 'DT', 'PRP', ':', 'WP$', 'NNPS', 'PRP$', 'WDT', '(', ')', '.', ',', '``', '$', 'RB', 'RBR', 'RBS', 'VBD', 'IN', 'FW', 'RP', 'JJR', 'JJS', 'PDT', 'MD', 'VB', 'WRB', 'NNP', 'EX', 'NNS', 'SYM', 'CC', 'CD', 'POS'])

In [26]:
tagdict['IN'][0]

'preposition or conjunction, subordinating'

Named Entity Recognition Using Spacy

In [27]:
#Install required library
#!pip install spacy
#!python -m spacy download en_core_web_sm

In [28]:
#import required library
import spacy
# load spacy model
nlp=spacy.load("en_core_web_sm")

In [29]:
#load data
doc=nlp("Berlin is the capital of Germany;and the residence of Chancellor Angela Merkel")
## print entities
doc.ents

(Berlin, Angela Merkel)

In [30]:
# print entity and label
print(doc.ents[0], doc.ents[0].label_)

Berlin GPE


In [31]:
#create text data
text="On Tuesday , Apple announced its plans for another major chunk of the money:It will buy back a further $75 billion in stock."
#load data
doc=nlp(text)
#Print entities
for ent in doc.ents :
  print(ent.text,"\t",ent.label_)

Tuesday 	 DATE
Apple 	 ORG
$75 billion 	 MONEY


In [32]:
#print description of ORG
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

In [33]:
#create text data
text="Joe Biden will meet CEO of Google Mr.Sundar Pichai in California on next Tuesday"
#load text data
doc=nlp(text)
#print the entities
for ent in doc.ents :
  print(ent.text,"\t",ent.label_)

Joe Biden 	 PERSON
Sundar Pichai 	 PERSON
California 	 GPE
next Tuesday 	 DATE


In [34]:
#print the description of GPE
spacy.explain("GPE")

'Countries, cities, states'