## Part of Speech Tagging | Chunking | Named Entity Recognition

In [1]:
# import the necessary libraries
import nltk
import string
import re
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

###### Part of Speech Tagging:
The part of speech explains how a word is used in a sentence. In a sentence, a word can have different
contexts and semantic meanings. The basic natural language processing models like bag-of-words fail to
identify these relations between words. Hence, we use part of speech tagging to mark a word to its part
of speech tag based on its context in the data. It is also used to extract relationships between words.


In [2]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# convert text into word_tokens with their tags
def pos_tagging(text):
	word_tokens = word_tokenize(text)
	return pos_tag(word_tokens)

pos_tagging('You just gave me a scare')

[('You', 'PRP'),
 ('just', 'RB'),
 ('gave', 'VBD'),
 ('me', 'PRP'),
 ('a', 'DT'),
 ('scare', 'NN')]

In [3]:
# download the tagset
nltk.download('tagsets')

# extract information about the tag
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


[nltk_data] Error loading tagsets: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


###### Chunking:
Chunking is the process of extracting phrases from unstructured text and more structure to it. It is also known as shallow parsing. It is done on top of Part of Speech tagging. It groups word into “chunks”, mainly of noun phrases. Chunking is done using regular expressions.

In [5]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# define chunking function with text and regular
# expression representing grammar as parameter
def chunking(text, grammar):
	word_tokens = word_tokenize(text)

	# label words with part of speech
	word_pos = pos_tag(word_tokens)

	# create a chunk parser using grammar
	chunkParser = nltk.RegexpParser(grammar)

	# test it on the list of word tokens with tagged pos
	tree = chunkParser.parse(word_pos)
	
	for subtree in tree.subtrees():
		print(subtree)
	tree.draw()
	
sentence = 'the little yellow bird is flying in the sky'
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunking(sentence, grammar)


(S
  (NP the/DT little/JJ yellow/JJ bird/NN)
  is/VBZ
  flying/VBG
  in/IN
  (NP the/DT sky/NN))
(NP the/DT little/JJ yellow/JJ bird/NN)
(NP the/DT sky/NN)


###### Named Entity Recognition:
Named Entity Recognition is used to extract information from unstructured text. It is used to classify entities present in a text into categories like a person, organization, event, places, etc. It gives us detailed knowledge about the text and the relationships between the different entities.

In [6]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

def named_entity_recognition(text):
	# tokenize the text
	word_tokens = word_tokenize(text)

	# part of speech tagging of words
	word_pos = pos_tag(word_tokens)

	# tree of word entities
	print(ne_chunk(word_pos))

text = 'Bill works for GeeksforGeeks so he went to Delhi for a meetup.'
named_entity_recognition(text)


(S
  (PERSON Bill/NNP)
  works/VBZ
  for/IN
  (ORGANIZATION GeeksforGeeks/NNP)
  so/RB
  he/PRP
  went/VBD
  to/TO
  (GPE Delhi/NNP)
  for/IN
  a/DT
  meetup/NN
  ./.)
