<a href="https://colab.research.google.com/github/sonupp13/NATURAL-LANGUAGE-PROCESSING/blob/main/DS_POS_CHUNKING_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Part of Speech (POS)

# One of the more powerful aspects of the NLTK module is the Part of Speech tagging.
# This means labeling words in a sentence as nouns, adjectives, verbs...etc.
# Even more impressive, it also labels by tense, and more.
# Here's a list of the tags, what they mean, and some examples:

# POS tag list:

# CC	coordinating conjunction
# CD	cardinal digit
# DT	determiner
# EX	existential there (like: "there is" ... think of it like "there exists")
# FW	foreign word
# IN	preposition/subordinating conjunction
# JJ	adjective	'big'
# JJR	adjective, comparative	'bigger'
# JJS	adjective, superlative	'biggest'
# LS	list marker	1)
# MD	modal	could, will
# NN	noun, singular 'desk'
# NNS	noun plural	'desks'
# NNP	proper noun, singular	'Harrison'
# NNPS	proper noun, plural	'Americans'
# PDT	predeterminer	'all the kids'
# POS	possessive ending	parent\'s
# PRP	personal pronoun	I, he, she
# PRP$	possessive pronoun	my, his, hers
# RB	adverb	very, silently,
# RBR	adverb, comparative	better
# RBS	adverb, superlative	best
# RP	particle	give up
# TO	to	go 'to' the store.
# UH	interjection	errrrrrrrm
# VB	verb, base form	take
# VBD	verb, past tense	took
# VBG	verb, gerund/present participle	taking
# VBN	verb, past participle	taken
# VBP	verb, sing. present, non-3d	take
# VBZ	verb, 3rd person sing. present	takes
# WDT	wh-determiner	which
# WP	wh-pronoun	who, what
# WP$	possessive wh-pronoun	whose
# WRB	wh-abverb	where, when

In [1]:
import nltk
nltk.download('state_union')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

[nltk_data] Downloading package state_union to /root/nltk_data...
[nltk_data]   Unzipping corpora/state_union.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
# Now, let's create our training and testing data:

# 'train_text' os One is a State of the Union address from 2005,
# and 'sample_text' is from 2006 from past President George W. Bush.


train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")


In [3]:
# Training Text
train_text

'PRESIDENT GEORGE W. BUSH\'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nFebruary 2, 2005\n\n\n9:10 P.M. EST \n\nTHE PRESIDENT: Mr. Speaker, Vice President Cheney, members of Congress, fellow citizens: \n\nAs a new Congress gathers, all of us in the elected branches of government share a great privilege: We\'ve been placed in office by the votes of the people we serve. And tonight that is a privilege we share with newly-elected leaders of Afghanistan, the Palestinian Territories, Ukraine, and a free and sovereign Iraq. (Applause.) \n\nTwo weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. \n\nTonight, with a healthy, growing economy, with more Americans going back to work, with our nation an active force for good in the world -- the state of our union is confident and strong. (Applause.

In [4]:
# Sample Text
sample_text

'PRESIDENT GEORGE W. BUSH\'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream. Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King. (Applause.)\n\nPresident George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan. 31, 2006. White House photo by Eric DraperEvery time I\'m invited to this rostrum, I\'m humbled by the privilege, and mindful of the history we\'ve seen together. We have gathered under this Capitol dome in moments of national mourning and national achievement. We have serv

In [5]:
# Next, we can train the Punkt tokenizer like:

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
custom_sent_tokenizer

<nltk.tokenize.punkt.PunktSentenceTokenizer at 0x7a3b54db0970>

In [6]:
#Then we can actually tokenize, using:

tokenized = custom_sent_tokenizer.tokenize(sample_text)
tokenized

["PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all.",
 'Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream.',
 'Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King.',
 '(Applause.)',
 'President George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan.',
 '31, 2006.',
 "White House photo by Eric DraperEvery time I'm invited to this rostrum, I'm humbled by the privilege, and mindful of the history we've seen together.",
 'We have gathered under this Capitol dome in moments of national mourning and national ach

In [7]:
# Now we can finish up this part of speech tagging script by creating a function

def process_content():
    try:
        for i in tokenized[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))


process_content()

# The output should be a list of tuples, where the first element in the tuple is the word,
# and the second is the part of speech tag

# Inference :
# At this point, we can begin to derive meaning, but there is still some work to do.
# The next topic that we're going to cover is chunking, which is where we group words,
# based on their parts of speech, into hopefully meaningful groups.

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nat

In [8]:
# Now that we know the parts of speech, we can do what is called chunking,
# and group words into hopefully meaningful chunks.
# One of the main goals of chunking is to group into what are known as "noun phrases."
# These are phrases of one or more words that contain a noun, maybe some descriptive words,
# maybe a verb, and maybe something like an adverb.
# The idea is to group nouns with the words that are in relation to them.

# In order to chunk, we combine the part of speech tags with regular expressions.
# Mainly from regular expressions, we are going to utilize the following:
# + = match 1 or more
# ? = match 0 or 1 repetitions.
# * = match 0 or MORE repetitions
# . = Any character except a new line
# The last things to note is that the part of speech tags are denoted with the "<" and ">"
# and we can also place regular expressions within the tags themselves,
# so account for things like "all nouns" (<N.*>)

In [9]:
# this regular expression pattern is designed to chunk together sequences of adverbs, verbs, proper nouns, single noun and proper nouns etc. It helps identify and group specific types of words in the text based on their parts of speech.

In [10]:
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            print(chunked)

    except Exception as e:
        print(str(e))

process_content()


# chunkGram: Creates a rule for grouping words together based on their parts of speech.
# chunkParser: Uses the rule to group words into chunks.

# Inference :
# chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
# <RB.?>* = "0 or more of any tense of adverb," followed by:
# <VB.?>* = "0 or more of any tense of verb," followed by:
# <NNP>+ = "One or more proper nouns," followed by
# <NN>? = "zero or one singular noun."
#DT = "determiner"
#JJ= "adjective"


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  the/DT
  dramatic/JJ
  progress/NN
  of/IN
  a/DT
  new/JJ
  democracy/NN
  ./.)
(S
  In/IN
  less/JJR
  than/IN
  three/CD
  years/NNS
  ,/,
  the/DT
  nation/NN
  has/VBZ
  gone/VBN
  from/IN
  dictatorship/NN
  to/TO
  liberation/NN
  ,/,
  to/TO
  sovereignty/VB
  ,/,
  to/TO
  a/DT
  constitution/NN
  ,/,
  to/TO
  national/JJ
  elections/NNS
  ./.)
(S
  At/IN
  the/DT
  same/JJ
  time/NN
  ,/,
  our/PRP$
  coalition/NN
  has/VBZ
  been/VBN
  relentless/VBN
  in/IN
  shutting/VBG
  off/RP
  terrorist/JJ
  infiltration/NN
  ,/,
  clearing/VBG
  out/RP
  insurgent/JJ
  strongholds/NNS
  ,/,
  and/CC
  turning/VBG
  over/RP
  territory/NN
  to/TO
  (Chunk Iraqi/NNP security/NN)
  forces/NNS
  ./.)
(S
  I/PRP
  am/VBP
  confident/JJ
  in/IN
  our/PRP$
  plan/NN
  for/IN
  victory/NN
  ;/:
  I/PRP
  am/VBP
  confident/JJ
  in/IN
  the/DT
  will/MD
  of/IN
  the/DT
  (Chunk Iraqi/NNP)
  people/NNS
  ;/:
  I/PRP
  am/VBP


In [11]:
import nltk
from nltk.tokenize import word_tokenize

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"

# Tokenize the sentence
words = word_tokenize(sentence)

# POS tagging
tagged_words = nltk.pos_tag(words)

# Define the chunk grammar
chunk_grammar = "Chunk: {<JJ>*<NN>}"

# Create a chunk parser
chunk_parser = nltk.RegexpParser(chunk_grammar)

# Perform chunking
chunked_words = chunk_parser.parse(tagged_words)

# Output the chunked words
print(chunked_words)


(S
  The/DT
  (Chunk quick/JJ brown/NN)
  (Chunk fox/NN)
  jumps/VBZ
  over/IN
  the/DT
  (Chunk lazy/JJ dog/NN))
