 ## Chunking is a process of extracting phrases (chunks) from unstructured text based on certain patterns or rules.
 - A chunk grammar is a combination of rules on how sentences should be chunked. It often uses regular expressions

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

# Sample text
text = "The quick brown fox jumps over the lazy dog"

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
tagged_tokens = pos_tag(tokens)
print(tagged_tokens)

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


### Steps for Chunking

- We define the chunk grammar using regular expressions. The grammar specifies patterns that indicate how chunks should be formed. In this example, we define a simple grammar to chunk noun phrases (NP) consisting of optional determiners (DT), adjectives (JJ), and nouns (NN).

- Chunk Parser: We create a chunk parser using the defined gramma

-  Apply the chunk parser to the tagged tokens, which identifies and groups tokens according to the patterns specified in the grammar.

- Print the chunked tokens, which represent the identified phrases based on the chunking rules.

In [2]:
# Define chunk grammar using regular expressions
chunk_grammar = r"""
    NP: {<DT>?<JJ>*<NN>}   # Chunk sequences of DT, JJ, and NN
    """

# Create a chunk parser using the defined grammar
chunk_parser = RegexpParser(chunk_grammar)

# Apply chunking
chunked_tokens = chunk_parser.parse(tagged_tokens)

# Print the chunked tokens
print(chunked_tokens)

(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN))


## Rules
- According to the rule you created, your chunks:
- Start with an optional (?) determiner ('DT')
- Can have any number (*) of adjectives (JJ)
- End with a noun ()

## Let's create another example for chunking, this time focusing on verb phrases (VP) in a sentence

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

# Sample text
text = "Suyashi has a Rabbit that ran from the Table. She bought it from Isha . The Jumping is best Habit of the Rabbit"

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
word_tokens_text = pos_tag(tokens)

# Define chunk grammar using regular expressions
chunk_grammar = r"""
    VP: {<VB.*><DT>?<JJ>*<NN>}   # Chunk sequences of verbs, determiners, adjectives, and nouns
    """

# Create a chunk parser using the defined grammar
chunk_parser = RegexpParser(chunk_grammar)

# Apply chunking
chunked_tokens = chunk_parser.parse(word_tokens_text)

# Print the chunked tokens
print(chunked_tokens)


(S
  Suyashi/NNP
  has/VBZ
  a/DT
  Rabbit/NNP
  that/WDT
  ran/VBD
  from/IN
  the/DT
  Table/NN
  ./.
  She/PRP
  bought/VBD
  it/PRP
  from/IN
  Isha/NNP
  ./.
  The/DT
  Jumping/NNP
  is/VBZ
  best/JJS
  Habit/NN
  of/IN
  the/DT
  Rabbit/NNP)


## Quick Practice
Let's create another example for chunking, this time focusing on extracting noun phrases (NP) along with prepositional phrases (PP) from a sentence

In [None]:
# Solution

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

# Sample text
text = "Ram Loves his Life. He have a cat named RUMMY"

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
tagged_tokens = pos_tag(tokens)

# Define chunk grammar using regular expressions
chunk_grammar = r"""
    NP: {<DT>?<JJ>*<NN>}    # Chunk noun phrases
    PP: {<IN><NP>}           # Chunk prepositional phrases
    """

# Create a chunk parser using the defined grammar
chunk_parser = RegexpParser(chunk_grammar)

# Apply chunking
chunked_tokens = chunk_parser.parse(tagged_tokens)

# Print the chunked tokens
print(chunked_tokens)



## Using Named Entity Recognition (NER)
- Named entities are noun phrases that refer to specific locations, people, organizations, and so on.
- With named entity recognition, you can find the named entities in your texts and also determine what kind of named entity they are
- you can use nltk.ne_chunk() to recognize named entities

In [9]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

# Sample text
text = "You know Ashi, she works  in ABC pvt Lt. India, and its CEO  Rommy  is from Australia. Rabbit plays with Cat"
# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
tagged_tokens = pos_tag(tokens)

# Perform named entity recognition
named_entities = ne_chunk(tagged_tokens)  #This function identifies named entities in the text based on the part-of-speech tags.

# Print the named entities
print(named_entities) 

(S
  You/PRP
  know/VBP
  (PERSON Ashi/NNP)
  ,/,
  she/PRP
  works/VBZ
  in/IN
  (ORGANIZATION ABC/NNP)
  pvt/NN
  Lt./NNP
  (GPE India/NNP)
  ,/,
  and/CC
  its/PRP$
  (ORGANIZATION CEO/NNP Rommy/NNP)
  is/VBZ
  from/IN
  (GPE Australia/NNP)
  ./.
  (PERSON Rabbit/NNP)
  plays/VBZ
  with/IN
  (ORGANIZATION Cat/NNP))


In [7]:
%%time
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, ne_chunk

# Sample text
text = "The teacher have a kid named Asha. They stay in Bali. They have a pet named Kipy."

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Initialize a list to store named entities
named_entities = []

# Iterate through each sentence
for sentence in sentences:
    # Tokenize the sentence into words
    tokens = word_tokenize(sentence)
    # Perform part-of-speech tagging
    tagged_tokens = pos_tag(tokens)
    # Perform named entity recognition
    named_entities.extend(ne_chunk(tagged_tokens))

# Print the named entities
for entity in named_entities:
    if hasattr(entity, 'label'):
        print(' '.join(c[0] for c in entity.leaves()), '-', entity.label())


Asha - PERSON
Bali - GPE
Kipy - PERSON
CPU times: total: 0 ns
Wall time: 10.1 ms


- Lambda functions are typically used for simple operations on single elements of a list, not for complex operations involving iteration, tokenization, tagging, and parsing.

In [None]:
%%time
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, ne_chunk

# Sample text
text = "The teacher have a kid named Asha. They stay in Bali. They have a pet named Kipy."

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Perform named entity recognition for each sentence and flatten the result
named_entities = [entity for sentence in sentences 
                        for entity in ne_chunk(pos_tag(word_tokenize(sentence))) 
                            if hasattr(entity, 'label')]

# Print the named entities
for entity in named_entities:
    print(' '.join(c[0] for c in entity.leaves()), '-', entity.label())


## Named Entity Recognition (NER) with Custom Entities:
- Extracting custom named entities from text using regular expressions and NLTK's NER capabilities

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

# Sample text
text = " Tina and Roohi are best Friends. They work in same company. They stay in America"

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
tagged_tokens = pos_tag(tokens)

# Define a custom named entity chunker
chunk_rule = r"NE: {<NNP>+}"
custom_chunker = nltk.RegexpParser(chunk_rule)

# Apply custom chunker
custom_named_entities = custom_chunker.parse(tagged_tokens)

# Print custom named entities
print(custom_named_entities)
