Chunking

Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text with a chunker. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.

To find the chunk structure for a given sentence, the RegexpParser chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.

**Recommended Steps:**

1. Sentence tokenization of the text.
2. Word tokenization.
3. POS tagging.
4. Run the chunk rule. 


So here is Mr. James Wilson who wrote about Elon Musk. Here is the following content.

~~~~ 
Musk was born on June 28, 1971, in Pretoria, Gauteng, South Africa,[27] the son of Maye Musk (née Haldeman), a model and dietician from Regina, Saskatchewan, Canada;[28] and Errol Musk, a South African electromechanical engineer, pilot and sailor. He has a younger brother, Kimbal (born 1972), and a younger sister, Tosca (born 1974).[28][29][30][31] His paternal grandmother was British, and he also has Pennsylvania Dutch ancestry,[32][33] and his maternal grandfather was American, from Minnesota.[34] After his parents divorced in 1980, Musk lived mostly with his father in the suburbs of Pretoria.[32]

During his childhood he had an interest in reading and often did so for hours at a time.[35]

At What age did Musk got interested in computing? At age 10, he developed an interest in computing with the Commodore VIC-20.[36] He taught himself computer programming at the age of 12, sold the code for a BASIC-based video game he created called Blastar, to a magazine called PC and Office Technology, for approximately $500. [37][38] A web version of the game is available online.[37][39]

Musk was severely bullied throughout his childhood, and was once hospitalized when a group of boys threw him down a flight of stairs and then beat him until he lost consciousness.[40]

Musk was initially educated at private schools, attending the English-speaking Waterkloof House Preparatory School. 

Mr. Singh helped Musk during his initial days when he was severely bullied during his childhood.

Musk later graduated from Pretoria Boys High School and moved to Canada in June 1989, just before his 18th birthday,[41] after obtaining Canadian citizenship through his Canadian-born mother.[42][43] Although at the time Musk had to register to become a Canadian citizen, Bill C-37, which came into law in April of 2009, effectively made his Canadian citizenship retroactive to his birth since he was born in the first generation abroad to a Canadian-born mother. Therefore, with the law change, he is considered to have always been a Canadian citizen by birth.
~~~~


## Part - 1

    Given the above extract find all the Noun Phrases(NP) using Chunking. Follow the recommended steps described above
    
    A Noun Phrase can be anything which has an optional Determiner(DT) followed by any number of adjectives(JJ) followed by a Noun(NN)
    
    
## Part - 2

    Use the built in Named Entity Extractor to extract the entities

In [2]:
import nltk
import nltk.draw
import tkinter
from nltk.tokenize import word_tokenize,sent_tokenize

In [3]:
text="""Musk was born on June 28, 1971, in Pretoria, Gauteng, South Africa,[27] the son of Maye Musk (née Haldeman), a model and dietician from Regina, Saskatchewan, Canada;[28] and Errol Musk, a South African electromechanical engineer, pilot and sailor. He has a younger brother, Kimbal (born 1972), and a younger sister, Tosca (born 1974).[28][29][30][31] His paternal grandmother was British, and he also has Pennsylvania Dutch ancestry,[32][33] and his maternal grandfather was American, from Minnesota.[34] After his parents divorced in 1980, Musk lived mostly with his father in the suburbs of Pretoria.[32]

During his childhood he had an interest in reading and often did so for hours at a time.[35]

At What age did Musk got interested in computing? At age 10, he developed an interest in computing with the Commodore VIC-20.[36] He taught himself computer programming at the age of 12, sold the code for a BASIC-based video game he created called Blastar, to a magazine called PC and Office Technology, for approximately $500. [37][38] A web version of the game is available online.[37][39]

Musk was severely bullied throughout his childhood, and was once hospitalized when a group of boys threw him down a flight of stairs and then beat him until he lost consciousness.[40]

Musk was initially educated at private schools, attending the English-speaking Waterkloof House Preparatory School. 

Mr. Singh helped Musk during his initial days when he was severely bullied during his childhood.

Musk later graduated from Pretoria Boys High School and moved to Canada in June 1989, just before his 18th birthday,[41] after obtaining Canadian citizenship through his Canadian-born mother.[42][43] Although at the time Musk had to register to become a Canadian citizen, Bill C-37, which came into law in April of 2009, effectively made his Canadian citizenship retroactive to his birth since he was born in the first generation abroad to a Canadian-born mother. Therefore, with the law change, he is considered to have always been a Canadian citizen by birth."""

# Part 1 #

In [4]:
sent=sent_tokenize(text)

In [None]:
try:
    for i in sent:
        words=word_tokenize(i)
        tagged=nltk.pos_tag(words)
        chunkGram=r'''Chunk:{<DT>?<JJ>*<NN>}'''
        chunkParser=nltk.RegexpParser(chunkGram)
        chunked=chunkParser.parse(tagged)
        print(chunked)
        print('\n')
        chunked.draw()
except Exception as e:
    print(str(e))

(S
  Musk/NNP
  was/VBD
  born/VBN
  on/IN
  June/NNP
  28/CD
  ,/,
  1971/CD
  ,/,
  in/IN
  Pretoria/NNP
  ,/,
  Gauteng/NNP
  ,/,
  South/NNP
  Africa/NNP
  ,/,
  [/VBD
  27/CD
  ]/IN
  (Chunk the/DT son/NN)
  of/IN
  Maye/NNP
  Musk/NNP
  (/(
  née/JJ
  Haldeman/NNP
  )/)
  ,/,
  (Chunk a/DT model/NN)
  and/CC
  dietician/JJ
  from/IN
  Regina/NNP
  ,/,
  Saskatchewan/NNP
  ,/,
  Canada/NNP
  ;/:
  [/VBZ
  28/CD
  (Chunk ]/NN)
  and/CC
  Errol/NNP
  Musk/NNP
  ,/,
  (Chunk a/DT South/JJ African/JJ electromechanical/JJ engineer/NN)
  ,/,
  (Chunk pilot/NN)
  and/CC
  (Chunk sailor/NN)
  ./.)


(S
  He/PRP
  has/VBZ
  a/DT
  younger/JJR
  (Chunk brother/NN)
  ,/,
  Kimbal/NNP
  (/(
  born/JJ
  1972/CD
  )/)
  ,/,
  and/CC
  a/DT
  younger/JJR
  (Chunk sister/NN)
  ,/,
  Tosca/NNP
  (/(
  born/JJ
  1974/CD
  )/)
  ./.)


(S
  [/RB
  28/CD
  ]/JJ
  [/$
  29/CD
  ]/NNP
  [/VBD
  30/CD
  ]/NNP
  [/VBD
  31/CD
  ]/IN
  His/PRP$
  (Chunk paternal/JJ grandmother/NN)
  was/VBD
  British/JJ
 

# Part 2 #

In [None]:
try:
    for i in sent:
        words=word_tokenize(i)
        tagged=nltk.pos_tag(words)
        namedEnt=nltk.ne_chunk(tagged)
        namedEnt.draw()
except Exception as e:
    print(str(e))