<a href="https://colab.research.google.com/github/shubheshswain91/Machine-learning/blob/master/text_preprocessing2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk
import string
import re

**Part of Speech Tagging**

The part of speech explains how a word is used in a sentence. In a sentence, a word can have different contexts and semantic meanings. The basic natural language processing models like bag-of-words fail to identify these relations between words. Hence, we use part of speech tagging to mark a word to its part of speech tag based on its context in the data. It is also used to extract relationships between words.

PRP stands for personal pronoun, RB for adverb, VBD for verb past tense, DT for determiner and NN for noun. We can get the details of all the part of speech tags using the Penn Treebank tagset.

In [6]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
  

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [7]:
def pos_tagging(text):
    word_tokens = word_tokenize(text)
    return pos_tag(word_tokens)
  
pos_tagging('You just gave me a scare')

[('You', 'PRP'),
 ('just', 'RB'),
 ('gave', 'VBD'),
 ('me', 'PRP'),
 ('a', 'DT'),
 ('scare', 'NN')]

In [8]:
pos_tagging('This is a non-linear world. ')

[('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('non-linear', 'JJ'),
 ('world', 'NN'),
 ('.', '.')]

In [9]:
nltk.download('tagsets')
nltk.help.upenn_tagset('NN')

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


**Chunking**

Chunking is the process of extracting phrases from unstructured text and more structure to it. It is also known as shallow parsing. It is done on top of Part of Speech tagging. It groups word into “chunks”, mainly of noun phrases. Chunking is done using regular expressions.

In [12]:


def chunking(text, grammar):
    word_tokens = word_tokenize(text)
  
    # label words with part of speech
    word_pos = pos_tag(word_tokens)
  
    # create a chunk parser using grammar
    chunkParser = nltk.RegexpParser(grammar)
  
    # test it on the list of word tokens with tagged pos
    tree = chunkParser.parse(word_pos)
      
    for subtree in tree.subtrees():
        print(subtree)
    tree.draw()
      
sentence = 'the little yellow bird is flying in the sky'
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunking(sentence, grammar)

(S
  (NP the/DT little/JJ yellow/JJ bird/NN)
  is/VBZ
  flying/VBG
  in/IN
  (NP the/DT sky/NN))
(NP the/DT little/JJ yellow/JJ bird/NN)
(NP the/DT sky/NN)


TclError: ignored

 **Named Entity Recognition:**


Named Entity Recognition is used to extract information from unstructured text. It is used to classify entities present in a text into categories like a person, organization, event, places, etc. It gives us detailed knowledge about the text and the relationships between the different entities.

In [19]:
from nltk import pos_tag, ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [20]:
def named_entity_recognition(text):
    # tokenize the text
    word_tokens = word_tokenize(text)
  
    # part of speech tagging of words
    word_pos = pos_tag(word_tokens)
  
    # tree of word entities
    print(ne_chunk(word_pos))

In [21]:
text = 'I am going to Ibiza for a party and will come back and work on Google Colab'
named_entity_recognition(text)

(S
  I/PRP
  am/VBP
  going/VBG
  to/TO
  (GPE Ibiza/NNP)
  for/IN
  a/DT
  party/NN
  and/CC
  will/MD
  come/VB
  back/RB
  and/CC
  work/NN
  on/IN
  (PERSON Google/NNP Colab/NNP))
