## Applying heurestic approaches to identify candidate keyphrases from given chunks of text.

**Three important steps will be considered in this stage:**
1. Use Standard (+ customized, if required) Stop words lists to exclude stopwords
2. Apply POS tagging and considering only certain POS Tags candidate keyphrases
3. Matching pre-defined lexico-syntactic patterns (after chunking)

Then, we apply tf-idf as our baseline keyphrase identfication method, over the shortlisted candidate keywords.

In [2]:
# Importing the required libraries

import nltk # the main NLP package; offers a set of corpora and easy interfaces to access them
from nltk import *
import string # We can now call on various methods to convert into large case, small case, deal with punctuation, etc.
import itertools # We can now slice text, group by (aggregate), count and repeat 

In [3]:
# Code 1: Function to Extract Candidate Keyphrases from the given text corpus
# Source of the code: http://bdewilde.github.io/blog/blogger/2013/10/15/friedman-corpus-1-background-and-creation/
# Already a part of ModuleQ documentation (Article on KeyPhrase Extraction by Burton de Wilde)

def lambda_unpack(f):
    return(lambda args: f(*args))
# Putting *args as the last item of the function allows that function to accept an arbitrary number of arguments

# The piece of code below is used to set rules and define the logic for POS-based candidate keyphrase chunking
# The function takes two arguments - the first is the piece of text that we are analyzing
# The second argument is the "grammar", which is passed through the regexp parser, to consider 
# only those phrases of text which satisfy the POS Sequence specified in the grammar

# This is a "noun-phrases only" (NP) heurestic, where only the nouns and the associated adjectives have
# been considered objects of interest
# This makes sense in a setting where we are only interested in studying the business priorities 
# within an e-mail, which are likely to be described by the nouns in the text

def extract_candidate_chunks(text, grammar=r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'):
                                                             # We pass this POS Sequence through the Regex Parser
    
    punct = set(string.punctuation) # creating a set of all punctuations 
    stop_words = set(nltk.corpus.stopwords.words('english')) # creating a set of all stopwords from the library
                                                             # in order to exclude stop words/just punctuation
# Note: JJ = Adjective; NN = Noun (Singular); IN = Preposition or subordinating conjunction
# Note: The '*', '+' and '?' qualifiers are greedy - they match as much text as possible
# ab* will match 'a','ab' or 'a' followed by any number of 'b's
# ab+ will match 'a' followed by any non-zero number of 'b's ('+' allowed for extra occurences of the last string)
# ab? will match either 'a' or 'ab'

# tokenize, POS-tag, and chunk using regular expressions
    chunker = nltk.chunk.regexp.RegexpParser(grammar)
  
 # nltk.chunk.regexp.ChunkRule(tag_pattern, descr)
 # A rule specifying how to add chunks to a ChunkString, using a matching tag pattern. 
 # When applied to a ChunkString, it will find any substring that matches this tag pattern and that is not already 
 # part of a chunk, and create a new chunk containing that substring.

    tagged_sents = nltk.pos_tag_sents(nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text))
# sent_tokenize is an nltk function to split a given text by sentence.
# nltk.pos_tag_sents function takes on the tokens (single words) of a sentence and assigns as POS Tag to it.
# If there are 5 sentences in the corpus, we will now have a list of 5 lists.

    all_chunks = list(itertools.chain.from_iterable(nltk.chunk.tree2conlltags(chunker.parse(tagged_sent))
                                                    for tagged_sent in tagged_sents))
# join constituent chunk words into a single chunked phrase
# Exclude O, becuase they represent stopwords, useless words
# Note that all_chunks is a list of tuples in the format (word, pos, chunk)

# nltk.chunk.util.tree2conlltags(t)
# Returns a list of 3-tuples containing (word, tag, IOB-tag). 
# Convert a tree to the CoNLL IOB tag format.
# Parameters: (Tree) – The tree to be converted.
# Return type:list(tuple)

# The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens 
# in a chunking task 
# The B- prefix before a tag indicates that the tag is the beginning of a chunk, 
# An I- prefix before a tag indicates that the tag is inside a chunk. 
# The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them. 
# An O tag indicates that a token belongs to no chunk.

# itertools.chain(*iterables)
# Make an iterator that returns elements from the first iterable until it is exhausted, 
# then proceeds to the next iterable, until all of the iterables are exhausted. 
# It is used for treating consecutive sequences as a single sequence. 
# It is roughly equivalent to:

# def chain(*iterables):
#     chain('ABC', 'DEF') --> A B C D E F
#     for it in iterables:
#         for element in it:
#             yield element

    candidates = [' '.join(word for word, pos, chunk in group).lower()
                  for key, group in itertools.groupby(all_chunks, lambda_unpack(lambda word,pos,chunk : chunk != 'O')) if key]

    # Using the list of punctuation characters and stopwords to filter out keyphrases    
    return [cand for cand in candidates if cand not in stop_words and not all(char in punct for char in cand)]

In [4]:
# Sample to understand the function line by line

text = "Where are we with respect to the presentation for MoneyGram tomorrow? \
Have we validated and plugged in the numbers in the deck. Don't forget - Western Union \
is the largest money transfer service across the world, let's be careful."
grammar = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'

punct = set(string.punctuation) 
stop_words = set(nltk.corpus.stopwords.words('english')) 

chunker = nltk.chunk.regexp.RegexpParser(grammar)

tagged_sents = nltk.pos_tag_sents(nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text))
all_chunks = list(itertools.chain.from_iterable(nltk.chunk.tree2conlltags(chunker.parse(tagged_sent))
                                                    for tagged_sent in tagged_sents))
    

candidates = [' '.join(word for word, pos, chunk in group).lower()
                  for key, group in itertools.groupby(all_chunks, lambda_unpack(lambda word,pos,chunk : chunk != 'O')) if key]
#print(cand for cand in candidates if cand not in stop_words and not all(char in punct for char in cand))

In [5]:
tagged_sents

[[('Where', 'WRB'),
  ('are', 'VBP'),
  ('we', 'PRP'),
  ('with', 'IN'),
  ('respect', 'NN'),
  ('to', 'TO'),
  ('the', 'DT'),
  ('presentation', 'NN'),
  ('for', 'IN'),
  ('MoneyGram', 'NNP'),
  ('tomorrow', 'NN'),
  ('?', '.')],
 [('Have', 'VBP'),
  ('we', 'PRP'),
  ('validated', 'VBN'),
  ('and', 'CC'),
  ('plugged', 'VBN'),
  ('in', 'IN'),
  ('the', 'DT'),
  ('numbers', 'NNS'),
  ('in', 'IN'),
  ('the', 'DT'),
  ('deck', 'NN'),
  ('.', '.')],
 [('Do', 'VBP'),
  ("n't", 'RB'),
  ('forget', 'VB'),
  ('-', ':'),
  ('Western', 'NNP'),
  ('Union', 'NNP'),
  ('is', 'VBZ'),
  ('the', 'DT'),
  ('largest', 'JJS'),
  ('money', 'NN'),
  ('transfer', 'NN'),
  ('service', 'NN'),
  ('across', 'IN'),
  ('the', 'DT'),
  ('world', 'NN'),
  (',', ','),
  ('let', 'VB'),
  ("'s", 'POS'),
  ('be', 'VB'),
  ('careful', 'JJ'),
  ('.', '.')]]

In [6]:
all_chunks

[('Where', 'WRB', 'O'),
 ('are', 'VBP', 'O'),
 ('we', 'PRP', 'O'),
 ('with', 'IN', 'O'),
 ('respect', 'NN', 'B-KT'),
 ('to', 'TO', 'O'),
 ('the', 'DT', 'O'),
 ('presentation', 'NN', 'B-KT'),
 ('for', 'IN', 'I-KT'),
 ('MoneyGram', 'NNP', 'I-KT'),
 ('tomorrow', 'NN', 'I-KT'),
 ('?', '.', 'O'),
 ('Have', 'VBP', 'O'),
 ('we', 'PRP', 'O'),
 ('validated', 'VBN', 'O'),
 ('and', 'CC', 'O'),
 ('plugged', 'VBN', 'O'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('numbers', 'NNS', 'B-KT'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('deck', 'NN', 'B-KT'),
 ('.', '.', 'O'),
 ('Do', 'VBP', 'O'),
 ("n't", 'RB', 'O'),
 ('forget', 'VB', 'O'),
 ('-', ':', 'O'),
 ('Western', 'NNP', 'B-KT'),
 ('Union', 'NNP', 'I-KT'),
 ('is', 'VBZ', 'O'),
 ('the', 'DT', 'O'),
 ('largest', 'JJS', 'O'),
 ('money', 'NN', 'B-KT'),
 ('transfer', 'NN', 'I-KT'),
 ('service', 'NN', 'I-KT'),
 ('across', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('world', 'NN', 'B-KT'),
 (',', ',', 'O'),
 ('let', 'VB', 'O'),
 ("'s", 'POS', 'O'),
 ('be', 'VB'

In [7]:
candidates

['respect',
 'presentation for moneygram tomorrow',
 'numbers',
 'deck',
 'western union',
 'money transfer service',
 'world']

In [8]:
for cand in candidates:
    if cand not in stop_words and not all(char in punct for char in cand):
        print(cand)

respect
presentation for moneygram tomorrow
numbers
deck
western union
money transfer service
world


In [9]:
extract_candidate_chunks("A brute-force method might consider all words and/or phrases in a document as candidate keyphrases. However, given computational costs and the fact that not all words and phrases in a document are equally likely to convey its content, heuristics are typically used to identify a smaller subset of better candidates. Common heuristics include removing stop words and punctuation; filtering for words with certain parts of speech or, for multi-word phrases, certain POS patterns; and using external knowledge bases like WordNet or Wikipedia as a reference source of good/bad keyphrases.")

['brute-force method',
 'words',
 'phrases',
 'document as candidate keyphrases',
 'computational costs',
 'fact',
 'words',
 'phrases',
 'document',
 'content',
 'heuristics',
 'subset',
 'candidates',
 'common heuristics',
 'stop words',
 'punctuation',
 'words with certain parts',
 'speech',
 'multi-word phrases',
 'certain pos patterns',
 'external knowledge bases like wordnet',
 'wikipedia',
 'reference source of good/bad keyphrases']

In [10]:
def extract_candidate_words(text, good_tags=set(['JJ','JJR','JJS','NN','NNP','NNS','NNPS'])):
    import itertools, nltk, string

    # exclude candidates that are stop words or entirely punctuation
    punct = set(string.punctuation)
    stop_words = set(nltk.corpus.stopwords.words('english'))
    # tokenize and POS-tag words
    tagged_words = itertools.chain.from_iterable(nltk.pos_tag_sents(nltk.word_tokenize(sent)
                                                                    for sent in nltk.sent_tokenize(text)))
    # filter on certain POS tags and lowercase all words
    candidates = [word.lower() for word, tag in tagged_words
                  if tag in good_tags and word.lower() not in stop_words
                  and not all(char in punct for char in word)]

    return candidates

In [11]:
# Applying the function
extract_candidate_words("A brute-force method might consider all words and/or phrases in a document as candidate keyphrases. However, given computational costs and the fact that not all words and phrases in a document are equally likely to convey its content, heuristics are typically used to identify a smaller subset of better candidates. Common heuristics include removing stop words and punctuation; filtering for words with certain parts of speech or, for multi-word phrases, certain POS patterns; and using external knowledge bases like WordNet or Wikipedia as a reference source of good/bad keyphrases.")

['brute-force',
 'method',
 'words',
 'phrases',
 'document',
 'candidate',
 'keyphrases',
 'computational',
 'costs',
 'fact',
 'words',
 'phrases',
 'document',
 'likely',
 'content',
 'heuristics',
 'smaller',
 'subset',
 'better',
 'candidates',
 'common',
 'heuristics',
 'stop',
 'words',
 'punctuation',
 'words',
 'certain',
 'parts',
 'speech',
 'multi-word',
 'phrases',
 'certain',
 'pos',
 'patterns',
 'external',
 'knowledge',
 'bases',
 'wordnet',
 'wikipedia',
 'reference',
 'source',
 'good/bad',
 'keyphrases']

In [12]:
# Illustration
" ".join(["a", "b", "c"])

'a b c'

In [22]:
# Now, let us apply the tf-idf algorithm to these candidate keywords
import gensim
def score_keyphrases_by_tfidf(texts, candidates='chunks'):
    import gensim, nltk
    
    # extract candidates from each text in texts, either chunks or words
    if candidates == 'chunks':
        boc_texts = [extract_candidate_chunks(text) for text in texts]
    elif candidates == 'words':
        boc_texts = [extract_candidate_words(text) for text in texts]
    # make gensim dictionary and corpus
    dictionary = gensim.corpora.Dictionary(boc_texts)
    corpus = [dictionary.doc2bow(boc_text) for boc_text in boc_texts]
    # transform corpus with tf*idf model
    tfidf = gensim.models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    print(corpus_tfidf)
    return corpus_tfidf, dictionary


In [23]:
score_keyphrases_by_tfidf(texts)#,"Testing four five. Testing five. Testing six.","Testing six, Testing seven.")

<gensim.interfaces.TransformedCorpus object at 0x10cb75f60>


(<gensim.interfaces.TransformedCorpus at 0x10cb75f60>,
 <gensim.corpora.dictionary.Dictionary at 0x10cb755c0>)

In [19]:
texts = ["Testing one. Red butterfly. Castle island.","Business interests, testing five.","Testing five, testing six, the blue butterfly"]
# Task: 

In [21]:
print(corpus_tfidf)

NameError: name 'corpus_tfidf' is not defined