# Key Phrases // Shannon Hamilton

## I. Methodology 

### Algorithm Overview

To pull key phrases, I chose to implement the three techniques below:

1. Partial Parsing: Chunking by noun phrases, scoring using TF*IDF
2. Collocations: Bigram/trigram, scored using PMI
3. Frequent Terms: Normalization, POS tagging, uni/bi/trigrams
    
I chose not to include semantic similarity analyses because I found my outputs to be very similar to that of my partial parsing techniques. I did not want any of my 2000 characters to be repetitive. :)

### Experiments Conducted

I ran a number of experiments to determine which techniques would select the best key phrases. 

**1. Partial Parsing:** When chunking, I experimented on a number of variables: (1a) whether to convert my text to all lowercase before POS-tagging, (2a) whether to stem my tokens before POS-tagging, (3a) what POS pattern to chunk by, (4a) what analysis is best to score the meaningfulness of key-phrase "candidates", and (5a) whether it would be best to preserve "candidates" in their original noun phrases or just create one long string with all words that were members of noun phrases. (1b) I decided to convert  my text to all lowercase to eliminate otherwise present capitalized stop words from my highest scoring words. (2b) I decided not to stem or lemmatize my corpus after looking at its chunking/scoring output. There were no word stems repeated - perhaps due to my the other normalization functions I applied to the text. (3b) For the chunking pattern, I decided after experimenting that it would be most meaningful to use noun phrases as a chunking pattern for this exercise. (4b) I chose to run a TFIDF analysis to score my chunks primarily because it was one we'd discussed most in class. I realize compared to other machine learning techniques, it is limited in its application. (5b) After tinkering with the code for some time, I decided to compile all words that were members of noun phrases into a single string for the sake of streamlining later calculations. I look forward to exploring how I might retain candidates in their noun phrases for other scoring analysis in the future. 

**2. Collocations:** When computing different collocations, I also experimented on a number of variables: (1a) what frequency filter to most appropriate for my text, (2a) whether I should change any of my tokenization to support better collocation computation, and (3a) whether bigram or trigram collocations provided more meaningful key phrases. (1a) After experimenting with different frequency filters (filtering bigram and trigrams by how often they appeared in the text), I settled on a frequency filter of 4. For my text, it appeared to display bigram and trigrams that added depth and breadth to the key phrases computed through partial parsing. (2b) After a number of experiments, I decided to keep my tokenization the same. It appeared to be tokenizing and creating collocations well. (3b) My text is a set of PubMed abstracts on depression. I decided to include tri-gram over bigram collocations, because I thought the tri-gram outputs provided a richer sense of the text, potentially highlighting depressions' comorbidities, or commonly-investigated population types. The bigram collocations on the other hand, seemed to present more adjective + noun pairs (ie: long term, systematic review, mental health) - information already gathered through my partial parsing analysis. 


**3. Frequent Terms:** I experimented less with frequent terms calculations because of their more straightforward nature. There was one analysis I had hoped to compute, but was unable to get the code to work after hours of tinkering this week. My thought was to find the five most frequent unigrams of a text, and then apply a word filter to my above bigram and trigram collocations to return the highest-scoring collocations that included one of the five most frequent unigrams. I thought that'd be a fun way to combine different techniques and look forward to seeing if that provides any meaningful key phrases in the future. For my analyses, I decided on displaying the top FreqDists of bigrams. I thought it best rounded out my keyphrases found through trigram collocations and partially parsed unigrams. 


### What Worked Well + What Could Be Improved

**What worked well?** While I realize TFIDF has its limitations, I think my chunking analysis provided the best high-level view of my text. I think if you were to see my chunking's output, you'd understand what the text is about. I think my FreqDist and collocation analyses provided more context to the high-level overview retreived by my chunking analysis - for example, providing more information about potential populations most recently studied (ie: gay/transgender, immigrants, pregnant women) and what might be common comorbidities of depression (ie: anxiety, alcohol, pregnancy, chronic pain, sickle cell anemia). 

**What could be improved?** I think I could improve the scoring of my chunks using a technique other than TFIDF. I'm excited to learn more about this as the semester progresses. I also think it would have been interesting to preserve the noun phrases as *phrases* and not just a long compiled string of words that were members of noun phrases. As I've been writing this up, I've also realized that I'd like to tokenize hypenated words differently - especially for bigram and trigram analyses. Lastly, I would be interested in further exploring how to use WordNet and other semantic similarity measures to find high-level concepts in a text, and how those findings might differ from what I found using chunking/TFDIF techniques. 

## II. Algorithm:

### Noun-Phrase Chunking + TF*DIF Scoring

#### Imports

In [3]:
import nltk
import nltk.chunk
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from operator import itemgetter


#### Raw Text Corpora

In [4]:
with open('pubmed_depression_bodytext.txt', 'r') as handle:
    raw = handle.read().replace('\n', ' ')

#### Tokenize

In [5]:
def tokenize_text(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus)
    return [nltk.word_tokenize(word) for word in raw_sents]

pubmed_sents = tokenize_text(raw)

#### Get Candidates

In [6]:
def get_chunks(sents):
    #POS-tag sentences
    tagged_sents = [nltk.pos_tag(sent) for sent in sents]
    
    #normalize and create a list of tagged words
    normed_tagged_words = [word_tag_pair for sent in tagged_sents for word_tag_pair in sent
                          if word_tag_pair[0].lower() not in nltk.corpus.stopwords.words('english')
                          and word_tag_pair[0] not in string.punctuation and not word_tag_pair[0].isdigit()]
    #set chunk rule for noun phrases
    chunk_rule = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'
    
    #chunk by chunker
    chunker = nltk.chunk.regexp.RegexpParser(chunk_rule)
    all_chunks = nltk.chunk.tree2conlltags(chunker.parse(normed_tagged_words))
    # http://web.media.mit.edu/~havasi/MAS.S60/PNLP7.pdf (tree2conlltags)
    return all_chunks

#take all the chunks and return all the ones where the chunk != 'O' 
#compile those noun phrases into one-string list
def get_candidates(all_chunks):
    candidates = []
    for group in all_chunks:
        word = group[0]
        tag = group[1]
        chunk = group[2]
        if chunk != "O":
            candidates.append(word.lower())
    return [" ".join(candidates)]

all_chunks = get_chunks(pubmed_sents)
candidates = get_candidates(all_chunks)

#### Score Candidates

In [7]:
def score_candidates(candidates):
    tfidf = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=None)
    tfidf_score = tfidf.fit_transform(candidates)
    feature_names = tfidf.get_feature_names()
    #used this tutorial: http://billchambers.me/tutorials/2014/12/21/tf-idf-explained-in-python.html
    
    words_and_scores = []
    for col in tfidf_score.nonzero()[1]:
        words_and_scores.append((feature_names[col], tfidf_score[0, col]))
    
    words_and_scores = sorted(words_and_scores, key=itemgetter(1), reverse=True)
    
    return words_and_scores[:30]

candidates_scored = score_candidates(candidates)
#format the printing

### Bigram + Trigram FreqDist and Collocations

#### Tokenize + Normalize

In [8]:
from nltk import word_tokenize, sent_tokenize
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sent_tokens = sent_tokenize(raw)

pattern = '\w+|\$[\d\.]+|\S+'
#alphabetic sequences, money expressions, and any other non-whitespace sequences
word_tokens = nltk.regexp_tokenize(raw, pattern)

def normalize_tokens(word_tokens):
    stops = nltk.corpus.stopwords.words('english')
    word_tokens = [w.lower() for w in word_tokens]
    word_tokens = [w for w in word_tokens if not w in stops] #remove stop words
    word_tokens = [w for w in word_tokens if w not in string.punctuation] #remove punctuation
    word_tokens = [w for w in word_tokens if not w.isdigit()] #remove numbers
    return word_tokens

word_tokens = normalize_tokens(word_tokens)

['neuroanatomy', 'depression', 'review', 'depression', 'common', 'psychiatric', 'disorder', 'number', 'one', 'cause', 'disability', 'affects', 'population', 'aim', 'review', 'present', 'brief', 'synopsis', 'various', 'biochemical', 'imbalances', 'thought', 'contribute', 'depression', 'aspects', 'anatomy', 'possibly', 'implicated', 'depression', 'treatments', 'related', 'targeting', 'specific', 'locales', 'multiple', 'neurotransmitters', 'parts', 'brain', 'involved', 'disorder', 'depression', 'although', 'exact', 'etiology', 'depression', 'found', 'cases', 'various', 'treatments', 'medicinal', 'psychiatric', 'surgical', 'exist', 'disabling', 'disease', 'improved', 'knowledge', 'anatomical', 'sites', 'involved', 'patients', 'depression', 'help', 'future', 'treatment', 'modalities', 'article', 'protected', 'copyright', 'rights', 'reserved', 'modifying', 'risk', 'factors', 'management', 'erectile', 'dysfunction', 'review', 'erectile', 'dysfunction', '(ed)', 'prevalent', 'among', 'men', 'pr

#### Bigram + Trigram FreqDist

In [40]:
def freq_uni(word_tokens):
    unigram_fdist = nltk.FreqDist(word_tokens)
    return unigram_fdist.most_common(30)

freq_uni = freq_uni(word_tokens)

In [1]:
def freq_bi(word_tokens):
    bigrams = nltk.bigrams(word_tokens)
    bigram_fdist = nltk.FreqDist(bigrams)
    return bigram_fdist.most_common(30)

freq_bi = freq_bi(word_tokens)

NameError: name 'word_tokens' is not defined

In [41]:
def freq_tri(word_tokens):
    trigrams = nltk.trigrams(word_tokens)
    trigram_fdist = nltk.FreqDist(trigrams)
    return trigram_fdist.most_common(30)

freq_tri = freq_tri(word_tokens)

#### Bigram + Trigram Collocations

In [23]:
import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
bigram_finder = BigramCollocationFinder.from_words(word_tokens)
trigram_finder = TrigramCollocationFinder.from_words(word_tokens)

In [28]:
bigram_finder.apply_freq_filter(5) #appears at least 4x in the text
bigram_collocations = bigram_finder.nbest(bigram_measures.pmi, 20)

In [29]:
trigram_finder.apply_freq_filter(5) #appears at least 4x in the text
trigram_collocations = trigram_finder.nbest(trigram_measures.pmi, 20)

## III. Compiled Results:

In [60]:
print("NOUN PHRASE EXCERPTS, SCORED BY TF*IDF")
print("################################################")
for word, score in candidates_scored:
    print(word)
print("\n")
print("TRI-GRAM COLLOCATIONS")
print("################################################") 
for item in trigram_collocations[:15]:
        print("{0:<20}{1:<20}{2:<20}".format(item[0],item[1],item[2]))
print("\n")
print("BI-GRAM FREQ-DIST")
print("################################################")   
for item in freq_bi[:15]:
    print("{0:<20}{1:<20}".format(item[0][0],item[0][1]))


NOUN PHRASE EXCERPTS, SCORED BY TF*IDF
################################################
depression
studies
review
patients
treatment
health
risk
evidence
anxiety
interventions
symptoms
effects
results
therapy
quality
systematic
trials
mental
disorders
clinical
factors
effect
data
research
care
outcomes
disease
depressive
disorder
analysis


TRI-GRAM COLLOCATIONS
################################################
avascular           necrosis            bone                
lesbian             gay                 bisexual            
lower               urinary             tract               
selective           serotonin           reuptake            
gay                 bisexual            transgender         
mild                moderate            intellectual        
activities          daily               living              
first               generation          migrants            
serotonin           reuptake            inhibitors          
confidence          interval         