# <center>Natural Language Processing Using NLTK (II)</center>

References:
 - http://www.nltk.org/book_1ed/
 - https://web.stanford.edu/class/cs124/lec/Information_Extraction_and_Named_Entity_Recognition.pdf
 - https://radimrehurek.com/gensim/models/phrases.html

## We will see other steps

   * Tokenization: split documents into individual words, phrases, or segments
   * Remove stop words and filter tokens
   * **POS (part of speech) Tagging**  
   * **Normalization: Stemming, Lemmatization**
   * **Named Entity Recognition (NER)**
   * **Term Frequency and Inverse Dcoument Frequency (TF-IDF)**
   * **Create document-to-term matrix (bag of words)**

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
import nltk

# Sample text for analysis

news=["Oil prices soar to all-time record", 
"Stocks end up near year end", 
"Money funds rose in latest week",
"Stocks up; traders eye crude oil prices",
"Dollar rising broadly on record trade gain"]
text=". ".join(news).lower()
text

'oil prices soar to all-time record. stocks end up near year end. money funds rose in latest week. stocks up; traders eye crude oil prices. dollar rising broadly on record trade gain'

## 4. POS (Part of Speech) Tagging

 - What is POS Tagging:
   * The process of marking up a word in a text as corresponding to a particular part of speech (e.g. nouns, verbs, adjectives, adverbs etc.), based on both **its definition**, as well as its **context** — adjacent and related words in a phrase, sentence, or paragraph. 
 - Why POS Tagging: 
   * **disambiguation**: A word may have different meanings. POS tag is a potential strong signal for word sense disambiguation. For example, "I fish a fish"
   * **Phrase extraction**: Use POS rules to define accepted phrases (or information unit), or collocations for indexing and retrieval:
     * Adj + Noun, e.g. nice house
     * Verb + Noun, e.g. play football
     * typical collocation patterns (https://nlp.stanford.edu/fsnlp/promo/colloc.pdf):
       - Adj + Noun: e.g. linear function
       - Noun + Noun: e.g. regression coefficient
       - Adj + Adj + Noun: e.g. Gaussian random variable
       - Noun + Adj + Noun: e.g. mean squared error
       - Noun + Noun + Noun: e.g. class probability function
       - Noun + Preposition + Noun: e.g. dregrees of freedom
   * **Filter tokens**:  some POS have less importance in retrieval, e.g. stopwords such as ‘a’, ‘an’, ‘the’, and other glue words like 'in', 'on', 'of' etc.
   * Find other forms of a word based on POS
        * Noun: plural and singular
        * Verb: past, present and future tense
        * Adjective: positive, comparative, and superlative
 - List of Penn Treebank Tags can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
 - A tagger (program for tagging) is trained based on a corpus using machine learning approaches. It may not be very accurate when applying it your corpus.
   - Stanford tagger
   - NLTK default tagger (PerceptronTagger)

In [4]:
# Exercise 4.1. To find all tags in treebank
nltk.help.upenn_tagset()

# find the meaning of a specific tag
nltk.help.upenn_tagset('JJ')


$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [6]:
# Exercise 4.2. NLTK POS Tagging

# The input to the tagging function is a list of words

# tokenize the text
tokens=nltk.word_tokenize(text)

# tag each tokenized word
tagged_tokens= nltk.pos_tag(tokens)

tagged_tokens
 

[('oil', 'NN'),
 ('prices', 'NNS'),
 ('soar', 'VB'),
 ('to', 'TO'),
 ('all-time', 'JJ'),
 ('record', 'NN'),
 ('.', '.'),
 ('stocks', 'NNS'),
 ('end', 'VBP'),
 ('up', 'RP'),
 ('near', 'IN'),
 ('year', 'NN'),
 ('end', 'NN'),
 ('.', '.'),
 ('money', 'NN'),
 ('funds', 'NNS'),
 ('rose', 'VBD'),
 ('in', 'IN'),
 ('latest', 'JJS'),
 ('week', 'NN'),
 ('.', '.'),
 ('stocks', 'NNS'),
 ('up', 'RP'),
 (';', ':'),
 ('traders', 'NNS'),
 ('eye', 'NN'),
 ('crude', 'VBP'),
 ('oil', 'NN'),
 ('prices', 'NNS'),
 ('.', '.'),
 ('dollar', 'NN'),
 ('rising', 'VBG'),
 ('broadly', 'RB'),
 ('on', 'IN'),
 ('record', 'NN'),
 ('trade', 'NN'),
 ('gain', 'NN')]

In [7]:
# Exercise 4.3. Extract Phrases by POS

# Extract phrases in pattern of adjective + noun
# i.e. nice house, growing market

bigrams=list(nltk.bigrams(tagged_tokens))
print(bigrams)

phrases=[ (x[0],y[0]) for (x,y) in bigrams \
         if x[1].startswith('JJ') \
         and y[1].startswith('NN')]

print(phrases)

[(('oil', 'NN'), ('prices', 'NNS')), (('prices', 'NNS'), ('soar', 'VB')), (('soar', 'VB'), ('to', 'TO')), (('to', 'TO'), ('all-time', 'JJ')), (('all-time', 'JJ'), ('record', 'NN')), (('record', 'NN'), ('.', '.')), (('.', '.'), ('stocks', 'NNS')), (('stocks', 'NNS'), ('end', 'VBP')), (('end', 'VBP'), ('up', 'RP')), (('up', 'RP'), ('near', 'IN')), (('near', 'IN'), ('year', 'NN')), (('year', 'NN'), ('end', 'NN')), (('end', 'NN'), ('.', '.')), (('.', '.'), ('money', 'NN')), (('money', 'NN'), ('funds', 'NNS')), (('funds', 'NNS'), ('rose', 'VBD')), (('rose', 'VBD'), ('in', 'IN')), (('in', 'IN'), ('latest', 'JJS')), (('latest', 'JJS'), ('week', 'NN')), (('week', 'NN'), ('.', '.')), (('.', '.'), ('stocks', 'NNS')), (('stocks', 'NNS'), ('up', 'RP')), (('up', 'RP'), (';', ':')), ((';', ':'), ('traders', 'NNS')), (('traders', 'NNS'), ('eye', 'NN')), (('eye', 'NN'), ('crude', 'VBP')), (('crude', 'VBP'), ('oil', 'NN')), (('oil', 'NN'), ('prices', 'NNS')), (('prices', 'NNS'), ('.', '.')), (('.', '.'

In [9]:
# Exercise 4.4. Extract Noun+Verb, 
# i.e. prices soar
bigrams=list(nltk.bigrams(tagged_tokens))
#print(bigrams)

phrases=[ (x[0],y[0]) for (x,y) in bigrams \
         if x[1].startswith('NN') \
         and y[1].startswith('VB')]

print(phrases)

[('prices', 'soar'), ('stocks', 'end'), ('funds', 'rose'), ('eye', 'crude'), ('dollar', 'rising')]


## 5. Normalization: Stemming & Lemmatization
 - What is normalization:
   - Converts a list of words in **different surface forms** to a more **uniform form**, e.g.
        * a word with different forms, e.g. organize, organizes, organized, and organizing
        * families of derivationally related words with similar meanings, such as democracy, democratic, and democratization.
 - Why normalization
   - **improve text matching**: in many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
   - reduce featue space generated from text
 - Stemming and lemmatization are two common techinques


### 5.1. Stemming 

* **Stemming**: reducing inflected (or sometimes derived) words to their **stem, base or root** form. 
   * For example, **crying** -> **cri**. 
   * Stemming may not generate a real word, but a root form. 
   * The stemming program is called stemmer. 
       * Famous stemers are Porter stemmer, Lancaster Stemmer, Snowball Stemmer.

In [10]:
# Exercise 5.1.1. Stermming Using Porter Stemmer

from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

print("Stem of organizing/organized/organizes/organization")
print(porter_stemmer.stem('organizing'))
print(porter_stemmer.stem('organized'))
print(porter_stemmer.stem('organizes'))
print(porter_stemmer.stem('organization'))


print("\nStem of crying")
print(porter_stemmer.stem('crying'))

Stem of organizing/organized/organizes/organization
organ
organ
organ
organ

Stem of crying
cri


In [11]:
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer(language='english')

# comparison of stemminng adverbs
porter_stemmer.stem('fairly')
snowball_stemmer.stem('fairly') 

# prevents overstemming in some cases
porter_stemmer.stem('generically') 
porter_stemmer.stem('generous') 

snowball_stemmer.stem('generically') 
snowball_stemmer.stem('generous') 

'fairli'

'fair'

'gener'

'gener'

'generic'

'generous'

In [12]:
from nltk.stem import LancasterStemmer
lanc_stemmer = LancasterStemmer()

# tends to overstem
snowball_stemmer.stem('salty') 
lanc_stemmer.stem('salty') 

snowball_stemmer.stem('sales') 
lanc_stemmer.stem('sales')

'salti'

'sal'

'sale'

'sal'

### 5.2. Lemmatization

* **Lemmatization**: determining the lemma for a given word, 
   * A lemma is a word which stands at the head of a definition in a dictionary, e.g. run (lemma),  runs, ran and running (inflections) 
   * Lemmatization is a complex task involving understanding context and determining the part of speech of a word in a sentence 
      * e.g. "organized" (verb or adjective?)
   * The widely used Lemmatization method is based on WordNet, a large lexical database of English.

* **Difference** between stemming and lemmatization: 
   * a stemmer operates on a single word **without knowledge of the context**, and therefore cannot discriminate between words which have different meanings depending on part of speech. While, lemmatization **requires context and POS tags**. 
   * Stemming may not generate a real word, but lemmization always generates real words.
   *  However, stemmers are typically easier to implement and run faster with reduced accuracy.

In [10]:
# Exercise 5.2.1. Lemmatization

# wordnet lemmatizer takes POS tag as a parameter
# However, wordnet has its own tag set, 
# different from treebank tag set
# The default POS tag is noun 

from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet

wordnet_lemmatizer = WordNetLemmatizer()

print("organizing (verb) ->", \
      wordnet_lemmatizer.lemmatize\
      ('organizing', wordnet.VERB))
print('organized (verb) ->', \
      wordnet_lemmatizer.lemmatize\
      ('organized', wordnet.VERB))
print('organized (adjective) ->',\
      wordnet_lemmatizer.lemmatize('organized', \
                                   wordnet.ADJ))
print('organization (noun) ->',\
      wordnet_lemmatizer.lemmatize('organization'))
print('crying (adjective) ->',\
      wordnet_lemmatizer.lemmatize('crying', \
                                   wordnet.ADJ))
print('crying (verb) ->', \
      wordnet_lemmatizer.lemmatize('crying', \
                                   wordnet.VERB))

# compare the result with Exercise 5.1.1.

organizing (verb) -> organize
organized (verb) -> organize
organized (adjective) -> organized
organization (noun) -> organization
crying (adjective) -> crying
crying (verb) -> cry


## 6. Named Entity Recognition (NER)

- Definition: find and classify real word entities (Person, Organization, Event etc.) in text
- Example: sentence "Jim bought 300 shares of Acme Corp. in 2006" can be annotated as "**[Jim]<sub>Person</sub>** bought 300 shares of **[Acme Corp.]<sub>Organization</sub>** in 2006"
- Uses of NER:
   *  Information Extraction: extract clear, factual information, i.e., Who did what to whom when?
   *  Named entities can be indexed, and their relations can be extracted as a knowledge graph.
   *  Sentiment can be attributed to companies or products. 
   *  For question answering, answers are often named entities.
- Techniques for NER
   * Regular expression: Telephone numbers, emails, Capital names (e.g. Capitalized word + {city,  center, river}
      * Adantages: simple and sometimes effective
      * Disadvantage: 
         * first word of a sentence is capitalized; sometimes, titles are all capitalized; new proper names constantly emerges (e.g. movie titles, books, etc.)
         * proper names may be ambiguous, e.g. Jordan can be *person* or *location*
   * Supervised learning (IOB) (https://arxiv.org/abs/cmp-lg/9505040)
       1. Collect a set of representative training documents
       2. Label each token for its entity class (I: inside entity, B: begining entity) or other (O)
       3. Design feature extractors appropriate to the text and classes, e.g. current word, pre/next word, pos tags etc.
       4. Train a sequence classifier to predict the labels from the data

In [11]:
# Exercise 6.1. Use NLTK for Named Entity Recognition

from nltk import word_tokenize, pos_tag, ne_chunk, Tree


sentence = "Jim bought 300 shares of Acme Corp. in 2006."


# the input to ne_chunk is list of (token, pos tag) tuples
ner_tree=ne_chunk(pos_tag(word_tokenize(sentence)))

# ne_chunk returns a tree
# print the tree
Tree.fromstring(str(ner_tree)).pretty_print()


# get PERSON out of the tree
person=[]
for t in ner_tree.subtrees():
    if t.label() == 'PERSON':
        person.append(t.leaves())
print("PERSON",person)


# how to extract organization?

                                     S                                                      
     ________________________________|_____________________________________                  
    |        |        |        |     |      |     |   PERSON          ORGANIZATION          
    |        |        |        |     |      |     |     |        __________|___________      
bought/VBD 300/CD shares/NNS of/IN in/IN 2006/CD ./. Jim/NNP Acme/NNP              Corp./NNP

PERSON [[('Jim', 'NNP')]]


In [12]:
from nltk import conlltags2tree, tree2conlltags

iob_tags = tree2conlltags(ner_tree)
print(iob_tags)


[('Jim', 'NNP', 'B-PERSON'), ('bought', 'VBD', 'O'), ('300', 'CD', 'O'), ('shares', 'NNS', 'O'), ('of', 'IN', 'O'), ('Acme', 'NNP', 'B-ORGANIZATION'), ('Corp.', 'NNP', 'I-ORGANIZATION'), ('in', 'IN', 'O'), ('2006', 'CD', 'O'), ('.', '.', 'O')]


## 7. Term Frequency and Inverse Dcoument Frequency (TF-IDF)
 - Motivation: How to identify important words (or phrases, named entities) in a text in a collecton or corpus? When search for documents, we'd like to have these important words are matched.
 - Intuition: 
   * In a document, if a word/term/phrase is repeated many times, it is likely to be important. 
   * However, if it appears in most of the documents in the corpus, then it has little discriminating power in determining relevance. 
   * For instance, a collection of documents on the auto industry is likely to have the term auto in almost every document. Search by "auto" you may get all the documents. 
 - **TF-IDF**: is composed by two terms: 
      - `TF (Term Frequency)`: which measures how frequently a term, say w, occurs in a document. 
      - `IDF (Inverse Document Frequency)`: measures how important a term is within the corpus. 
 
 - TF-IDF provides another way to remove stop words

### 7.1. Term Frequency (TF)
- Measures how frequently a term, say w, occurs in a document, say $d$. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. 
- Thus, the frequency of $w$ in $d$, denoted as $freq(w,d)$ is often divided by the document length (a.k.a. the total number of terms in the document, denoted as $|d|$) as a way of normalization: $$tf(w,d) = \frac{freq(w,d)}{|d|}$$
- Example: d="Stocks end up near year end"
   * `tf('Stocks',d)=?`
   * `tf('end',d)=?`

### 7.2. Inverse Document Frequency (IDF)
- Measures how important a term is within the corpus. 
- However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. 
- Thus we need to weigh down the frequent terms while scale up the rare ones. 
- Let $|D|$ denote the number of documents, $df(w,D)$ denotes the number of documents with term $w$ in them. Then, $$idf(w) = ln(\frac{|D|}{df(w,D)})+1$$ Or a smoothed version: $$idf(w) = ln(\frac{|D|+1}{df(w,D)+1})+1$$
- Examples: 
  * Considering dataset:
       1. "Oil prices soar to all-time record", 
       2. "Stocks end up near year end", 
       3. "Money funds rose in latest week",
       4. "Stocks up; traders eye crude oil prices",
       5. "Dollar rising broadly on record trade gain"
  * `idf('Stocks')=`?
  * `idf('all-time')=`?
  * Discussion:
     * What words get very low IDF score?
     * What words get very high IDF score?


### 7.3. TF-IDF 
- Let $s(w,d)=tf(w,d) * idf(w)$, normalize the TF-IDF score of each word in a document normalized by the Euclidean norm, then 
   $$tfidf(w,d)=\frac{s(w,d)}{\sqrt{\sum_{w \in d}{s(w,d)^2}}}$$

In [13]:
# Exercise 7.1. computing tf-idf


import nltk, re, string
from nltk.corpus import stopwords

# library for normalization
from sklearn.preprocessing import normalize

# numpy is the package for matrix caculation
import numpy as np  

stop_words = stopwords.words('english')

docs=["Oil prices soar to all-time record", 
"Stocks end up near year end", 
"Money funds rose in latest week",
"Stocks up; traders eye crude oil prices",
"Dollar rising broadly on record trade gain"]   


In [14]:
# Step 1. get tokens of each document as list

def get_doc_tokens(doc):
    tokens=[token.strip() \
            for token in nltk.word_tokenize(doc.lower()) \
            if token.strip() not in stop_words and\
               token.strip() not in string.punctuation]
    
    # you can add bigrams, collocations, or lemmatization here
    
    # create token count dictionary
    token_count=nltk.FreqDist(tokens)
    
    # or you can create dictionary by yourself
    #token_count={token:tokens.count(token) for token in set(tokens)}
    return token_count

# step 2. process all documents to 
# a dictionary of dictionaries
docs_tokens={idx:get_doc_tokens(doc) \
             for idx,doc in enumerate(docs)}
docs_tokens

{0: FreqDist({'oil': 1, 'prices': 1, 'soar': 1, 'all-time': 1, 'record': 1}),
 1: FreqDist({'end': 2, 'stocks': 1, 'near': 1, 'year': 1}),
 2: FreqDist({'money': 1, 'funds': 1, 'rose': 1, 'latest': 1, 'week': 1}),
 3: FreqDist({'stocks': 1, 'traders': 1, 'eye': 1, 'crude': 1, 'oil': 1, 'prices': 1}),
 4: FreqDist({'dollar': 1, 'rising': 1, 'broadly': 1, 'record': 1, 'trade': 1, 'gain': 1})}

In [15]:
# step 3. get document-term matrix
# contruct a document-term matrix where 
# each row is a doc 
# each column is a token
# and the value is the frequency of the token

import pandas as pd

# since we have a small corpus, we can use dataframe 
# to get document-term matrix
# but don't use this when you have a large corpus

dtm=pd.DataFrame.from_dict(docs_tokens, \
                           orient="index" 
                          )
dtm

dtm=dtm.fillna(0)
dtm

# sort by index (i.e. doc id)
dtm = dtm.sort_index(axis = 0)
dtm


Unnamed: 0,oil,prices,soar,all-time,record,stocks,end,near,year,money,...,latest,week,traders,eye,crude,dollar,rising,broadly,trade,gain
0,1.0,1.0,1.0,1.0,1.0,,,,,,...,,,,,,,,,,
3,1.0,1.0,,,,1.0,,,,,...,,,1.0,1.0,1.0,,,,,
4,,,,,1.0,,,,,,...,,,,,,1.0,1.0,1.0,1.0,1.0
1,,,,,,1.0,2.0,1.0,1.0,,...,,,,,,,,,,
2,,,,,,,,,,1.0,...,1.0,1.0,,,,,,,,


Unnamed: 0,oil,prices,soar,all-time,record,stocks,end,near,year,money,...,latest,week,traders,eye,crude,dollar,rising,broadly,trade,gain
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0
1,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,oil,prices,soar,all-time,record,stocks,end,near,year,money,...,latest,week,traders,eye,crude,dollar,rising,broadly,trade,gain
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0


In [17]:
# step 4. get normalized term frequency (tf) matrix

# convert dtm to numpy arrays
dtm2=dtm.values

# sum the value of each row
doc_len=dtm2.sum(axis=1)
doc_len

# divide dtm matrix by the doc length matrix
# note broadcasting 
tf=np.divide(dtm2, doc_len[:,None])  

# set float precision to print nicely
np.set_printoptions(precision=2)

tf
tf.shape

array([5., 5., 5., 6., 6.])

array([[0.2 , 0.2 , 0.2 , 0.2 , 0.2 , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.2 , 0.4 , 0.2 , 0.2 , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.2 , 0.2 ,
        0.2 , 0.2 , 0.2 , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.17, 0.17, 0.  , 0.  , 0.  , 0.17, 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.17, 0.17, 0.17, 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.17, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.17, 0.17, 0.17, 0.17, 0.17]])

(5, 22)

In [18]:
# step 5. get idf

# get document frequency
df=np.where(dtm2>0,1,0)
df

# get idf
idf=np.log(np.divide(len(docs), \
        np.sum(df, axis=0)))+1
print("\nIDF Matrix")
idf
idf.shape

# what is the size of idf array?

smoothed_idf=np.log(np.divide(len(docs)+1, np.sum(df, axis=0)+1))+1
print("\nSmoothed IDF Matrix")
smoothed_idf



array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]])


IDF Matrix


array([1.92, 1.92, 2.61, 2.61, 1.92, 1.92, 2.61, 2.61, 2.61, 2.61, 2.61,
       2.61, 2.61, 2.61, 2.61, 2.61, 2.61, 2.61, 2.61, 2.61, 2.61, 2.61])

(22,)


Smoothed IDF Matrix


array([1.69, 1.69, 2.1 , 2.1 , 1.69, 1.69, 2.1 , 2.1 , 2.1 , 2.1 , 2.1 ,
       2.1 , 2.1 , 2.1 , 2.1 , 2.1 , 2.1 , 2.1 , 2.1 , 2.1 , 2.1 , 2.1 ])

In [20]:
# step 6. get tf-idf
print("TF-IDF Matrix")
s=tf*idf
s  # raw score

# by default normalize by row
tf_idf=normalize(tf*idf)   # is broadcast possible here?
tf_idf

print("\nSmoothed TF-IDF Matrix")
smoothed_tf_idf=normalize(tf*smoothed_idf)
smoothed_tf_idf


TF-IDF Matrix


array([[0.38, 0.38, 0.52, 0.52, 0.38, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.38, 1.04, 0.52, 0.52, 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.52, 0.52,
        0.52, 0.52, 0.52, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.32, 0.32, 0.  , 0.  , 0.  , 0.32, 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.43, 0.43, 0.43, 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.32, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.43, 0.43, 0.43, 0.43, 0.43]])

array([[0.39, 0.39, 0.53, 0.53, 0.39, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.29, 0.78, 0.39, 0.39, 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.45, 0.45,
        0.45, 0.45, 0.45, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.34, 0.34, 0.  , 0.  , 0.  , 0.34, 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.47, 0.47, 0.47, 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.31, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.42, 0.42, 0.42, 0.42, 0.42]])


Smoothed TF-IDF Matrix


array([[0.41, 0.41, 0.5 , 0.5 , 0.41, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.31, 0.78, 0.39, 0.39, 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.45, 0.45,
        0.45, 0.45, 0.45, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.36, 0.36, 0.  , 0.  , 0.  , 0.36, 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.45, 0.45, 0.45, 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.34, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.42, 0.42, 0.42, 0.42, 0.42]])

- TF-IDF matrix gives **weight** of each word in each document
- Documents:
    1. "Oil prices soar to all-time record", 
    2. "Stocks end up near year end", 
    3. "Money funds rose in latest week",
    4. "Stocks up; traders eye crude oil prices",
    5. "Dollar rising broadly on record trade gain"

In [21]:
# For better visualization, let's make the tf-idf array a dataframe
pd.options.display.float_format = '{:,.2f}'.format # set format for float

pd.DataFrame(smoothed_tf_idf, columns = dtm.columns)
# the dtm dataframe we created in Step 3 has each word as a column

Unnamed: 0,oil,prices,soar,all-time,record,stocks,end,near,year,money,...,latest,week,traders,eye,crude,dollar,rising,broadly,trade,gain
0,0.41,0.41,0.5,0.5,0.41,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.31,0.78,0.39,0.39,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.45,...,0.45,0.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.36,0.36,0.0,0.0,0.0,0.36,0.0,0.0,0.0,0.0,...,0.0,0.0,0.45,0.45,0.45,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.42,0.42,0.42,0.42,0.42


In [22]:
# Exercise 7.2. Find the top three words 
# of each document by TF-IDF weight

top=smoothed_tf_idf.argsort()[:,::-1][:,0:3]
top

for row in top:
    print([dtm.columns[x] for x in row])

array([[ 2,  3,  0],
       [ 6,  7,  8],
       [10, 13, 12],
       [16, 15, 14],
       [21, 19, 18]])

['soar', 'all-time', 'oil']
['end', 'near', 'year']
['funds', 'week', 'latest']
['crude', 'eye', 'traders']
['gain', 'broadly', 'rising']


### 7.4. What to do with TF-IDF
- This is the `feature sapce` of text mining (a.k.a. `Bag of Words` or `Vector Space Model`)
- Identify important words in each document
- Find similar documents
    * How to measure simialrity (or distance)? 
        - `Euclidean distance`
        - `Cosine distance`
    * Euclidean distance:
        - It can be **large** for vectors of high dimension
        - `Curse of dimensionality`: In a high-dimensional space, the ratio between the nearest and farthest points approaches 1, i.e. the points essentially become uniformly distant from each other. (https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)
    * Cosine similarity: The similarity between two documents is a function of the angle between their vectors in the if-idf vector space. 
      <img src='cosine.png' width=50% />
      <img src='cosine_formula.svg' width=50% />
      - Example: A=[0,2,1], B=[1,1,2], then
      $$cosine(A,B)=\frac{0*1+2*1+1*2}{\sqrt{0+4+1}*\sqrt{1+1+4}}$$

In [21]:
## Exercise 7.4.1 Document similarity

# package to calculate distance
from sklearn.metrics import pairwise_distances

# calculate cosince distance of every pair of documents 
# similarity is 1-distance
similarity=1-pairwise_distances(tf_idf, metric = 'cosine')
similarity

# find top doc similar to the first one
# Note the diagonal value is 1, which is the largest

np.argsort(similarity)[:,::-1][0,0:2]

for idx, doc in enumerate(docs):
    print(idx,doc)

array([[1.  , 0.  , 0.  , 0.26, 0.12],
       [0.  , 1.  , 0.  , 0.1 , 0.  ],
       [0.  , 0.  , 1.  , 0.  , 0.  ],
       [0.26, 0.1 , 0.  , 1.  , 0.  ],
       [0.12, 0.  , 0.  , 0.  , 1.  ]])

array([0, 3])

0 Oil prices soar to all-time record
1 Stocks end up near year end
2 Money funds rose in latest week
3 Stocks up; traders eye crude oil prices
4 Dollar rising broadly on record trade gain


### 7.5. Put Everyting together -- Computing TF-IDF

In [22]:
import nltk, re, string
from sklearn.preprocessing import normalize
from nltk.corpus import stopwords
# numpy is the package for matrix cacluation
import numpy as np  
import pandas as pd

stop_words = stopwords.words('english')

# Step 1. get tokens of each document as list
def get_doc_tokens(doc):
    tokens=[token.strip() \
            for token in nltk.word_tokenize(doc.lower()) \
            if token.strip() not in stop_words and\
               token.strip() not in string.punctuation]
    
    # you can add bigrams, collocations, stemming, 
    # or lemmatization here
    
    token_count={token:tokens.count(token) for token in set(tokens)}
    return token_count

def tfidf(docs):
    # step 2. process all documents to get list of token list
    docs_tokens={idx:get_doc_tokens(doc) \
             for idx,doc in enumerate(docs)}

    # step 3. get document-term matrix
    dtm=pd.DataFrame.from_dict(docs_tokens, orient="index" )
    dtm=dtm.fillna(0)
    dtm = dtm.sort_index(axis = 0)
      
    # step 4. get normalized term frequency (tf) matrix        
    tf=dtm.values
    doc_len=tf.sum(axis=1, keepdims=True)
    tf=np.divide(tf, doc_len)
    
    # step 5. get idf
    df=np.where(tf>0,1,0)
    #idf=np.log(np.divide(len(docs), \
    #    np.sum(df, axis=0)))+1

    smoothed_idf=np.log(np.divide(len(docs)+1, np.sum(df, axis=0)+1))+1    
    smoothed_tf_idf=tf*smoothed_idf
    
    return smoothed_tf_idf