# Natural Language Processing for Text Indexing in Python

**Objective of text indexing**: to create a representation (vector) for each document that we can use to relate documents one to the other. This representation should encapsulate the overall "meaning" of the document.

In this sequence, we don't care much about obtaining every detail and subtleties of that meaning, we want to obtain a synthetic representation to process a large amount of documents and extract the emerging global trends.

So, there are two sides to NLP:
- indexing text: reducing complexity to be able to process large corpus of documents, for information retrieval
- extracting information from text: recognizing details in the meaning of each document, paragraph, sentence, to be able to extract relevant entities from these documents.

These two purposes can be mutually beneficial, but the more you get into the details, the more complex the processing will be.

## 0. Text sources and possible text mining inputs

In [None]:
paragraph = u"My mother drove me to the airport with the windows rolled down. It was seventy-five degrees in Phoenix, the sky a perfect, cloudless blue. I was wearing my favorite shirt – sleeveless, white eyelet lace; I was wearing it as a farewell gesture. My carry-on item was a parka. In the Olympic Peninsula of northwest Washington State, a small town named Forks exists under a near-constant cover of clouds. It rains on this inconsequential town more than any other place in the United States of America. It was from this town and its gloomy, omnipresent shade that my mother escaped with me when I was only a few months old. It was in this town that I’d been compelled to spend a month every summer until I was fourteen. That was the year I finally put my foot down; these past three summers, my dad, Charlie, vacationed with me in California for two weeks instead."

print(paragraph)

### Encode

In [None]:
input_string = paragraph.encode('utf-8')

print(input_string)

In [None]:
import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', unicode(input_str))
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

input_string = remove_accents(paragraph)

print(input_string)

# 1. Creating bag-of-words for each document

## 1.1. Tokenize document

**"Tokenize"** means creating "tokens" which are atomic units of the text. These tokens are usually words we extract from the document by splitting it (using punctuations as a separator). We can also consider sentences as tokens (and words as sub-tokens of sentences).

### nltk.tokenize.sent_tokenize

In [None]:
from nltk.tokenize import sent_tokenize

sent_tokens = sent_tokenize(input_string)

for sent in sent_tokens:
    print("--- sentence: {}".format(sent))

### nltk.tokenize.word_tokenize

In [None]:
from nltk.tokenize import word_tokenize

tokens = map(word_tokenize, sent_tokens)
#tokens = word_tokenize(input_string)
for sent in tokens:
    print("--- sentence tokens: {}".format(sent))
#print("--- nltk tokens from paragraph:\n{}".format(tokens))

### lower

In [None]:
import string

tokens_lower = [map(string.lower, sent) for sent in tokens]

for sent in tokens_lower:
    print("--- sentence tokens: {}".format(sent))

## 1.2. Filtering stopwords (and punctuation)

**Stopwords** are words that should be stopped at this step because they do not carry much information about the actual meaning of the document. Usually, they are "common" words you use. You can find lists of such **stopwords** online, or embedded within the NLTK library.

### Using your own stop list

In [None]:
from nltk.corpus import stopwords

stopwords_ = set(stopwords.words('english'))

print("--- stopwords in english: {}".format(stopwords_))

In [None]:
# list found at http://www.textfixer.com/resources/common-english-words.txt
# 'not' has been removed (do you know why ?)

stopwords_ = "a,able,about,across,after,all,almost,also,am,among,an,and,any,\
are,as,at,be,because,been,but,by,can,could,dear,did,do,does,either,\
else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,\
how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,\
me,might,most,must,my,neither,no,of,off,often,on,only,or,other,our,\
own,rather,said,say,says,she,should,since,so,some,than,that,the,their,\
them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,\
what,when,where,which,while,who,whom,why,will,with,would,yet,you,your]".split(',')

print("--- stopwords in english: {}".format(stopwords_))

We also need to filter punctuation tokens: tokens made of punctuation marks. We can find a list of those punctuations in string.punctuation.

In [None]:
import string

punctuation_ = set(string.punctuation)
print("--- punctuation: {}".format(string.punctuation))

def filter_tokens(sent):
    return([w for w in sent if not w in stopwords_ and not w in punctuation_])

tokens_filtered = map(filter_tokens, tokens_lower)

for sent in tokens_filtered:
    print("--- sentence tokens: {}".format(sent))

## 1.3. Stemming and lemmatization

**Stemming** means reducing each word to a **stem**. That is, reducing each word in all its diversity to a root common to all its variants.

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

stemmer_porter = PorterStemmer()
tokens_stemporter = [map(stemmer_porter.stem, sent) for sent in tokens_filtered]
print("--- sentence tokens (porter): {}".format(tokens_stemporter[0]))

stemmer_snowball = SnowballStemmer('english')
tokens_stemsnowball = [map(stemmer_snowball.stem, sent) for sent in tokens_filtered]
print("--- sentence tokens (snowball): {}".format(tokens_stemsnowball[0]))

## 1.4. N-Grams

<span style="color:red">To capture sequences of tokens</span>

In [None]:
from nltk.util import ngrams

list(ngrams(tokens_stemporter[0],2))

In [None]:
from nltk.util import ngrams

def join_sent_ngrams(input_tokens, n):
    # first add the 1-gram tokens
    ret_list = list(input_tokens)
    
    #then for each n
    for i in range(2,n+1):
        # add each n-grams to the list
        ret_list.extend(['-'.join(tgram) for tgram in ngrams(input_tokens, n)])
    
    return(ret_list)

tokens_ngrams = map(lambda x : join_sent_ngrams(x, 3), tokens_stemporter)

for sent in tokens_ngrams:
    print("--- sentence tokens: {}".format(sent))

## 1.5. Part-of-Speech tagging

This is an alternative process that relies on machine learning to tag each word in a sentence with its function. In libraries such as NLTK, there are embedded tools to do that. Tags detected depend on the corpus used for training. In NLTK, the function `nltk.pos_tag()` uses the [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

### nltk.pos_tag

In [None]:
from nltk import pos_tag

sent_tags = map(pos_tag, tokens)

for sent in sent_tags:
    print("--- sentence tags: {}".format(sent))

Let's filter verbs !

In [None]:
for sent in sent_tags:
    tags_filtered = [t for t in sent if t[1].startswith('VB')]
    print("--- verbs:\n{}".format(tags_filtered))

In [None]:
from nltk import RegexpParser

grammar = r"""
  NPB: {<DT|PP\$>?<JJ|NN|,>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
  V2V: {<V.*> <TO> <V.*>}
"""

cp = RegexpParser(grammar)
result = cp.parse(sent_tags[1])

#print result

for sent in sent_tags:
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'NPB': print(subtree)
        if subtree.label() == 'V2V': print(subtree)

# 2. Creating a vector for each document in a corpus

Our objective now is, from a corpus of documents, to build a vector representation of each of these documents. This vector representation will be used for:

- mine the features that can caracterize classes of documents (supervised learning)
- mine the documents that have similar features to establish trends (unsupervised learning).

To do that, we need:
- a fixed number of features
- a quantitative value for each feature.

The number of features is given by the vocabulary over the corpus: the set of all possible words (tokens).
The quantitative value is given, for each doc, by counting the occurences of each of these words in the doc and by using a TF-IDF formula.

## 2.1. Vocabulary and term frequencies

In [None]:
s1 = "Aunt Mary is going to town"
s2 = "Mary and you meet at the town hall "
s3 = "We should meet in town"
s4 = "I will meet Mary in town"

corpus = [s1,s2,s3,s4]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = ['is', 'and', 'we', 'you', 'will',
                                   'the','going', 'to', 'having', 'a',
                                   'should', 'in', 'i', 'm', 'at'],
                    strip_accents='ascii',
                    min_df=0.0,
                    max_df=1.0)

document_term_matrix = cv.fit_transform(corpus)

print("--vocabulary: {}".format(cv.vocabulary_))
print("--doc matrix/count:\n{}".format(document_term_matrix.todense()))

## 2.2. TF-IDF

TF-IDF is an acronym for the product of two parts: the term frequency tf and what is called the inverse document frequency idf. The term frequency is just the counts in a term frequency vector. 

$tf(term,document) = \# \ of \ times \ a \ term \ appears \ in \ a \ document$

The idf part is defined in terms of the document frequency. The document frequency is 

$df(term,corpus) = \frac{ \# \ of \ documents \ that \ contain \ a \ term}{ \# \ of \ documents \ in \ the \ corpus}$

The inverse document frequency is defined in terms of the document frequency as

$idf(term,corpus) = \log{\frac{1}{df(term,corpus)}}$.

It is called the inverse document frequency but really it is the log of the inverse document frequency. Finally tf-idf is just

tf-idf $ = tf(term,document) * idf(term,corpus)$

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words = ['is', 'and', 'we', 'you', 'will',
                                   'the','going', 'to', 'having', 'a',
                                   'should', 'in', 'i', 'm', 'at'],
                    strip_accents='ascii',
                    min_df=0.0,
                    max_df=1.0)
document_tfidf_matrix = tfidf.fit_transform(corpus)

print("--vocabulary: {}".format(tfidf.vocabulary_))
print("--doc matrix/count:\n{}".format(document_tfidf_matrix.todense()))

## 2.3. Measure similarity between documents

Recall that for two vector $\vec{x}$ and $\vec{y}$ that $\vec{x} \cdot \vec{y} = ||\vec{x}|| ||\vec{y}|| \cos{\theta}$. And so,

$\frac{\vec{x} \cdot \vec{y} }{||\vec{x}|| ||\vec{y}||} = \cos{\theta}$

This is called the cosine similarity of two vectors because it is the cosine of the angle between two vectors. Intuitively, the more similar two documents, the smaller the angle between them and the more dissimilar the larger the angle. An extreme example is when two documents share no words in common; the dot product is zero and therefor the cosine is zero. On the opposite extreme, when two documents are identical they share all of the same words with the same frequencies so their cosine is 1.