# Text Mining in Python
#### In our last lecture we learned about the bag of words (BOW) representation for transforming unstructured text into a document-term matrix that we could use with machine learning algorithms. Today, we'll learn about another way of representing the presence of terms in a document by reweighting the counts based on the importance of the terms.

## TF-IDF (Term Frequency - Inverse Document Frequency)

From [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf):<br>

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes; 83% of text-based recommender systems in the domain of digital libraries use tf-idf.

#### Term Frequency: the number of times a term (token, e.g. a word) appears in a document<br>Document Frequency: the number of documents that a word appears in

So let's stop and think about this for a second. Say our goal is to find all relevant documents from a corpus given a search phrase. Say we're only allowed to search for documents using one word of the search phrase at a time. You've created a dictionary where all of the terms present in the documents are the keys and the values are set of tuples (document, term frequency). We'll need to consider each of the words in the search phrase to determine the relevance of each of the documents found.<br>

example search phrase: <em>the fast fourier transform</em><br>

Your first thought might be to take the intersection of each set of documents that contain each word. But how would you go about ordering those results? What if there was a document where French version of Pinocchio about a doll named Fourier that wanted to transform into a real boy (and do so fast)? How would you determine that was irrelevant?

Let's consider each word:

<b>the</b>: this word probably appears in every document so a document containing <em>the</em> doesn't mean that it's relevant at all

<b>fast</b>: this word probably appears in a lot of documents that have nothing to do with the fourier transform in addition to those about the fast fourier transform so it's not as useless as the but still pretty irrelevant

<b>fourier</b>: this word will appear in a lot less documents than the word fast, therefore it should be more relevant to our query

<b>transform</b>: this word will appear in more documents that <b>fourier</b> but less documents than <b>fast</b> and it's relevance should reflect that

We also care how many times the word is mentioned. In a document about the fast fourier transform, we would expect each of those words to occur frequently. However, we should keep in mind that we care more about the relative frequency than the overall frequency. Therefore we should normalize the term frequency based on how many words are present in the document. So that we don't place assign higher relevance to a document merely because it is longer.

Notice, that the relevance of any document is directly proportional to the normalized term frequency and inversely proportional to how many documents the term appears in. This is the motivation behind tf-idf.

we're looking for something like:

\begin{equation*}
\text{tf-idf} = f(\text{term freq}) \times g({\frac{1}{\text{doc freq}}})
\end{equation*}

which could be as simple as:

\begin{equation*}
\text{tf-idf}_{word} = \langle\text{term freq}_{word}\rangle \times log \left( \frac{N_{doc}}{\text{doc freq}_{word}} \right)
\end{equation*}

using the log for smaller overall values in case $N_{doc}$ is large.

There are various calculations used for calculating the tf-idf score. The Wikipedia page lists several. Refer to the Scikit documenation to see which one they use and why.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
example_sentences = ['The dog is a good dog.', 
                     'The boy is bad.', 
                     'The girl is good.',
                    ]

In [None]:
tfidf = TfidfVectorizer(lowercase=True, norm=None, stop_words='english', use_idf=False)
tfidf.fit_transform(example_sentences).toarray()

In [None]:
tfidf.get_feature_names()

In [None]:
tfidf = TfidfVectorizer(lowercase=True, norm='l1', stop_words='english', use_idf=False)
tfidf.fit_transform(example_sentences).toarray()

In [None]:
tfidf = TfidfVectorizer(lowercase=True, norm=None, stop_words='english', use_idf=True)
tfidf.fit_transform(example_sentences).toarray()

In [None]:
tfidf = TfidfVectorizer(lowercase=True, norm='l2', stop_words='english', use_idf=True)
tfidf.fit_transform(example_sentences).toarray()

From the [Scikit-learn docs](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting):
> In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.<br><br>
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

#### Learn more about tf-idf: <br> http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/ <br> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

#### We also learned in the previous lecture about n-grams and that one of the problems with calculating n-grams is that our number of features will explode. What if we came up with a way to identify meaningful/significant n-grams and only used those instead. Lucky for us, some people already figured out some ways to do just that. 

## Collocations

From [Wikipedia](https://en.wikipedia.org/wiki/Collocation):

In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance.

I won't go into the details of the calculations here. But if you would like to work collocations in your project, here are resources to learn more about them:<br>
https://nlp.stanford.edu/fsnlp/promo/colloc.pdf <br>
http://www.scielo.org.mx/scielo.php?pid=S1405-55462016000300327&script=sci_arttext

In [None]:
import bs4
import pandas as pd
import spacy

from gensim.models.phrases import Phrases, Phraser
from spacy.pipeline import Pipe

So now instead of just using the frequency of the word in the document. You're reweighting the frequency based on how important that term should be based on the tfidf score. 

In [None]:
movie_data = pd.read_csv('../Lecture_10/labeledTrainData.tsv/labeledTrainData.tsv', sep='\t')
text = movie_data.sample(10000, random_state=42).loc[:, 'review'].apply(lambda t: bs4.BeautifulSoup(t, 'lxml').get_text())

#### Note: using lxml instead of html5lib will significantly speed up the html parsing

In [None]:
print(text.iloc[0])

#### Let's find collocations at the sentence level instead of the review level so we don't find collocations between words at the end of sentences and the beginning of others.

In [None]:
nlp = spacy.load('en')

In [None]:
%%time
token_text = []

for doc in nlp.pipe(text, n_threads=16):
    for sent in doc.sents:
        token_text.append([t.lower_ for t in sent if not t.is_punct])

In [None]:
print(token_text[0])

In [None]:
from sklearn.feature_extraction import stop_words

In [None]:
common_terms = list(stop_words.ENGLISH_STOP_WORDS) + ["'m", "'re", "'ll", "'s", "'ve", "'d", 'ca', 'is']
common_terms.remove('not')
common_terms.remove('nothing')
common_terms.remove('never')

In [None]:
sorted(common_terms)

In [None]:
phrases = Phrases(token_text, common_terms=common_terms)

In [None]:
colloc = Phraser(phrases)

In [None]:
colloc_text = colloc[token_text]

In [None]:
colloc_text[0]

In [None]:
colloc_text[1999]

In [None]:
tri_phrases = Phrases(colloc_text, common_terms=common_terms)

In [None]:
tri_colloc = Phraser(tri_phrases)

In [None]:
tri_colloc_text = tri_colloc[colloc_text]

In [None]:
tri_colloc_text[1999]

#### Now you can use this Phraser to convert a list of tokens into a list of tokens that groups together collocations

In [None]:
tri_colloc[['it', 'was', 'a', 'waste', 'of', 'time']]

See the gensim docs for more info: https://radimrehurek.com/gensim/models/phrases.html

## Word2Vec

For some of your projects, your goal is to figure out the sentiment expressed on specific aspects of an object. In order to do that, you'd have to account for all of the different ways a person could refer to that aspect.

Say you're looking at product reviews for cell phones and you've noticed one aspect of cell phones that reviewers seem to care about is the battery life. But you've noticed that sometimes they talk about that aspect using different words such as: 'battery life', 'battery', and 'battery power.' You now know how to find collocations such as 'battery_life' and 'battery_power.' But how would know that those are all used to refer to the same thing. This is an unsupervised learning problem. You don't have labels for each of those terms telling you that they refer to "battery life." So you need a way to learn from the text that those terms are used to refer to the same aspect. Word2Vec can do this for you.

![](https://deeplearning4j.org/img/word2vec_diagrams.png)

The gist:

By using surrounding words (context) to predict a word or by using a word to predict the surrounding words, you can use the hidden layer of NN to map words to a lower dimensional vector space (instead of the original vector space that had the same number of dimensions as the number of words in your corpus vocabulary). In order to shrink the vector space, the NN has to learn to recognize patterns in the text (represenatations) to compress the information.

What you get from word2vec are vectors for each word where the position of word in the lower dimensional vector space represents some concept and similar words (words used in similar contexts in the training data) are close to each other.

Using these vectors, you can cluster the words together.

In [None]:
from gensim.models import Word2Vec

In [None]:
%%time

model = Word2Vec(tri_colloc_text, size=100, workers=8)

In [None]:
model.wv['cinematography']

In [None]:
model.wv.most_similar('cinematography')

In [None]:
model.wv.most_similar('plot')

In [None]:
model.wv.most_similar('character')

In [None]:
model.wv.most_similar('director')

In [None]:
from sklearn.cluster import KMeans, DBSCAN

In [None]:
vocab_set = set()

In [None]:
for sent in tri_colloc_text:
    vocab_set.update(sent)

In [None]:
vocab = pd.Series(list(model.))

for word in vocab_set:
    try:
        vec = model.wv[word]
        vocab.append(word)
        vocab_vectors.append(vec)
    except:
        pass

In [None]:
len(vocab), len(vocab_vectors)

In [None]:
vocab_vectors[0]

In [None]:
import numpy as np

from sklearn.preprocessing import normalize

In [None]:
vector_array = np.concatenate(vocab_vectors, axis=0).reshape(-1, 100)
vec_array_l1 = normalize(vector_array, axis=1, norm='l1')
vec_array_l2 = normalize(vector_array, axis=1, norm='l2')

In [None]:
vector_array[0]

In [None]:
vec_array_l1[0].sum()

In [None]:
vocab[0]

In [None]:
km = KMeans(n_clusters=200, n_jobs=-1)

In [None]:
km.fit(vec_array_l2)

In [None]:
km.labels_

In [None]:
km.labels_[vocab.isin(['cinematography', 'plot', 'character', 'filmmaker'])]

In [None]:
print(vocab[km.labels_ == 69])
print(vocab[km.labels_ == 147])
print(vocab[km.labels_ == 160])
print(vocab[km.labels_ == 167])