# DiSummit 2018 - Introduction to text mining 
## Exercise version
<img src="XploData_logo.png" width = "40%">
<br>
Powered by [XploData](https://www.xplodata.be/)

## Introduction

Text mining is a general term to retrieve information from (large quantities of) text. What can be done by hand (reading a text, knowing what it is about, determining keywords, associating with other texts...) can to some extent also be done my using machine learning algorithms, be it much faster. In this hands-on Python workshop, we will introduce you to some of the text mining principles that are used to achieve this.

In the context of the [HIV Hackathon](https://hivhack.org/), organized by Digityser in September 2018, we will focus on a [public available PDF](http://phia.icap.columbia.edu/wp-content/uploads/2017/11/Tanzania_SummarySheet_A4.English.v19.pdf) related to HIV in Tanzania. Although the PDF contains multiple types of data, the scope of this workshop will be limited to the actual text itself. 
*As a side note, alternatively text can be extracted from images using [OCR](https://en.wikipedia.org/wiki/Optical_character_recognition).*

Before we can start with text mining, we need to acquire the text from our PDF into Python.
Fortunately, multiple Python libraries exist, but we will use the popular PyPDF2 package.
For flexibility reasons, we will also introduce the XpdfReader toolkit.

### Exercises
This notebook is the exercise version of the workshop. Feel free to check the solution version.
In this version some code parts are left out and you need to fill it in yourself.
Places where your input is expected will look like this:

`var = ##your code##`

Make sure to replace all the `#` so that the code can be executed.


## Importing data

For the exercises we assume that we already have a txt version for al our pdf files.
For more info on how to achieve those txt files, look at the solution version of this document.

In [None]:
# Project setup
import os

# change working directory
os.chdir('..')
project_root = os.getcwd()
print(project_root)

In [None]:
import chardet

# define input file path
txt_file = os.path.join(project_root, 'output', 'output_txt', 'guess_encoding.txt')

# read file as bytes
with open(txt_file, mode='rb') as byte_file:
    raw_input = byte_file.read()

# guess encoding
encodings = chardet.detect(raw_input)
print(encodings)

In [None]:
# decode bytes to string
text = raw_input.decode('8859')
# show text sample
print(text[-900:])

We will continue the rest of the examples with the `ISO-8859-1` decoded text.

For more info on supported codecs, visit the python [docs](https://docs.python.org/3/library/codecs.html#standard-encodings)

### Language detection
A human capable of reading is able to distinguish between his mother tongue and a foreign language. We perceive this by reading language specific words, grammatical constructions, context... Language detection in Python is quite straightforward and is performed using the `langdetect` package.

In [None]:
from langdetect import detect

language = detect(text)
print(language)

## Data preprocessing
Until this point, you were mainly preparing the data, i.e. converting the pdf to text and analyzing some meta-data.
The next critical step in text mining is preprocessing. This process involves different techniques (for an overview, see [here](https://pdfs.semanticscholar.org/1fa1/1c4de09b86a05062127c68a7662e3ba53251.pdf)), of which we will cover **tokenization, part-of-speech tagging, stop word removal, stemming and lemmatization**.
### Wordclouds and bar charts
To get insights of the contents of a text file, various visualisations are possible. Frequently used visualisations are wordclouds and bar charts, and will be used to demonstrate the differenct aspects of text mining preprocessing.

*Since we will often repeat the same visualisation during this workshop, we prepared a custom function.*

In [None]:
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# instantiate wordcloud
wordcloud = WordCloud(  background_color='white',
                        max_words=50,
                        max_font_size=80, 
                        random_state=42,
                        collocations=False)

def gen_wc_barh(tokens, title='Default title', amount=9):
    """
    Generate a horizontal bar graph from the top #amount tokens.
    Generate a wordcloud from the provided tokens.
    Show both vizualizations.
    """
    # count tokens
    ctr = Counter(tokens)
    
    # initialize plt figure
    plt.figure(figsize=(12,9))
    
    # generate barh from counted tokens
    tokens, weights = zip(*ctr.most_common(amount))
    plt.subplot2grid((3, 3), (0, 0))
    plt.barh(tokens, weights)
    plt.ylabel('Tokens')
    plt.xlabel('Count')
    plt.title('Token count bar-graph')
    
    # generate wc from the counted tokens
    wordcloud.generate_from_frequencies(dict(ctr))
    plt.subplot2grid((3, 3), (0, 1), colspan=2)
    plt.imshow(wordcloud, interpolation='nearest')
    plt.axis('off')
    plt.title('Word cloud')

    # general title
    plt.suptitle(title, fontsize=16)
        
    # show both graphs
    plt.show()

To demonstrate the importance of preprocessing, we will generate a visualisation after each preprocessing step, whereafter you can evaluate the effect. For starters, we build our first visual on the *unpreprocessed* text, to get a feeling of what a wordcloud can tell you about the contents of a text file.

In [None]:
# generate from unprocessed input text
wordcloud.generate_from_text(text)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Notice how frequent occurring words are displayed in a larger font size, but these are not necessarily the most important ones (e.g. percent, year, among). You can improve this, let’s do this step by step.

### Tokenization
[Tokenization]( https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of cutting the text files in individual ‘tokens’. These can be words, but also numbers, punctuation marks or symbols, but don’t worry for that now. You’ll need an additional package for this, called the "Natural Language Toolkit": `nltk`

#### Exercise
Use `nltk.sent_tokenize` to split the text into sentences.

Use `nltk.word_tokenize` to split a sentence into tokens.

*Note that both functions take a `str` as input and return a list, so you will need to use a list-comprehension to apply `nltk.word_tokenize` on an individual sentence.*

In [None]:
import nltk

# get the input text
text = raw_input.decode('8859')

# break text up into smaller bits -> tokens
# First, split the text up in sentences
sentences = ##your code##
# Then we further split each sentence into tokens
tokenized_sentences = [##your code##]


In [None]:
# concatenate all the tokenized sentences into a single list
gen_wc_barh([token for sent in tokenized_sentences for token in sent], title='Tokenized text')

As predicted, you can see that the extracted tokens resulted in more than just 'words'. What do you think was used as delimiter (or separator) for splitting up the tokens?

### Part-of-Speech tagging
To increase the value of our freshly extracted tokens, we can assign labels/tags to them. As a result, each token is given a grammatical meaning. This process is called [Part-Of-Speech (POS) tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging). The ‘tagging’ is based on a previously trained model, which we can also call from the `nltk` package. The default tags generated by `nltk` can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

#### Exercise
Use `nltk.pos_tag` to perform POS tagging on the tokens.

This functions expects a list of tokens and returns a list of tuples `(token, tag)`

*So again, you will have to use a list-comprehension to apply `nltk.pos_tag` on a sentence.*

In [None]:
# We use the model built into NLTK to assign grammatical meaning to our tokens -> POS tags
tagged_sentences = [##your code##]
print("Example tagged sentence:")
print(tagged_sentences[42])

When visualizing our text, we usually aren't interested in all types of tokens. For our usecase we are interested in discovering the topics of our text, so we are more interested in 'word-like' tokens.

In [None]:
# determine which tokens we want to consider
# shorthand POS tag lists:
adj = ['JJ', 'JJR', 'JJS']
noun = ['NN', 'NNS', 'NNP', 'NNPS']
adverb = ['RB', 'RBR', 'RBS']
verb = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']

# interesting tokens
tokens = [(token, tag)\
            for sentence in tagged_sentences\
            for (token, tag) in sentence\
            if token.isalpha()\
            and tag in adj + noun + adverb + verb]

# extract the 'token-values' from the tagged sentences
gen_wc_barh([token for (token, tag) in tokens], title='Filtered tokens')

Note that the tagging is not always ‘sure’ about the meaning of a word. Could you determine which (of the) word(s) in the above bar chart has/have multiple tagging possibilities?

#### Exercise
Different Textmining applications require different preprocessing.
For exmple in sentiment analysis we're mostly interested in adjectives.
Or maybe we're analysing certain verb usage per category, ...

**Apply different filters on the code above to get a different set of tokens.**

### Stop words removal
Spoken languages contain a lot of (small) words that usually don't add extra meaning to a sentence (how many stop words are in this sentence?). To better understand the content of a text, you want to filter out those stop words.

In [None]:
from nltk.corpus import stopwords
# filter out the english stopwords
stopwords = stopwords.words('english')
tokens = [(token, tag) for (token,tag) in tokens if token not in stopwords]

# extract the 'token-values' from the filtered tokens 
gen_wc_barh([token for (token, tag) in tokens], title='Stopwords removed')

Do you see the changes in the wordcloud with and without the removal of stop words? Which words were omitted?

#### Exercise
Check if you guessed correctly which words in the paragraph above are stopwords by filtering them out using the `stopwords` list.

In [None]:
paragraph = "Spoken languages contain a lot of (small) words that usually don't add extra meaning to a sentence (how many stop words are in this sentence?). To better understand the content of a text, you want to filter out those stop words."

# tokenize the paragraph
paragraph_tokenized = ##your code##

# remove the stopwords
paragraph_no_stops = ##your code##

# print the result
print(' '.join([token for (token, tag) in paragraph_no_stops]))

### Stemming
[Stemming](https://en.wikipedia.org/wiki/Stemming) is a preprocessing step whereby inflected words are reduced to their word stem, base or root form. To give an example, the words “computational”, “computers” and “computation” would, according to a predefined algorithm, result in the stem word “comput” after the stemming process. Consequently, words with a similar stem will be grouped together and won’t skew the word count/frequency distribution. However, “comput” isn’t a real word and could obscure the interpretation of the text. Nevertheless, the code is uncomplicated as shown below:

#### Exercise
Use the method `stem` of the initialized stemmer to stem all the `tokens`.

It takes as input a `str` and returns a `str`.

*Note that `tokens` is a list of tuples `(token, tag)`, so again list-comprehensions are our friend and make sure you apply the method to the token only, not the whole tuple.*

In [None]:
from nltk.stem import PorterStemmer

# stem tokens using the Porter Stemmer
stemmer = PorterStemmer()
stemmed_tokens = [##your code##]

gen_wc_barh(stemmed_tokens, title='Stemmed tokens')

*As a side note: stemming could be followed by a stem completion algorithm, which completes the stemmed words (e.g. "comput") to their meaningful counterparts (e.g. "computer"), based on a predefined completion dictionary.*

### Lemmatization
While stemming uses predetermined rules to get to the *stem* version of a word, [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is based on a trained model and is aware of word *meanings*. For example: “are” and “be” will remain two different words after stemming, but will be changed into 'is' after the lemmatization proces. Still, a well-trained algorithm can distinguish between similar words with different meanings (e.g. 'viral' and 'virally'), so they are not combined during lemmatization.
For our lemmatization we are using the 'wordnet' model. When using this for the first time, you need to **download this model first**.

#### Exercise
Use the method `lemmatize` of the initialized lemmatizer to lemmatize all the `tokens`.

It takes as input a `str` and returns a `str`.

*Note that `tokens` is a list of tuples `(token, tag)`, so again list-comprehensions are our friend and make sure you apply the method to the token only, not the whole tuple.*


In [None]:
import nltk.stem.wordnet as wordnet
# nltk.download('wordnet')

# Lemmatize interesting tokens
lemmatizer = wordnet.WordNetLemmatizer()
lemmatized_tokens = [##your code##]
gen_wc_barh(lemmatized_tokens, title='Lemmatized tokens')

What are the main differences between these visuals and the previous ones? Which one is more useful?

## B-o-w

## Named Entity Recognition
After al preprocessing steps, the fun *really* begins. Where POS tagging was limited to tag individual words only, [named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) is able to group multiple tokens into predefined categories, such as person names, locations, organizations, products, time... 
We'll give an example below:

In [None]:
# discover named entities based on POS tags
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
# take a 'random' sentence
sentence = list(chunked_sentences)[-4]
print(sentence)
# You can get a more graphical drawing (will show up in a pop-up window)
#sentence.draw()

In the sentence above, you can recognize the found 'Named Entities' by the prefix `(NE`. We can now use the same wordcloud and bar chart visualizations to highlight the most frequent occurring named entities:

In [None]:
# discover named entities based on POS tags
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

# filter out everything else
named_entities = [chunk.leaves() for sent in chunked_sentences for chunk in sent if hasattr(chunk, "label") and chunk.label() == "NE"]
NE_tokens = [token for leaves in named_entities for (token, tag) in leaves]

gen_wc_barh(NE_tokens, title='Named Entities')

### Bag of words
During visualisation in the previous steps we already implicitly used bag-of-words to pass onto the wordcloud and bar graph generators. Basically it boils down to counting the tokens we are interested in.
Python has some useful built-in classes for this!

In [None]:
from collections import Set, Counter

# get a set of unique tokens in our documents
set_of_words = set([token for token_list in df['i_tokens'] for token in token_list])
print('The documents contain {} unique words'.format(len(set_of_words)))

# count the tokens using a 'Counter' collection
count_of_words = Counter([token for token_list in df['i_tokens'] for token in token_list])
print('The most common words are:')
print(count_of_words.most_common(9))

### Term frequency matrix
We can combine the bags-of-words into DTMs, however `sklearn` let us skip the bag-of-words step by using `vectorizers` to go from text or tokens to a ([sparse](https://docs.scipy.org/doc/scipy/reference/sparse.html)) matrix directly.

### tf vs tfidf
Let us compare a sample of the term-frequency matrix with the same sample from the term-frequency-inverse-document-frequency matrix.

In [None]:
print('Term frequency')
document_amount = tf_segm.shape[0]
token_amount = tf_segm.shape[1]
print((tf_segm/token_amount * document_amount)[:,:6].todense())
print()

print('Term frequency - Inverse document frequency')
print(tfidf[:,:6].todense())

We continue with the *regular* term-frequency matrix because we want to use LDA in this example.

## A classification example
### Using topic detection
So far, what you mainly have been doing, is a so called [bag-of-words analysis](https://en.wikipedia.org/wiki/Bag-of-words_model). This simplified model literally throws all words (or tokens, if you wish) of a document into a ‘bag’ and then looks for the most occurring ones. Now, we will do this for a range of multiple documents, categorizing topics for each document. As a result, you should be able to predict the topics of new, unseen documents.
For this, you will need to know about two more text mining techniques: *(i)* convert the bag of words to a [document-term-matrix** (DTM)**](https://en.wikipedia.org/wiki/Document-term_matrix) and *(ii)* the [latent Dirichlet allocation model **(LDA)**](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). 

Simply put, A DTM is just another mathematical representation of multiple bag of word analyses, with rows corresponding to the documents in the collection and columns corresponding to the tokens. In Python, you can use the `CountVectorizer` package to create a DTM.
An LDA model is essentially used to discover topics in documents, based on a variety of variables, such as the number of topics, number of documents, probability and distribution of words, identity and weights of the words… The math behind the model is mind-boggling (have a look [here]( https://en.wikipedia.org/wiki/Dirichlet-multinomial_distribution). Luckily for us, calling the LDA model in Python is not that hard.

In [1]:
import os
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

#txt_dir = r'..\output\medicine_txt'
txt_dir = os.path.join(project_root, 'output', 'medicine_txt')

file_name = []
input_text = []

# read all txt files (one time)
for curr_dir,_,filenames in os.walk(txt_dir):
    for filename in filenames:
        # filter to select only the pdfs that were converted using the 'simple' option
        if filename[:7] == 'simple_':
            # decode using the iso-8859 character set
            with open(os.path.join(curr_dir, filename), 'rt', encoding='8859') as file:
                txt_input = file.read()
                # consider only files with at least 100 characters
                if len(txt_input) > 99:
                    # strip '.txt' from the filename
                    file_name.append(filename[7:-4])
                    input_text.append(txt_input)

# store as pd.DataFrame
df = pd.DataFrame({'filename':file_name, 'text':input_text})


NameError: name 'project_root' is not defined

In [None]:
# extract tokens
df['tokens'] = df['text'].apply(lambda txt: [nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(txt)])

# POS-tagging
df['POS_tags'] = df['tokens'].apply(lambda tokens: [nltk.pos_tag(sent) for sent in tokens])

# split on tags
adj = ['JJ', 'JJR', 'JJS']
#df['adj'] = df['POS_tags'].apply(lambda tokens: [token for sent in tokens for (token, tag) in sent if tag in adj])
noun = ['NN', 'NNS', 'NNP', 'NNPS']
#df['noun'] = df['POS_tags'].apply(lambda tokens: [token for sent in tokens for (token, tag) in sent if tag in noun])
adverb = ['RB', 'RBR', 'RBS']
#df['adverb'] = df['POS_tags'].apply(lambda tokens: [token for sent in tokens for (token, tag) in sent if tag in adverb])
verb = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
#df['verb'] = df['POS_tags'].apply(lambda tokens: [token for sent in tokens for (token, tag) in sent if tag in verb])

# select only tokens that were tagged as 'adjective', 'noun' or 'verb'.
df['i_tokens'] = df['POS_tags'].apply(lambda tokens: [token for sent in tokens for (token, tag) in sent if token.isalpha() and tag in adj + noun + verb])

In [None]:
# Create a document-term-matrix from the full text
tf_full_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
tf_full = tf_full_vectorizer.fit_transform(df['text'])

# Create a document-term-matrix from the selected tokens only
tf_segm_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
tf_segm = tf_segm_vectorizer.fit_transform(df['i_tokens'].apply(' '.join))

# Create a Term frequency-inverse document frequency (tf-idf) matrix from the selected tokens
#tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
#tfidf = tfidf_vectorizer.fit_transform(df['i_tokens'].apply(' '.join))

In [None]:
# init LDA to look for 3 topics
lda_full = LatentDirichletAllocation(n_components=3, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
# a second model for the segmented DTM
lda_segm = LatentDirichletAllocation(n_components=3, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
# fit the model and transform for results
lda_full_results = lda_full.fit_transform(tf_full)
lda_segm_results = lda_segm.fit_transform(tf_segm)

# read in a new file
txt_file = os.path.join(project_root, 'output', 'output_txt', 'guess_encoding.txt')
with open(txt_file, mode='rt', encoding='8859') as byte_file:
    new_text = byte_file.read()

# transform the new text to a doc-term-matrix
tf_full_new = tf_full_vectorizer.transform([new_text])
tf_segm_new = tf_segm_vectorizer.transform([new_text])
# and transform to discover associated topics
lda_full_new = lda_full.fit_transform(tf_full_new)
lda_segm_new = lda_segm.fit_transform(tf_segm_new)
print(lda_full_new)
print(lda_segm_new)

In [None]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
    
print_top_words(lda_full, tf_full_vectorizer.get_feature_names(), 9)
print_top_words(lda_segm, tf_segm_vectorizer.get_feature_names(), 9)

# Hier komt nog feedback op de gegeven topics?


## Other

Usefull libs:

- re
- gensim
- spaCy
- polyglot
- scikit-learn


Popular NL classifier:
- Naive-Bayes
