## Step 0: Latent Dirichlet Allocation ##

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. 

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial. 
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. 

## Step 1: Load the dataset

The dataset we'll use is a list of over one million news headlines published over a period of 15 years. We'll start by loading it from the `abcnews-date-text.csv` file.

In [1]:
'''
Load the dataset from the CSV and save it to 'data_text'
'''
import pandas as pd
data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False);
print("There are {:,} rows and {} columns in dataset.".format(data.shape[0], data.shape[1]))

There are 1,103,665 rows and 2 columns in dataset.


In [2]:
# Prining the names of two columns
print("There are two columns in the dataset: {} and {}".format(data.columns[0], data.columns[1]))

There are two columns in the dataset: publish_date and headline_text


In [3]:
# We only need the Headlines text column from the data
data_text = data[:500][['headline_text']];
data_text['index'] = data_text.index

documents = data_text

Let's glance at the dataset:

In [4]:
'''
Get the total number of documents
'''
print(len(documents))

500


In [5]:
documents[:5]

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


## Step 2: Data Preprocessing ##

We will perform the following steps:

* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 4 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form (removed from notbook, since Spacy does not support stemming)


In [6]:
'''
Loading spacy library
'''
import spacy
import numpy as np
np.random.seed(400)

In [7]:
# Loading english small model
nlp = spacy.load("en_core_web_sm")

### Lemmatizer Example
Before preprocessing our dataset, let's first look at an lemmatizing example. 

In [8]:
my_doc = nlp("To be or not to be is the question")
for token in my_doc:
    # print(token, token.lemma)
    print(token, token.lower_, token.lemma_)

To to to
be be be
or or or
not not not
to to to
be be be
is is be
the the the
question question question


In [85]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
# Tokenize, lemmatize and remove stopwords
stop_words = nlp.Defaults.stop_words
def preprocess(text):
    result=[]
    text_nlp = nlp(text)
    for token in text_nlp:
        if len(token) > 2:
            result.append(token.lemma_)
            
    result = [word for word in result if word not in stop_words]    
    return result



In [86]:
preprocess("The quick brown fox jumps over a lazy dog")

['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']

In [87]:
'''
Preview a document after preprocessing
'''
document_num = 250
doc_sample = documents[documents['index'] == document_num].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['drought', 'taking', 'toll', 'on', 'insects']


Tokenized and lemmatized document: 
['drought', 'toll', 'insect']


Let's now preprocess all the news headlines we have. To do that, let's use the [map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) function from pandas to apply `preprocess()` to the `headline_text` column

**Note**: This may take a few minutes (it take 6 minutes on my laptop)

In [88]:
import time
start_time = time.clock()
# TODO: preprocess all the headlines, saving the list of results as 'processed_docs'
processed_docs = documents['headline_text'].map(preprocess)
time_elapsed = time.clock() - start_time
print("Time taken to process the document: {}".format(time_elapsed))

Time taken to process the document: 18.738136700000723


In [89]:
'''
Preview 'processed_docs'
'''
processed_docs[:10]

0      [aba, decide, community, broadcasting, licence]
1              [act, fire, witness, aware, defamation]
2                 [infrastructure, protection, summit]
3                [air, staff, aust, strike, pay, rise]
4         [air, strike, affect, australian, traveller]
5               [ambitious, olsson, win, triple, jump]
6          [antic, delighted, record, breaking, barca]
7    [aussie, qualifier, stosur, waste, memphis, ma...
8             [aust, address, security, council, iraq]
9               [australia, lock, war, timetable, opp]
Name: headline_text, dtype: object

In [15]:
print(type(processed_docs))
print(type(processed_docs[0]))

<class 'pandas.core.series.Series'>
<class 'list'>


## Step 3: Bag-of-words and TF-iDF on dataset

In [16]:
# Importing libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorrizer

ImportError: cannot import name 'TfidfVectorrizer'

### Step 3.1: Word count on the entire dataset

Now let's create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. 

In [None]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set and call it 'dictionary'
'''
# TODO
dictionary = {'the':2, 'a': 5, 'team':67}

In [None]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

** Gensim filter_extremes **

[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

Filter out tokens that appear in

* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [None]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
# TODO: apply dictionary.filter_extremes() with the parameters mentioned above


### Step 3.2 Sklearn bag-of word using Count

In [90]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
# TODO
bow_obj = CountVectorizer(analyzer=lambda x: x)
docs = bow_obj.fit_transform(processed_docs)
bow_corpus = bow_obj.get_feature_names()
print(bow_corpus)

['100th', '120bn', '250', '2500', '260', '300', '302', '314', '353', '370', '50000', '61st', 'ab', 'aba', 'abandon', 'abattoir', 'aboard', 'aboriginal', 'academic', 'accc', 'access', 'accident', 'accidental', 'accusation', 'accuse', 'acid', 'acquit', 'act', 'action', 'address', 'adelaide', 'administrator', 'adventure', 'advertising', 'aec', 'aek', 'affect', 'afl', 'africa', 'agree', 'agreement', 'agriculture', 'ahead', 'aid', 'air', 'aircraft', 'airport', 'ajax', 'alcohol', 'alcoholic', 'alinghi', 'alive', 'allan', 'allege', 'allegedly', 'alliance', 'allocate', 'allocation', 'allow', 'alp', 'alternative', 'ama', 'ambitious', 'ambo', 'ambulance', 'amcor', 'america', 'andersson', 'anger', 'angler', 'angry', 'ankle', 'anniversary', 'announce', 'antarctic', 'anti', 'antic', 'anz', 'apologise', 'appeal', 'appoint', 'arab', 'arabia', 'arabian', 'area', 'arm', 'armed', 'army', 'arrest', 'arrive', 'arrogance', 'arsenal', 'art', 'asia', 'asian', 'ask', 'assurance', 'asylum', 'atp', 'attack', 'a

In [50]:
type(bow_corpus)

list

In [18]:
len(bow_corpus)

1459

## Step 3.2: TF-IDF on our document set ##

While performing TF-IDF on the corpus is not necessary for LDA implemention using the gensim model, it is recemmended. TF-IDF expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality.

*Please note: The author of Gensim dictates the standard procedure for LDA to be using the Bag of Words model.*

** TF-IDF stands for "Term Frequency, Inverse Document Frequency".**

* It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
* Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = `(Number of times term w appears in a document) / (Total number of terms in the document)`.
* IDF(w) = `log_e(Total number of documents / Number of documents with term w in it)`.

** For example **

* Consider a document containing `100` words wherein the word 'tiger' appears 3 times. 
* The term frequency (i.e., tf) for 'tiger' is then: 
    - `TF = (3 / 100) = 0.03`. 

* Now, assume we have `10 million` documents and the word 'tiger' appears in `1000` of these. Then, the inverse document frequency (i.e., idf) is calculated as:
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* Thus, the Tf-idf weight is the product of these quantities: 
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [None]:
'''
Create tf-idf model object using models.TfidfModel on 'bow_corpus' and save it to 'tfidf'
'''
from gensim import corpora, models

# TODO
tfidf = 

In [None]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''
# TODO
corpus_tfidf = 

In [None]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

## Running LDA

In [23]:
# Importing libraries
from sklearn.decomposition import LatentDirichletAllocation, NMF 

###  Step 4.1: Running LDA using Bag of Words ##

We are going for 10 topics in the document corpus.

** We will be running LDA using all CPU cores to parallelize and speed up model training.**


In [91]:
# LDA multicore 
'''
abc
'''
# TODO
lda_model = LatentDirichletAllocation(n_components=5, max_iter=5)
lda_model.fit(docs)

LatentDirichletAllocation(max_iter=5, n_components=5)

In [92]:
lda_model.components_

array([[0.20000764, 0.20000878, 0.20001228, ..., 0.20000552, 0.20001369,
        1.19997709],
       [0.20000688, 1.19996765, 0.20001106, ..., 0.20000516, 0.20001245,
        0.20000565],
       [0.20000728, 0.20000786, 1.19652871, ..., 0.20000521, 0.20001289,
        0.20000558],
       [1.19997112, 0.20000762, 0.20343657, ..., 2.19997898, 2.20121512,
        0.20000584],
       [0.20000708, 0.2000081 , 0.20001138, ..., 0.20000514, 1.19874586,
        0.20000584]])

In [93]:
lda_model.components_[0]

array([0.20000764, 0.20000878, 0.20001228, ..., 0.20000552, 0.20001369,
       1.19997709])

In [94]:
print(len(lda_model.components_[0]))
print(len(bow_corpus))

1378
1378


In [95]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
topic_1 = lda_model.components_[0]
word_list = bow_corpus

In [97]:
import numpy as np
sorted_TopicWord = np.argsort(topic_1)[:-20:-1]

In [98]:
sorted_TopicWord

array([1356,  775,  457,  316,  836,   28,  968,  601,  537, 1216, 1010,
        888,  698,  554,  985, 1344,  856,   42,  609], dtype=int64)

In [99]:
for index in sorted_TopicWord:
    word = word_list[index]
    value = topic_1[index]
    print(word, value)

win 4.20363635060956
meeting 4.201703555086007
esso 4.199977305014401
council 3.7976765406139528
nsw 3.289829458866724
action 3.21800573093237
public 3.210037932991938
hold 3.2078489124874077
gas 3.203549928771822
target 3.2027111648298936
record 3.2026901718408625
pay 3.2009435452842414
korean 3.2002163729364126
govt 3.1961391923290376
race 3.1882884869006087
welcome 2.8089972430915644
opp 2.314321715981768
ahead 2.207151433132504
hospital 2.206882736330703


### Classification of the topics ###

Using the words in each topic and their corresponding weights, what categories were you able to infer?

* 0: 
* 1: 
* 2: 
* 3: 
* 4: 
* 5: 
* 6: 
* 7:  
* 8: 
* 9: 

## Step 4.2 Running LDA using TF-IDF ##

In [None]:
'''
Define lda model using corpus_tfidf, again using gensim.models.LdaMulticore()
'''
# TODO
lda_model_tfidf = 

In [None]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

In [None]:
# charting the word occurence in various topics

trans_type = data.groupby(['transaction_type']).transaction_type.count()
fig_trans, axes_trans = plt.subplots(nrows = 1, ncols = 2, figsize = (20,7))
fig_trans.suptitle("Transactions by transaction type", fontsize="x-large")

axes_trans[0].set(title = "Number of transactions by transaction type",
                  xlabel = "Transaction type", ylabel = "Number of Transactions")
axes_trans[0].bar(x = trans_type.index, height = trans_type.values)
for i, v in enumerate(trans_type):
    #print (i, v)
    axes_trans[0].text(i-.15, v+5000, str(v))
    
axes_trans[1].set(title = "Number of transactions by transaction type")
axes_trans[1].pie(x = trans_type, 
                  labels = list(trans_type.index), 
                     autopct = '%1.1f%%', 
                     explode= (0.05, 0.2, 0.1, 0.1, 0.0))
axes_trans[1].axis('equal')

plt.show()

### Classification of the topics ###

As we can see, when using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. That makes it harder to figure out the categories as nouns can be hard to categorize. This goes to show that the models we apply depend on the type of corpus of text we are dealing with. 

Using the words in each topic and their corresponding weights, what categories could you find?

* 0: 
* 1:  
* 2: 
* 3: 
* 4:  
* 5: 
* 6: 
* 7: 
* 8: 
* 9: 

## Step 5.1: Performance evaluation by classifying sample document using LDA Bag of Words model##

We will check to see where our test document would be classified. 

In [None]:
'''
Text of sample document 4310
'''
processed_docs[4310]

In [None]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''
document_num = 4310
# Our test document is document number 4310

# TODO
# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

### It has the highest probability (`x`) to be  part of the topic that we assigned as X, which is the accurate classification. ###

## Step 5.2: Performance evaluation by classifying sample document using LDA TF-IDF model##

In [None]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''
# Our test document is document number 4310
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))

### It has the highest probability (`x%`) to be  part of the topic that we assigned as X. ###

## Step 6: Testing model on unseen document ##

In [None]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

The model correctly classifies the unseen document with 'x'% probability to the X category.