# Step 0: Latent Dirichlet Allocation
I'm going to start by trying [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) for classification. I'm not going to go into how it does what it does, I'm just goint to use it. LDA classifies text in a document to a particular topic.

# Step 1: Load the Dataset
I'm going to start with a subset of the RSS datafeeds from NYTimes and WSJ. For now, I'm going to add some code to pull this data out of our database rather than putting the data in a CSV file.  This will make filtering the data much easier than having to produce a new csv file every time we want to change something.

Notes:
1. I'm using sqlalchemy and pymysql rather than Oracle's sql connector because I want to put the data into a padas frame. They are infinitely more suited to analysis than python's default list.

**Dataset Inputs**  
MaxRows = maximum number of rows to load from the database  
DtStart = look for articles published on or after this date  
DtStop  = look for articles published before this date  


In [1]:
from sqlalchemy import create_engine
import pymysql
import pandas as pd
pd.set_option('display.expand_frame_repr', False)

dburl     = 'mysql+pymysql://ec2-user@localhost:3306'
sqlEngine = create_engine(dburl)
cnx       = sqlEngine.connect()

#-------------------------------------------------------------
#  Set query variables and get the data
#-------------------------------------------------------------
MaxItems  = 10000
DtStart   = "2020-09-01"
DtStop    = "2021-11-15"
q         = 'SELECT Title FROM plato.Item WHERE "{}" <= PubDt AND PubDt < "{}" LIMIT {};'.format(DtStart,DtStop,MaxItems)
documents = pd.read_sql(q, cnx)
documents['index'] = documents.index
print(documents)
documents['Title'] = documents['Title'].apply(lambda s: s.lower())
print(documents)
cnx.close()

                                                 Title  index
0    When Is It Too Late to Buy the Hottest Gifts O...      0
1              Robots vs. Fatbergs in America's Sewers      1
2    Bitcoin's Creator Could Be Unmasked at Florida...      2
3     Toxic Positivity Is Very Real, and Very Annoying      3
4    Daylight-Saving Time Ends Sunday. Four Ways to...      4
..                                                 ...    ...
452          Vienna Reels From a Rare Terrorist Attack    452
453          Coronavirus Briefing: What Happened Today    453
454  She Was Losing Fistfuls of Hair. What Was Caus...    454
455  Think You Have ‘Normal’ Blood Pressure? Think ...    455
456  Navigating My Son’s A.D.H.D. Made Me Realize I...    456

[457 rows x 2 columns]
                                                 Title  index
0    when is it too late to buy the hottest gifts o...      0
1              robots vs. fatbergs in america's sewers      1
2    bitcoin's creator could be unmasked at fl

 # Step 2: Data Processing
 * **Tokenize**: split the text into sentences, the sentences into words, lower case, remove punctuation
 * Remove 1- and 2-character words
 * Remove **stopwords** (the, ...)
 * **Lemmatize**: change all verb tenses to present tense
 * **Stem**: reduce words to their root form
 
**Notes**  
The simple example cases for <code>nltk.download('wordnet')</code> I've seen on the net do not have any of the ssl stuff.  I tried it that way first, and it resulted in an SSL error. The ssl code shown below comes from a [Stackoverflow article](https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed), apparently others hit this problem as well.

**Notes**  
<code>gensim</code> is an open-source, production-ready, machine learning library for unsupervised topic modeling. Seems like it will be perfect for what we have in mind.  
<code>nltk</code> is a Natural Language Tool Kit, it is a standard for processing text and seems to be used a lot in machine learning.  

In [2]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(10000000)
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/sman/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Lemmatize Examples
Just to get the idea, look what happens when we lemmatize "went", "gone".

In [3]:
for w in ["went", "gone", "going"]:
    print( "{} -> {}".format(w,WordNetLemmatizer().lemmatize(w, pos='v')))


went -> go
gone -> go
going -> go


## Stemmer Examples
Here are some examples that show how the stemmer works

In [4]:
stemmer = SnowballStemmer("english")
wordsIn = ['caresses', 'flies', 'dies', 'mules', 'denied', 'died', 'agreed', 
           'humbled', 'meeting', 'itemization', 'sensational', 'sensation',
           'traditional', 'tradition', 'reference', 'colony', 'colonizer', 'colon']
wordsOut = [stemmer.stem(plural) for plural in wordsIn]
pd.DataFrame(data={'Input Word':wordsIn, 'Output Word':wordsOut})

Unnamed: 0,Input Word,Output Word
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,humbled,humbl
8,meeting,meet
9,itemization,item


## Process the Article Titles
The function <code>lemstem</code> simply applies the lemmatizer and stemmer on the supplied text. The function <code>preprocess</code> tokenizes the input text.

In [58]:
def lemstem(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text,pos='v'))

def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) >= 3:
            result.append(lemstem(token))
    return result

To validate this code, let's just grab a random title from the document and run it through <code>preprocess</code> and review what comes back.

In [84]:
docno = 357  # pick one at random
sample = documents[documents['index'] == docno].values[0][0]
print("original document:")
words = []
for w in sample.split(' '):
    words.append(w)
print(words)
print("\n\npreprocessed document:")
print(preprocess(sample))

original document:
['a', 'dispatch', 'from', 'an', 'endangered', 'bird’s', '‘garden', 'of', 'eden’']


preprocessed document:
['dispatch', 'endang', 'bird', 'garden', 'eden']


Now let's process everything. This step may take a while if the dataset gets large.

In [85]:
processed_docs = documents['Title'].map(preprocess)
print(processed_docs)

0                      [late, buy, hottest, gift, onlin]
1                       [robot, fatberg, america, sewer]
2             [bitcoin, creator, unmask, florida, trial]
3                            [toxic, posit, real, annoy]
4      [daylight, save, time, end, sunday, way, win, ...
                             ...                        
452              [vienna, reel, rare, terrorist, attack]
453                  [coronavirus, brief, happen, today]
454                             [lose, fist, hair, caus]
455               [think, normal, blood, pressur, think]
456                                 [navig, son, realiz]
Name: Title, Length: 457, dtype: object


 ## Step 3.1  "Bag of Words" on the dataset
 Based on the frequency of the words, extract topics. That is, create a dictionary form <code>processedDocs</code> containing the number of times the words appear in the "training set" (the article Titles).

In [86]:
dictionary = gensim.corpora.Dictionary(processed_docs)

Let's print out the top 10:

In [87]:
i = 0
for k,v in dictionary.iteritems():
    print(k,v)
    i += 1
    if i > 9:
        break

0 buy
1 gift
2 hottest
3 late
4 onlin
5 america
6 fatberg
7 robot
8 sewer
9 bitcoin


To make the ai training faster, we can filter out the extremes. Very rare words are probably not significant. The talks I've listened to say that words appearing in more than 50% of the documents are probably not significant (may be true but this has not been settled for me yet, I will probably change the params around a bit and try to understand this better).

In [88]:
dictionary.filter_extremes(no_below=5, no_above=0.5)

Now we want to use the <code>doc2bow</code> function to create a bag-of-words... a list of 2-tuples (token_id, token_count).  <code>doc2bow</code> requires that each word is tokenized and normalized (either unicode or utf8), so we'll use <code>processed_docs</code> from above. Essentially, we end up with a dictionary that reports how many words and how many times those words appear.

In [89]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Do a quick check on one of the rows... we'll use index <code>docno</code>.  For each tuple, the first number is the index of the word, the second number is the count for the word in the title.  For most titles, the count will be 1 for every word.

In [90]:
bow_docno = bow_corpus[docno]
print(bow_docno)
for i in range(len(bow_docno)):
    print('Word {} ("{}") count: {}'.format(bow_docno[i][0],
                                            dictionary[bow_docno[i][0]],
                                            bow_docno[i][1]))

[]


## Step 3.2 TF-IDF the Document Set
While performing TF-IDF on the corpus is not necessary for LDA implementation using the gensim model, it is recommended. TF-IDF expects a bag-of-words (integer values) training corpus dor initialization. During transformation, it will take a vector and return another vector of the same dimensions.

_Note: The author of gensim dictates the standard procedure for LDA to be using the Bag of Words model._

TF-IDF stands for "Term Frequency, Inverse Document Frequency"

* It is a way to score the importance of words in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, then it is important and will be given a high score. But if a word appears  in many documents, it is not a unique identifier and will be given a lower score.
* So, common words like "the" and "for" will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = (Number of times term w appears in a document) / (Total number of terms in the document)
* IDF(w) = log_e(Total number of documents / Number of documents with term w in it).

Example
* Consider a document containing 100 words and the word "tiger" appears 3 times.
* The term frequency (TF) for "tiger" is:
  * TF = (3/100) = 0.03
* Now, assume we have 10 million documents and the word "tiger" appears in 1000 of them. Then, the indverse document frequency (IDF) is:
  * IDF = log(10,000,000 / 1,000) = 4
* Thus, the TF-IDF weight is the product of these quantities:
  * TFIDF = 0.03 * 4 = 0.12


In [91]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
print(tfidf)

TfidfModel(num_docs=457, num_nnz=546)


Apply tranformation to the entire corpus

In [92]:
corpus_tfidf = tfidf[bow_corpus]
print(corpus_tfidf[1])
from pprint import pprint
pprint(corpus_tfidf[1])

[(1, 1.0)]
[(1, 1.0)]


## Step 4:  Running LDA using Bag of Words
The model is finally ready to do the unsupervised learning. We choose the number of topics we want it to find.

Some of the parameters:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus
* **id2word** is a mapping from word ids (integers) to words (strings).  It is used to determin the vocabulary size, as well as for debugging and topic printing.
* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now (default value is 1/num_topics).
  * **Alpha** is the per-document topic distribution.
    * High alpha: Every document has a mixture of all topics (documents appear similar to each other).
    * Low alpha: Every document has a mixture of very few topics
  * **Eta** is the per-topic word distribution
    * High eta: Each topic has a mixture of most words (the docs appear similar to each other).
    * Low eta: Ech topic a mixture of few words.
  * **passes**: the number of training passes through the corpus.  For example if the training corpus has 50,000 documents, and the chunksize is 10,000, and passes is 2, then online training is done in 10 updates:
    * update  1: docs      0 - 9,999
    * update  2: docs 10,000 - 19,999
    * ...
    * update  9: docs 30,000 - 39,999
    * update 10: docs 40,000 - 49,999

## Train the Model
First, I'm going to list the fallback code, which is based on the LDA mono-core. We'll only use this if LdaMulticore throws an error:

In [93]:
# lda_model = gensim.models.LdaModel(bow_corpus, num_topics = 10, id2word = dictionary, passes=50)

We'll train the model using LdaMulticore

In [94]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word = dictionary, passes = 2, workers = 2)

#----------------------------------------------------------------------------------
# For each topic, show the words occurring in that topic and its relative weight:
#----------------------------------------------------------------------------------
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {}\nWords: {}\n\n".format(idx,topic))


Topic: 0
Words: 0.130*"make" + 0.096*"american" + 0.060*"tri" + 0.060*"game" + 0.049*"best" + 0.049*"old" + 0.043*"power" + 0.037*"rule" + 0.037*"year" + 0.037*"electr"


Topic: 1
Words: 0.117*"cup" + 0.112*"team" + 0.100*"lead" + 0.090*"ryder" + 0.056*"final" + 0.041*"week" + 0.031*"busi" + 0.031*"best" + 0.027*"world" + 0.026*"win"


Topic: 2
Words: 0.135*"covid" + 0.118*"democrat" + 0.097*"climat" + 0.070*"plan" + 0.057*"virus" + 0.043*"time" + 0.043*"report" + 0.036*"world" + 0.022*"take" + 0.022*"work"


Topic: 3
Words: 0.152*"new" + 0.131*"review" + 0.071*"long" + 0.068*"big" + 0.067*"appl" + 0.066*"season" + 0.054*"electr" + 0.049*"vaccin" + 0.041*"say" + 0.023*"week"


Topic: 4
Words: 0.083*"war" + 0.078*"chang" + 0.072*"kill" + 0.072*"star" + 0.066*"space" + 0.065*"covid" + 0.058*"rule" + 0.058*"colleg" + 0.049*"report" + 0.031*"offer"


Topic: 5
Words: 0.163*"new" + 0.068*"inflat" + 0.059*"fight" + 0.050*"retir" + 0.050*"onlin" + 0.046*"biden" + 0.046*"life" + 0.041*"american

Check which topic our test document (at index <code>docno</code> belongs to using the LDA Bag of Words model:

In [95]:
print(documents)  # just to remember what it was about
for index, score in sorted(lda_model[bow_corpus[docno]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\nTopic: {}".format(score,lda_model.print_topic(index,10)))

                                                 Title  index
0    when is it too late to buy the hottest gifts o...      0
1              robots vs. fatbergs in america's sewers      1
2    bitcoin's creator could be unmasked at florida...      2
3     toxic positivity is very real, and very annoying      3
4    daylight-saving time ends sunday. four ways to...      4
..                                                 ...    ...
452          vienna reels from a rare terrorist attack    452
453          coronavirus briefing: what happened today    453
454  she was losing fistfuls of hair. what was caus...    454
455  think you have ‘normal’ blood pressure? think ...    455
456  navigating my son’s a.d.h.d. made me realize i...    456

[457 rows x 2 columns]

Score: 0.10000000149011612
Topic: 0.130*"make" + 0.096*"american" + 0.060*"tri" + 0.060*"game" + 0.049*"best" + 0.049*"old" + 0.043*"power" + 0.037*"rule" + 0.037*"year" + 0.037*"electr"

Score: 0.10000000149011612
Topic: 0.117*"cu