# 04-Latent Dirichlet Allocation

LDA is a generative statitstical model. It locates similarities between texts in the form of key words that belong to potential topics. The higher the presence of certain key words in a text, the higher the probability of it belonging to the corresponding topic. The topics can then be visualized or used to label data.

In this exercice, you will learn to build and adjust an LDA model to generate potential topics, explore their corresponding key words, and visualize the model. You will be working on 3 categories of the 20 Newsgroups dataset available in Sklearn.

Run the code below to load the data into a dataframe and visualize an example.

In [14]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

data = fetch_20newsgroups(categories = ['talk.politics.mideast', 'rec.sport.hockey','soc.religion.christian']) 

mydataframe = pd.DataFrame({'text': data.data})

print(mydataframe.text.iloc[5])

mydataframe.head()

From: Patrick Walker <F1HH@UNB.CA>
Subject: Did you really expect Toronto to go anywhere?  REALLY!
Lines: 13
Organization: The University of New Brunswick

Detroit is a very disciplined team.  There's a lot of Europeans
in Detroit which would make the game fast, so Toronto would have
to slow the game down, which means drawing penalties, as a last
resort anyway.  Toronto will be a good team as soon as they get
more good players.  Toronto is just an average team, Detroit isn't
Ballard screwed Toronto when he was owner.  Everyone knows that.
and it's going to take time for Toronto to become a real force.
I expect Gilmour to be burnt out next year.  He can't pull the
whole team forever.

Patrick Walker
University of New Brunswick




Unnamed: 0,text
0,From: huot@cray.com (Tom Huot)\nSubject: Re: U...
1,From: golchowy@alchemy.chem.utoronto.ca (Geral...
2,From: trajan@cwis.unomaha.edu (Stephen McIntyr...
3,From: gt1091a@prism.gatech.EDU (gt1091a gt1091...
4,From: jrmst8+@pitt.edu (Joseph R Mcdonald)\nSu...


## Preprocessing 

You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [15]:
from nltk.corpus import stopwords 
import string
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize 

def clean (text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, ' ') # Remove Punctuation
    lowercased = text.lower() # Lower Case
    tokenized = word_tokenize(lowercased) # Tokenize
    words_only = [word for word in tokenized if word.isalpha()] # Remove numbers
    stop_words = set(stopwords.words('english')) # Make stopword list
    without_stopwords = [word for word in words_only if not word in stop_words] # Remove Stop Words
    lemma=WordNetLemmatizer() # Initiate Lemmatizer
    lemmatized = [lemma.lemmatize(word) for word in without_stopwords] # Lemmatize
    return lemmatized

# Apply to all texts
mydataframe['clean_text'] = mydataframe.text.apply(clean)
mydataframe['clean_text'] = mydataframe['clean_text'].astype('str')

mydataframe.head()

Unnamed: 0,text,clean_text
0,From: huot@cray.com (Tom Huot)\nSubject: Re: U...,"['huot', 'cray', 'com', 'tom', 'huot', 'subjec..."
1,From: golchowy@alchemy.chem.utoronto.ca (Geral...,"['golchowy', 'alchemy', 'chem', 'utoronto', 'c..."
2,From: trajan@cwis.unomaha.edu (Stephen McIntyr...,"['trajan', 'cwis', 'unomaha', 'edu', 'stephen'..."
3,From: gt1091a@prism.gatech.EDU (gt1091a gt1091...,"['prism', 'gatech', 'edu', 'kaan', 'timucin', ..."
4,From: jrmst8+@pitt.edu (Joseph R Mcdonald)\nSu...,"['pitt', 'edu', 'joseph', 'r', 'mcdonald', 'su..."


## Vectorize data

Like for many Natural Language Processing tasks, the text needs to be vectorized before being used for LDA modelling. Initiate a default CountVectorizer and vectorize clean texts to a standard Bag of Words.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

data_vectorized = vectorizer.fit_transform(mydataframe.clean_text)

## Initiate Latent Dirichlet Allocation model

Once again, Sklearn offers a package to build an LDA model. Use it to initiate an LDA with the following parameters:
    - 3 topics
    - "online" learning method

In [17]:
from sklearn.decomposition import LatentDirichletAllocation

lda_model = LatentDirichletAllocation(n_components=3)

## Fit  data to LDA

Fit the vectorized data to the LDA

In [18]:
lda_vectors = lda_model.fit_transform(data_vectorized)



## Visualize potential topics

The function to print the potential topics is already made for you. You just have to pass the correct arguments ;)

In [19]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        

print_topics(lda_model, vectorizer)

Topic 0:
[('edu', 1992.910506523671), ('one', 1637.4968820248218), ('god', 1565.2754405813828), ('people', 1542.2281224719309), ('would', 1458.720296420081), ('subject', 1196.2147944533556), ('line', 1115.8721785257574), ('organization', 1028.580899415441), ('say', 1018.9380248293903), ('know', 974.0629380455912)]
Topic 1:
[('pt', 336.5275572819271), ('period', 240.44295177802962), ('la', 237.85057952529547), ('chi', 138.6387099743006), ('power', 135.10214501897974), ('bos', 131.06898866754196), ('pp', 127.71602789504996), ('play', 124.57577619658032), ('det', 121.02434333100335), ('cal', 117.18261531472565)]
Topic 2:
[('edu', 1256.2882577676512), ('armenian', 1041.4234669872094), ('team', 940.7142499532063), ('game', 925.7449880469636), ('line', 827.0270734819227), ('subject', 724.102914057838), ('ca', 684.7661290935573), ('organization', 659.445835182035), ('turkish', 654.6357019952504), ('hockey', 633.1478382064817)]


## Train LDA with different CountVectorizer parameters

CountVectorizer parameter `max_df` is used to ignore certain terms in the corpus. `max_df` specifies that a word with a document frequency higher than the given threshold should be ignored (considered corpus-specific stop words).

When followed by an LDA model, playing around with max_df will improve interpretability. Adjust that parameter until you are satisfied with the interpretability of your LDA.

You could also try vectorizing as bigrams...

In [20]:
vectorizer2 = CountVectorizer(max_df=0.4)

data_vectorized2 = vectorizer2.fit_transform(mydataframe.clean_text)

lda_model2 = LatentDirichletAllocation(n_components=3, max_iter=10, learning_method='online')

lda_vectors2 = lda_model2.fit_transform(data_vectorized2)

print_topics(lda_model2, vectorizer2)

Topic 0:
[('game', 967.1647300775876), ('team', 957.9184600493079), ('ca', 689.9904867589947), ('hockey', 642.0030272003964), ('play', 529.6157479354728), ('player', 512.7558061391252), ('year', 470.31070007728573), ('go', 418.4053986323868), ('nhl', 412.7338974083475), ('season', 404.72377264859904)]
Topic 1:
[('com', 340.3962527690914), ('arab', 285.4845518068633), ('israel', 272.24699305743985), ('jew', 239.58780530039712), ('right', 177.49707662811264), ('absolute', 158.94821964874365), ('truth', 155.73720012523404), ('adam', 147.16156100362807), ('jewish', 143.87321053092344), ('american', 142.7291549882537)]
Topic 2:
[('people', 1616.1247695896102), ('god', 1561.308756111878), ('armenian', 1357.5100184554444), ('say', 967.7462622568142), ('know', 897.2671858624383), ('christian', 800.6701839876438), ('time', 790.3743459991564), ('said', 786.3220605784089), ('think', 782.8703309777752), ('like', 740.2791442551894)]


## Predict topic of new text

Once you are happy with the interpretability of your LDA model, you can use it to predict the topic of a new text.

First, use vectorizer2 to vectorize the example. Then, fit to lda_model2 to make LDA vectors. The following code will print the results.

In [21]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

example_vectorized = vectorizer2.transform(example)

lda_vectors = lda_model2.transform(example_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])
print("topic 2 :", lda_vectors[0][2])

topic 0 : 0.9418028748604899
topic 1 : 0.02834031943237327
topic 2 : 0.029856805707136846


The LDA model predicts the percentage chance of our example belonging to each category. Feel free to try different examples.

##  Graph  LDA 

Using pyLDAvis, prepare a graph with the following arguments:
- lda_model2
- data_vectorized2
- vectorizer2
- mds='tsne'

In [22]:
import pyLDAvis.sklearn
 
pyLDAvis.enable_notebook()

graph = pyLDAvis.sklearn.prepare(lda_model2, data_vectorized2, vectorizer2, mds='tsne')

graph 

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]
