# 03 - Latent Dirichlet Allocation

LDA is a generative statitstical model. It locates similarities between texts in the form of key words that belong to potential topics. The higher the presence of certain key words in a text, the higher the probability of it belonging to the corresponding topic. The topics can then be visualized or used to label data.

In this exercice, you will learn to build and adjust an LDA model to generate potential topics, explore their corresponding key words, and visualize the model.

Run the code below to load the data into a dataframe. It is a collection of emails.

In [2]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


## Preprocessing 

You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text". You may want to add some extra steps to the preprocessing such as removing email addresses.

In [3]:
from nltk.corpus import stopwords 
import string
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize 

def clean (text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, ' ') # Remove Punctuation
    lowercased = text.lower() # Lower Case
    tokenized = word_tokenize(lowercased) # Tokenize
    words_only = [word for word in tokenized if word.isalpha()] # Remove numbers
    stop_words = set(stopwords.words('english')) # Make stopword list
    without_stopwords = [word for word in words_only if not word in stop_words] # Remove Stop Words
    lemma=WordNetLemmatizer() # Initiate Lemmatizer
    lemmatized = [lemma.lemmatize(word) for word in without_stopwords] # Lemmatize
    return lemmatized

# Apply to all texts
data['clean_text'] = data.text.apply(clean)
data['clean_text'] = data['clean_text'].astype('str')

data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,"['gld', 'cunixb', 'cc', 'columbia', 'edu', 'ga..."
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,"['atterlep', 'vela', 'ac', 'oakland', 'edu', '..."
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,"['miner', 'kuhub', 'cc', 'ukans', 'edu', 'subj..."
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,"['atterlep', 'vela', 'ac', 'oakland', 'edu', '..."
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,"['vzhivov', 'superior', 'carleton', 'ca', 'vla..."


## Latent Dirichlet Allocation model

Train an LDA model! Like for many Natural Language Processing tasks, the text needs to be vectorized before being used for LDA modelling. You need to:

- Vectorize the data with a default `CountVectorizer`
- Initiate a `LatentDirichletAllocation` with parameter `n_components=2`
- Fit the vectorized data to the `LatentDirichletAllocation` model

[This](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) should help.

In [4]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

data_vectorized = vectorizer.fit_transform(data['clean_text'])

lda_model = LatentDirichletAllocation(n_components=2)

lda_vectors = lda_model.fit_transform(data_vectorized)

## Visualize potential topics

The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments ;)

In [5]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        

print_topics(lda_model, vectorizer)

Topic 0:
[('god', 1523.0176918623063), ('edu', 1048.8430041793895), ('one', 829.9846066625206), ('christian', 820.9240187991802), ('would', 793.5325684607122), ('people', 680.2664900676945), ('subject', 649.6877828765759), ('jesus', 626.2665493001997), ('line', 613.7684552144001), ('organization', 538.5312084298521)]
Topic 1:
[('edu', 1079.1569958205596), ('team', 958.4282085467993), ('game', 950.6928750288483), ('line', 743.231544785547), ('ca', 702.6666542559303), ('subject', 657.3122171233715), ('hockey', 649.467102446423), ('organization', 633.468791570096), ('player', 529.3702429712984), ('play', 522.1333755512087)]


## Predict topic of new text

You can now use your LDA model to predict the topic of a new text.

First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example. The following code will print the results.

In [6]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]
#example = ["Everyone is free to have its own beliefs"]

example_vectorized = vectorizer.transform(example)

lda_vectors = lda_model.transform(example_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

topic 0 : 0.04512341837487121
topic 1 : 0.9548765816251288


The LDA model predicts the percentage chance of our example belonging to each category. Feel free to try different examples.