# Latent Dirichlet Allocation (LDA)

üéØ The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

‚úâÔ∏è Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [60]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()


Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [61]:
data.shape


(1199, 1)

## (1) Preprocessing 

‚ùì **Question (Cleaning**) ‚ùì You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [62]:
# YOUR CODE HERE

import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

def preprocessing(sentence):
    #pass  # YOUR CODE HERE
    #Remove whitespace
    sentence = sentence.strip()

    #Remove lowercase
    sentence = sentence.lower()

    #Remove numbers
    sentence = ''.join([i for i in sentence if not i.isdigit()])

    #Remove punctuation
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))

    #Tokenize
    sentence = word_tokenize(sentence)

    #Remove stopwords
    stop_words = set(stopwords.words('english'))
    sentence = [word for word in sentence if word not in stop_words]

    #Lemmatize
    lemmatizer = WordNetLemmatizer()
    sentence = [lemmatizer.lemmatize(word) for word in sentence]
    sentence = ' '.join(sentence)


    return sentence

data['clean_text'] = data['text'].apply(preprocessing)
# Remove the first word of each sentence
data['clean_text'] = data['clean_text'].str.split(' ').str[1:].str.join(' ')

data.head()


Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gary l dare subject stan fischler summary devi...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,cardinal ximenez subject arrogance christian o...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,subject ancient book organization university k...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,cardinal ximenez subject atheist hell organiza...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vladimir zhivov subject flame truly brutal los...


## (2) Latent Dirichlet Allocation model

‚ùì **Question (Training)** ‚ùì Train a LDA model to extract potential topics

In [63]:
# YOUR CODE HERE
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize the text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['clean_text'])
X = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

print(X.head(3))
print("* "*20)

# Train the LDA model
n_components = 2
lda = LatentDirichletAllocation(n_components=n_components, max_iter=100)
lda.fit(X)


    aa  \
0  0.0   
1  0.0   
2  0.0   

   aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg  \
0                                                0.0                                 
1                                                0.0                                 
2                                                0.0                                 

   aacc  aadams  aafreenetcarletonca  aargh  aaron  aaronbinahccbrandeisedu  \
0   0.0     0.0              0.00000    0.0    0.0                      0.0   
1   0.0     0.0              0.08765    0.0    0.0                      0.0   
2   0.0     0.0              0.00000    0.0    0.0                      0.0   

   aassists  aatchoo  ...  zombo  zone  zoo  zoomed  zorasterism  zubov  \
0       0.0      0.0  ...    0.0   0.0  0.0     0.0          0.0    0.0   
1       0.0      0.0  ...    0.0   0.0  0.0     0.0          0.0    0.0   
2       0.0      0.0  ...    0.0   0.0  0.0     0.0          0.0    0.0 

##  (3) Visualize potential topics

üéÅ We coded for you a  function that prints the words associated with the potential topics.

In [64]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])


‚ùì **Question** ‚ùì Print the topics extracted by your LDA.

In [65]:
# YOUR CODE HERE
print_topics(lda, vectorizer)


Topic 0:
[('god', 35.83244031444331), ('christian', 22.515181868001036), ('jesus', 19.132054714187067), ('people', 17.596444414773664), ('would', 16.762166976788347), ('church', 16.574066443433946), ('one', 16.47974288345757), ('bible', 13.827861695007067), ('believe', 13.703478312342906), ('say', 13.139484110133772)]
Topic 1:
[('game', 26.940861494896726), ('team', 25.649498410089638), ('hockey', 18.67018280111222), ('player', 18.319798251840233), ('go', 15.522639576074832), ('play', 14.593812262122475), ('nhl', 13.573322092287574), ('year', 13.39512325067377), ('playoff', 13.255668125765654), ('university', 12.956343105525674)]


## (4) Predict the document-topic mixture of a new text

‚ùì **Question (Prediction)** ‚ùì

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [66]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]


In [67]:
# YOUR CODE HERE

# Vecrorize the 'example'
example = vectorizer.transform(example)

# Use the LDA model to predict the topics of the 'example' text and print the  topic names
print(lda.transform(example))
print(lda.transform(example).argmax(axis=1))


[[0.14782687 0.85217313]]
[1]




In [68]:
# Print the topics of the 'example'
topic_words = pd.DataFrame(
    lda.components_,
    columns=vectorizer.get_feature_names_out()
)
topic_words


Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aassists,aatchoo,...,zombo,zone,zoo,zoomed,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.878691,0.505132,0.533998,0.500094,1.733956,0.504016,1.795785,0.774057,0.501148,0.640062,...,0.502259,0.504272,0.504058,0.503915,0.580897,0.504485,0.501248,0.503013,0.838437,0.503404
1,0.525538,0.596298,0.500611,0.506393,0.577039,1.276513,0.504583,0.502906,0.576773,0.508866,...,0.764234,2.587097,0.660956,0.761463,0.502202,1.768031,0.586746,0.656323,0.504918,0.680752


In [69]:
# Print the max value od row 0 and its corresponding column name
print(topic_words.iloc[0].max())
print(topic_words.iloc[0].idxmax())

# Print the max value od row 1 and its corresponding column name
print(topic_words.iloc[1].max())
print(topic_words.iloc[1].idxmax())


35.83244031444331
god
26.940861494896726
game


üèÅ Congratulations! You know how to implement an LDA quickly.

üíæ Don't forget to¬†`git add/commit/push`¬†your notebook...

üöÄ ... and move on to the next challenge!