# Latent Dirichlet Allocation (LDA)

üéØ The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

‚úâÔ∏è Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [1]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

## (1) Preprocessing 

‚ùì **Question (Cleaning**) ‚ùì You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [3]:
import re
from sklearn.feature_extraction.text import CountVectorizer

# Nettoyage basique : suppression des caract√®res sp√©ciaux, chiffres, liens, ponctuation
def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)  # liens
    text = re.sub(r"[^a-z\s]", " ", text)       # garde uniquement les lettres
    text = re.sub(r"\s+", " ", text).strip()    # supprime espaces multiples
    return text

# Application √† la colonne "text"
data["clean_text"] = data["text"].apply(clean_text)

# V√©rifions le r√©sultat
data[["text", "clean_text"]].head()


Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,from gld cunixb cc columbia edu gary l dare su...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlep vela acs oakland edu cardinal xi...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,from miner kuhub cc ukans edu subject re ancie...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlep vela acs oakland edu cardinal xi...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,from vzhivov superior carleton ca vladimir zhi...


## (2) Latent Dirichlet Allocation model

‚ùì **Question (Training)** ‚ùì Train a LDA model to extract potential topics

In [5]:
from sklearn.decomposition import LatentDirichletAllocation

# Vectorisation (sac de mots)
vectorizer = CountVectorizer(
    max_df=0.90,     # ignore mots trop fr√©quents
    min_df=2,        # ignore mots rares
    stop_words='english'
)
X = vectorizer.fit_transform(data["clean_text"])

# Entra√Ænement du mod√®le LDA
lda = LatentDirichletAllocation(
    n_components=5,   # nombre de topics √† d√©couvrir
    random_state=42,
    learning_method='batch'
)
lda.fit(X)

print("Mod√®le entra√Æn√© avec succ√®s.")


Mod√®le entra√Æn√© avec succ√®s.


##  (3) Visualize potential topics

üéÅ We coded for you a  function that prints the words associated with the potential topics.

In [7]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

‚ùì **Question** ‚ùì Print the topics extracted by your LDA.

In [9]:
print_topics(lda, vectorizer)



Topic 0:
[('edu', 282.6097838797633), ('god', 278.53439356386014), ('does', 161.72631020874041), ('article', 157.43527607097502), ('question', 151.0719777020838), ('com', 137.36120896462273), ('writes', 136.09820770223698), ('ca', 100.67314867312408), ('existence', 99.48679479808875), ('reason', 96.89908570885126)]
Topic 1:
[('edu', 731.3257306832522), ('hockey', 603.4401904005871), ('team', 562.8976382506897), ('game', 495.0184862232725), ('ca', 357.5390806241832), ('season', 357.200425951211), ('games', 301.5573107063094), ('university', 275.3695619923858), ('nhl', 272.79302250403424), ('year', 271.99932568256577)]
Topic 2:
[('edu', 258.70913166245043), ('ca', 250.638530598734), ('vs', 221.64272138264133), ('pts', 217.51792115363264), ('period', 211.84022464482373), ('play', 210.69744595936652), ('la', 205.08528951025323), ('nhl', 146.60477965629113), ('pittsburgh', 145.70270523554797), ('team', 145.49987598026283)]
Topic 3:
[('god', 957.8718140048742), ('edu', 534.5490171403095), ('

In [10]:
def print_topics_clean(model, vectorizer, n_top_words=10):
    feature_names = vectorizer.get_feature_names_out()
    for idx, topic in enumerate(model.components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        print(f"Topic {idx}: {', '.join(top_words)}")

print_topics_clean(lda, vectorizer)


Topic 0: edu, god, does, article, question, com, writes, ca, existence, reason
Topic 1: edu, hockey, team, game, ca, season, games, university, nhl, year
Topic 2: edu, ca, vs, pts, period, play, la, nhl, pittsburgh, team
Topic 3: god, edu, people, jesus, think, church, christian, don, know, hell
Topic 4: edu, god, jesus, truth, people, christians, think, know, bible, believe


## (4) Predict the document-topic mixture of a new text

‚ùì **Question (Prediction)** ‚ùì

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [None]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [11]:
# Exemple de texte
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

# 1. Nettoyage du texte (m√™me fonction que plus haut)
example_clean = [clean_text(example[0])]

# 2. Vectorisation (avec le vectorizer d√©j√† entra√Æn√©)
example_vec = vectorizer.transform(example_clean)

# 3. Pr√©diction du m√©lange de topics
topic_distribution = lda.transform(example_vec)

# 4. Affichage du r√©sultat
print("Distribution des topics :", topic_distribution)
print("Topic dominant :", topic_distribution.argmax())


Distribution des topics : [[0.0200584  0.91958657 0.02019871 0.02011948 0.02003684]]
Topic dominant : 1


In [13]:
for k in [2, 3, 4, 6]:
    lda = LatentDirichletAllocation(n_components=k, random_state=42)
    lda.fit(X)
    print(k, lda.perplexity(X))


2 3083.975739492571
3 3022.4119299032996
4 2941.776849982074
6 2926.5842600733204


üèÅ Congratulations! You know how to implement an LDA quickly.

üíæ Don't forget to¬†`git add/commit/push`¬†your notebook...

üöÄ ... and move on to the next challenge!