# Latent Dirichlet Allocation

In [1]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [18]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

In [60]:
#rajouter stopwords 
#enlever /n
import string 
def preprocessing(reviews):
    for i in string.punctuation:
        reviews = reviews.replace(i, '')
    reviews = reviews.replace('\n', '')
    reviews = reviews.replace('\t', '') 
    reviews = reviews.lower()
    stop_words = set(stopwords.words('english')) 
    word_tokens = word_tokenize(reviews) 
    ' '.join(w for w in word_tokens if not w in stop_words)
    return reviews 
data['clean_text']= data['text'].apply(preprocessing)

In [61]:
data['clean_text'][0]

'from gldcunixbcccolumbiaedu gary l daresubject stan fischler 44summary from the devils pregame show prior to hosting the penguinsnntppostinghost cunixbcccolumbiaedureplyto gldcunixbcccolumbiaedu gary l dareorganization phds in the halllines 32at the lester patrick awards lunch bill torrey mentioned that one of hisoptions next season is to be president of the miami team with bob clarkeworking for him  at the same dinner clarke said that his worst mistakein philadelphia was letting mike keenan go  in retrospect almost allplayers came realize that keenan knew what it took to win  rumours arenow circulating that keenan will be back with the flyersnick polano is sick of being a scapegoat for the schedule made for thered wings after all bryan murray approved itgerry meehan and john muckler are worried over the sabres prospectsassistant don lever says that the sabres have to get their share nowbecause a quebec dynasty is emerging the mighty ducks have declared that they will not throw money 

## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [34]:
#Latent Dirichlet Allocation is an unsupervised learning algorithm for text data.
#It is based on co-occurences of words in texts and is used to find topics from a corpus of document

In [49]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

In [63]:
vectorizer = TfidfVectorizer(max_df=0.6).fit(data['clean_text'])
data_vectorized = vectorizer.transform(data['clean_text'])
lda_model = LatentDirichletAllocation(n_components=2).fit(data_vectorized)

In [64]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        
print_topics(lda_model, vectorizer)

Topic 0:




[('drbombaynetlinkctscom', 0.7887351018055007), ('qtr', 0.7794476904468729), ('mailnews', 0.7620487537786635), ('dohertylukacgladcsxmailer', 0.7620487356978213), ('noneorganization', 0.7620487320649891), ('205lines', 0.7620487243068226), ('ukreturnpath', 0.7620487141809262), ('dohertylsubject', 0.7620487040155777), ('dohertyldcsglaacuk', 0.7620486739311798), ('messagedistribution', 0.7535839218494351)]
Topic 1:
[('you', 43.39537899977944), ('not', 38.15553612919066), ('are', 36.385524419015724), ('was', 35.32779830647467), ('he', 34.846250073033644), ('as', 32.95152163584898), ('we', 29.58358941744508), ('but', 29.033375815825885), ('with', 28.402234680086895), ('they', 27.449252959347866)]


## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [59]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        
print_topics(lda_model, vectorizer)


Topic 0:
[('testing', 1.431972769852258), ('utk', 1.3062130701032189), ('bucknell', 1.1060973972866022), ('khettry', 1.0462392614992946), ('23064rfl', 1.0462392614694938), ('r1w2', 1.0462392614685918), ('tennessee', 1.0462392612829188), ('netlink', 1.027430803904121), ('pub', 0.9704246806328174), ('sturm', 0.8869797659694745)]
Topic 1:
[('you', 47.30171263376801), ('he', 38.33155730212746), ('are', 37.52461729649852), ('was', 35.358862170702466), ('as', 34.50316900446088), ('god', 31.8913265140161), ('we', 31.600802305541436), ('they', 29.745023193233717), ('with', 29.337701360028042), ('if', 27.364375838053235)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [58]:
example = ["rice var congratulations save upenn"]

example_vectorized = vectorizer.transform(example)

lda_vectors = lda_model.transform(example_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

topic 0 : 0.19624637016413138
topic 1 : 0.8037536298358686


In [66]:
stop_words = set(stopwords.words('english')) 
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r