# Latent Dirichlet Allocation

In [1]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [18]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

In [75]:
#rajouter stopwords 
#enlever /n
import string 
def preprocessing(reviews):
    for i in string.punctuation:
        reviews = reviews.replace(i, '')
    reviews = reviews.replace('\n', '')
    reviews = reviews.replace('\t', '') 
    reviews = reviews.lower()
    stop_words = set(stopwords.words('english')) 
    word_tokens = word_tokenize(reviews) 
    va = ' '.join(w for w in word_tokens if not w in stop_words)
    return va  
#return va qui contient mes info de reviews que j'ai reset et mes words tokens 
data['clean_text']= data['text'].apply(preprocessing)

In [76]:
data['clean_text'][0]

'gldcunixbcccolumbiaedu gary l daresubject stan fischler 44summary devils pregame show prior hosting penguinsnntppostinghost cunixbcccolumbiaedureplyto gldcunixbcccolumbiaedu gary l dareorganization phds halllines 32at lester patrick awards lunch bill torrey mentioned one hisoptions next season president miami team bob clarkeworking dinner clarke said worst mistakein philadelphia letting mike keenan go retrospect almost allplayers came realize keenan knew took win rumours arenow circulating keenan back flyersnick polano sick scapegoat schedule made thered wings bryan murray approved itgerry meehan john muckler worried sabres prospectsassistant lever says sabres get share nowbecause quebec dynasty emerging mighty ducks declared throw money aroundloosely buy teamoilers coach ted green remarked guys around canfill tie domis skates none fill helmetsenators andrew mcbain told security guard chicago stadiumwho warned stairs leading locker room mcbainmouthed seasoned professional andtumbled e

## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [34]:
#Latent Dirichlet Allocation is an unsupervised learning algorithm for text data.
#It is based on co-occurences of words in texts and is used to find topics from a corpus of document

In [69]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

In [70]:
vectorizer = TfidfVectorizer(max_df=0.6).fit(data['clean_text'])
data_vectorized = vectorizer.transform(data['clean_text'])
lda_model = LatentDirichletAllocation(n_components=2).fit(data_vectorized)

In [71]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        
print_topics(lda_model, vectorizer)

Topic 0:




[('god', 25.91727380022496), ('would', 23.612030315964205), ('one', 20.699808843884956), ('people', 18.778163364660003), ('university', 18.73899114438708), ('think', 17.4147149261789), ('article', 17.374170991728956), ('team', 17.04796557460904), ('go', 16.57558950894027), ('game', 16.46708826848599)]
Topic 1:
[('captain', 5.943432807777781), ('traded', 3.286422239910874), ('gainey', 2.663516141683326), ('leafs', 1.8269818588828133), ('roger', 1.7433740800571338), ('media', 1.741115257659495), ('doug', 1.6027425621491767), ('maynardramseycslaurentianca', 1.537701524563601), ('forwards', 1.443354646666446), ('suhonen', 1.39591292823386)]


## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [72]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        
print_topics(lda_model, vectorizer)


Topic 0:
[('god', 25.91727380022496), ('would', 23.612030315964205), ('one', 20.699808843884956), ('people', 18.778163364660003), ('university', 18.73899114438708), ('think', 17.4147149261789), ('article', 17.374170991728956), ('team', 17.04796557460904), ('go', 16.57558950894027), ('game', 16.46708826848599)]
Topic 1:
[('captain', 5.943432807777781), ('traded', 3.286422239910874), ('gainey', 2.663516141683326), ('leafs', 1.8269818588828133), ('roger', 1.7433740800571338), ('media', 1.741115257659495), ('doug', 1.6027425621491767), ('maynardramseycslaurentianca', 1.537701524563601), ('forwards', 1.443354646666446), ('suhonen', 1.39591292823386)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [73]:
example = ["rice var congratulations save upenn"]

example_vectorized = vectorizer.transform(example)

lda_vectors = lda_model.transform(example_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

topic 0 : 0.7728954230822402
topic 1 : 0.2271045769177598


In [74]:
stop_words = set(stopwords.words('english')) 
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r