# Latent Dirichlet Allocation

In [1]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [4]:
import string

def punc_lower(data):
    data.lower()
    for punctuation in string.punctuation:
        data=data.replace(punctuation," ")
    return data

data["clean_text"]=data.text.apply(punc_lower)
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,From gld cunixb cc columbia edu Gary L Dare ...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,From atterlep vela acs oakland edu Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,From miner kuhub cc ukans edu\nSubject Re A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,From atterlep vela acs oakland edu Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,From vzhivov superior carleton ca Vladimir Z...


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [11]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer=TfidfVectorizer().fit(data["clean_text"])
data_vectorized=vectorizer.transform(data["clean_text"])

lda_model=LatentDirichletAllocation(n_components=2).fit(data_vectorized)



In [14]:
lda_model.components_

array([[2.82641525, 0.73453846, 0.71528414, ..., 0.50866224, 0.90540575,
        0.5147117 ],
       [2.3575067 , 1.5106806 , 0.50220795, ..., 0.62752055, 0.50461181,
        0.6433207 ]])

In [19]:
vectorizer.get_feature_names()

['00',
 '000',
 '0001',
 '000256',
 '001323',
 '0014',
 '00309',
 '003221',
 '005',
 '005512',
 '0059',
 '00630',
 '007',
 '0086',
 '00iv1a6kyd2',
 '01',
 '01002',
 '010305',
 '011',
 '013',
 '0131',
 '013653rap115',
 '013939',
 '014',
 '014237',
 '015225',
 '015415',
 '01580',
 '015936',
 '02',
 '021',
 '02173',
 '022113',
 '0226',
 '023843',
 '03',
 '031823',
 '031840',
 '032017',
 '032350',
 '0349',
 '0358',
 '036',
 '037',
 '038',
 '04',
 '0400',
 '042',
 '043426',
 '0435',
 '044045',
 '044323',
 '045046',
 '0458',
 '0483',
 '05',
 '050',
 '051',
 '0510',
 '0511',
 '051942',
 '052120rap115',
 '052907',
 '053',
 '053748rap115',
 '055',
 '059',
 '06',
 '060010',
 '062622',
 '063253',
 '065',
 '0666',
 '06paul',
 '07',
 '0702',
 '0706',
 '07102',
 '072',
 '073134',
 '073540',
 '073716',
 '0739',
 '074054',
 '07748',
 '08',
 '082502acps6992',
 '085337',
 '085435',
 '089',
 '09',
 '091503rap115',
 '091859',
 '092246dlmqc',
 '094',
 '094815mece7187',
 '095653',
 '097',
 '099',
 '0fovj7i0

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [12]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

print_topics(lda_model, vectorizer)

Topic 0:
[('the', 126.16628900881942), ('of', 73.72689282308086), ('to', 73.58651050555873), ('that', 58.32642635322059), ('is', 56.40267906329505), ('and', 54.37560680655887), ('in', 48.289625051634836), ('it', 39.232602271315066), ('you', 37.451190636110454), ('not', 31.85380427935519)]
Topic 1:
[('the', 55.06010656764189), ('in', 22.85903670665075), ('ca', 19.261048539404197), ('to', 19.137715940492487), ('and', 17.184300121456534), ('team', 16.632914150811068), ('game', 16.488996030043673), ('edu', 15.779768635738236), ('of', 14.792884598581205), ('hockey', 14.704786589365876)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [0]:
example = ["rice var congratulations save upenn"]

example_vectorized = vectorizer.transform(example)

lda_vectors = lda_model.transform(example_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])