# Advanced Learning

## Topic Modeling & LDA

#### Med Radhi Toujani  ----  Med Nacer Cherni  ----  Med Chiheb Bargaoui


We are going to apply LDA to a set of documents and split them into topics.

In [32]:
import pandas as pd
data = pd.read_csv('apple.txt',sep='\n',index_col=False,header=None);

data['headline_text'] = data[0]
data['index'] = data.index
data=data.drop(columns=[0])
documents = data


In [90]:
documents.head(10)

Unnamed: 0,headline_text,index
0,"LOVE U @APPLE,1.8\t",0
1,"Thank you @apple, loving my new iPhone 5S!!!!!...",1
2,.@apple has the best customer service. In and ...,2
3,@apple ear pods are AMAZING! Best sound from i...,3
4,Omg the iPhone 5S is so cool it can read your ...,4
5,"the iPhone 5c is so beautiful <3 @Apple,1.6\t",5
6,#AttributeOwnership is exactly why @apple will...,6
7,Just checked out the specs on the new iOS 7......,7
8,I love the new iOS so much!!!!! Thnx @apple @p...,8
9,"Can't wait to get my #Iphone5S!!! @apple,1.6\t",9


In [36]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\Radhi
[nltk_data]     Toujani\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [60]:
def lemmatize_stemming(text):
    snowballStemmer = SnowballStemmer(language='english')
    return snowballStemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [62]:
doc_sample = documents[documents['index'] == 99].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['So', 'glad', '@apple', 'now', 'offers', 'the', 'iPhone', 'in', 'multiple', 'colors', 'because', 'god', 'forbid', 'they', 'actually', 'spend', 'resources', 'improving', 'the', 'network', 'capability.,0.6\t']


 tokenized and lemmatized document: 
['glad', 'appl', 'offer', 'iphon', 'multipl', 'color', 'forbid', 'actual', 'spend', 'resourc', 'improv', 'network', 'capabl']


In [63]:
processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]

0                                         [love, appl]
1    [thank, appl, love, iphon, appl, iphon, twitte...
2                  [appl, best, custom, servic, phone]
3             [appl, pod, amaz, best, sound, headphon]
4    [iphon, cool, read, finger, print, unlock, iph...
5                                [iphon, beauti, appl]
6     [exact, appl, appl, market, market, busi, innov]
7              [check, spec, wait, updat, bravo, appl]
8                       [love, thnx, appl, phillydvib]
9                                  [wait, iphon, appl]
Name: headline_text, dtype: object

## Bag Of Words

In [64]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 appl
1 love
2 iphon
3 thank
4 twitter
5 xmhjcu
6 best
7 custom
8 phone
9 servic
10 amaz


In [65]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=1000)

In [68]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[200]

[(1, 1), (10, 1), (13, 1)]

In [71]:
bow_doc_4310 = bow_corpus[200]
for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                               dictionary[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))

Word 1 ("iphon") appears 1 time.
Word 10 ("come") appears 1 time.
Word 13 ("like") appears 1 time.


## TF-IDF

In [72]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 1.0)]


## Running LDA using Bag of Words

In [73]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [74]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.304*"like" + 0.262*"iphon" + 0.219*"twitter" + 0.059*"thank" + 0.044*"fingerprint" + 0.040*"love" + 0.017*"microsoft" + 0.015*"ipad" + 0.010*"need" + 0.008*"googl"
Topic: 1 
Words: 0.394*"thank" + 0.213*"think" + 0.081*"time" + 0.069*"fingerprint" + 0.049*"twitter" + 0.038*"microsoft" + 0.035*"need" + 0.021*"ipad" + 0.019*"http" + 0.016*"iphon"
Topic: 2 
Words: 0.437*"twitter" + 0.159*"http" + 0.107*"samsung" + 0.059*"love" + 0.059*"iphon" + 0.043*"time" + 0.038*"thank" + 0.029*"like" + 0.017*"ipad" + 0.013*"phone"
Topic: 3 
Words: 0.363*"phone" + 0.150*"googl" + 0.119*"http" + 0.098*"twitter" + 0.083*"store" + 0.080*"come" + 0.027*"samsung" + 0.025*"thank" + 0.015*"iphon" + 0.012*"time"
Topic: 4 
Words: 0.250*"itun" + 0.240*"ipodplayerpromo" + 0.180*"ipad" + 0.164*"ipod" + 0.144*"promo" + 0.006*"googl" + 0.005*"http" + 0.004*"microsoft" + 0.002*"iphon" + 0.002*"go"
Topic: 5 
Words: 0.311*"come" + 0.146*"iphon" + 0.093*"itun" + 0.079*"ipodplayerpromo" + 0.065*"ipad" 

## Running LDA using TF-IDF

In [81]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} \nWord: {}\n'.format(idx, topic))

Topic: 0 
Word: 0.297*"love" + 0.236*"need" + 0.141*"time" + 0.139*"twitter" + 0.053*"iphon" + 0.041*"store" + 0.019*"think" + 0.016*"microsoft" + 0.012*"http" + 0.008*"itun"

Topic: 1 
Word: 0.428*"twitter" + 0.214*"thank" + 0.184*"like" + 0.049*"iphon" + 0.027*"phone" + 0.022*"googl" + 0.022*"http" + 0.012*"love" + 0.008*"store" + 0.007*"need"

Topic: 2 
Word: 0.268*"store" + 0.131*"fingerprint" + 0.107*"go" + 0.103*"phone" + 0.088*"http" + 0.084*"iphon" + 0.042*"like" + 0.031*"ipad" + 0.030*"ipodplayerpromo" + 0.029*"itun"

Topic: 3 
Word: 0.172*"ipodplayerpromo" + 0.168*"itun" + 0.144*"phone" + 0.116*"samsung" + 0.106*"ipod" + 0.105*"promo" + 0.088*"ipad" + 0.053*"http" + 0.012*"store" + 0.008*"need"

Topic: 4 
Word: 0.442*"come" + 0.107*"ipod" + 0.093*"time" + 0.065*"itun" + 0.060*"ipodplayerpromo" + 0.055*"store" + 0.040*"http" + 0.034*"promo" + 0.030*"ipad" + 0.023*"iphon"

Topic: 5 
Word: 0.169*"itun" + 0.125*"iphon" + 0.109*"http" + 0.082*"phone" + 0.072*"googl" + 0.058*"come"

### Performance evaluation: classifying sample document using LDA Bag of Words model

We will check where our test document would be classified.

In [82]:
processed_docs[99]

['glad',
 'appl',
 'offer',
 'iphon',
 'multipl',
 'color',
 'forbid',
 'actual',
 'spend',
 'resourc',
 'improv',
 'network',
 'capabl']

In [84]:
for index, score in sorted(lda_model[bow_corpus[99]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.5499690771102905	 
Topic: 0.635*"iphon" + 0.231*"http" + 0.041*"fingerprint" + 0.018*"love" + 0.017*"samsung" + 0.014*"store" + 0.009*"twitter" + 0.009*"come" + 0.008*"ipad" + 0.006*"go"

Score: 0.050010502338409424	 
Topic: 0.297*"iphon" + 0.263*"need" + 0.155*"twitter" + 0.148*"think" + 0.043*"go" + 0.032*"phone" + 0.010*"love" + 0.009*"itun" + 0.008*"store" + 0.008*"samsung"

Score: 0.05000915005803108	 
Topic: 0.304*"like" + 0.262*"iphon" + 0.219*"twitter" + 0.059*"thank" + 0.044*"fingerprint" + 0.040*"love" + 0.017*"microsoft" + 0.015*"ipad" + 0.010*"need" + 0.008*"googl"

Score: 0.05000494420528412	 
Topic: 0.311*"come" + 0.146*"iphon" + 0.093*"itun" + 0.079*"ipodplayerpromo" + 0.065*"ipad" + 0.051*"ipod" + 0.046*"like" + 0.044*"promo" + 0.037*"think" + 0.030*"http"

Score: 0.05000383034348488	 
Topic: 0.447*"http" + 0.181*"store" + 0.111*"iphon" + 0.093*"time" + 0.038*"phone" + 0.038*"need" + 0.018*"thank" + 0.017*"think" + 0.015*"go" + 0.013*"come"

Score: 0.050001937

Our test document has the highest probability to be part of the topic that our model assigned.

### Performance evaluation: classifying sample document using LDA TF-IDF model

In [85]:
for index, score in sorted(lda_model_tfidf[bow_corpus[99]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.5499815940856934	 
Topic: 0.696*"iphon" + 0.131*"http" + 0.085*"ipad" + 0.010*"ipodplayerpromo" + 0.010*"microsoft" + 0.009*"itun" + 0.008*"thank" + 0.006*"ipod" + 0.006*"promo" + 0.005*"love"

Score: 0.050006456673145294	 
Topic: 0.232*"twitter" + 0.205*"iphon" + 0.089*"think" + 0.089*"http" + 0.078*"thank" + 0.057*"ipad" + 0.051*"time" + 0.032*"phone" + 0.030*"come" + 0.021*"microsoft"

Score: 0.050003595650196075	 
Topic: 0.169*"itun" + 0.125*"iphon" + 0.109*"http" + 0.082*"phone" + 0.072*"googl" + 0.058*"come" + 0.058*"microsoft" + 0.042*"ipad" + 0.039*"store" + 0.039*"ipodplayerpromo"

Score: 0.0500025749206543	 
Topic: 0.734*"http" + 0.085*"iphon" + 0.058*"need" + 0.024*"like" + 0.019*"fingerprint" + 0.014*"time" + 0.012*"love" + 0.010*"go" + 0.008*"come" + 0.007*"twitter"

Score: 0.050002504140138626	 
Topic: 0.268*"store" + 0.131*"fingerprint" + 0.107*"go" + 0.103*"phone" + 0.088*"http" + 0.084*"iphon" + 0.042*"like" + 0.031*"ipad" + 0.030*"ipodplayerpromo" + 0.029*"i

Our test document has the highest probability to be part of the topic that our model assigned, which is the accurate classification.

## Testing model on an unknown document

In [89]:
unknown_document = 'Samsung galaxy is the best'
bow_vector = dictionary.doc2bow(preprocess(unknown_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.5499909520149231	 Topic: 0.437*"twitter" + 0.159*"http" + 0.107*"samsung" + 0.059*"love" + 0.059*"iphon"
Score: 0.050004422664642334	 Topic: 0.363*"phone" + 0.150*"googl" + 0.119*"http" + 0.098*"twitter" + 0.083*"store"
Score: 0.050003502517938614	 Topic: 0.635*"iphon" + 0.231*"http" + 0.041*"fingerprint" + 0.018*"love" + 0.017*"samsung"
Score: 0.05000048130750656	 Topic: 0.297*"iphon" + 0.263*"need" + 0.155*"twitter" + 0.148*"think" + 0.043*"go"
Score: 0.05000028386712074	 Topic: 0.447*"http" + 0.181*"store" + 0.111*"iphon" + 0.093*"time" + 0.038*"phone"
Score: 0.05000006780028343	 Topic: 0.304*"like" + 0.262*"iphon" + 0.219*"twitter" + 0.059*"thank" + 0.044*"fingerprint"
Score: 0.05000006780028343	 Topic: 0.394*"thank" + 0.213*"think" + 0.081*"time" + 0.069*"fingerprint" + 0.049*"twitter"
Score: 0.05000006780028343	 Topic: 0.250*"itun" + 0.240*"ipodplayerpromo" + 0.180*"ipad" + 0.164*"ipod" + 0.144*"promo"
Score: 0.05000006780028343	 Topic: 0.311*"come" + 0.146*"iphon" + 0.0