__Topic Modeling__ a process to automatically identify __topics present in a text object__ 
and to derive __hidden patterns exhibited by a text corpus__

Topics = a repeating pattern of co-occurring terms in a corpus

__Regular Expressions__ or __Bag of Words Dictionary__ based keywords searching techniques = __Rule based__

Topic Modeling is __unsupervised__

__Applications of Topic Modeling__:
1. New York Times are using topic models to boost their user – article recommendation engines
2. Various professionals are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates

###### Top Techniques of Topic Modeling: 
1. __Latent Dirichlet Allocation__
2. __Term Frequency and Inverse Document Frequency__
3. __NonNegative Matrix Factorization__
4. __Ngrams__

LDA assumes __documents__ are produced from __a mixture of topics__
LDA is a matrix factorization technique

In vector space, the collection of documents can be represented as a __Document - Term Matrix__
Document-Term or Document - Words Matrix can be divided into
1. Document - Topics Matrix (dimension = N,K; N = No. of Documents; K = No. of Topics)
2. Topics - Words Matrix (dimension = K,M; K= No. of Topics; M = Vocabulary Size)

LDA uses sampling to keep improving the above 2 matrices

For every topic among the K topics, two probabilities are created:
    p1 - p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t
                                 = the probabily of the topic t to be present in document d   
    
    p2 - p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w
                             = the probability of the word w to be present in topic t
        
the probabilities are updated for every word w in each topic t in every document d

4 Paramters of LDA

I) ###### __Alpha and Beta__ Hyperparameters
1. alpha represents document-topic density = No. of topics in the documents
2. beta represents topic-word density = No. of words in the topics

Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics

Higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.

II) Optimal No. of Topics from the Corpus
III) No. of terms composed in a single topic

__optimal number of topics__ are obtained using __Kullback Leibler Divergence Score__

__No. of terms in a topic__ can be __high__ if extracting __themes or concepts__
__No. of terms in a topic__ can be __low__ if extracting __features or terms__

IV) Number of Iterations

Maximum number of iterations allowed to LDA algorithm for convergence.

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

In [2]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [3]:
import re
from time import time

In [4]:
import gensim
from gensim import corpora



In [5]:
ip=pd.read_csv(r"C:\Users\senthil.kumar\Documents\F car\New\Final_Conversations.csv",encoding="ISO-8859-1",error_bad_lines=False, index_col=False)

In [6]:
ip['Modified Text']=''

In [7]:
for each in range(len(ip)):
    text=(re.sub(r"^RT @\w+:","",ip.ix[each,'Content'])).strip()
    text=(re.sub(r"@\w+","",text)).strip()
    text=(re.sub(r"(http[a-zA-Z0-9!@#$%^&*.,?:;()\/]+)(\s|$){1,}","",text)).strip()
    ip.loc[each,'Modified Text']=text 

In [8]:
ip.head()

Unnamed: 0,S No,Content,Mentions,Modified Text
0,1,"""@Ford insists the Ranger is no parking mall p...",@Ford,""" insists the Ranger is no parking mall poseur..."
1,2,"""The 4-Wheel drive vehicle for people who like...",@Ford,"""The 4-Wheel drive vehicle for people who like..."
2,3,"""The decisions I make on a daily basis in auto...",@Ford,"""The decisions I make on a daily basis in auto..."
3,4,"""Want to Be"" is featured in the new @Ford Camp...",@Ford,"""Want to Be"" is featured in the new Campaign!..."
4,5,"""We the people, who are better together than w...",@Ford,"""We the people, who are better together than w..."


In [11]:
stoplist=list(stopwords.words('english'))
context_stopwords = ['ford']
stoplist.extend(context_stopwords)

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect=TfidfVectorizer(min_df=5, max_df=0.7,ngram_range=(1,3),stop_words=frozenset(stoplist))

In [13]:
Cleaned_doc=pd.Series(ip["Modified Text"])
X = vect.fit_transform(Cleaned_doc)

In [14]:
print vect.vocabulary_.items()[0:20]
print len(vect.vocabulary_.items())

[(u'looking', 413), (u'replaced', 590), (u'china', 134), (u'focus rs', 281), (u'titanium', 713), (u'want', 774), (u'feature', 265), (u'fit', 277), (u'fix', 279), (u'pretty cool', 543), (u'rs', 600), (u'service', 628), (u'needed', 470), (u'f150 raptor', 248), (u'came', 114), (u'saying', 611), (u'plate', 532), (u'lots', 418), (u'customer service', 177), (u'much', 460)]
819


In [15]:
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

In [16]:
# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

In [17]:
Lda = gensim.models.ldamodel.LdaModel
Passes=30
Number_of_topics=15
ldamodel = Lda(corpus, num_topics=Number_of_topics, id2word = id_map, passes=Passes,random_state=34)

In [18]:
topics_list=ldamodel.print_topics(num_topics=15, num_words=4)

In [19]:
topics_processed = []
for each in range(len(topics_list)):
    topics_processed.append(re.findall(r'\"([a-zA-Z ]+)\"',topics_list[each][1],re.I))

In [20]:
topics_processed[0]

[u'lol', u'though', u'mine', u'big']

In [21]:
op=pd.Series(topics_processed)

In [22]:
op1=pd.DataFrame(op)

In [34]:
op1.to_excel(r"C:\Users\senthil.kumar\Documents\F car\New\Processed Outputs\Top 15 Topics_onTFIDF_Scored.xlsx")

In [24]:
Features=pd.Series(vect.vocabulary_.items())

In [25]:
Features.to_csv(r'C:\Users\senthil.kumar\Documents\F car\New\Processed Outputs\TFIDF_uni_Bi_Tri_gram_Features.csv')

In [26]:
#To visibly see how the corpus is built, save the corpus in the Matrix Market format:
corpora.MmCorpus.serialize('corpus.mm', corpus)

In [27]:
#load the saved corpus
corpus2 = corpora.MmCorpus('corpus.mm')

In [28]:
print corpus2

MmCorpus(2424 documents, 819 features, 10291 non-zero entries)


In [29]:
list_corpus=list(corpus2)

In [126]:
list_corpus[100]

[]

In [124]:
a=ldamodel[list_corpus[100]]

In [125]:
a[0]

(0, 0.06666666666666668)

In [45]:
index=-1
value=-1

In [58]:
for each in a: 
    j=each[1]
    if j>value:
        value=j
        index=each[0]

In [59]:
index

3

In [30]:
ip['Topic_No']=-1
ip['Topic_Prob']=-1

In [31]:
for each in range(len(ip)):
    if each%100==0: print each
    a=ldamodel[list_corpus[each]]
    index=-1
    value=-1
    for every in a: 
        j=every[1]
        if j>value:
            value=j
            index=every[0]
    if value>0.3:
        ip.ix[each,'Topic_No']=index
        ip.ix[each,'Topic_Prob']=value

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400


In [32]:
ip.sort_values(['Topic_No','Content'],ascending=[True,True],inplace=True)

In [33]:
ip.to_excel(r'C:\Users\senthil.kumar\Documents\F car\New\Processed Outputs\15_Topics_per_post_1,3.xlsx')

In [12]:
#To run any mathematical model on text corpus (a collection of text documents), it is a good practice to convert it into a matrix representation
#LDA model looks for repeating term patterns in the entire DT matrix
#gensim is a clean and beautiful library to handle text data in python.
#gensim is scalable, robust and efficient

In [8]:
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [9]:
doc_clean=[]

In [10]:
for each in range(len(ip)):
    doc_clean.append(clean(ip.ix[each,"Modified Text"]).split())

In [11]:
doc_clean

[[u'insists',
  u'ranger',
  u'parking',
  u'mall',
  u'poseur',
  u'designed',
  u'getgo',
  u'handle',
  u'rough',
  u'stuff'],
 [u'the',
  u'4wheel',
  u'drive',
  u'vehicle',
  u'people',
  u'like',
  u'visit',
  u'wildlife',
  u'sanctuary',
  u'bronco',
  u'ii',
  u'1989'],
 [u'the',
  u'decision',
  u'make',
  u'daily',
  u'basis',
  u'automotive',
  u'impact',
  u'someone',
  u'safety',
  u'really',
  u'meaningful',
  u'me'],
 [u'want',
  u'be',
  u'featured',
  u'new',
  u'campaign',
  u'ecosport',
  u'written',
  u'performed',
  u'myself',
  u'ch'],
 [u'we',
  u'people',
  u'better',
  u'together',
  u'alone',
  u'unstoppable',
  u'proud',
  u'collaboration',
  u'betw'],
 [u'cas18',
  u'empowering',
  u'woman',
  u'technology',
  u'amp',
  u'automobile',
  u'welcome',
  u'chicago',
  u'auto',
  u'show'],
 [u'cas18',
  u'galaxy',
  u'far',
  u'far',
  u'away',
  u'frank',
  u'robot',
  u'behold',
  u'new',
  u'era',
  u'technology',
  u'participation'],
 [u'ford', u'fiesta', u'

In [13]:
import gensim
from gensim import corpora
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)



In [15]:
type(dictionary)

gensim.corpora.dictionary.Dictionary

In [19]:
dictionary.save_as_text('something.txt',sort_by_word=False)

In [32]:
dictionary[8]

u'surface'

In [21]:
#there are 3959 unique words identified from the dataset after removing stopwords

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [36]:
doc_clean[2]

[u'so',
 u'decision',
 u'surface',
 u'pro',
 u'4',
 u'surface',
 u'book',
 u'cant',
 u'believe',
 u'much',
 u'relied',
 u'surface',
 u'pen']

In [35]:
doc_term_matrix[2]

[(6, 1),
 (8, 3),
 (20, 1),
 (24, 1),
 (25, 1),
 (26, 1),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1)]

In [37]:
# Creating the object for LDA model using gensim library
start_time=time()
Lda = gensim.models.ldamodel.LdaModel


# Running and Trainign LDA model on the document term matrix.
Passes=50
Number_of_topics=10
ldamodel = Lda(doc_term_matrix, num_topics=Number_of_topics, id2word = dictionary, passes=Passes)
end_time=time()

print "Time taken for %d passs is %d", Passes,end_time-start_time  

Time taken for %d passs is %d 50 395.230999947


In [38]:
print ldamodel
type(ldamodel)

LdaModel(num_terms=3959, num_topics=10, decay=0.5, chunksize=2000)


gensim.models.ldamodel.LdaModel

In [39]:
#interpretation: 
#Passess = 50 mean the probabilities p1 and p2 are altered 50 times for every document
#chunksize= 2000 mean 2,000 documents are analysed together for forming the two distribution matrices
print ldamodel.print_topics(num_topics=10, num_words=5)

[(0, u'0.060*"surface" + 0.027*"pro" + 0.025*"book" + 0.017*"4" + 0.012*"cant"'), (1, u'0.063*"surface" + 0.056*"pro" + 0.024*"4" + 0.013*"one" + 0.012*"book"'), (2, u'0.079*"surface" + 0.028*"microsoft" + 0.023*"book" + 0.021*"pro" + 0.012*"buy"'), (3, u'0.057*"surface" + 0.034*"pro" + 0.017*"4" + 0.016*"window" + 0.014*"new"'), (4, u'0.067*"surface" + 0.031*"pro" + 0.017*"microsoft" + 0.013*"book" + 0.012*"like"'), (5, u'0.064*"surface" + 0.027*"pro" + 0.016*"book" + 0.014*"it" + 0.012*"laptop"'), (6, u'0.049*"surface" + 0.041*"pro" + 0.016*"4" + 0.008*"book" + 0.008*"pen"'), (7, u'0.087*"surface" + 0.044*"microsoft" + 0.032*"video" + 0.028*"pro" + 0.025*"studio"'), (8, u'0.090*"surface" + 0.059*"pro" + 0.027*"4" + 0.015*"book" + 0.015*"it"'), (9, u'0.032*"surface" + 0.014*"pro" + 0.012*"surfacepro4" + 0.010*"3" + 0.010*"im"')]


In [43]:
ldamodel.print_topics(num_topics=10, num_words=5)

[(0,
  u'0.060*"surface" + 0.027*"pro" + 0.025*"book" + 0.017*"4" + 0.012*"cant"'),
 (1,
  u'0.063*"surface" + 0.056*"pro" + 0.024*"4" + 0.013*"one" + 0.012*"book"'),
 (2,
  u'0.079*"surface" + 0.028*"microsoft" + 0.023*"book" + 0.021*"pro" + 0.012*"buy"'),
 (3,
  u'0.057*"surface" + 0.034*"pro" + 0.017*"4" + 0.016*"window" + 0.014*"new"'),
 (4,
  u'0.067*"surface" + 0.031*"pro" + 0.017*"microsoft" + 0.013*"book" + 0.012*"like"'),
 (5,
  u'0.064*"surface" + 0.027*"pro" + 0.016*"book" + 0.014*"it" + 0.012*"laptop"'),
 (6,
  u'0.049*"surface" + 0.041*"pro" + 0.016*"4" + 0.008*"book" + 0.008*"pen"'),
 (7,
  u'0.087*"surface" + 0.044*"microsoft" + 0.032*"video" + 0.028*"pro" + 0.025*"studio"'),
 (8, u'0.090*"surface" + 0.059*"pro" + 0.027*"4" + 0.015*"book" + 0.015*"it"'),
 (9,
  u'0.032*"surface" + 0.014*"pro" + 0.012*"surfacepro4" + 0.010*"3" + 0.010*"im"')]

###### Which topic was most likely spoken in the third document?

In [44]:
print ldamodel[doc_term_matrix[2]]

[(8, 0.93570287947855924)]


In [14]:
#Some of the above steps via Scikit-Learn

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
#Creating an object of the Count Vectorizer:
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english')

In [17]:
Cleaned_doc=pd.Series(ip["Modified Text"])

In [18]:
type(Cleaned_doc)

pandas.core.series.Series

In [19]:
Cleaned_doc[0:5]

0    The surface pro is just what I was looking for...
1    Hi Beau. We're here to help you. Have you trie...
2    so... the decision for me is Surface Pro 4 or ...
3    My new Surface Pro 4 pen stopped working, i'm ...
4    .  What a fun way to collaborate! What are the...
Name: Modified Text, dtype: object

In [20]:
X = vect.fit_transform(Cleaned_doc)

In [21]:
X

<1831x102 sparse matrix of type '<type 'numpy.int64'>'
	with 4759 stored elements in Compressed Sparse Row format>

In [22]:
print len(Cleaned_doc)

1831


In [23]:
print vect.get_feature_names()[0:5]
print len(vect.get_feature_names())
#there are only 102 features because of the following limitations
#the min_df=20 (minimum 20 documents) and 
#max_df=0.2 (at maximum, it should occur only in 20% of the documents) 

[u'10', u'2016', u'added', u'amazing', u'app']
102


In [24]:
print vect.vocabulary_.items()[0:5]
print len(vect.vocabulary_.items())
#print vocabulary or feature name and its id as well. 

[(u'looking', 46), (u'look', 45), (u'new', 57), (u'studio', 73), (u'phone', 62)]
102


In [25]:
# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

In [26]:
dense_X=X.todense()

In [27]:
dense_X.shape

(1831L, 102L)

In [28]:
dense_X[1830]
#representation of the document (1830th document) and term (102 features) matrix

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [29]:
# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

In [30]:
?corpus

In [31]:
#To visibly see how the corpus is built, save the corpus in the Matrix Market format:
corpora.MmCorpus.serialize('corpus.mm', corpus)

In [32]:
#load the saved corpus
corpus2 = corpora.MmCorpus('corpus.mm')

In [33]:
print corpus2

MmCorpus(1831 documents, 102 features, 4759 non-zero entries)


In [34]:
list_corpus=list(corpus2)
list_corpus[1]

[(17, 1.0), (31, 1.0), (33, 1.0), (84, 1.0), (88, 1.0)]

In [35]:
for key,value in vect.vocabulary_.items():
    if value==17 or value==31 or value==33 or value==84 or value==88:
        print key,value

did 17
tried 84
help 31
use 88
hi 33


In [36]:
Cleaned_doc[1]

u"Hi Beau. We're here to help you. Have you tried to use a power plug adapter? Where did you purchase your Surface Pro 4?"

In [38]:
Lda = gensim.models.ldamodel.LdaModel

In [39]:
Passes=25
Number_of_topics=10
ldamodel = Lda(corpus, num_topics=Number_of_topics, id2word = id_map, passes=Passes,random_state=34)

In [40]:
ldamodel.print_topics(num_topics=10, num_words=5)

[(0,
  u'0.211*"just" + 0.083*"book" + 0.074*"better" + 0.055*"time" + 0.053*"try"'),
 (1,
  u'0.223*"pen" + 0.099*"thanks" + 0.080*"did" + 0.075*"device" + 0.067*"help"'),
 (2,
  u'0.349*"book" + 0.124*"love" + 0.080*"ms" + 0.060*"phone" + 0.041*"run"'),
 (3,
  u'0.154*"buy" + 0.136*"good" + 0.105*"book" + 0.086*"want" + 0.073*"going"'),
 (4,
  u'0.461*"microsoft" + 0.200*"studio" + 0.072*"ipad" + 0.042*"app" + 0.036*"just"'),
 (5,
  u'0.147*"video" + 0.102*"microsoft" + 0.073*"liked" + 0.062*"added" + 0.060*"playlist"'),
 (6,
  u'0.260*"new" + 0.148*"really" + 0.108*"like" + 0.064*"great" + 0.046*"book"'),
 (7,
  u'0.145*"macbook" + 0.138*"work" + 0.097*"getting" + 0.087*"got" + 0.063*"surfacepro4"'),
 (8,
  u'0.202*"windows" + 0.105*"10" + 0.105*"need" + 0.098*"screen" + 0.075*"using"'),
 (9,
  u'0.164*"tablet" + 0.096*"laptop" + 0.092*"ve" + 0.067*"surfacebook" + 0.047*"today"')]

In [42]:
len(list_corpus)

1831

In [44]:
ldamodel[list_corpus[1830]]

[(0, 0.020004321303857896),
 (1, 0.020007612761615265),
 (2, 0.41997724728185687),
 (3, 0.020000000037507758),
 (4, 0.020000124144786073),
 (5, 0.020000000089457341),
 (6, 0.020013342610422043),
 (7, 0.020010318008410445),
 (8, 0.020000621599945052),
 (9, 0.41998641216214122)]

Sources:
    https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
    https://stats.stackexchange.com/questions/120031/in-lda-how-to-interpret-the-meaning-of-topics
    https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation