In [118]:
#Gensim accomplishes this by taking a corpus, a collection of text documents, and producing a vector representation of
#the text in the corpus. The vector representation can then be used to train a model, which is an algorithm to create 
#different representations of the data, which are usually more semantic. These three concepts are key to understanding 
#how Gensim works. At the same time, we'll work through a simple example that illustrates each of them.

In [119]:
# Gensim Small example 
# Small corpus for this example 


In [211]:
raw_corpus = ["Human machine interface for lab abc computer applications human",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

In [212]:
#A corpus is a collection of digital documents. This corpus is fed to Gensim from which it will infer the structure of the 
#documents and extract topics from the documents. Once the algorithm learns on how to infer topics from the training corpus, 
#it can be used to assign topics to new documents which were not present in the training corpus.For this reason, we also refer
#to this collection as the training corpus. No human intervention is required - the topic classification is unsupervised.

In [213]:
print(raw_corpus)


['Human machine interface for lab abc computer applications human', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']


In [214]:
raw_corpus1 = open(r'C:\Users\lohit\Desktop\course\KDD\HealthNews\Dentistry00501.txt', 'r')
print(raw_corpus1.read())

Just one 10-second kiss transfers 80 million bacteria
 <header>In the 1960s, a singer named Betty Everett belted, "If you wanna know if he loves you so, it's in his kiss!" Covered by Cher in the 1990s, the song neglects to mention what is also "in his kiss" - 80 million bacteria, according to a new study published in the journal <em>Microbiome</em>.</header><div class="photobox_right" style='max-width:350px;' ><img src="http://www.medicalnewstoday.com/images/articles/285/285563/couple-kissing.jpg" alt="Couple kissing"><br><em>What is "in his kiss"? According to the latest study, 80 million bacteria.</em></div><p>Before germaphobes swear off kissing forever, it should be noted that over 100 trillion microorganisms naturally live in our bodies. Called the microbiome, they are vital for digesting food, synthesizing nutrients and preventing disease.</p><p>The researchers - led by Remco Kort, of TNO (Netherlands Organisation for Applied Scientific Research) and adviser to the Micropia museu

In [215]:
#We first remove all the words the commonly used English words - called stop words such as 'the', ‘a’, ‘we’, etc.) and words 
#that occur only once in the corpus.
# Second we are counting number of different words inside the document and if the count is more than one it will keep on adding
# frequency and calcuate the total frequency of all words.
# Now we are take the words which are more than 0 time inside the corpus.
# After that we are printing the words.

In [216]:
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in raw_corpus]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 0] for text in texts]
processed_corpus

[['human',
  'machine',
  'interface',
  'lab',
  'abc',
  'computer',
  'applications',
  'human'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

In [217]:
#We now need to tokenize our data. This breaks the documents into words and assigns tokens: unique numbers to the words that
#have been repeated more than “x” times. Thus, we associate each word in the corpus with a unique integer ID. We can do this 
#using the Gensim.corpora.Dictionary class. This dictionary defines the vocabulary of all words that our processing knows about.

In [218]:
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
dictionary.save('C:/Users/lohit/AppData/Local/Temp/den.dict')
print(dictionary)

2017-10-25 18:24:25,961 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-10-25 18:24:25,965 : INFO : built Dictionary(35 unique tokens: ['human', 'machine', 'interface', 'lab', 'abc']...) from 9 documents (total 53 corpus positions)
2017-10-25 18:24:25,969 : INFO : saving Dictionary object under C:/Users/lohit/AppData/Local/Temp/den.dict, separately None
2017-10-25 18:24:25,975 : INFO : saved C:/Users/lohit/AppData/Local/Temp/den.dict


Dictionary(35 unique tokens: ['human', 'machine', 'interface', 'lab', 'abc']...)


In [219]:
#Next, we need to represent the documents mathematically to be able to continue further processing, so we represent each 
#document as a vector. We use the bag-of-words model where each document is represented by a vector containing the frequency 
#counts of each word in the dictionary. The length of the vector is the number of entries in the dictionary. One of the main 
#properties of the bag-of-words model is that it completely ignores the order of the tokens in the document that is encoded,
#hence bag-of-words.

In [220]:
print(dictionary.token2id)

{'human': 0, 'machine': 1, 'interface': 2, 'lab': 3, 'abc': 4, 'computer': 5, 'applications': 6, 'survey': 7, 'user': 8, 'opinion': 9, 'system': 10, 'response': 11, 'time': 12, 'eps': 13, 'management': 14, 'engineering': 15, 'testing': 16, 'relation': 17, 'perceived': 18, 'error': 19, 'measurement': 20, 'generation': 21, 'random': 22, 'binary': 23, 'unordered': 24, 'trees': 25, 'intersection': 26, 'graph': 27, 'paths': 28, 'minors': 29, 'iv': 30, 'widths': 31, 'well': 32, 'quasi': 33, 'ordering': 34}


In [221]:
# now are trying some small example and make vector for them to get the vector.

In [222]:
new_doc = "found when among transfer saliva"
new_vec = dictionary.doc2bow(new_doc.lower().split())
new_vec

[]

In [223]:
new_doc = "EPS computer relation"
new_vec = dictionary.doc2bow(new_doc.lower().split())
new_vec

[(5, 1), (13, 1), (17, 1)]

In [224]:
#The first entry in each tuple corresponds to the ID of the token in the dictionary, the second corresponds to the count
#of this token. Now changing the whole courpus into vector form.
#We are saving this into our temporary folder called with an extension of “.mm”. Note that while this list lives entirely
#in memory, in most applications you will want a more scalable solution.

In [225]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
corpora.MmCorpus.serialize('C:/Users/lohit/AppData/Local/Temp/den.mm', bow_corpus)
bow_corpus

2017-10-25 18:24:31,112 : INFO : storing corpus in Matrix Market format to C:/Users/lohit/AppData/Local/Temp/den.mm
2017-10-25 18:24:31,116 : INFO : saving sparse matrix to C:/Users/lohit/AppData/Local/Temp/den.mm
2017-10-25 18:24:31,118 : INFO : PROGRESS: saving document #0
2017-10-25 18:24:31,121 : INFO : saved 9x35 matrix, density=16.190% (51/315)
2017-10-25 18:24:31,126 : INFO : saving MmCorpus index to C:/Users/lohit/AppData/Local/Temp/den.mm.index


[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(5, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(2, 1), (8, 1), (10, 1), (13, 1), (14, 1)],
 [(0, 1), (10, 2), (13, 1), (15, 1), (16, 1)],
 [(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)],
 [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)],
 [(25, 1), (26, 1), (27, 1), (28, 1)],
 [(25, 1), (27, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)],
 [(7, 1), (27, 1), (29, 1)]]

In [226]:
#Now that we have vectorized our corpus we can begin to transform it using models. We use model as an abstract term 
#referring to a transformation from one document representation to another. In Gensim documents are represented as vectors
#so a model can be thought of as a transformation between two vector spaces. The details of this transformation are learned 
#from the training corpus. One simple example of a model is tf-idf. The tf-idf model transforms vectors from the bag-of-words
#representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in 
#the corpus.
#Term Frequency, Inverse Document Frequency (TF-IDF) TF-IDF is a way to score the importance of words (or "terms") in a 
#document based on how frequently they appear across multiple documents.
#If a word appears frequently in a document, it's important and TF-IDF gives the word a high score.
#But if a word appears in many documents, then it's not a unique identifier and gives the word a low score.
#Therefore, stop words will be scaled down. Words that appear frequently in a single document will be scaled up.
#For a term t in a document d, the weight Wt, d of term t in document d is given by:
#Wt,d = TFt,d log (N/DFt )
#Where,
#TFt,d is the number of occurrences of t in document d.
#DFt is the number of documents containing term t.
#N is the total number of documents in corpus.

In [227]:
from gensim import models
# train the model
tfidf = models.TfidfModel(bow_corpus)
# transform the "system minors" string
tfidf[dictionary.doc2bow("human computer interaction EPS".lower().split())]

2017-10-25 18:24:32,547 : INFO : collecting document frequencies
2017-10-25 18:24:32,549 : INFO : PROGRESS: processing document #0
2017-10-25 18:24:32,551 : INFO : calculating IDF weights for 9 documents and 34 features (51 matrix non-zeros)


[(0, 0.5773502691896257), (5, 0.5773502691896257), (13, 0.5773502691896257)]

In [228]:
#The tfidf model again returns a list of tuples, where the first entry is the token ID and the second entry is the tf-idf
#weighting.

In [229]:
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [230]:
import tempfile
import os.path

TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

Folder "C:\Users\lohit\AppData\Local\Temp" will be used to save temporary dictionary and corpus.


In [231]:
from gensim import corpora, models, similarities
if os.path.isfile(os.path.join(TEMP_FOLDER, 'den.dict')):
    dictionary = corpora.Dictionary.load(os.path.join(TEMP_FOLDER, 'den.dict'))
    corpus = corpora.MmCorpus(os.path.join(TEMP_FOLDER, 'den.mm'))
    print("Used files generated before ")
else:
    print("Run again error")

2017-10-25 18:24:35,514 : INFO : loading Dictionary object from C:\Users\lohit\AppData\Local\Temp\den.dict
2017-10-25 18:24:35,519 : INFO : loaded C:\Users\lohit\AppData\Local\Temp\den.dict
2017-10-25 18:24:35,522 : INFO : loaded corpus index from C:\Users\lohit\AppData\Local\Temp\den.mm.index
2017-10-25 18:24:35,525 : INFO : initializing corpus reader from C:\Users\lohit\AppData\Local\Temp\den.mm
2017-10-25 18:24:35,529 : INFO : accepted corpus with 9 documents, 35 features, 51 non-zero entries


Used files generated before 


In [232]:
print(dictionary[0])
print(dictionary[1])
print(dictionary[2])

human
machine
interface


In [233]:
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

2017-10-25 18:24:37,005 : INFO : collecting document frequencies
2017-10-25 18:24:37,012 : INFO : PROGRESS: processing document #0
2017-10-25 18:24:37,014 : INFO : calculating IDF weights for 9 documents and 34 features (51 matrix non-zeros)


In [234]:
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors

[(0, 0.5648663441460566), (1, 0.8251824121072071)]


In [235]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5245699338309155), (1, 0.38315779281548723), (2, 0.2622849669154578), (3, 0.38315779281548723), (4, 0.38315779281548723), (5, 0.2622849669154578), (6, 0.38315779281548723)]
[(5, 0.3726494271826947), (7, 0.3726494271826947), (8, 0.27219160459794917), (9, 0.5443832091958983), (10, 0.27219160459794917), (11, 0.3726494271826947), (12, 0.3726494271826947)]
[(2, 0.438482464916089), (8, 0.32027755044706185), (10, 0.32027755044706185), (13, 0.438482464916089), (14, 0.6405551008941237)]
[(0, 0.3449874408519962), (10, 0.5039733231394895), (13, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(8, 0.21953536176370683), (11, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.6324555320336759), (

In [236]:

#Here we transformed our Tf-Idf corpus via Latent Semantic Indexing into a latent 2-D space (2-D because we set num_topics=2).
#Now you’re probably wondering: what do these two latent dimensions stand for? Let’s inspect with models.LsiModel.print_topics():


In [237]:

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

2017-10-25 18:24:39,891 : INFO : using serial LSI version on this node
2017-10-25 18:24:39,895 : INFO : updating model with new documents
2017-10-25 18:24:39,901 : INFO : preparing a new chunk of documents
2017-10-25 18:24:39,904 : INFO : using 100 extra samples and 2 power iterations
2017-10-25 18:24:39,906 : INFO : 1st phase: constructing (35, 102) action matrix
2017-10-25 18:24:39,909 : INFO : orthonormalizing (35, 102) action matrix
2017-10-25 18:24:39,916 : INFO : 2nd phase: running dense svd on (35, 9) matrix
2017-10-25 18:24:39,920 : INFO : computing the final decomposition
2017-10-25 18:24:39,922 : INFO : keeping 2 factors (discarding 66.599% of energy spectrum)
2017-10-25 18:24:39,924 : INFO : processed documents up to #9
2017-10-25 18:24:39,926 : INFO : topic #0(1.271): 0.408*"system" + 0.301*"survey" + 0.283*"user" + 0.282*"eps" + 0.246*"human" + 0.236*"management" + 0.227*"opinion" + 0.226*"response" + 0.226*"time" + 0.224*"interface"
2017-10-25 18:24:39,928 : INFO : topic 

In [238]:
#the topics are printed to log – see the note at the top of this page about activating logging
#It appears that according to LSI, “trees”, “graph” and “minors” are all related words (and contribute the most to the direction
#of the first topic), while the second topic practically concerns itself with all the other words. As expected,
#the first five documents are more strongly related to the second topic while the remaining four documents to the first topic:


In [239]:
lsi.print_topics(2)

2017-10-25 18:24:42,179 : INFO : topic #0(1.271): 0.408*"system" + 0.301*"survey" + 0.283*"user" + 0.282*"eps" + 0.246*"human" + 0.236*"management" + 0.227*"opinion" + 0.226*"response" + 0.226*"time" + 0.224*"interface"
2017-10-25 18:24:42,185 : INFO : topic #1(1.180): 0.425*"minors" + 0.422*"graph" + 0.313*"survey" + 0.236*"trees" + 0.222*"intersection" + 0.222*"paths" + 0.188*"widths" + 0.188*"ordering" + 0.188*"quasi" + 0.188*"well"


[(0,
  '0.408*"system" + 0.301*"survey" + 0.283*"user" + 0.282*"eps" + 0.246*"human" + 0.236*"management" + 0.227*"opinion" + 0.226*"response" + 0.226*"time" + 0.224*"interface"'),
 (1,
  '0.425*"minors" + 0.422*"graph" + 0.313*"survey" + 0.236*"trees" + 0.222*"intersection" + 0.222*"paths" + 0.188*"widths" + 0.188*"ordering" + 0.188*"quasi" + 0.188*"well"')]

In [240]:
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)

[(0, 0.38491889088288589), (1, -0.23799083182688749)]
[(0, 0.67327570398106373), (1, 0.061063011983416668)]
[(0, 0.59429153367225518), (1, -0.32013947136769177)]
[(0, 0.56595474120106282), (1, -0.3443358962412072)]
[(0, 0.37883572265509352), (1, -0.013238424814016938)]
[(0, 0.032346275240837975), (1, 0.17672649803391668)]
[(0, 0.13320822144013733), (1, 0.48928974583068763)]
[(0, 0.19468626669806066), (1, 0.6377245136397599)]
[(0, 0.37343003126004548), (1, 0.65767275604411801)]


In [241]:
lsi.save(os.path.join(TEMP_FOLDER, 'model.lsi')) # same for tfidf, lda, ...
#lsi = models.LsiModel.load(os.path.join(TEMP_FOLDER, 'model.lsi'))

2017-10-25 18:24:43,985 : INFO : saving Projection object under C:\Users\lohit\AppData\Local\Temp\model.lsi.projection, separately None
2017-10-25 18:24:44,000 : INFO : saved C:\Users\lohit\AppData\Local\Temp\model.lsi.projection
2017-10-25 18:24:44,002 : INFO : saving LsiModel object under C:\Users\lohit\AppData\Local\Temp\model.lsi, separately None
2017-10-25 18:24:44,005 : INFO : not storing attribute projection
2017-10-25 18:24:44,007 : INFO : not storing attribute dispatcher
2017-10-25 18:24:44,012 : INFO : saved C:\Users\lohit\AppData\Local\Temp\model.lsi


In [242]:
#A common reason for such a charade is that we want to determine similarity between pairs of documents, or the similarity
#between a specific document and a set of other documents (such as a user query vs. indexed documents).
#To show how this can be done in gensim,

In [243]:
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary.load('C:/Users/lohit/AppData/Local/Temp/den.dict')
corpus = corpora.MmCorpus('C:/Users/lohit/AppData/Local/Temp/den.mm') # comes from the first tutorial, "From strings to vectors"
print(corpus)

2017-10-25 18:24:45,463 : INFO : loading Dictionary object from C:/Users/lohit/AppData/Local/Temp/den.dict
2017-10-25 18:24:45,467 : INFO : loaded C:/Users/lohit/AppData/Local/Temp/den.dict
2017-10-25 18:24:45,470 : INFO : loaded corpus index from C:/Users/lohit/AppData/Local/Temp/den.mm.index
2017-10-25 18:24:45,473 : INFO : initializing corpus reader from C:/Users/lohit/AppData/Local/Temp/den.mm
2017-10-25 18:24:45,475 : INFO : accepted corpus with 9 documents, 35 features, 51 non-zero entries


MmCorpus(9 documents, 35 features, 51 non-zero entries)


In [244]:
#Now suppose a user typed in the query “Human computer interaction”. We would like to sort our nine corpus documents in 
#decreasing order of relevance to this query. Unlike modern search engines, here we only concentrate on a single aspect of
#possible similarities—on apparent semantic relatedness of their texts (words). No hyperlinks, no random-walk static ranks,
#just a semantic extension over the boolean keyword match:

In [245]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)


[(0, 0.46390054531878433), (1, -0.20359920281643867)]


In [246]:
#In addition, we will be considering cosine similarity to determine the similarity of two vectors. Cosine similarity is a
#standard measure in Vector Space Modeling, but wherever the vectors represent probability distributions, different similarity 
#measures may be more appropriate.

In [247]:
#Initializing query structures
#To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. 
#In our case, they are the same nine documents used for training LSI, converted to 2-D LSA space. But that’s only incidental, 
#we might also be indexing a different corpus altogether

In [248]:
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it

2017-10-25 18:24:49,087 : INFO : creating matrix with 9 documents and 2 features


In [249]:
#Index persistency is handled via the standard save() and load() functions:

In [250]:
index.save('C:/Users/lohit/AppData/Local/Temp/den.index')
index = similarities.MatrixSimilarity.load('C:/Users/lohit/AppData/Local/Temp/den.index')

2017-10-25 18:24:50,465 : INFO : saving MatrixSimilarity object under C:/Users/lohit/AppData/Local/Temp/den.index, separately None
2017-10-25 18:24:50,472 : INFO : saved C:/Users/lohit/AppData/Local/Temp/den.index
2017-10-25 18:24:50,474 : INFO : loading MatrixSimilarity object from C:/Users/lohit/AppData/Local/Temp/den.index
2017-10-25 18:24:50,478 : INFO : loaded C:/Users/lohit/AppData/Local/Temp/den.index


In [251]:
#This is true for all similarity indexing classes (similarities.Similarity, similarities.MatrixSimilarity and
#similarities.SparseMatrixSimilarity). Also in the following, index can be an object of any of these. When in doubt, 
#use similarities.Similarity, as it is the most scalable version, and it also supports adding more documents to the index later.

In [252]:
sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

[(0, 0.99145329), (1, 0.8966043), (2, 0.99835241), (3, 0.99378079), (4, 0.9334408), (5, -0.21829586), (6, -0.13329136), (7, -0.10740998), (8, 0.08808583)]


In [253]:
#Cosine measure returns similarities in the range <-1, 1> (the greater, the more similar), so that the first document has a
#score of 0.99809301 etc.
#With some standard Python magic we sort these similarities into descending order, and obtain the final answer to the query 
#“Human computer interaction”:

In [254]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims) # print sorted (document number, similarity score) 2-tuples

[(2, 0.99835241), (3, 0.99378079), (0, 0.99145329), (4, 0.9334408), (1, 0.8966043), (8, 0.08808583), (7, -0.10740998), (6, -0.13329136), (5, -0.21829586)]


In [255]:
#The thing to note here is that documents no. 2 ("The EPS user interface management system") and
#4 ("Relation of user perceived response time to error measurement") would never be returned by a standard boolean fulltext
#search, because they do not share any common words with "Human computer interaction". 
#However, after applying LSI, we can observe that both of them received quite high similarity scores 
#(no. 2 is actually the most similar!), which corresponds better to our intuition of them sharing a “computer-human” 
#related topic with the query. In fact, this semantic generalization is the reason why we apply transformations and do
#topic modelling in the first place.