# Gensim
Gensim is billed as a NLP package that does **Topic Modeling** for humans.   
Gensim provides algorithms like *LDA* and *LSI* and the necessary sophistication to build high-quality topic models.   
It's significant advantage is : it lets you handle large text files without having to load the entire file in memory.

In order to work on text docs, Gensim requires the words be converted to unique ids. In order to achieve that, Gensim lets you create `Dictionary` object that maps each word to unique id.   
Create a `Dictionary` by converting text to a list of words and pass it to the `corpora.Dictionary()`.

Let's start with the 'List of sentences' input.   
Need to convert each sentence to a list of words and list comprehension is a common way to do this.

In [1]:
import gensim
from gensim import corpora
from pprint import pprint

docu1 = ["The Saudis are preparing a report that will acknowledge that", 
             "Saudi journalist Jamal Khashoggi's death was the result of an", 
             "interrogation that went wrong, one that was intended to lead", 
             "to his abduction from Turkey, according to two sources."]

# tokenize the sentences into words
texts1 = [[text for text in doc.split()] for doc in docu1]

# create dictionary
dictionary = corpora.Dictionary(texts1)

# get information about the dictionary
print(dictionary)
print(dictionary.token2id)

Dictionary(33 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)
{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32}


In [None]:
# docu2 = ["One source says the report will likely conclude that", 
#                 "the operation was carried out without clearance and", 
#                 "transparency and that those involved will be held", 
#                 "responsible. One of the sources acknowledged that the", 
#                 "report is still being prepared and cautioned that", 
#                 "things could change."]

# texts2 = [[text for text in doc.split()] for doc in docu2]
# dictionary.add_documents(texts2)
# print(dictionary)

In [2]:
# update an existing dictionary to include the new words
docu2 = ["The intersection graph of paths in trees",
               "Graph minors IV Widths of trees and well quasi ordering",
               "Graph minors A survey"]

texts2 = [[text for text in doc.split()] for doc in docu2]
dictionary.add_documents(texts2)
print(dictionary)

Dictionary(48 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)


### Create a Dictionary from one or more text files
Below example reads a file line-by-line and uses gensim's `simple_preprocess` to process one line of the file at a time.

In [3]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open('sample1.txt', encoding='utf-8'))
# dictionary.token2id
print(dictionary.token2id)

{'army': 0, 'chinaa': 1, 'chinese': 2, 'force': 3, 'liberation': 4, 'of': 5, 'peoplea': 6, 'recently': 7, 'recruited': 8, 'rocket': 9, 'tank': 10, 'technicians': 11, 'the': 12, 'think': 13, 'companies': 14, 'daily': 15, 'from': 16, 'on': 17, 'pla': 18, 'private': 19, 'reported': 20, 'saturday': 21, 'and': 22, 'appointment': 23, 'at': 24, 'ceremony': 25, 'experts': 26, 'founding': 27, 'hao': 28, 'letters': 29, 'other': 30, 'received': 31, 'science': 32, 'technology': 33, 'zhang': 34, 'according': 35, 'by': 36, 'defense': 37, 'national': 38, 'panel': 39, 'published': 40, 'report': 41, 'to': 42, 'as': 43, 'fellow': 44, 'his': 45, 'honored': 46, 'will': 47, 'œrocket': 48, 'conduct': 49, 'design': 50, 'fields': 51, 'into': 52, 'like': 53, 'members': 54, 'overall': 55, 'research': 56, 'serve': 57, 'which': 58, 'five': 59, 'for': 60, 'launching': 61, 'missile': 62, 'missiles': 63, 'network': 64, 'system': 65, 'years': 66, 'counterparts': 67, 'enjoy': 68, 'firms': 69, 'owned': 70, 'said': 71, 

Assuming all the text files in the same directory, you need to define a class with an `__iter__()` method which should iterate through all the files in a given directory and yield the processed list of word tokens.  

In [4]:
class ReadTxtFiles(object):
    def __init__(self,dirname):
        self.dirname = dirname
    
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname), encoding='latin'):
                yield simple_preprocess(line)
                
path_to_text_directory = 'lsa_sports_food_docs'

dictionary = corpora.Dictionary(ReadTxtFiles(path_to_text_directory))
print(dictionary)
print(dictionary.token2id)

Dictionary(525 unique tokens: ['and', 'are', 'at', 'be', 'but']...)
{'and': 0, 'are': 1, 'at': 2, 'be': 3, 'but': 4, 'can': 5, 'common': 6, 'cultures': 7, 'cut': 8, 'dough': 9, 'eat': 10, 'eaten': 11, 'extracted': 12, 'extruded': 13, 'far': 14, 'flat': 15, 'food': 16, 'form': 17, 'from': 18, 'in': 19, 'into': 20, 'is': 21, 'it': 22, 'made': 23, 'many': 24, 'more': 25, 'noodle': 26, 'noodles': 27, 'of': 28, 'once': 29, 'one': 30, 'or': 31, 'plural': 32, 'rolled': 33, 'see': 34, 'serve': 35, 'serving': 36, 'shapes': 37, 'single': 38, 'staple': 39, 'stretched': 40, 'the': 41, 'thus': 42, 'to': 43, 'unleavened': 44, 'variety': 45, 'which': 46, 'word': 47, 'accompanying': 48, 'added': 49, 'ago': 50, 'an': 51, 'been': 52, 'boiling': 53, 'china': 54, 'composition': 55, 'consumption': 56, 'cooked': 57, 'cooking': 58, 'cultural': 59, 'deep': 60, 'derives': 61, 'discussing': 62, 'dried': 63, 'evidence': 64, 'folded': 65, 'for': 66, 'found': 67, 'fried': 68, 'future': 69, 'geo': 70, 'german': 71,

### Create a bag of words corpus in gensim
Corpus object contains the word id and its frequency in each document.   
You can think of it as gensim's equivalent of a Docu-Term matrix.   
To create a bag of words corpus, you need to pass the tokenized list of words to the `Dictionary.doc2bow()`   
*the order of the words gets lost. Just the word and it’s frequency information is retained.*

In [5]:
docu3 = ["Who let the dogs out?",
           "Who? Who? Who? Who?"]

tokenized_list = [simple_preprocess(doc) for doc in docu3]

dict3 = corpora.Dictionary()
corpus3 = [dict3.doc2bow(doc, allow_update=True) for doc in tokenized_list]
pprint(corpus3)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(4, 4)]]


In [6]:
# get the original texts back
word_counts = [[(dict3[idx], count) for idx, count in line] for line in corpus3]
pprint(word_counts)

[[('dogs', 1), ('let', 1), ('out', 1), ('the', 1), ('who', 1)], [('who', 4)]]


### Create a bag of words corpus from a text file
The `__iter__()` from `BoWCorpus` reads a line from the file, process it to a list of words using `simple_preprocess()` and pass that to the `dictionary.doc2bow()`.   
`smart_open()` from `smart_open` package lets you open and read large files line-by-line from a variety of sources. However, `open()` for a file in system will work perfectly fine as well.

In [7]:
import gensim
from gensim import corpora
from gensim.utils import simple_preprocess
from smart_open import smart_open
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [8]:
class BoWCorpus:
    def __init__(self, path, dictionary):
        self.filepath = path
        self.dictionary = dictionary
        
    def __iter__(self):
        global mydict # only if updating the source dictionary
        for line in smart_open(self.filepath, encoding='latin'):
            tokenized_list = simple_preprocess(line, deacc=True)
            bow = self.dictionary.doc2bow(tokenized_list, allow_update=True)
            mydict.merge_with(self.dictionary) # update the source dictionary
            yield bow # lazy return
            
mydict = corpora.Dictionary()

# text 파일 하나에서 '\n' 기준 문장 나눠서
bow_corpus = BoWCorpus('sample1.txt', dictionary=mydict)
for line in bow_corpus:
    print(line)

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)]
[(14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1)]
[(5, 2), (12, 1), (22, 2), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
[(3, 1), (9, 1), (12, 2), (18, 1), (22, 1), (26, 1), (32, 1), (33, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1)]
[(15, 1), (17, 1), (18, 1), (21, 1)]
[(3, 1), (9, 1), (14, 1), (16, 1), (19, 1), (22, 2), (26, 2), (32, 1), (33, 1), (34, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1)]
[(3, 1), (5, 2), (9, 1), (10, 1), (12, 1), (13, 1), (18, 1), (43, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1)]
[(12, 1), (22, 1), (33, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1)]
[(12, 3), (16, 1), (26, 1), (41, 1), (43, 1), (47, 1), (66, 1), (67, 1), (68, 1), (69, 1), (

### Save a gensim dictionary and corpus to disk and load them back

In [9]:
mydict.save('mydict.dict')
corpora.MmCorpus.serialize('bow_corpus.mm', bow_corpus)

In [10]:
loaded_dict = corpora.Dictionary.load('mydict.dict')
corpus = corpora.MmCorpus('bow_corpus.mm')

for line in corpus:
    print(line)


[(0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0), (9, 1.0), (10, 1.0), (11, 1.0), (12, 1.0), (13, 1.0)]
[(14, 1.0), (15, 1.0), (16, 1.0), (17, 1.0), (18, 1.0), (19, 1.0), (20, 1.0), (21, 1.0)]
[(5, 2.0), (12, 1.0), (22, 2.0), (23, 1.0), (24, 1.0), (25, 1.0), (26, 1.0), (27, 1.0), (28, 1.0), (29, 1.0), (30, 1.0), (31, 1.0), (32, 1.0), (33, 1.0), (34, 1.0)]
[(3, 1.0), (9, 1.0), (12, 2.0), (18, 1.0), (22, 1.0), (26, 1.0), (32, 1.0), (33, 1.0), (35, 1.0), (36, 1.0), (37, 1.0), (38, 1.0), (39, 1.0), (40, 1.0), (41, 1.0), (42, 1.0)]
[(15, 1.0), (17, 1.0), (18, 1.0), (21, 1.0)]
[(3, 1.0), (9, 1.0), (14, 1.0), (16, 1.0), (19, 1.0), (22, 2.0), (26, 2.0), (32, 1.0), (33, 1.0), (34, 1.0), (43, 1.0), (44, 1.0), (45, 1.0), (46, 1.0), (47, 1.0)]
[(3, 1.0), (5, 2.0), (9, 1.0), (10, 1.0), (12, 1.0), (13, 1.0), (18, 1.0), (43, 1.0), (47, 1.0), (48, 1.0), (49, 1.0), (50, 1.0), (51, 1.0), (52, 1.0), (53, 1.0), (54, 1.0), (55, 1.0), (56, 1.0), (57, 1.0)]
[(12, 1.0)

### Create the TF-IDF matrix in gensim
TF-IDF is also a BoW model but down weights tokens that appears frequently across documents.   
TF-IDF is computed by multiplying a local component like TF with a global component IDF and optionally normalizing the result to unit length.   
Gensim uses the `SMART Information retrieval system` that can be used to implement these variations.   
Specifying `smartirs` parameter in the `TfidfModel` can specify what formula to use. `help(models.TfidfModel)`   
   
How to get the weights : by training the corpus with `models.TfidfModel()`.

In [11]:
from gensim import models
help(models.TfidfModel) # ntc?

Help on class TfidfModel in module gensim.models.tfidfmodel:

class TfidfModel(gensim.interfaces.TransformationABC)
 |  TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity at 0x7fba31352820>, wglobal=<function df2idf at 0x7fba1f70ae50>, normalize=True, smartirs=None, pivot=None, slope=0.25)
 |  
 |  Objects of this class realize the transformation between word-document co-occurrence matrix (int)
 |  into a locally/globally weighted TF-IDF matrix (positive floats).
 |  
 |  Examples
 |  --------
 |  .. sourcecode:: pycon
 |  
 |      >>> import gensim.downloader as api
 |      >>> from gensim.models import TfidfModel
 |      >>> from gensim.corpora import Dictionary
 |      >>>
 |      >>> dataset = api.load("text8")
 |      >>> dct = Dictionary(dataset)  # fit dictionary
 |      >>> corpus = [dct.doc2bow(line) for line in dataset]  # convert corpus to BoW format
 |      >>>
 |      >>> model = TfidfModel(corpus)  # fit model
 |      >>> vector = model[corpu

In [12]:
from gensim import models
import numpy as np

docu5 = ["This is the first line",
             "This is the second sentence",
             "This third document"]

mydict = corpora.Dictionary([simple_preprocess(line) for line in docu5])
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in docu5]

for doc in corpus:
    print([[mydict[idx], freq] for idx, freq in doc])

[['first', 1], ['is', 1], ['line', 1], ['the', 1], ['this', 1]]
[['is', 1], ['the', 1], ['this', 1], ['second', 1], ['sentence', 1]]
[['this', 1], ['document', 1], ['third', 1]]


In [13]:
tfidf = models.TfidfModel(corpus, smartirs='ntc')

# show weights
for doc in tfidf[corpus]:
    print([[mydict[idx], np.around(freq, decimals=2)] for idx, freq in doc])

[['first', 0.63], ['is', 0.31], ['line', 0.63], ['the', 0.31], ['this', 0.13]]
[['is', 0.31], ['the', 0.31], ['this', 0.13], ['second', 0.63], ['sentence', 0.63]]
[['this', 0.15], ['document', 0.7], ['third', 0.7]]


*tutorial 참고자료와 다르게 'this'가 사라지지 않았다.*
3번 모두 나타난 'this'는 사라졌어야 했다.

### Use gensim downloader API to load datasets
Gensim provides an inbuilt API to download popular text datasets and word embedding models.    
Using the API to download the dataset is as simple as calling the `api.load()` method with the right data or model name.   

In [15]:
import gensim.downloader as api

api.info()
api.info('glove-wiki-gigaword-50')

{'num_records': 400000,
 'file_size': 69182535,
 'base_dataset': 'Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)',
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py',
 'license': 'http://opendatacommons.org/licenses/pddl/',
 'parameters': {'dimension': 50},
 'description': 'Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).',
 'preprocessing': 'Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`.',
 'read_more': ['https://nlp.stanford.edu/projects/glove/',
  'https://nlp.stanford.edu/pubs/glove.pdf'],
 'checksum': 'c289bc5d7f2f02c6dc9f2f9b67641813',
 'file_name': 'glove-wiki-gigaword-50.gz',
 'parts': 1}

In [16]:
# download
w2v_model = api.load('glove-wiki-gigaword-50')
w2v_model.most_similar('blue')

[('red', 0.8901657462120056),
 ('black', 0.8648406863212585),
 ('pink', 0.8452917337417603),
 ('green', 0.834681510925293),
 ('yellow', 0.8320708274841309),
 ('purple', 0.829311192035675),
 ('white', 0.8225340843200684),
 ('orange', 0.8114302754402161),
 ('bright', 0.799933910369873),
 ('colored', 0.787665605545044)]

In [17]:
w2v_model.most_similar('red')

[('yellow', 0.8995459079742432),
 ('blue', 0.8901659250259399),
 ('green', 0.8561931848526001),
 ('black', 0.840058445930481),
 ('purple', 0.8323202133178711),
 ('white', 0.8149363994598389),
 ('pink', 0.81486576795578),
 ('orange', 0.8042871356010437),
 ('golden', 0.7416438460350037),
 ('colored', 0.7381109595298767)]

In [18]:
from gensim.models import Word2Vec

test = api.load('text8')
model = Word2Vec(test)
print(model)

Word2Vec(vocab=71290, vector_size=100, alpha=0.025)


### Create bigrams and trigrams using Phraser models
Let's download the `text` dataset, which is nothing but the "First 100,000,000 bytes of plain text from Wikipedia", then generate bigrams and trigrams.   
- Bigrams and Trigrams?   
In paragraphs, certain words always tend to occur in paris or in groups of threes because the two words combined together form the actual entity.   
 - ex) Combine of 'French' & 'Revolution' : 'French Revolution' refers to something completely different.   
 
Use gensim's `Phrases` model to create. As the `Phrases` model allows indexing, just pass the original text to the built model.   
*참고 : N-gram Model*

In [19]:
dataset = api.load('text8')
dataset = [wd for wd in dataset]
# print(dataset)

dct = corpora.Dictionary(dataset)
corpus = [dct.doc2bow(line) for line in dataset]
# print(corpus)

# build bigram
bigram = gensim.models.phrases.Phrases(dataset, min_count=3, threshold=10)
# threshold??
print(bigram)
print(bigram[dataset[0]])

Phrases<4400410 vocab, min_count=3, threshold=10, max_vocab_size=40000000>
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working_class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans_culottes', 'of', 'the', 'french_revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative_way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken_up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived_from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political_philosophy', 'is', 'the', 'belief_that', 'rulers', 'are', 'unnecessary', 'and', 'should_be', 'abolished', 'although', 'there_are', 'differing_interpretations', 'of', 'what', 'this', 'means', 'anarchis

In [28]:
# build trigram
# trigram = gensim.models.phrases.Phrases(bigram[dataset], threshold=10)
print('FIRST : ', trigram[dataset[0]])
# print('SECOND : ', trigram[bigram[dataset[0]]])

FIRST :  ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working_class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french_revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to_describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also_been', 'taken_up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived_from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political_philosophy', 'is', 'the', 'belief_that', 'rulers', 'are', 'unnecessary', 'and', 'should_be', 'abolished', 'although', 'there_are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers_to', 'related', 'social_movements', 'that',

In [30]:
print(len(bigram[dataset[0]]))
len(trigram[dataset[0]])

9062


8968

### Create Topic Models with LDA
The objective of topic models is to extract the underlying topics from a given collection of text documents. Each document in the text is considered ad a combination of topics and each topic is considered ad a combination of related words.   

It can be done by algorithms like LDA and LSI.   
Both models need the number of topics as input, and will provide the topic keywords for each topic and the percentage contribution of topics in each document.   

** gensim.utils.lemmatize() is no longer supported. Need to use own lemmatizer with `NLTK` or `SpaCy`

In [31]:
from gensim.models import LdaModel, LdaMulticore
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tag import pos_tag
import gensim.downloader as api
import re, logging, nltk

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
stop_words = stopwords.words('english')
stop_words = stop_words + ['com', 'edu', 'subject', 'lines', 'organization', 'would', 'article', 'could']

In [34]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

In [35]:
# Step1 : Import the dataset and get the text, real topic of each news article
dataset = api.load('text8')
data = [d for d in dataset]
print('Data[:100] = ',data[:100])

# Step2 : Prepare Data - remove stopwords and lemmatize
data_processed = []
for doc in data[:100]:
    doc_tagged = pos_tag(doc)
    doc_out = []
    for wd, pos in doc_tagged:
        if (wd not in stop_words) and (re.compile('(NN|JJ|RB)').match(pos)):
            lemmatized_word = WordNetLemmatizer().lemmatize(wd)
            
            if lemmatized_word:
                doc_out = doc_out + [wd]
        else:
            continue
            
    data_processed.append(doc_out) # list of list of words

print(data_processed[0][:5])

# Step3 : Create the inputs of LDA model : dictionary and corpus
dct = corpora.Dictionary(data_processed)
corpus = [dct.doc2bow(line) for line in data_processed]
# Step : Train the LDA model with 7 topics
lda_model = LdaMulticore(corpus=corpus, id2word=dct, random_state=100,
                        num_topics=7, passes=10, chunksize=1000, batch=False,
                        alpha='asymmetric', decay=0.5, offset=64, eta=None,
                        eval_every=0, iterations=100, gamma_threshold=0.001,
                        per_word_topics=True)

lda_model.save('lda_model_tut.model')

# shows what words contributed to which of the 7 topics, along with the weightage
lda_model.print_topics(-1)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



['anarchism', 'term', 'abuse', 'first', 'early']


[(0,
  '0.001*"zero" + 0.000*"also" + 0.000*"first" + 0.000*"many" + 0.000*"time" + 0.000*"people" + 0.000*"american" + 0.000*"world" + 0.000*"war" + 0.000*"new"'),
 (1,
  '0.001*"zero" + 0.000*"also" + 0.000*"many" + 0.000*"first" + 0.000*"time" + 0.000*"american" + 0.000*"world" + 0.000*"often" + 0.000*"however" + 0.000*"states"'),
 (2,
  '0.017*"zero" + 0.006*"also" + 0.004*"american" + 0.004*"first" + 0.003*"many" + 0.003*"world" + 0.003*"new" + 0.003*"time" + 0.002*"war" + 0.002*"states"'),
 (3,
  '0.001*"zero" + 0.001*"also" + 0.000*"many" + 0.000*"first" + 0.000*"american" + 0.000*"people" + 0.000*"time" + 0.000*"world" + 0.000*"war" + 0.000*"new"'),
 (4,
  '0.003*"zero" + 0.002*"arsenic" + 0.002*"atoms" + 0.001*"antimony" + 0.001*"electrons" + 0.001*"atom" + 0.001*"atomic" + 0.001*"also" + 0.001*"element" + 0.001*"lincoln"'),
 (5,
  '0.002*"zero" + 0.000*"also" + 0.000*"first" + 0.000*"many" + 0.000*"time" + 0.000*"american" + 0.000*"states" + 0.000*"war" + 0.000*"new" + 0.000*

Words like 'also', 'many' coming across different topics.   
Add such words to the `stop_words` to remove them and further tune to topic model for optimal number of topics.   
`LdaMulticore()` supports parallel processing, alternately can use `LdaModel()`.

### Interprete the LDA Topic Model's output
If a document is passed to the `lda_model`, it provides 3 things:   
- The topics that document belongs to along with percentage.
- The topics each word in that document belongs to.
- The topics each word in that document belongs to AND the phi values.   

*Phi value* is the probability of the word belonging to that particular topic.   
In below, the word with id=0 belongs to topic6 and the phi value is 3.999, which means the word appeared 4 times in the document.
*?????? 뭔가 안맞는다????????

In [37]:
for c in lda_model[corpus]:
    print(c)

([(2, 0.9998081)], [(0, [2]), (1, [2]), (2, [2]), (3, [2]), (4, [2]), (5, [2]), (6, [2]), (7, [2]), (8, [2]), (9, [2]), (10, [2]), (11, [2]), (12, [2]), (13, [2]), (14, [2]), (15, [2]), (16, [2]), (17, [2]), (18, [2]), (19, [2]), (20, [2]), (21, [2]), (22, [2]), (23, [2]), (24, [2]), (25, [2]), (26, [2]), (27, [2]), (28, [2]), (29, [2]), (30, [2]), (31, [2]), (32, [2]), (33, [2]), (34, [2]), (35, [2]), (36, [2]), (37, [2]), (38, [2]), (39, [2]), (40, [2]), (41, [2]), (42, [2]), (43, [2]), (44, [2]), (45, [2]), (46, [2]), (47, [2]), (48, [2]), (49, [2]), (50, [2]), (51, [2]), (52, [2]), (53, [2]), (54, [2]), (55, [2]), (56, [2]), (57, [2]), (58, [2]), (59, [2]), (60, [2]), (61, [2]), (62, [2]), (63, [2]), (64, [2]), (65, [2]), (66, [2]), (67, [2]), (68, [2]), (69, [2]), (70, [2]), (71, [2]), (72, [2]), (73, [2]), (74, [2]), (75, [2]), (76, [2]), (77, [2]), (78, [2]), (79, [2]), (80, [2]), (81, [2]), (82, [2]), (83, [2]), (84, [2]), (85, [2]), (86, [2]), (87, [2]), (88, [2]), (89, [2]), 

([(2, 0.9998125)], [(3, [2]), (12, [2]), (18, [2]), (22, [2]), (25, [2]), (28, [2]), (39, [2]), (40, [2]), (42, [2]), (47, [2]), (58, [2]), (59, [2]), (60, [2]), (62, [2]), (66, [2]), (68, [2]), (72, [2]), (85, [2]), (86, [2]), (93, [2]), (95, [2]), (96, [2]), (102, [2]), (103, [2]), (104, [2]), (105, [2]), (110, [2]), (123, [2]), (126, [2]), (128, [2]), (134, [2]), (137, [2]), (138, [2]), (144, [2]), (147, [2]), (149, [2]), (151, [2]), (153, [2]), (163, [2]), (168, [2]), (169, [2]), (170, [2]), (172, [2]), (179, [2]), (180, [2]), (186, [2]), (189, [2]), (192, [2]), (195, [2]), (202, [2]), (203, [2]), (212, [2]), (213, [2]), (216, [2]), (218, [2]), (231, [2]), (232, [2]), (237, [2]), (238, [2]), (240, [2]), (241, [2]), (245, [2]), (246, [2]), (248, [2]), (250, [2]), (257, [2]), (260, [2]), (270, [2]), (271, [2]), (272, [2]), (273, [2]), (276, [2]), (277, [2]), (278, [2]), (279, [2]), (282, [2]), (285, [2]), (290, [2]), (298, [2]), (307, [2]), (310, [2]), (311, [2]), (321, [2]), (322, [

([(2, 0.9997952)], [(3, [2]), (22, [2]), (23, [2]), (25, [2]), (40, [2]), (42, [2]), (56, [2]), (57, [2]), (58, [2]), (60, [2]), (66, [2]), (67, [2]), (68, [2]), (86, [2]), (91, [2]), (94, [2]), (96, [2]), (101, [2]), (106, [2]), (117, [2]), (123, [2]), (126, [2]), (128, [2]), (134, [2]), (137, [2]), (148, [2]), (151, [2]), (153, [2]), (154, [2]), (158, [2]), (160, [2]), (174, [2]), (179, [2]), (180, [2]), (182, [2]), (189, [2]), (195, [2]), (202, [2]), (203, [2]), (209, [2]), (231, [2]), (235, [2]), (238, [2]), (240, [2]), (244, [2]), (248, [2]), (249, [2]), (255, [2]), (256, [2]), (257, [2]), (258, [2]), (260, [2]), (261, [2]), (264, [2]), (267, [2]), (270, [2]), (277, [2]), (278, [2]), (279, [2]), (282, [2]), (290, [2]), (293, [2]), (306, [2]), (307, [2]), (310, [2]), (311, [2]), (315, [2]), (328, [2]), (329, [2]), (336, [2]), (338, [2]), (340, [2]), (357, [2]), (359, [2]), (361, [2]), (363, [2]), (367, [2]), (368, [2]), (379, [2]), (385, [2]), (386, [2]), (406, [2]), (414, [2]), (4

([(2, 0.999796)], [(2, [2]), (4, [2]), (6, [2]), (12, [2]), (13, [2]), (18, [2]), (22, [2]), (23, [2]), (27, [2]), (28, [2]), (29, [2]), (31, [2]), (33, [2]), (35, [2]), (36, [2]), (40, [2]), (42, [2]), (51, [2]), (56, [2]), (57, [2]), (60, [2]), (65, [2]), (66, [2]), (68, [2]), (86, [2]), (95, [2]), (102, [2]), (103, [2]), (104, [2]), (106, [2]), (110, [2]), (118, [2]), (123, [2]), (124, [2]), (127, [2]), (128, [2]), (132, [2]), (137, [2]), (150, [2]), (151, [2]), (153, [2]), (154, [2]), (172, [2]), (173, [2]), (174, [2]), (179, [2]), (183, [2]), (186, [2]), (189, [2]), (195, [2]), (202, [2]), (203, [2]), (211, [2]), (215, [2]), (216, [2]), (219, [2]), (221, [2]), (227, [2]), (231, [2]), (232, [2]), (236, [2]), (237, [2]), (238, [2]), (240, [2]), (245, [2]), (246, [2]), (248, [2]), (249, [2]), (255, [2]), (256, [2]), (258, [2]), (260, [2]), (261, [2]), (264, [2]), (270, [2]), (274, [2]), (275, [2]), (278, [2]), (279, [2]), (282, [2]), (285, [2]), (288, [2]), (306, [2]), (307, [2]), (3

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



([(2, 0.9997876)], [(3, [2]), (13, [2]), (17, [2]), (18, [2]), (22, [2]), (24, [2]), (25, [2]), (29, [2]), (33, [2]), (37, [2]), (40, [2]), (41, [2]), (42, [2]), (56, [2]), (57, [2]), (60, [2]), (62, [2]), (66, [2]), (68, [2]), (93, [2]), (96, [2]), (101, [2]), (102, [2]), (103, [2]), (106, [2]), (115, [2]), (116, [2]), (123, [2]), (128, [2]), (132, [2]), (133, [2]), (134, [2]), (138, [2]), (151, [2]), (153, [2]), (163, [2]), (177, [2]), (179, [2]), (180, [2]), (183, [2]), (186, [2]), (195, [2]), (202, [2]), (203, [2]), (211, [2]), (212, [2]), (222, [2]), (227, [2]), (231, [2]), (232, [2]), (235, [2]), (237, [2]), (238, [2]), (240, [2]), (245, [2]), (246, [2]), (248, [2]), (249, [2]), (255, [2]), (256, [2]), (257, [2]), (258, [2]), (264, [2]), (267, [2]), (270, [2]), (271, [2]), (282, [2]), (290, [2]), (310, [2]), (311, [2]), (315, [2]), (316, [2]), (317, [2]), (324, [2]), (328, [2]), (329, [2]), (334, [2]), (336, [2]), (339, [2]), (340, [2]), (349, [2]), (357, [2]), (362, [2]), (368, 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [38]:
for c in lda_model[corpus[5:8]]:
    print("Document Topics      : ", c[0])      # [(Topics, Perc Contrib)]
    print("Word id, Topics      : ", c[1][:3])  # [(Word id, [Topics])]
    print("Phi Values (word id) : ", c[2][:2])  # [(Word id, [(Topic, Phi Value)])]
    print("Word, Topics         : ", [(dct[wd], topic) for wd, topic in c[1][:2]])   # [(Word, [Topics])]
    print("Phi Values (word)    : ", [(dct[wd], topic) for wd, topic in c[2][:2]])  # [(Word, [(Topic, Phi Value)])]
    print("------------------------------------------------------\n")

Document Topics      :  [(2, 0.9997902)]
Word id, Topics      :  [(2, [2]), (9, [2]), (12, [2])]
Phi Values (word id) :  [(2, [(2, 2.9981723)]), (9, [(2, 0.9732207)])]
Word, Topics         :  [('ability', [2]), ('absurdity', [2])]
Phi Values (word)    :  [('ability', [(2, 2.9981723)]), ('absurdity', [(2, 0.9732207)])]
------------------------------------------------------

Document Topics      :  [(2, 0.99978703)]
Word id, Topics      :  [(2, [2]), (12, [2]), (18, [2])]
Phi Values (word id) :  [(2, [(2, 5.996344)]), (12, [(2, 2.9960558)])]
Word, Topics         :  [('ability', [2]), ('academic', [2])]
Phi Values (word)    :  [('ability', [(2, 5.996344)]), ('academic', [(2, 2.9960558)])]
------------------------------------------------------

Document Topics      :  [(2, 0.9998125)]
Word id, Topics      :  [(3, [2]), (12, [2]), (18, [2])]
Phi Values (word id) :  [(3, [(2, 0.9996195)]), (12, [(2, 5.992112)])]
Word, Topics         :  [('able', [2]), ('academic', [2])]
Phi Values (word)    

### Create a LSI topic model using gensim
It is similar to how we built the LDA model, except using `LsiModel()`.

In [39]:
from gensim.models import LsiModel
# build the model
lsi_model = LsiModel(corpus=corpus, id2word=dct, num_topics=7, decay=0.5)

# view topics
pprint(lsi_model.print_topics(-1))

[(0,
  '0.700*"zero" + 0.204*"also" + 0.163*"american" + 0.143*"first" + '
  '0.116*"many" + 0.101*"world" + 0.100*"new" + 0.093*"war" + 0.090*"time" + '
  '0.083*"states"'),
 (1,
  '0.447*"american" + -0.242*"also" + 0.241*"zero" + 0.158*"b" + -0.142*"many" '
  '+ 0.120*"football" + 0.111*"british" + 0.107*"player" + 0.103*"actor" + '
  '0.099*"french"'),
 (2,
  '0.383*"zero" + -0.297*"american" + -0.220*"lincoln" + -0.216*"war" + '
  '-0.166*"football" + -0.141*"ball" + -0.136*"british" + -0.132*"line" + '
  '0.129*"atheism" + 0.117*"jews"'),
 (3,
  '0.368*"lincoln" + 0.246*"jews" + 0.204*"anti" + 0.194*"war" + '
  '-0.182*"apollo" + 0.179*"union" + 0.153*"states" + 0.139*"jewish" + '
  '0.137*"state" + 0.127*"semitism"'),
 (4,
  '-0.271*"atheism" + -0.261*"jews" + -0.238*"american" + -0.199*"anti" + '
  '-0.192*"god" + -0.178*"b" + -0.158*"jewish" + -0.126*"semitism" + '
  '0.116*"football" + 0.115*"zero"'),
 (5,
  '-0.497*"atheism" + -0.310*"god" + -0.188*"lincoln" + -0.168*"atheis

### Train Word2Vec model using gensim
Downloading pre-built word embedding models like word2vec, fasttext, GloVe and ConceptNet is possible Using the Gensim's downloader API.   
If you are working in a specialized niche, it is desirable to train your own model, and Gensim's `Word2Vec` implementation let's you train own word embedding model.


In [None]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

dataset = api.load('text8')
data = [d for d in dataset]

# data split
data_p1 = data[:1000]
data_p2 = data[1000:]

# train model. defaults result vector size=100
model = Word2Vec(data_p1, min_count=0, workers=cpu_count())
# get the word vector for given word
model.wv['topic']

model.wv.most_similar('topic')

In [41]:
# save and load
model.save('new_w2v')
model = Word2Vec.load('new_w2v')

In [46]:
model.score()

TypeError: score() missing 1 required positional argument: 'sentences'

### Update an existing Word2Vec model with new data
Call the `build_vocab()` on the new dataset and then call the `train()` method.

In [47]:
import warnings
warnings.filterwarnings('ignore')
model.build_vocab(data_p2, update=True)
model.train(data_p2, total_examples=model.corpus_count, epochs=model.epochs)

(26272788, 35026035)

In [48]:
model.wv['topic']

array([ 0.7446504 , -0.28071564, -0.00606414,  0.83305943, -0.08232357,
       -0.9816075 , -1.8075072 , -0.80230474,  0.26997536, -0.29957888,
        0.5220611 ,  1.1691512 , -1.287881  ,  0.302683  , -1.3999207 ,
        1.4122881 , -1.0140525 , -1.6933585 ,  1.1840072 , -0.03015611,
       -0.99758565,  0.3975928 , -0.13866511,  0.4552208 , -0.81108534,
        0.6114069 , -1.1909314 , -0.46130013,  1.1777731 ,  0.46712992,
        2.7886486 , -0.71799904, -1.2384708 , -1.8221023 ,  0.31907815,
       -0.8455994 , -1.0189147 , -0.9707055 ,  0.9249707 , -1.269237  ,
       -0.28289005,  0.32410923,  0.5300748 , -0.33475715, -0.02085111,
        0.2372339 ,  0.19842613,  0.0462123 , -1.1710405 ,  0.31548834,
       -0.3538645 ,  0.9893512 ,  0.44972095,  0.608138  ,  0.05214502,
       -1.7729468 , -2.3391788 , -1.5013051 , -1.8594626 ,  0.90815234,
        0.21476643,  1.1746788 ,  1.2129115 , -0.8201682 ,  0.5766669 ,
       -0.55788064,  0.00893889,  0.46345812, -1.0837823 ,  0.83

In [49]:
model.wv.most_similar('topic')

[('discussion', 0.7000298500061035),
 ('discourse', 0.6609850525856018),
 ('commentary', 0.6580086350440979),
 ('comment', 0.6551846861839294),
 ('methodology', 0.6435954570770264),
 ('opinions', 0.6417107582092285),
 ('scholarly', 0.629954993724823),
 ('subject', 0.6292686462402344),
 ('interpretation', 0.6277948021888733),
 ('misunderstanding', 0.6263808608055115)]

### Extract word vectors using pre-trained Word2Vec and FastText models
Gensim lets you download SOTA pretrained models through the downloader API.

In [50]:
# import gensim.downloader as api
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')
word2vec_model300 = api.load('word2vec-google-news-300')
glove_model300 = api.load('glove-wiki-gigaword-300')

[--------------------------------------------------] 1.4% 23.4/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[=-------------------------------------------------] 3.5% 58.4/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[==------------------------------------------------] 5.6% 93.1/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[===-----------------------------------------------] 7.8% 129.2/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[====----------------------------------------------] 9.4% 156.4/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[=====---------------------------------------------] 10.1% 167.2/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





In [51]:
type(fasttext_model300)

gensim.models.keyedvectors.KeyedVectors

keyedvectors는 일회용이라고 볼 수 있다.   
한 번 학습하면 추가로 벡터를 업데이트할 수 없는 단점이 있지만, 확장성이 높고 모델 상태를 저장하지 않기 때문에 가볍다.   

In [53]:
# get word embeddings
word2vec_model300.most_similar('support')

[('supporting', 0.6251285076141357),
 ('suport', 0.6071150302886963),
 ('suppport', 0.6053199172019958),
 ('Support', 0.6044272780418396),
 ('supported', 0.6009396910667419),
 ('backing', 0.6007589101791382),
 ('supports', 0.5269277691841125),
 ('assistance', 0.5207136869430542),
 ('sup_port', 0.5192490220069885),
 ('supportive', 0.5110024809837341)]

Embedding models can be evaluated by using the respective model's `evaluate_word_analogies()` on a standard *analogies dataset*.

In [54]:
word2vec_model300.evaluate_word_analogies(analogies="questions-words.txt")[0]
fasttext_model300.evaluate_word_analogies(analogies="questions-words.txt")[0]
glove_model300.evaluate_word_analogies(analogies="questions-words.txt")[0]

0.7195422354510931

### Create document vectors using Doc2Vec
`Doc2Vec` model provides a vectorised representation of a group of words taken collectively  as a single unit. Not a simple average of the word vectors.   
The training data should be a list of `TaggedDocument`s.   
Pass a list of words and a unique integer as input to the `models.doc2vec.TaggedDocument()` to create one.

In [55]:
import gensim
import gensim.downloader as api
dataset = api.load('text8')
data = [d for d in dataset]

# create the tagged document for doc2vec
def create_tagged_document(lolow):
    for i, low in enumerate(lolow):
        yield gensim.models.doc2vec.TaggedDocument(low, [i])

train_data = list(create_tagged_document(data))
print('Train data : ', train_data[:1])

# initialize the model
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
# build the vocabulary
model.build_vocab(train_data)
# finally train the model
model.train(train_data, total_examples=model.corpus_count, epochs=model.epochs)

# get the doc vector of a sentence
print(model.infer_vector(['australian', 'captain', 'elected', 'to', 'bowl']))

Train data :  [TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'a

[-0.62048596  0.20153642  0.44353288 -0.14461097  0.02394026  0.06545825
  0.12366036  0.1999058  -0.26196873 -0.4248057   0.06105191  0.0118495
 -0.00475305 -0.32675147  0.5402481  -0.17461716 -0.14855836  0.13041614
 -0.00577485  0.18187208  0.01257265 -0.3578692   0.26982632  0.4065035
 -0.08455116 -0.25114775  0.07757098 -0.6654502  -0.26110008 -0.3343312
 -0.19534782 -0.35002714 -0.49486947 -0.33408606 -0.25023034 -0.10588334
 -0.06655324 -0.06311012 -0.05397988 -0.35398582 -0.03015532  0.3150858
  0.5925839  -0.13444912 -0.0262106  -0.1539785   0.36782086  0.89940256
 -0.02971512  0.20827499]


### Compute similarity metrics like cosine similarity and soft cosine similarity
Soft cosine similarity considers the semantic relationship between the words through its vector representation.   
First, compute the `similarity_matrix`. Then convert the input sentences to BoW corpus and pass them to the `softcossim()` along with the similarity matrix.   
*softcossim removed*

In [56]:
from gensim.matutils import cossim
from gensim import corpora

sent_1 = 'Sachin is a cricket player and a opening batsman'.split()
sent_2 = 'Dhoni is a cricket player too He is a batsman and keeper'.split()
sent_3 = 'Anand is a chess player'.split()

# prepare the similarity matrix
# similarity_matrix = fasttext_model300.similarity(dictionary, tfidf=None,
#                                                   threshold=0.0, exponent=2.0,
#                                                   nonzero_limit=100)
# prepare a dictionary and a corpus
documents = [sent_1, sent_2, sent_3]
dictionary = corpora.Dictionary(documents)

# convert the sentences into bag-of-words vectors
sent_1 = dictionary.doc2bow(sent_1)
sent_2 = dictionary.doc2bow(sent_2)
sent_3 = dictionary.doc2bow(sent_3)

# compute cosine similarity
print(cossim(sent_1, sent_2))
print(cossim(sent_1, sent_3))
print(cossim(sent_2, sent_3))


0.753778361444409
0.5393598899705937
0.5590169943749475


Useful similarity and distance metrics based on the word embedding models like fasttext and GloVe.

In [57]:
print(fasttext_model300.doesnt_match(['india', 'australia', 'pakistan', 'china', 'beetroot']))

# compute cosine distance between two words
print(fasttext_model300.distance('king', 'queen'))

# Compute cosine distances from given word or vector to all words in `other_words`
print(fasttext_model300.distances('king', ['queen','man','woman']))

# compute cosine similarities
print(fasttext_model300.cosine_similarities(fasttext_model300['king'],
                                           vectors_all=(fasttext_model300['queen'],
                                                       fasttext_model300['man'],
                                                       fasttext_model300['woman'],
                                                       fasttext_model300['queen']+fasttext_model300['man'])))
# get the words closer to w1 than w2
print(glove_model300.words_closer_than('king', 'kingdom'))

# find the top-N most similar words
print(fasttext_model300.most_similar(positive='king', negative=None, topn=5, restrict_vocab=None, indexer=None))

# find the top-n using th multiplicative combination objective
print(glove_model300.most_similar_cosmul(positive='king', negative=None, topn=5))


beetroot
0.22957539558410645
[0.22957546 0.465837   0.547001  ]
[0.77042454 0.534163   0.45299897 0.7657255 ]
['prince', 'queen', 'monarch']
[('king-', 0.78380286693573), ('boy-king', 0.7704818844795227), ('queen', 0.7704246640205383), ('prince', 0.7700966596603394), ('kings', 0.7668929696083069)]
[('queen', 0.8168227076530457), ('prince', 0.809830367565155), ('monarch', 0.7949802279472351), ('kingdom', 0.7895625829696655), ('throne', 0.7803236842155457)]


### Summarize text documents
Gensim implements the textrank summarization using the `summarize()` ftn in the `summarization` module. Just pass in the text string along with either the output summarization `ratio` or the maximum `count` of words in the summarized output. ?????????    
There is no need to split the sentence into a tokenized list because gensim dose using the built-in `split_sentences()` method in the `gensim.summarization.texcleaner` module   
***4.x 버전에서는 removed. BERT나 다른 모델을 사용하는 것을 추천***

In [None]:
from gensim.summarization import summarize, keywords
from pprint import pprint

text = " ".join((line for line in smart_open('sample.txt', encoding='utf-8')))

# Summarize the paragraph
pprint(summarize(text, word_count=20))
#> ('the PLA Rocket Force national defense science and technology experts panel, '
#>  'according to a report published by the')

# Important keywords from the paragraph
print(keywords(text))
#> force zhang technology experts pla rocket