# Twitter corpus creation and LDA topic modelling

## Introduction

This notebook has two sections:
1. Prototype code for creating a gensim-compatible corpus from my collection of tweets.
2. Training an LDA topic model on a subset of the corpus.

This is largely prototyping and experimenting with model hyperparameters. When done, I'll create separate scripts for each part of this. There's a very rudimentary version of this (probably not committed. Oops...) using a Wikipedia corpus for training, but I wasn't satisfied with it and the preprocessing and retraining takes about 12 hours.

## Libraries and setup

In [11]:
# Python libs
import sys, os
from dotenv import find_dotenv, load_dotenv
import re
import logging
from pathlib import Path
from pprint import pprint
import random

# Database
import pymongo

# NLP libs
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet

from gensim.test.utils import datapath
from gensim.utils import simple_preprocess
from gensim.test.utils import common_texts
from gensim.models import TfidfModel, CoherenceModel, LdaMulticore
from gensim.models.ldamodel import LdaModel
from gensim.models.word2vec import Text8Corpus
from gensim.models.phrases import Phrases, Phraser
from gensim.corpora import Dictionary, MmCorpus

from sklearn.model_selection import ParameterGrid

# Visualisation libs
import matplotlib.pyplot as plt

In [12]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Data loading

In [13]:
# From src/data/db_handlers.py
def mongodb_connect():
    """
    Establish connection to MongoDB
    db name given in .env file
    """
    dotenv_path = find_dotenv()
    load_dotenv(dotenv_path)

    client = pymongo.MongoClient(os.environ.get("DATABASE_URL"), 27017)
    db = client.tweetbase
    return db

In [14]:
db = mongodb_connect()

The Tweetbase db contains two separate collections of tweets, one from Aberdeen, Scotland and the second from Hammersmith, London.

In [15]:
print(f"""There are {db.tweets_abdn.count()} tweets from Aberdeen 
and {db.tweets_hsmith.count()} tweets from Hammersmith in the db""")

There are 164523 tweets from Aberdeen 
and 160335 tweets from Hammersmith in the db


In [16]:
# Get full_text field for each tweet
# Only the first 50 tweets were returned for testing purposes
tweets_abdn = db.tweets_abdn.find({}, {"_id": 0, "full_text": 1})[:50000]
tweets_hsmith = db.tweets_hsmith.find({}, {"_id": 0, "full_text": 1})[:50000]

# Convert mongodb cursor objects into lists
tweets_abdn = [_.get("full_text") for _ in tweets_abdn]
tweets_hsmith = [_.get("full_text") for _ in tweets_hsmith]

In [17]:
tweets = tweets_abdn + tweets_hsmith

In [18]:
# Get indices for a random sample of 50 tweets for inspection
tweets_sample_idxs = [idx for idx, _ in random.sample(list(enumerate(tweets)), 50)]
pprint([tweets[i] for i in tweets_sample_idxs])

['Duit habis, novel yg kutunggu h-3.\n\nMau nangis saya.',
 'Left Billy only 8 hours ago and miss him already, sad times\U0001f97a',
 '@RedMHiro @elgatogaming It runs so well! No delay from my Xbox to pc and '
 'doesn’t even use 10% CPU. It’s fantastic!',
 'Are you a #helicopter #pilot in either #transport or #SAR #searchandrescue '
 'operations with an interest in progressing #CRM? Then my research needs you. '
 'Please drop me a message to take part in our most recent study exploring '
 'pilot skills and the factors that influence them.',
 '@mintR_TV Amalgamations are fine if you personally know the other 2 '
 'streamers but id suggest keeping to yourself. By all means advertise for '
 'each other though.',
 'How do you calm sensitive skin?\n'
 '\n'
 'Cleansing is a crucial morning and evening ritual for your skin all year '
 "round, especially if it's sensitive. It's a vital first step and ensures "
 'that… https://t.co/UadSb90zy5',
 '@kettls @OutsideXbox Been meaning to go since it

## Corpus preprocessing

### Cleaning

In [19]:
# Remove links and @ prefixes from tweets
tweets = [re.sub('@|https?\://\S+', '', t) for t in tweets]

In [20]:
pprint([tweets[i] for i in tweets_sample_idxs])

['Duit habis, novel yg kutunggu h-3.\n\nMau nangis saya.',
 'Left Billy only 8 hours ago and miss him already, sad times\U0001f97a',
 'RedMHiro elgatogaming It runs so well! No delay from my Xbox to pc and '
 'doesn’t even use 10% CPU. It’s fantastic!',
 'Are you a #helicopter #pilot in either #transport or #SAR #searchandrescue '
 'operations with an interest in progressing #CRM? Then my research needs you. '
 'Please drop me a message to take part in our most recent study exploring '
 'pilot skills and the factors that influence them.',
 'mintR_TV Amalgamations are fine if you personally know the other 2 streamers '
 'but id suggest keeping to yourself. By all means advertise for each other '
 'though.',
 'How do you calm sensitive skin?\n'
 '\n'
 'Cleansing is a crucial morning and evening ritual for your skin all year '
 "round, especially if it's sensitive. It's a vital first step and ensures "
 'that… ',
 'kettls OutsideXbox Been meaning to go since it opened! Dundee have pumped 

### Tokenizing

Tokenization of the tweets was performed with `gensim.utils.simple_preprocess()`. This method only produces unigram tokens. Using `gensim.models.phrases.Phraser` on the tokenized output should create bigrams, but it is unclear at present whether the method used below actually did for the given input...

In [21]:
tweets_tokens = [simple_preprocess(t) for t in tweets]

In [22]:
sentences = Text8Corpus(datapath('testcorpus.txt'))
phrases = Phrases(sentences, min_count=1, threshold=1)

# bigram = Phrases(common_texts)
bigram = Phraser(phrases)
tweets_tokens = [bigram[t] for t in tweets_tokens]

2019-07-18 11:01:42,897 : INFO : collecting all words and their counts
2019-07-18 11:01:42,898 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2019-07-18 11:01:42,899 : INFO : collected 37 word types from a corpus of 29 words (unigram + bigrams) and 1 sentences
2019-07-18 11:01:42,899 : INFO : using 37 counts as vocab in Phrases<0 vocab, min_count=1, threshold=1, max_vocab_size=40000000>
2019-07-18 11:01:42,900 : INFO : source_vocab length 37
2019-07-18 11:01:42,900 : INFO : Phraser built with 3 phrasegrams


In [23]:
pprint([tweets_tokens[i] for i in tweets_sample_idxs])

[['duit', 'habis', 'novel', 'yg', 'kutunggu', 'mau', 'nangis', 'saya'],
 ['left',
  'billy',
  'only',
  'hours',
  'ago',
  'and',
  'miss',
  'him',
  'already',
  'sad',
  'times'],
 ['redmhiro',
  'elgatogaming',
  'it',
  'runs',
  'so',
  'well',
  'no',
  'delay',
  'from',
  'my',
  'xbox',
  'to',
  'pc',
  'and',
  'doesn',
  'even',
  'use',
  'cpu',
  'it',
  'fantastic'],
 ['are',
  'you',
  'helicopter',
  'pilot',
  'in',
  'either',
  'transport',
  'or',
  'sar',
  'searchandrescue',
  'operations',
  'with',
  'an',
  'interest',
  'in',
  'progressing',
  'crm',
  'then',
  'my',
  'research',
  'needs',
  'you',
  'please',
  'drop',
  'me',
  'message',
  'to',
  'take',
  'part',
  'in',
  'our',
  'most',
  'recent',
  'study',
  'exploring',
  'pilot',
  'skills',
  'and',
  'the',
  'factors',
  'that',
  'influence',
  'them'],
 ['mintr_tv',
  'amalgamations',
  'are',
  'fine',
  'if',
  'you',
  'personally',
  'know',
  'the',
  'other',
  'streamers',
  'b

### Stopword removal

In [24]:
stop = stopwords.words('english')
print(stop)
whitelist = []

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

For some reason negated forms of *should*, *would* and *might* are included, but not the regular forms. I've added them to the list myself.

In [25]:
stop_additions = ['should', 'would', 'might', 'could']
stop = stop + stop_additions

In [26]:
whitelist = []

tweets_tokens = [[word for word in sentence if word in whitelist or word not in stop]
     for sentence in tweets_tokens]

In [27]:
pprint([tweets_tokens[i] for i in tweets_sample_idxs])

[['duit', 'habis', 'novel', 'yg', 'kutunggu', 'mau', 'nangis', 'saya'],
 ['left', 'billy', 'hours', 'ago', 'miss', 'already', 'sad', 'times'],
 ['redmhiro',
  'elgatogaming',
  'runs',
  'well',
  'delay',
  'xbox',
  'pc',
  'even',
  'use',
  'cpu',
  'fantastic'],
 ['helicopter',
  'pilot',
  'either',
  'transport',
  'sar',
  'searchandrescue',
  'operations',
  'interest',
  'progressing',
  'crm',
  'research',
  'needs',
  'please',
  'drop',
  'message',
  'take',
  'part',
  'recent',
  'study',
  'exploring',
  'pilot',
  'skills',
  'factors',
  'influence'],
 ['mintr_tv',
  'amalgamations',
  'fine',
  'personally',
  'know',
  'streamers',
  'id',
  'suggest',
  'keeping',
  'means',
  'advertise',
  'though'],
 ['calm',
  'sensitive',
  'skin',
  'cleansing',
  'crucial',
  'morning',
  'evening',
  'ritual',
  'skin',
  'year',
  'round',
  'especially',
  'sensitive',
  'vital',
  'first',
  'step',
  'ensures'],
 ['kettls',
  'outsidexbox',
  'meaning',
  'go',
  'sin

### Lemmatizer

Lemmatization is grouping words under their lemma, or dictionary form e.g. *knows* and *knew* under *know*, or *feet* under *foot*. This requires knowledge of the Part of Speech (PoS) of the item.

While lemmatization was performed on the corpus, it should be noted that it may not necessarily be beneficial. [Schofield and Mimno (2016)](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00099) report that, at best, text preprocessed with the NLTK WordNet lemmatizer showed no meaningful change in topic coherence scores when it comes to LDA topic modelling compared with data that had not been stemmed. At worst, some stemming methods decrease LDA topic stability.

TODO: Possibly compare a lemmatized and unlemmatized version of the corpus

In [28]:
# From https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [29]:
wnl = WordNetLemmatizer()

tweets_tokens = [[wnl.lemmatize(word, get_wordnet_pos(word)) for word in sentence]
    for sentence in tweets_tokens]

In [30]:
pprint([tweets_tokens[i] for i in tweets_sample_idxs])

[['duit', 'habis', 'novel', 'yg', 'kutunggu', 'mau', 'nangis', 'saya'],
 ['left', 'billy', 'hour', 'ago', 'miss', 'already', 'sad', 'time'],
 ['redmhiro',
  'elgatogaming',
  'run',
  'well',
  'delay',
  'xbox',
  'pc',
  'even',
  'use',
  'cpu',
  'fantastic'],
 ['helicopter',
  'pilot',
  'either',
  'transport',
  'sar',
  'searchandrescue',
  'operation',
  'interest',
  'progress',
  'crm',
  'research',
  'need',
  'please',
  'drop',
  'message',
  'take',
  'part',
  'recent',
  'study',
  'explore',
  'pilot',
  'skill',
  'factor',
  'influence'],
 ['mintr_tv',
  'amalgamation',
  'fine',
  'personally',
  'know',
  'streamer',
  'id',
  'suggest',
  'keep',
  'mean',
  'advertise',
  'though'],
 ['calm',
  'sensitive',
  'skin',
  'cleanse',
  'crucial',
  'morning',
  'even',
  'ritual',
  'skin',
  'year',
  'round',
  'especially',
  'sensitive',
  'vital',
  'first',
  'step',
  'ensures'],
 ['kettls',
  'outsidexbox',
  'meaning',
  'go',
  'since',
  'open',
  'dunde

In [31]:
set_tmp = set()
for sentence in tweets_tokens:
    for word in sentence:
        set_tmp.add(word)
print(f"There are {len(set_tmp)} unique words in the corpus")
print(f"The first 150 words are:\n {list(set_tmp)[:150]}")

There are 136724 unique words in the corpus
The first 150 words are:
 ['hsn_ron', 'opang', 'markcooke', 'shanklanding', 'ladycorvia', 'rwsreece', 'nativeesoul', 'шутить', 'uisgebeatha', 'ماله', 'ira', 'он', 'toweroflondon', 'sheilamarson', 'свой', 'murghi', 'imjames_', 'korzystasz', 'unexpectedly', 'portobelloshack', 'villager', 'yet', 'shree_das', 'clauding', 'بيه', 'شاهد', 'bandung', 'lewiskingy', 'shona', 'leigh__h', 'ulovey', 'skeletalmess', 'colombian', 'ماعرفناه', 'fflad', 'restmarkinnov', 'nigra', 'noclass', 'bertahan', '표출하려', 'trudesundset', 'babettewasserma', 'حرفا', 'dino_melaye', 'gossipnya', 'stopbrexitnow', 'shayouy', 'floofy', 'گیری', 'tooley_evans', 'elissabairdx', 'anning', 'leslie', 'rightfully', 'colon', 'heyitschloejade', 'azizanalysis', 'hadeert', 'hunker', 'whynstuart', 'mandywolfbear', 'reddingpower', 'старше', 'ведь', 'yorkshiredalesb', 'emily_hutton_', 'chantelsophiaxo', 'malcom', '国際', 'melodic', 'finalmente', 'vvv_rfc', 'сухостоем', 'dunked', 'jordanbpeterson

Many of the words in the corpus so far are Twitter username mentions and hashtags. The creation of the dictionary in the next section will filter those that are not widely used.

## Creating Dictionary and BoW corpus

In [32]:
KEEP_WORDS = 100000 # Max number of words in dictionary
CORPUS_PATH = Path('../../data/corpora') # Location to save corpus and dict

In [33]:
dictionary = Dictionary(tweets_tokens)

# Filter the dictionary:
#    Words must appear in no fewer than n documents
#    Words must not appear in more than n of the total documents
dictionary.filter_extremes(no_below=40, no_above=0.05, keep_n=KEEP_WORDS)
print(dictionary.token2id)

2019-07-18 11:03:59,366 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-07-18 11:03:59,511 : INFO : adding document #10000 to Dictionary(19563 unique tokens: ['best', 'friday', 'night', 'thing', 'add']...)
2019-07-18 11:03:59,657 : INFO : adding document #20000 to Dictionary(31740 unique tokens: ['best', 'friday', 'night', 'thing', 'add']...)
2019-07-18 11:03:59,804 : INFO : adding document #30000 to Dictionary(42304 unique tokens: ['best', 'friday', 'night', 'thing', 'add']...)
2019-07-18 11:03:59,953 : INFO : adding document #40000 to Dictionary(51777 unique tokens: ['best', 'friday', 'night', 'thing', 'add']...)
2019-07-18 11:04:00,109 : INFO : adding document #50000 to Dictionary(60635 unique tokens: ['best', 'friday', 'night', 'thing', 'add']...)
2019-07-18 11:04:00,258 : INFO : adding document #60000 to Dictionary(79663 unique tokens: ['best', 'friday', 'night', 'thing', 'add']...)
2019-07-18 11:04:00,415 : INFO : adding document #70000 to Dictionary(95875 uni

{'best': 0, 'friday': 1, 'night': 2, 'thing': 3, 'add': 4, 'chip': 5, 'day': 6, 'flat': 7, 'fry': 8, 'meal': 9, 'pm': 10, 'ring': 11, 'sauce': 12, 'serve': 13, 'till': 14, 'weekend': 15, 'cool': 16, 'look': 17, 'pocket': 18, 'te': 19, 'top': 20, 'bad': 21, 'gorgeous': 22, 'close': 23, 'definitely': 24, 'drive': 25, 'fact': 26, 'know': 27, 'left': 28, 'lucky': 29, 'onto': 30, 'right': 31, 'street': 32, 'union': 33, 'begin': 34, 'cold': 35, 'time': 36, 'good': 37, 'blue': 38, 'call': 39, 'home': 40, 'order': 41, 'pre': 42, 'road': 43, 'need': 44, 'people': 45, 'reason': 46, 'bother': 47, 'pay': 48, 'point': 49, 'question': 50, 'dont': 51, 'mix': 52, 'politics': 53, 'sport': 54, 'always': 55, 'base': 56, 'company': 57, 'enter': 58, 'giveaway': 59, 'global': 60, 'prize': 61, 'range': 62, 'say': 63, 'since': 64, 'st': 65, 'uk': 66, 'way': 67, 'amp': 68, 'background': 69, 'expose': 70, 'fight': 71, 'gang': 72, 'hate': 73, 'heard': 74, 'never': 75, 'state': 76, 'strike': 77, 'white': 78, 'arg

Along with the LDA model hyperparameters, the dictionary filtering above will have a considerable impact on the topic model.

In [34]:
# Create the corpus
tweet_corpus = [dictionary.doc2bow(tweet) for tweet in tweets_tokens] # Use the dict to create BoW vectors
# tweet_corpus = [text for text in tweet_corpus if len(text) > 0]
MmCorpus.serialize(str(CORPUS_PATH / 'tweets') + '_bow.mm', tweet_corpus, progress_cnt=10000)

2019-07-18 11:04:01,939 : INFO : storing corpus in Matrix Market format to ../../data/corpora/tweets_bow.mm
2019-07-18 11:04:01,940 : INFO : saving sparse matrix to ../../data/corpora/tweets_bow.mm
2019-07-18 11:04:01,940 : INFO : PROGRESS: saving document #0
2019-07-18 11:04:02,044 : INFO : PROGRESS: saving document #10000
2019-07-18 11:04:02,143 : INFO : PROGRESS: saving document #20000
2019-07-18 11:04:02,243 : INFO : PROGRESS: saving document #30000
2019-07-18 11:04:02,340 : INFO : PROGRESS: saving document #40000
2019-07-18 11:04:02,443 : INFO : PROGRESS: saving document #50000
2019-07-18 11:04:02,525 : INFO : PROGRESS: saving document #60000
2019-07-18 11:04:02,609 : INFO : PROGRESS: saving document #70000
2019-07-18 11:04:02,690 : INFO : PROGRESS: saving document #80000
2019-07-18 11:04:02,774 : INFO : PROGRESS: saving document #90000
2019-07-18 11:04:02,858 : INFO : saved 100000x2717 matrix, density=0.178% (482349/271700000)
2019-07-18 11:04:02,859 : INFO : saving MmCorpus inde

In [35]:
dictionary.save_as_text(str(CORPUS_PATH / 'tweets') + '_wordids.txt.bz2')

2019-07-18 11:04:02,867 : INFO : saving dictionary mapping to ../../data/corpora/tweets_wordids.txt.bz2


## Create LDA topic model

In [36]:
CORPUS_PATH = Path('../../data/corpora') # Location to save corpus and dict
MODEL_PATH = Path('../../models')

In [37]:
mm = MmCorpus(str(CORPUS_PATH / 'tweets') + '_bow.mm')

2019-07-18 11:04:03,150 : INFO : loaded corpus index from ../../data/corpora/tweets_bow.mm.index
2019-07-18 11:04:03,151 : INFO : initializing cython corpus reader from ../../data/corpora/tweets_bow.mm
2019-07-18 11:04:03,151 : INFO : accepted corpus with 100000 documents, 2717 features, 482349 non-zero entries


In [38]:
dictionary = Dictionary.load_from_text(str(CORPUS_PATH / 'tweets') + '_wordids.txt.bz2')

### FInd good parameters

Find LDA model hyperparamters that give the highest topic coherence score.

In [51]:
parameter_grid = ParameterGrid({
    'num_topics': [100, 150, 200],
    'chunksize': [10000],
    'passes': [3, 5],
    'iterations': [25],
    'decay': [0.5, 0.6],
    'eval_every': [None]
})

In [52]:
lda_models = []
coherence = []
parameters = []

for params in parameter_grid:
    parameters.append(params)
    
    # Fit the model
    lda = LdaMulticore(corpus=mm, id2word=dictionary,
                   **params,
                   workers=3)
    
    # Get the coherence score for the model
    cm = CoherenceModel(model=lda, texts=tweets_tokens, dictionary=dictionary, coherence='c_v')
    
    # Append results
    lda_models.append(lda)
    coherence.append(cm.get_coherence())

2019-07-18 11:21:58,094 : INFO : using symmetric alpha at 0.01
2019-07-18 11:21:58,095 : INFO : using symmetric eta at 0.01
2019-07-18 11:21:58,096 : INFO : using serial LDA version on this node
2019-07-18 11:21:58,133 : INFO : running online LDA training, 100 topics, 3 passes over the supplied corpus of 100000 documents, updating every 30000 documents, evaluating every ~0 documents, iterating 25x with a convergence threshold of 0.001000
2019-07-18 11:21:58,138 : INFO : training LDA model using 3 processes
2019-07-18 11:21:58,213 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #10000/100000, outstanding queue size 1
2019-07-18 11:21:58,269 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #20000/100000, outstanding queue size 2
2019-07-18 11:21:58,318 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #30000/100000, outstanding queue size 3
2019-07-18 11:21:58,383 : INFO : PROGRESS: pass 0, dispatched chunk #3 = documents up to #40000/10000

2019-07-18 11:22:14,855 : INFO : topic #58 (0.010): 0.037*"ada" + 0.018*"tak" + 0.017*"kalau" + 0.012*"london" + 0.011*"work" + 0.011*"go" + 0.010*"excite" + 0.010*"see" + 0.008*"dah" + 0.008*"dapat"
2019-07-18 11:22:14,857 : INFO : topic #66 (0.010): 0.027*"aberdeen" + 0.025*"job" + 0.019*"ya" + 0.018*"hire" + 0.018*"scotland" + 0.015*"time" + 0.015*"careerarc" + 0.014*"music" + 0.014*"road" + 0.010*"council"
2019-07-18 11:22:14,858 : INFO : topic #0 (0.010): 0.023*"go" + 0.018*"want" + 0.015*"use" + 0.014*"time" + 0.011*"مع" + 0.011*"لا" + 0.009*"say" + 0.008*"work" + 0.008*"know" + 0.008*"فيه"
2019-07-18 11:22:14,861 : INFO : topic #96 (0.010): 0.060*"yes" + 0.029*"happy" + 0.023*"birthday" + 0.019*"take" + 0.011*"please" + 0.011*"go" + 0.010*"wonder" + 0.008*"one" + 0.008*"u" + 0.007*"decent"
2019-07-18 11:22:14,872 : INFO : topic diff=inf, rho=0.288675
2019-07-18 11:22:18,012 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:22:18,057 : I

2019-07-18 11:22:30,832 : INFO : topic #0 (0.010): 0.033*"winner" + 0.032*"use" + 0.031*"لا" + 0.023*"bottle" + 0.023*"orang" + 0.021*"brighton" + 0.020*"مع" + 0.018*"فيه" + 0.018*"go" + 0.015*"everywhere"
2019-07-18 11:22:30,832 : INFO : topic #51 (0.010): 0.041*"bus" + 0.039*"finally" + 0.025*"man" + 0.024*"wanna" + 0.023*"track" + 0.020*"cover" + 0.019*"human" + 0.018*"town" + 0.017*"driver" + 0.017*"time"
2019-07-18 11:22:30,833 : INFO : topic #18 (0.010): 0.159*"de" + 0.117*"la" + 0.088*"que" + 0.058*"en" + 0.043*"un" + 0.037*"tu" + 0.023*"e" + 0.020*"ça" + 0.017*"por" + 0.016*"al"
2019-07-18 11:22:30,834 : INFO : topic #72 (0.010): 0.094*"happy" + 0.090*"christmas" + 0.071*"best" + 0.052*"birthday" + 0.032*"day" + 0.025*"omg" + 0.021*"hope" + 0.018*"xxx" + 0.017*"one" + 0.014*"celebrate"
2019-07-18 11:22:30,834 : INFO : topic #50 (0.010): 0.109*"le" + 0.059*"pa" + 0.048*"est" + 0.048*"il" + 0.039*"et" + 0.030*"se" + 0.027*"nah" + 0.025*"ffs" + 0.024*"son" + 0.022*"super"
2019-07-

2019-07-18 11:22:59,587 : INFO : topic diff=inf, rho=0.377964
2019-07-18 11:23:01,732 : INFO : merging changes from 10000 documents into a model of 100000 documents
2019-07-18 11:23:01,766 : INFO : topic #33 (0.010): 0.019*"love" + 0.013*"show" + 0.013*"chelsea" + 0.011*"say" + 0.011*"go" + 0.010*"one" + 0.009*"left" + 0.009*"happy" + 0.008*"watch" + 0.008*"em"
2019-07-18 11:23:01,767 : INFO : topic #10 (0.010): 0.017*"london" + 0.017*"thanks" + 0.014*"shot" + 0.012*"history" + 0.011*"one" + 0.010*"madrid" + 0.010*"natural" + 0.010*"two" + 0.010*"never" + 0.009*"like"
2019-07-18 11:23:01,768 : INFO : topic #28 (0.010): 0.023*"like" + 0.012*"year" + 0.010*"one" + 0.010*"say" + 0.009*"thanks" + 0.008*"como" + 0.008*"happy" + 0.008*"look" + 0.008*"chelsea" + 0.007*"great"
2019-07-18 11:23:01,769 : INFO : topic #5 (0.010): 0.015*"travel" + 0.014*"minute" + 0.011*"one" + 0.009*"shirt" + 0.008*"road" + 0.008*"anyway" + 0.008*"amp" + 0.008*"time" + 0.008*"old" + 0.008*"good"
2019-07-18 11:23:

2019-07-18 11:23:13,288 : INFO : PROGRESS: pass 2, dispatched chunk #7 = documents up to #80000/100000, outstanding queue size 8
2019-07-18 11:23:13,327 : INFO : PROGRESS: pass 2, dispatched chunk #8 = documents up to #90000/100000, outstanding queue size 9
2019-07-18 11:23:15,749 : INFO : PROGRESS: pass 2, dispatched chunk #9 = documents up to #100000/100000, outstanding queue size 9
2019-07-18 11:23:16,012 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:23:16,052 : INFO : topic #79 (0.010): 0.055*"thanks" + 0.048*"music" + 0.046*"dah" + 0.036*"expect" + 0.024*"когда" + 0.021*"health" + 0.017*"roll" + 0.013*"fight" + 0.012*"mental" + 0.011*"today"
2019-07-18 11:23:16,053 : INFO : topic #0 (0.010): 0.127*"de" + 0.085*"la" + 0.072*"que" + 0.049*"en" + 0.046*"yeah" + 0.039*"un" + 0.034*"el" + 0.028*"tu" + 0.023*"apa" + 0.022*"e"
2019-07-18 11:23:16,054 : INFO : topic #69 (0.010): 0.035*"white" + 0.027*"tonight" + 0.024*"proud" + 0.024*"amp" + 

2019-07-18 11:23:28,726 : INFO : topic #99 (0.010): 0.044*"stupid" + 0.033*"pls" + 0.030*"dream" + 0.028*"shock" + 0.025*"evidence" + 0.024*"woman" + 0.022*"none" + 0.021*"stream" + 0.021*"attend" + 0.019*"rangersfc"
2019-07-18 11:23:28,728 : INFO : topic #24 (0.010): 0.130*"agree" + 0.049*"many" + 0.029*"sing" + 0.025*"parliament" + 0.021*"unfortunately" + 0.020*"longer" + 0.019*"yellow" + 0.019*"think" + 0.019*"access" + 0.017*"people"
2019-07-18 11:23:28,733 : INFO : topic #22 (0.010): 0.079*"follow" + 0.044*"night" + 0.031*"without" + 0.031*"last" + 0.025*"reply" + 0.024*"peterhead" + 0.023*"add" + 0.023*"thesnp" + 0.021*"cake" + 0.020*"delicious"
2019-07-18 11:23:28,735 : INFO : topic diff=inf, rho=0.267261
2019-07-18 11:23:30,662 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:23:30,698 : INFO : topic #41 (0.010): 0.082*"ya" + 0.072*"food" + 0.062*"put" + 0.050*"bet" + 0.040*"bar" + 0.035*"totally" + 0.031*"pop" + 0.024*"across" + 0.02

2019-07-18 11:23:40,954 : INFO : topic #51 (0.010): 0.064*"eu" + 0.050*"ready" + 0.034*"ask" + 0.026*"report" + 0.025*"advice" + 0.020*"sent" + 0.020*"air" + 0.019*"u" + 0.017*"colleague" + 0.017*"cunt"
2019-07-18 11:23:40,955 : INFO : topic #49 (0.010): 0.181*"bad" + 0.102*"little" + 0.055*"meeting" + 0.041*"work" + 0.036*"class" + 0.030*"receive" + 0.024*"tag" + 0.022*"nobody" + 0.020*"highlight" + 0.016*"finger"
2019-07-18 11:23:40,955 : INFO : topic #4 (0.010): 0.086*"must" + 0.060*"young" + 0.057*"medium" + 0.041*"social" + 0.040*"realise" + 0.036*"hey" + 0.033*"cool" + 0.031*"attack" + 0.030*"people" + 0.029*"rule"
2019-07-18 11:23:40,956 : INFO : topic #40 (0.010): 0.095*"leave" + 0.042*"area" + 0.038*"strong" + 0.031*"sometimes" + 0.030*"grow" + 0.026*"ground" + 0.024*"board" + 0.023*"studio" + 0.020*"discover" + 0.020*"george"
2019-07-18 11:23:40,956 : INFO : topic #95 (0.010): 0.140*"xx" + 0.060*"number" + 0.060*"worth" + 0.042*"mum" + 0.040*"bloody" + 0.031*"spur" + 0.027*"s

2019-07-18 11:24:10,919 : INFO : topic diff=inf, rho=0.316228
2019-07-18 11:24:10,953 : INFO : PROGRESS: pass 1, dispatched chunk #0 = documents up to #10000/100000, outstanding queue size 1
2019-07-18 11:24:10,999 : INFO : PROGRESS: pass 1, dispatched chunk #1 = documents up to #20000/100000, outstanding queue size 2
2019-07-18 11:24:11,245 : INFO : PROGRESS: pass 1, dispatched chunk #2 = documents up to #30000/100000, outstanding queue size 3
2019-07-18 11:24:11,292 : INFO : PROGRESS: pass 1, dispatched chunk #3 = documents up to #40000/100000, outstanding queue size 4
2019-07-18 11:24:11,340 : INFO : PROGRESS: pass 1, dispatched chunk #4 = documents up to #50000/100000, outstanding queue size 5
2019-07-18 11:24:11,368 : INFO : PROGRESS: pass 1, dispatched chunk #5 = documents up to #60000/100000, outstanding queue size 6
2019-07-18 11:24:11,393 : INFO : PROGRESS: pass 1, dispatched chunk #6 = documents up to #70000/100000, outstanding queue size 7
2019-07-18 11:24:11,420 : INFO : PR

2019-07-18 11:24:23,939 : INFO : topic #71 (0.007): 0.111*"di" + 0.037*"honest" + 0.028*"piece" + 0.027*"safe" + 0.021*"rid" + 0.021*"legal" + 0.016*"experienced" + 0.014*"amp" + 0.012*"injured" + 0.010*"lie"
2019-07-18 11:24:23,940 : INFO : topic diff=inf, rho=0.277350
2019-07-18 11:24:26,286 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:24:26,345 : INFO : topic #30 (0.007): 0.153*"oh" + 0.078*"wow" + 0.053*"black" + 0.047*"friday" + 0.021*"wine" + 0.020*"law" + 0.016*"nation" + 0.013*"emergency" + 0.012*"come" + 0.011*"allah"
2019-07-18 11:24:26,346 : INFO : topic #129 (0.007): 0.086*"eu" + 0.086*"vote" + 0.032*"nicolasturgeon" + 0.026*"want" + 0.025*"people" + 0.024*"country" + 0.022*"scotland" + 0.022*"choice" + 0.017*"brother" + 0.016*"uk"
2019-07-18 11:24:26,347 : INFO : topic #125 (0.007): 0.069*"side" + 0.042*"surely" + 0.040*"rate" + 0.035*"university" + 0.029*"ross" + 0.021*"ان" + 0.021*"theme" + 0.020*"ship" + 0.016*"هو" + 0.015

2019-07-18 11:25:02,775 : INFO : topic #48 (0.007): 0.015*"well" + 0.010*"go" + 0.010*"think" + 0.008*"people" + 0.008*"one" + 0.008*"like" + 0.008*"fuck" + 0.007*"call" + 0.006*"take" + 0.006*"know"
2019-07-18 11:25:02,778 : INFO : topic #143 (0.007): 0.018*"do" + 0.012*"year" + 0.011*"look" + 0.010*"work" + 0.009*"fuck" + 0.009*"go" + 0.009*"bad" + 0.008*"oh" + 0.008*"well" + 0.008*"people"
2019-07-18 11:25:02,789 : INFO : topic #100 (0.007): 0.021*"go" + 0.018*"thank" + 0.012*"one" + 0.012*"make" + 0.011*"like" + 0.008*"say" + 0.007*"new" + 0.007*"see" + 0.007*"know" + 0.007*"day"
2019-07-18 11:25:02,793 : INFO : topic diff=inf, rho=0.500000
2019-07-18 11:25:05,159 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:25:05,210 : INFO : topic #47 (0.007): 0.018*"time" + 0.016*"go" + 0.013*"hahaha" + 0.012*"day" + 0.011*"like" + 0.010*"amp" + 0.010*"tout" + 0.009*"still" + 0.008*"nice" + 0.007*"make"
2019-07-18 11:25:05,211 : INFO : topic #78 (0

2019-07-18 11:25:17,830 : INFO : topic #90 (0.007): 0.071*"agree" + 0.034*"attend" + 0.033*"lovely" + 0.025*"drop" + 0.020*"awareness" + 0.015*"time" + 0.014*"one" + 0.014*"go" + 0.012*"day" + 0.012*"hi"
2019-07-18 11:25:17,831 : INFO : topic #106 (0.007): 0.030*"bush" + 0.022*"manage" + 0.021*"man" + 0.019*"sense" + 0.017*"dia" + 0.015*"pressure" + 0.014*"attack" + 0.013*"go" + 0.012*"produce" + 0.012*"also"
2019-07-18 11:25:17,831 : INFO : topic #25 (0.007): 0.091*"ya" + 0.047*"welcome" + 0.028*"e" + 0.027*"por" + 0.027*"trump" + 0.025*"del" + 0.023*"de" + 0.020*"brain" + 0.018*"la" + 0.018*"ignore"
2019-07-18 11:25:17,832 : INFO : topic diff=inf, rho=0.288675
2019-07-18 11:25:17,865 : INFO : PROGRESS: pass 2, dispatched chunk #0 = documents up to #10000/100000, outstanding queue size 1
2019-07-18 11:25:17,911 : INFO : PROGRESS: pass 2, dispatched chunk #1 = documents up to #20000/100000, outstanding queue size 2
2019-07-18 11:25:17,959 : INFO : PROGRESS: pass 2, dispatched chunk #2 

2019-07-18 11:25:29,833 : INFO : topic #17 (0.007): 0.063*"cfc" + 0.062*"literally" + 0.043*"bar" + 0.042*"chance" + 0.038*"care" + 0.034*"half" + 0.034*"weird" + 0.028*"fake" + 0.021*"create" + 0.019*"voting"
2019-07-18 11:25:29,836 : INFO : topic #72 (0.007): 0.164*"call" + 0.040*"email" + 0.036*"almost" + 0.030*"matter" + 0.026*"hahahahaha" + 0.025*"press" + 0.022*"tip" + 0.019*"one" + 0.017*"ja" + 0.013*"shift"
2019-07-18 11:25:29,845 : INFO : topic #108 (0.007): 0.057*"join" + 0.041*"wall" + 0.041*"decent" + 0.034*"cupcake" + 0.026*"theme" + 0.025*"youth" + 0.024*"onto" + 0.021*"discus" + 0.020*"u" + 0.019*"december"
2019-07-18 11:25:29,845 : INFO : topic #73 (0.007): 0.041*"health" + 0.029*"difference" + 0.029*"mental" + 0.028*"charge" + 0.026*"stress" + 0.020*"value" + 0.020*"partner" + 0.020*"advent" + 0.018*"calendar" + 0.018*"clue"
2019-07-18 11:25:29,849 : INFO : topic diff=inf, rho=0.267261
2019-07-18 11:25:32,071 : INFO : merging changes from 30000 documents into a model o

2019-07-18 11:25:41,175 : INFO : topic diff=inf, rho=0.258199
2019-07-18 11:25:43,031 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:25:43,085 : INFO : topic #50 (0.007): 0.142*"sign" + 0.104*"set" + 0.078*"v" + 0.070*"bbc" + 0.053*"door" + 0.041*"когда" + 0.033*"fund" + 0.029*"client" + 0.025*"celtic" + 0.025*"copy"
2019-07-18 11:25:43,086 : INFO : topic #129 (0.007): 0.127*"finally" + 0.087*"james" + 0.083*"english" + 0.075*"coach" + 0.061*"scot" + 0.032*"leaf" + 0.030*"scotrail" + 0.029*"environment" + 0.029*"apology" + 0.020*"wm"
2019-07-18 11:25:43,087 : INFO : topic #108 (0.007): 0.114*"join" + 0.049*"wall" + 0.043*"decent" + 0.037*"u" + 0.036*"cupcake" + 0.034*"december" + 0.029*"however" + 0.028*"onto" + 0.026*"theme" + 0.025*"youth"
2019-07-18 11:25:43,090 : INFO : topic #38 (0.007): 0.112*"finish" + 0.056*"pas" + 0.050*"massive" + 0.046*"product" + 0.045*"shut" + 0.042*"bottle" + 0.041*"university" + 0.035*"pitch" + 0.034*"approac

2019-07-18 11:26:21,524 : INFO : topic #51 (0.005): 0.098*"love" + 0.043*"ada" + 0.021*"tak" + 0.019*"kalau" + 0.011*"stop" + 0.009*"tahu" + 0.008*"go" + 0.008*"one" + 0.008*"well" + 0.007*"pun"
2019-07-18 11:26:21,525 : INFO : topic diff=inf, rho=0.377964
2019-07-18 11:26:23,894 : INFO : merging changes from 10000 documents into a model of 100000 documents
2019-07-18 11:26:23,959 : INFO : topic #174 (0.005): 0.083*"الله" + 0.031*"week" + 0.018*"time" + 0.016*"today" + 0.015*"important" + 0.013*"feel" + 0.010*"take" + 0.009*"balik" + 0.008*"want" + 0.007*"well"
2019-07-18 11:26:23,960 : INFO : topic #150 (0.005): 0.018*"online" + 0.013*"happy" + 0.013*"travel" + 0.010*"see" + 0.010*"roll" + 0.010*"xx" + 0.009*"need" + 0.009*"birthday" + 0.009*"look" + 0.009*"dont"
2019-07-18 11:26:23,962 : INFO : topic #140 (0.005): 0.017*"make" + 0.013*"central" + 0.013*"amp" + 0.009*"feel" + 0.009*"boy" + 0.008*"like" + 0.008*"take" + 0.007*"think" + 0.007*"today" + 0.007*"day"
2019-07-18 11:26:23,96

2019-07-18 11:26:35,525 : INFO : PROGRESS: pass 2, dispatched chunk #5 = documents up to #60000/100000, outstanding queue size 6
2019-07-18 11:26:35,550 : INFO : PROGRESS: pass 2, dispatched chunk #6 = documents up to #70000/100000, outstanding queue size 7
2019-07-18 11:26:35,575 : INFO : PROGRESS: pass 2, dispatched chunk #7 = documents up to #80000/100000, outstanding queue size 8
2019-07-18 11:26:35,601 : INFO : PROGRESS: pass 2, dispatched chunk #8 = documents up to #90000/100000, outstanding queue size 9
2019-07-18 11:26:37,960 : INFO : PROGRESS: pass 2, dispatched chunk #9 = documents up to #100000/100000, outstanding queue size 9
2019-07-18 11:26:38,161 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:26:38,238 : INFO : topic #111 (0.005): 0.076*"family" + 0.036*"card" + 0.032*"many" + 0.027*"mark" + 0.023*"today" + 0.022*"yet" + 0.022*"receive" + 0.022*"people" + 0.016*"large" + 0.015*"hero"
2019-07-18 11:26:38,239 : INFO : topic #52

2019-07-18 11:27:24,840 : INFO : topic #141 (0.005): 0.022*"like" + 0.018*"need" + 0.017*"come" + 0.010*"home" + 0.010*"good" + 0.010*"scotland" + 0.009*"time" + 0.008*"know" + 0.008*"amp" + 0.008*"go"
2019-07-18 11:27:24,842 : INFO : topic #34 (0.005): 0.011*"think" + 0.011*"scotland" + 0.010*"want" + 0.009*"aberdeen" + 0.009*"great" + 0.008*"fuck" + 0.008*"still" + 0.008*"christmas" + 0.007*"brexit" + 0.007*"today"
2019-07-18 11:27:24,844 : INFO : topic #91 (0.005): 0.020*"aberdeen" + 0.013*"show" + 0.011*"well" + 0.011*"need" + 0.011*"today" + 0.009*"one" + 0.008*"go" + 0.008*"say" + 0.007*"thank" + 0.007*"haha"
2019-07-18 11:27:24,846 : INFO : topic #197 (0.005): 0.029*"like" + 0.013*"see" + 0.013*"good" + 0.012*"look" + 0.011*"need" + 0.009*"take" + 0.009*"never" + 0.008*"time" + 0.007*"much" + 0.007*"try"
2019-07-18 11:27:24,848 : INFO : topic #48 (0.005): 0.018*"oh" + 0.012*"absolutely" + 0.011*"need" + 0.011*"go" + 0.010*"well" + 0.010*"night" + 0.009*"make" + 0.008*"like" + 0.

2019-07-18 11:27:41,711 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:27:41,779 : INFO : topic #186 (0.005): 0.032*"correct" + 0.023*"ft" + 0.022*"make" + 0.021*"cancel" + 0.021*"design" + 0.021*"confuse" + 0.016*"advice" + 0.014*"bisa" + 0.013*"go" + 0.013*"alive"
2019-07-18 11:27:41,780 : INFO : topic #40 (0.005): 0.062*"lead" + 0.047*"sama" + 0.040*"dog" + 0.027*"dulu" + 0.027*"dari" + 0.021*"egg" + 0.020*"masih" + 0.017*"sekarang" + 0.015*"job" + 0.015*"tree"
2019-07-18 11:27:41,781 : INFO : topic #76 (0.005): 0.058*"lovely" + 0.054*"kid" + 0.029*"piss" + 0.028*"cash" + 0.018*"harry" + 0.017*"one" + 0.012*"push" + 0.012*"need" + 0.011*"yes" + 0.010*"look"
2019-07-18 11:27:41,781 : INFO : topic #191 (0.005): 0.076*"god" + 0.068*"sorry" + 0.041*"road" + 0.018*"compare" + 0.016*"george" + 0.016*"spirit" + 0.016*"station" + 0.015*"law" + 0.012*"oh" + 0.011*"call"
2019-07-18 11:27:41,783 : INFO : topic #167 (0.005): 0.030*"lady" + 0.027*"te

2019-07-18 11:27:53,874 : INFO : topic #175 (0.005): 0.099*"lt" + 0.045*"paid" + 0.033*"champion" + 0.029*"william" + 0.029*"ability" + 0.026*"bell" + 0.022*"profit" + 0.022*"heat" + 0.021*"scary" + 0.018*"ahead"
2019-07-18 11:27:53,875 : INFO : topic diff=inf, rho=0.277350
2019-07-18 11:27:53,908 : INFO : PROGRESS: pass 3, dispatched chunk #0 = documents up to #10000/100000, outstanding queue size 1
2019-07-18 11:27:53,960 : INFO : PROGRESS: pass 3, dispatched chunk #1 = documents up to #20000/100000, outstanding queue size 2
2019-07-18 11:27:54,012 : INFO : PROGRESS: pass 3, dispatched chunk #2 = documents up to #30000/100000, outstanding queue size 3
2019-07-18 11:27:54,066 : INFO : PROGRESS: pass 3, dispatched chunk #3 = documents up to #40000/100000, outstanding queue size 4
2019-07-18 11:27:54,125 : INFO : PROGRESS: pass 3, dispatched chunk #4 = documents up to #50000/100000, outstanding queue size 5
2019-07-18 11:27:54,153 : INFO : PROGRESS: pass 3, dispatched chunk #5 = documen

2019-07-18 11:28:05,856 : INFO : topic #130 (0.005): 0.126*"park" + 0.067*"race" + 0.062*"sub" + 0.060*"rate" + 0.057*"referendum" + 0.052*"simply" + 0.050*"ga" + 0.047*"boleh" + 0.040*"nature" + 0.023*"chinese"
2019-07-18 11:28:05,858 : INFO : topic #17 (0.005): 0.254*"club" + 0.067*"smile" + 0.049*"one" + 0.037*"share" + 0.035*"image" + 0.033*"stadium" + 0.028*"teach" + 0.027*"ta" + 0.025*"application" + 0.025*"usa"
2019-07-18 11:28:05,860 : INFO : topic #80 (0.005): 0.147*"read" + 0.076*"everything" + 0.062*"dan" + 0.058*"test" + 0.055*"chat" + 0.050*"common" + 0.045*"choose" + 0.039*"relationship" + 0.028*"jose" + 0.026*"female"
2019-07-18 11:28:05,863 : INFO : topic diff=inf, rho=0.258199
2019-07-18 11:28:07,969 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:28:08,039 : INFO : topic #17 (0.005): 0.253*"club" + 0.064*"smile" + 0.052*"one" + 0.043*"share" + 0.039*"image" + 0.034*"teach" + 0.033*"application" + 0.033*"stadium" + 0.026*"us

2019-07-18 11:28:56,178 : INFO : topic #39 (0.010): 0.019*"one" + 0.009*"day" + 0.009*"take" + 0.008*"need" + 0.008*"job" + 0.008*"happy" + 0.008*"good" + 0.007*"new" + 0.007*"hope" + 0.007*"people"
2019-07-18 11:28:56,179 : INFO : topic #97 (0.010): 0.015*"go" + 0.012*"year" + 0.011*"well" + 0.010*"one" + 0.010*"see" + 0.008*"amp" + 0.008*"oh" + 0.007*"best" + 0.007*"think" + 0.006*"scottish"
2019-07-18 11:28:56,180 : INFO : topic #92 (0.010): 0.015*"like" + 0.015*"work" + 0.012*"look" + 0.011*"see" + 0.010*"well" + 0.009*"need" + 0.009*"good" + 0.008*"one" + 0.006*"year" + 0.006*"amp"
2019-07-18 11:28:56,181 : INFO : topic #43 (0.010): 0.022*"go" + 0.019*"need" + 0.011*"make" + 0.010*"year" + 0.010*"see" + 0.009*"time" + 0.009*"like" + 0.009*"amp" + 0.008*"look" + 0.008*"still"
2019-07-18 11:28:56,182 : INFO : topic diff=inf, rho=0.435275
2019-07-18 11:28:58,737 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:28:58,773 : INFO : topic #5 (0

2019-07-18 11:29:12,264 : INFO : topic #90 (0.010): 0.018*"make" + 0.017*"even" + 0.015*"cool" + 0.014*"drop" + 0.014*"player" + 0.014*"sell" + 0.011*"say" + 0.011*"red" + 0.011*"beat" + 0.011*"key"
2019-07-18 11:29:12,264 : INFO : topic #4 (0.010): 0.033*"amp" + 0.031*"course" + 0.023*"film" + 0.021*"literally" + 0.013*"هذا" + 0.012*"spotify" + 0.011*"time" + 0.011*"watch" + 0.011*"room" + 0.010*"well"
2019-07-18 11:29:12,265 : INFO : topic #18 (0.010): 0.041*"let" + 0.029*"next" + 0.019*"london" + 0.016*"damn" + 0.015*"hill" + 0.013*"end" + 0.012*"shopping" + 0.012*"vegan" + 0.010*"fa" + 0.010*"come"
2019-07-18 11:29:12,269 : INFO : topic #71 (0.010): 0.039*"nice" + 0.032*"beautiful" + 0.028*"jadi" + 0.024*"share" + 0.017*"look" + 0.015*"sih" + 0.013*"aberdeen" + 0.012*"balik" + 0.011*"где" + 0.011*"gue"
2019-07-18 11:29:12,270 : INFO : topic diff=inf, rho=0.225160
2019-07-18 11:29:12,301 : INFO : PROGRESS: pass 2, dispatched chunk #0 = documents up to #10000/100000, outstanding queu

2019-07-18 11:29:38,974 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #10000/100000, outstanding queue size 1
2019-07-18 11:29:39,030 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #20000/100000, outstanding queue size 2
2019-07-18 11:29:39,085 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #30000/100000, outstanding queue size 3
2019-07-18 11:29:39,143 : INFO : PROGRESS: pass 0, dispatched chunk #3 = documents up to #40000/100000, outstanding queue size 4
2019-07-18 11:29:39,191 : INFO : PROGRESS: pass 0, dispatched chunk #4 = documents up to #50000/100000, outstanding queue size 5
2019-07-18 11:29:39,229 : INFO : PROGRESS: pass 0, dispatched chunk #5 = documents up to #60000/100000, outstanding queue size 6
2019-07-18 11:29:39,274 : INFO : PROGRESS: pass 0, dispatched chunk #6 = documents up to #70000/100000, outstanding queue size 7
2019-07-18 11:29:39,313 : INFO : PROGRESS: pass 0, dispatched chunk #7 = documents up to #80000/1

2019-07-18 11:29:55,060 : INFO : topic diff=inf, rho=0.225160
2019-07-18 11:29:58,346 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:29:58,389 : INFO : topic #60 (0.010): 0.038*"haha" + 0.019*"love" + 0.013*"channel" + 0.011*"legend" + 0.011*"cover" + 0.010*"way" + 0.010*"go" + 0.010*"museum" + 0.009*"thank" + 0.009*"tonight"
2019-07-18 11:29:58,391 : INFO : topic #73 (0.010): 0.031*"see" + 0.017*"first" + 0.016*"hahahaha" + 0.012*"aye" + 0.011*"love" + 0.011*"go" + 0.011*"nae" + 0.011*"vote" + 0.008*"people" + 0.008*"tory"
2019-07-18 11:29:58,391 : INFO : topic #6 (0.010): 0.020*"remember" + 0.017*"look" + 0.015*"say" + 0.014*"make" + 0.013*"time" + 0.013*"think" + 0.011*"confirm" + 0.010*"one" + 0.009*"business" + 0.009*"like"
2019-07-18 11:29:58,392 : INFO : topic #75 (0.010): 0.014*"month" + 0.014*"feel" + 0.014*"think" + 0.013*"hello" + 0.012*"drop" + 0.012*"problem" + 0.011*"year" + 0.009*"great" + 0.008*"london" + 0.008*"afternoon"
2

2019-07-18 11:30:11,931 : INFO : topic #62 (0.010): 0.039*"season" + 0.032*"wrong" + 0.029*"arsenal" + 0.022*"court" + 0.017*"self" + 0.016*"price" + 0.015*"whatever" + 0.014*"card" + 0.011*"born" + 0.011*"new"
2019-07-18 11:30:11,934 : INFO : topic diff=inf, rho=0.214602
2019-07-18 11:30:13,404 : INFO : merging changes from 10000 documents into a model of 100000 documents
2019-07-18 11:30:13,439 : INFO : topic #95 (0.010): 0.064*"work" + 0.032*"student" + 0.030*"picture" + 0.028*"yeah" + 0.023*"interest" + 0.022*"lot" + 0.022*"mad" + 0.020*"knew" + 0.016*"go" + 0.015*"see"
2019-07-18 11:30:13,440 : INFO : topic #81 (0.010): 0.036*"ça" + 0.032*"day" + 0.026*"response" + 0.021*"work" + 0.018*"light" + 0.018*"member" + 0.016*"experience" + 0.015*"customer" + 0.014*"everyday" + 0.013*"year"
2019-07-18 11:30:13,441 : INFO : topic #96 (0.010): 0.031*"для" + 0.029*"tel" + 0.020*"deep" + 0.020*"product" + 0.020*"como" + 0.019*"credit" + 0.018*"fat" + 0.017*"join" + 0.016*"на" + 0.016*"figure"

2019-07-18 11:30:23,264 : INFO : PROGRESS: pass 4, dispatched chunk #5 = documents up to #60000/100000, outstanding queue size 6
2019-07-18 11:30:23,290 : INFO : PROGRESS: pass 4, dispatched chunk #6 = documents up to #70000/100000, outstanding queue size 7
2019-07-18 11:30:23,316 : INFO : PROGRESS: pass 4, dispatched chunk #7 = documents up to #80000/100000, outstanding queue size 8
2019-07-18 11:30:23,342 : INFO : PROGRESS: pass 4, dispatched chunk #8 = documents up to #90000/100000, outstanding queue size 9
2019-07-18 11:30:25,445 : INFO : PROGRESS: pass 4, dispatched chunk #9 = documents up to #100000/100000, outstanding queue size 9
2019-07-18 11:30:26,292 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:30:26,333 : INFO : topic #0 (0.010): 0.051*"je" + 0.042*"ما" + 0.034*"cross" + 0.030*"finger" + 0.025*"risk" + 0.023*"patient" + 0.023*"fail" + 0.022*"scene" + 0.021*"internet" + 0.018*"society"
2019-07-18 11:30:26,337 : INFO : topic #72

2019-07-18 11:30:53,069 : INFO : topic #110 (0.007): 0.017*"time" + 0.017*"aberdeen" + 0.016*"amp" + 0.014*"de" + 0.013*"see" + 0.009*"still" + 0.008*"pm" + 0.007*"know" + 0.007*"people" + 0.007*"league"
2019-07-18 11:30:53,070 : INFO : topic #96 (0.007): 0.015*"day" + 0.013*"go" + 0.011*"great" + 0.011*"say" + 0.010*"like" + 0.008*"one" + 0.008*"look" + 0.008*"take" + 0.007*"make" + 0.007*"come"
2019-07-18 11:30:53,071 : INFO : topic #77 (0.007): 0.041*"yes" + 0.015*"new" + 0.014*"think" + 0.009*"want" + 0.009*"people" + 0.008*"look" + 0.008*"sure" + 0.007*"tweet" + 0.007*"start" + 0.007*"ticket"
2019-07-18 11:30:53,071 : INFO : topic #53 (0.007): 0.014*"say" + 0.012*"year" + 0.011*"aberdeen" + 0.011*"well" + 0.009*"take" + 0.009*"good" + 0.009*"left" + 0.008*"make" + 0.008*"best" + 0.008*"via"
2019-07-18 11:30:53,073 : INFO : topic diff=108.695847, rho=1.000000
2019-07-18 11:30:56,227 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:30:56,2

2019-07-18 11:31:09,568 : INFO : topic #56 (0.007): 0.043*"die" + 0.025*"london" + 0.011*"understand" + 0.010*"uk" + 0.010*"amp" + 0.010*"place" + 0.008*"interest" + 0.008*"want" + 0.008*"make" + 0.008*"win"
2019-07-18 11:31:09,569 : INFO : topic #7 (0.007): 0.084*"please" + 0.025*"someone" + 0.024*"mean" + 0.023*"gonna" + 0.022*"else" + 0.016*"know" + 0.015*"congrats" + 0.013*"hun" + 0.011*"drop" + 0.011*"help"
2019-07-18 11:31:09,569 : INFO : topic #71 (0.007): 0.056*"go" + 0.033*"cat" + 0.020*"tottenham" + 0.019*"way" + 0.017*"было" + 0.015*"chelsea" + 0.014*"send" + 0.014*"information" + 0.009*"legend" + 0.009*"fix"
2019-07-18 11:31:09,572 : INFO : topic #82 (0.007): 0.025*"call" + 0.017*"phone" + 0.015*"hun" + 0.013*"due" + 0.012*"work" + 0.011*"go" + 0.009*"network" + 0.009*"male" + 0.009*"court" + 0.008*"notting"
2019-07-18 11:31:09,573 : INFO : topic diff=inf, rho=0.225160
2019-07-18 11:31:11,609 : INFO : merging changes from 10000 documents into a model of 100000 documents
201

2019-07-18 11:31:21,083 : INFO : topic diff=inf, rho=0.214602
2019-07-18 11:31:21,204 : INFO : using ParallelWordOccurrenceAccumulator(processes=3, batch_size=64) to estimate probabilities from sliding windows
2019-07-18 11:31:29,755 : INFO : serializing accumulator to return to master...
2019-07-18 11:31:29,759 : INFO : accumulator serialized
2019-07-18 11:31:29,765 : INFO : serializing accumulator to return to master...
2019-07-18 11:31:29,769 : INFO : serializing accumulator to return to master...
2019-07-18 11:31:29,769 : INFO : accumulator serialized
2019-07-18 11:31:29,784 : INFO : accumulator serialized
2019-07-18 11:31:30,760 : INFO : 3 accumulators retrieved from output queue
2019-07-18 11:31:32,890 : INFO : accumulated word occurrence stats for 82886 virtual documents
2019-07-18 11:31:46,209 : INFO : using symmetric alpha at 0.006666666666666667
2019-07-18 11:31:46,210 : INFO : using symmetric eta at 0.006666666666666667
2019-07-18 11:31:46,211 : INFO : using serial LDA versi

2019-07-18 11:31:58,228 : INFO : PROGRESS: pass 1, dispatched chunk #7 = documents up to #80000/100000, outstanding queue size 8
2019-07-18 11:31:58,256 : INFO : PROGRESS: pass 1, dispatched chunk #8 = documents up to #90000/100000, outstanding queue size 9
2019-07-18 11:32:00,862 : INFO : PROGRESS: pass 1, dispatched chunk #9 = documents up to #100000/100000, outstanding queue size 9
2019-07-18 11:32:00,982 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:32:01,036 : INFO : topic #65 (0.007): 0.027*"shit" + 0.021*"make" + 0.012*"shirt" + 0.011*"one" + 0.011*"kind" + 0.010*"need" + 0.009*"come" + 0.009*"well" + 0.009*"bayern" + 0.008*"go"
2019-07-18 11:32:01,038 : INFO : topic #73 (0.007): 0.031*"bet" + 0.016*"tu" + 0.016*"monday" + 0.014*"de" + 0.013*"know" + 0.011*"need" + 0.010*"sleep" + 0.008*"se" + 0.008*"law" + 0.008*"think"
2019-07-18 11:32:01,040 : INFO : topic #2 (0.007): 0.019*"cat" + 0.018*"know" + 0.016*"well" + 0.015*"word" + 0.0

2019-07-18 11:32:14,479 : INFO : topic #77 (0.007): 0.056*"aberdeen" + 0.056*"okay" + 0.044*"pal" + 0.030*"ga" + 0.030*"boleh" + 0.021*"winter" + 0.019*"xmas" + 0.016*"window" + 0.016*"salary" + 0.014*"ideal"
2019-07-18 11:32:14,482 : INFO : topic #23 (0.007): 0.110*"well" + 0.044*"di" + 0.041*"easy" + 0.029*"quite" + 0.028*"deserve" + 0.026*"think" + 0.022*"per" + 0.022*"one" + 0.012*"tie" + 0.012*"look"
2019-07-18 11:32:14,484 : INFO : topic diff=inf, rho=0.214602
2019-07-18 11:32:16,487 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:32:16,539 : INFO : topic #25 (0.007): 0.056*"wonder" + 0.048*"second" + 0.044*"one" + 0.036*"fair" + 0.024*"indeed" + 0.018*"left" + 0.018*"smh" + 0.016*"boring" + 0.016*"journey" + 0.015*"anything"
2019-07-18 11:32:16,540 : INFO : topic #148 (0.007): 0.099*"ha" + 0.030*"arm" + 0.029*"home" + 0.026*"square" + 0.024*"english" + 0.018*"go" + 0.017*"close" + 0.016*"commit" + 0.016*"pisces" + 0.013*"today"
2019-0

2019-07-18 11:32:26,846 : INFO : topic #61 (0.007): 0.109*"en" + 0.083*"el" + 0.076*"de" + 0.068*"finally" + 0.061*"la" + 0.048*"dead" + 0.045*"mi" + 0.044*"un" + 0.041*"que" + 0.029*"dr"
2019-07-18 11:32:26,847 : INFO : topic #76 (0.007): 0.156*"top" + 0.081*"baby" + 0.080*"plan" + 0.050*"когда" + 0.034*"alone" + 0.033*"break" + 0.031*"thesnp" + 0.027*"still" + 0.024*"sun" + 0.023*"ba"
2019-07-18 11:32:26,847 : INFO : topic #83 (0.007): 0.075*"hand" + 0.064*"bro" + 0.050*"told" + 0.042*"award" + 0.031*"glass" + 0.027*"meet" + 0.027*"truth" + 0.026*"year" + 0.025*"perhaps" + 0.024*"bitch"
2019-07-18 11:32:26,848 : INFO : topic diff=inf, rho=0.205269
2019-07-18 11:32:26,881 : INFO : PROGRESS: pass 4, dispatched chunk #0 = documents up to #10000/100000, outstanding queue size 1
2019-07-18 11:32:26,929 : INFO : PROGRESS: pass 4, dispatched chunk #1 = documents up to #20000/100000, outstanding queue size 2
2019-07-18 11:32:26,977 : INFO : PROGRESS: pass 4, dispatched chunk #2 = documents u

2019-07-18 11:33:02,188 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #30000/100000, outstanding queue size 3
2019-07-18 11:33:02,243 : INFO : PROGRESS: pass 0, dispatched chunk #3 = documents up to #40000/100000, outstanding queue size 4
2019-07-18 11:33:02,494 : INFO : PROGRESS: pass 0, dispatched chunk #4 = documents up to #50000/100000, outstanding queue size 5
2019-07-18 11:33:02,527 : INFO : PROGRESS: pass 0, dispatched chunk #5 = documents up to #60000/100000, outstanding queue size 6
2019-07-18 11:33:02,557 : INFO : PROGRESS: pass 0, dispatched chunk #6 = documents up to #70000/100000, outstanding queue size 7
2019-07-18 11:33:02,585 : INFO : PROGRESS: pass 0, dispatched chunk #7 = documents up to #80000/100000, outstanding queue size 8
2019-07-18 11:33:02,614 : INFO : PROGRESS: pass 0, dispatched chunk #8 = documents up to #90000/100000, outstanding queue size 9
2019-07-18 11:33:05,585 : INFO : PROGRESS: pass 0, dispatched chunk #9 = documents up to #100000/

2019-07-18 11:33:21,230 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:33:21,300 : INFO : topic #69 (0.005): 0.058*"never" + 0.018*"imagine" + 0.016*"shock" + 0.015*"back" + 0.013*"red" + 0.012*"drink" + 0.012*"year" + 0.011*"idea" + 0.011*"daughter" + 0.011*"healthy"
2019-07-18 11:33:21,300 : INFO : topic #138 (0.005): 0.048*"dah" + 0.018*"car" + 0.017*"left" + 0.014*"pack" + 0.012*"never" + 0.011*"council" + 0.011*"ni" + 0.011*"th" + 0.010*"card" + 0.009*"available"
2019-07-18 11:33:21,303 : INFO : topic #17 (0.005): 0.040*"travel" + 0.036*"road" + 0.028*"traffic" + 0.024*"minute" + 0.023*"london" + 0.018*"new" + 0.015*"court" + 0.015*"usual" + 0.015*"ye" + 0.014*"street"
2019-07-18 11:33:21,305 : INFO : topic #162 (0.005): 0.057*"good" + 0.036*"cry" + 0.025*"see" + 0.017*"لك" + 0.012*"الله" + 0.012*"speed" + 0.012*"time" + 0.011*"thing" + 0.011*"amen" + 0.011*"suck"
2019-07-18 11:33:21,307 : INFO : topic #72 (0.005): 0.031*"que" + 0.026*

2019-07-18 11:33:34,437 : INFO : topic #31 (0.005): 0.230*"man" + 0.054*"fulham" + 0.044*"account" + 0.037*"jadi" + 0.030*"four" + 0.027*"bank" + 0.019*"sih" + 0.017*"balik" + 0.017*"gue" + 0.011*"jim"
2019-07-18 11:33:34,439 : INFO : topic diff=inf, rho=0.214602
2019-07-18 11:33:36,219 : INFO : merging changes from 10000 documents into a model of 100000 documents
2019-07-18 11:33:36,285 : INFO : topic #53 (0.005): 0.050*"wear" + 0.050*"garden" + 0.032*"receive" + 0.031*"google" + 0.030*"branch" + 0.029*"security" + 0.023*"purpose" + 0.021*"limit" + 0.021*"teach" + 0.017*"fantastic"
2019-07-18 11:33:36,286 : INFO : topic #163 (0.005): 0.057*"men" + 0.037*"jorginho" + 0.028*"medical" + 0.025*"doctor" + 0.025*"pretty" + 0.023*"beat" + 0.021*"ross" + 0.019*"people" + 0.017*"play" + 0.017*"commission"
2019-07-18 11:33:36,287 : INFO : topic #66 (0.005): 0.089*"enjoy" + 0.075*"must" + 0.068*"leave" + 0.053*"eat" + 0.043*"score" + 0.016*"head" + 0.015*"go" + 0.014*"time" + 0.013*"alternative"

2019-07-18 11:34:21,871 : INFO : topic #66 (0.005): 0.019*"fuck" + 0.012*"william" + 0.011*"great" + 0.010*"mean" + 0.009*"go" + 0.009*"scotland" + 0.008*"weekend" + 0.007*"cupcake" + 0.007*"holy" + 0.007*"look"
2019-07-18 11:34:21,872 : INFO : topic #110 (0.005): 0.027*"amp" + 0.015*"say" + 0.010*"need" + 0.010*"interview" + 0.008*"sarri" + 0.008*"vote" + 0.008*"still" + 0.007*"london" + 0.007*"deal" + 0.007*"tea"
2019-07-18 11:34:21,873 : INFO : topic diff=inf, rho=0.251189
2019-07-18 11:34:21,906 : INFO : PROGRESS: pass 1, dispatched chunk #0 = documents up to #10000/100000, outstanding queue size 1
2019-07-18 11:34:21,956 : INFO : PROGRESS: pass 1, dispatched chunk #1 = documents up to #20000/100000, outstanding queue size 2
2019-07-18 11:34:22,005 : INFO : PROGRESS: pass 1, dispatched chunk #2 = documents up to #30000/100000, outstanding queue size 3
2019-07-18 11:34:22,266 : INFO : PROGRESS: pass 1, dispatched chunk #3 = documents up to #40000/100000, outstanding queue size 4
201

2019-07-18 11:34:36,990 : INFO : topic #14 (0.005): 0.050*"attack" + 0.032*"ko" + 0.027*"weird" + 0.023*"sea" + 0.019*"unless" + 0.015*"people" + 0.015*"useless" + 0.013*"today" + 0.013*"need" + 0.012*"disagree"
2019-07-18 11:34:37,004 : INFO : topic #2 (0.005): 0.068*"busy" + 0.037*"lane" + 0.034*"felt" + 0.031*"road" + 0.027*"لك" + 0.026*"court" + 0.018*"one" + 0.017*"drunk" + 0.015*"oops" + 0.015*"alcohol"
2019-07-18 11:34:37,006 : INFO : topic #178 (0.005): 0.184*"thanks" + 0.038*"community" + 0.036*"list" + 0.028*"earlier" + 0.025*"great" + 0.017*"pedro" + 0.013*"see" + 0.010*"support" + 0.009*"need" + 0.009*"anna_soubry"
2019-07-18 11:34:37,010 : INFO : topic diff=inf, rho=0.214602
2019-07-18 11:34:39,306 : INFO : merging changes from 30000 documents into a model of 100000 documents
2019-07-18 11:34:39,377 : INFO : topic #118 (0.005): 0.060*"consider" + 0.040*"stick" + 0.026*"chocolate" + 0.022*"ta" + 0.022*"say" + 0.018*"insta" + 0.018*"highlight" + 0.017*"tfl" + 0.015*"like" + 

2019-07-18 11:34:52,011 : INFO : topic #141 (0.005): 0.126*"party" + 0.094*"labour" + 0.046*"charity" + 0.037*"general" + 0.035*"corbyn" + 0.029*"если" + 0.022*"brexit" + 0.017*"people" + 0.017*"primary" + 0.016*"election"
2019-07-18 11:34:52,012 : INFO : topic #110 (0.005): 0.080*"absolute" + 0.065*"interview" + 0.035*"parliament" + 0.032*"amp" + 0.032*"straight" + 0.026*"adam" + 0.026*"crime" + 0.023*"fee" + 0.021*"deal" + 0.020*"threaten"
2019-07-18 11:34:52,013 : INFO : topic #22 (0.005): 0.098*"sign" + 0.054*"announce" + 0.048*"indeed" + 0.045*"leader" + 0.042*"surprised" + 0.041*"com" + 0.036*"scar" + 0.029*"leadership" + 0.028*"effect" + 0.023*"new"
2019-07-18 11:34:52,013 : INFO : topic #105 (0.005): 0.088*"hard" + 0.066*"wear" + 0.049*"often" + 0.047*"forget" + 0.042*"finish" + 0.039*"mi" + 0.025*"medical" + 0.025*"push" + 0.021*"duty" + 0.021*"usa"
2019-07-18 11:34:52,017 : INFO : topic #4 (0.005): 0.154*"job" + 0.052*"aberdeen" + 0.051*"scotland" + 0.050*"hire" + 0.042*"mad"

2019-07-18 11:35:02,648 : INFO : topic #4 (0.005): 0.189*"job" + 0.059*"aberdeen" + 0.055*"scotland" + 0.049*"hire" + 0.049*"mad" + 0.032*"replace" + 0.030*"opening" + 0.028*"recommend" + 0.024*"careerarc" + 0.023*"work"
2019-07-18 11:35:02,649 : INFO : topic diff=inf, rho=0.196945
2019-07-18 11:35:02,726 : INFO : using ParallelWordOccurrenceAccumulator(processes=3, batch_size=64) to estimate probabilities from sliding windows
2019-07-18 11:35:12,865 : INFO : serializing accumulator to return to master...
2019-07-18 11:35:12,867 : INFO : accumulator serialized
2019-07-18 11:35:12,873 : INFO : serializing accumulator to return to master...
2019-07-18 11:35:12,874 : INFO : serializing accumulator to return to master...
2019-07-18 11:35:12,876 : INFO : accumulator serialized
2019-07-18 11:35:12,886 : INFO : accumulator serialized
2019-07-18 11:35:14,163 : INFO : 3 accumulators retrieved from output queue
2019-07-18 11:35:17,015 : INFO : accumulated word occurrence stats for 83787 virtual 

In [53]:
for params in parameter_grid:
    print(params)

{'chunksize': 10000, 'decay': 0.5, 'eval_every': None, 'iterations': 25, 'num_topics': 100, 'passes': 3}
{'chunksize': 10000, 'decay': 0.5, 'eval_every': None, 'iterations': 25, 'num_topics': 100, 'passes': 5}
{'chunksize': 10000, 'decay': 0.5, 'eval_every': None, 'iterations': 25, 'num_topics': 150, 'passes': 3}
{'chunksize': 10000, 'decay': 0.5, 'eval_every': None, 'iterations': 25, 'num_topics': 150, 'passes': 5}
{'chunksize': 10000, 'decay': 0.5, 'eval_every': None, 'iterations': 25, 'num_topics': 200, 'passes': 3}
{'chunksize': 10000, 'decay': 0.5, 'eval_every': None, 'iterations': 25, 'num_topics': 200, 'passes': 5}
{'chunksize': 10000, 'decay': 0.6, 'eval_every': None, 'iterations': 25, 'num_topics': 100, 'passes': 3}
{'chunksize': 10000, 'decay': 0.6, 'eval_every': None, 'iterations': 25, 'num_topics': 100, 'passes': 5}
{'chunksize': 10000, 'decay': 0.6, 'eval_every': None, 'iterations': 25, 'num_topics': 150, 'passes': 3}
{'chunksize': 10000, 'decay': 0.6, 'eval_every': None, 

In [54]:
print(coherence)

[0.4384908645647992, 0.47253372650237063, 0.4441244130162207, 0.45568775728125976, 0.4194790059190659, 0.4214115472435088, 0.44284943857801046, 0.4604311841446061, 0.4318464192434856, 0.44184628807838644, 0.41799105775870626, 0.4280441069151003]


In [55]:
best_model_idx = coherence.index(max(coherence))
print(f"""Params: {parameters[best_model_idx]}
Coherence: {coherence[best_model_idx]}
""")

Params: {'chunksize': 10000, 'decay': 0.5, 'eval_every': None, 'iterations': 25, 'num_topics': 100, 'passes': 5}
Coherence: 0.47253372650237063



In [56]:
lda_models[best_model_idx].save(str(MODEL_PATH / 'tweets') + 'best_lda.model')

2019-07-18 11:40:24,136 : INFO : saving LdaState object under ../../models/tweetsbest_lda.model.state, separately None
2019-07-18 11:40:24,141 : INFO : saved ../../models/tweetsbest_lda.model.state
2019-07-18 11:40:24,144 : INFO : saving LdaMulticore object under ../../models/tweetsbest_lda.model, separately ['expElogbeta', 'sstats']
2019-07-18 11:40:24,144 : INFO : storing np array 'expElogbeta' to ../../models/tweetsbest_lda.model.expElogbeta.npy
2019-07-18 11:40:24,146 : INFO : not storing attribute dispatcher
2019-07-18 11:40:24,147 : INFO : not storing attribute state
2019-07-18 11:40:24,147 : INFO : not storing attribute id2word
2019-07-18 11:40:24,148 : INFO : saved ../../models/tweetsbest_lda.model


### Load best model

In [58]:
best_lda = LdaModel.load(str(MODEL_PATH / 'tweets') + 'best_lda.model')

2019-07-18 11:41:52,442 : INFO : loading LdaModel object from ../../models/tweetsbest_lda.model
2019-07-18 11:41:52,443 : INFO : loading expElogbeta from ../../models/tweetsbest_lda.model.expElogbeta.npy with mmap=None
2019-07-18 11:41:52,444 : INFO : setting ignored attribute dispatcher to None
2019-07-18 11:41:52,445 : INFO : setting ignored attribute state to None
2019-07-18 11:41:52,446 : INFO : setting ignored attribute id2word to None
2019-07-18 11:41:52,446 : INFO : loaded ../../models/tweetsbest_lda.model
2019-07-18 11:41:52,447 : INFO : loading LdaState object from ../../models/tweetsbest_lda.model.state
2019-07-18 11:41:52,451 : INFO : loaded ../../models/tweetsbest_lda.model.state
