# Topic Modeling of Twitter Followers

This notebook is associated to [this article on my blog](http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html).

We use LDAvis to visualize several LDA modeling of the followers of the [@alexip](https://twitter.com/alexip) account.

The different LDAs were trained with the following parameters

* 40 topics, 100 passes, alpha = 0.001

Extraction of the data from twitter was done via [this python 2 script](https://github.com/alexperrier/datatalks/tree/master/twitter)
And the dictionary and corpus were created via [this one](https://github.com/alexperrier/datatalks/tree/master/twitter)

To see the best results, set lambda around [0.5, 0.6]. Lowering Lambda gives more importance to words that are discriminatory for the active topic, words that best define the topic. 

You can skip the 2 first models and jump to the last model which is the best (40 topics)

A working version of this notebook is available on [nbviewer](http://nbviewer.ipython.org/github/alexperrier/datatalks/blob/master/twitter/LDAvis.ipynb)


In [1]:
# Load the corpus and dictionary
from gensim import corpora, models
import pyLDAvis.gensim

corpus = corpora.MmCorpus('data/twitter.mm')
dictionary = corpora.Dictionary.load('data/twitter.dict')

In [2]:
lda = models.LdaModel.load('data/lda.model')
followers_data =  pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)



In [3]:
lda.show_topics()


[(10,
  u'0.010*playin + 0.010*dex + 0.010*haiku + 0.010*newmusic + 0.010*code + 0.010*step + 0.010*move + 0.010*link + 0.010*much + 0.010*today'),
 (27,
  u'0.011*time + 0.010*https + 0.010*code + 0.009*python + 0.007*javascript + 0.007*work + 0.006*people + 0.005*working + 0.005*next + 0.005*first'),
 (18,
  u'0.020*https + 0.016*awwwardsams + 0.016*mimpi + 0.012*digiveletrh + 0.012*direct + 0.008*conference + 0.008*interested + 0.008*game + 0.008*ago + 0.008*time'),
 (22,
  u'0.083*https + 0.061*pyconcz + 0.048*python + 0.015*video + 0.013*talk + 0.011*brno + 0.011*pycon + 0.010*conference + 0.009*workshop + 0.009*talks'),
 (8,
  u'0.084*https + 0.040*free + 0.014*mockups + 0.013*download + 0.011*icons + 0.010*python + 0.010*prost\u0159ednictv\xedm + 0.009*django + 0.009*beautiful + 0.008*gomockups'),
 (4,
  u'0.113*https + 0.059*job + 0.047*poland + 0.041*wroclaw + 0.028*developer + 0.023*net + 0.022*switzerland + 0.018*java + 0.016*sql + 0.013*software'),
 (33,
  u'0.000*https + 0