# Visualization of Web Summit 2015 Tweets
In the following, we analyze tweets about the Web Summit 2015 in Dublin regarding their topics and visualize them.

Over the course of the Web Summit, we have collected about 77k tweets that refer to the event and stored them in `tweets.json`. We will now load this file and store only the contents of our tweets in a list.

In [1]:
import json
def load_tweets(dump_file):
    corpus = []
    with open(dump_file) as json_file:
        data = json.load(json_file)
        for i, entry in enumerate(data):
            corpus.append(entry['text'])
    return corpus

tweets = load_tweets('tweets.json')

Let's take a closer look at the number of tweets and the first five tweets in our dataset.

In [2]:
print '# of tweets:', len(tweets)
for tweet in tweets[:5]:
    print tweet

# of tweets: 77111
@sarahtavel What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2wFLAVpGiV
Start-ups from every continent heading to #websummit, including #LendInvest! https://t.co/7zvTM5ihcH @WebSummitHQ
I'm at the #WebSummit2015 this week. On ali(at)goss(dot)ie if anyone wants to say hi! 👋🏻🙂
@jalak What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2vwDJdWIJJ
#websummit is about to kickoff in #dublin! What are you looking forward to the most?? @WebSummitHQ


Now, in order for the topic modeling to do its job, we do some preprocessing. We normalize positive, negative, neutral and 'lol' smileys, , extract the content of a hashtag, and remove all mentions, urls, numbers, repeating punctuation, elongated characters, as well as common terms such as _websummit_ and _websummit2015_.

In [3]:
import re
class TweetPreprocessor(object):

    def preprocess(self, text):
        eyes, nose = r"[8:=;]", r"['`\-]?"
        
        re_sub = lambda pattern, repl: re.sub(pattern, repl, text, flags=re.MULTILINE | re.DOTALL)

        text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", '')  # remove url
        text = re_sub(r"/"," / ")
        text = re_sub(r"@\w+", '')  # remove @mentions
        text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), ':)')  # normalize happy smiley
        text = re_sub(r"{}{}p+".format(eyes, nose), ':P')  # normalize lol smiley
        text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), ':(')  # normalize sad smiley
        text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), ':|')  # normalize neutral smiley
        text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", '')  # remove number
        text = re_sub(r"#(\S+)", "\1")  # extract hashtag
        text = re_sub(r"([!?.]){2,}", r"\1")  # remove repeating punctuation
        text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2")  # remove elongated characters
        text = re_sub(r"(websummit|https|websummit2015)", '')  # remove common characters

        return text.lower()

    def preprocess_corpus(self, corpus):
        proc_corpus = []
        for document in corpus:
            proc_corpus.append(self.preprocess(document))
        return proc_corpus
    
proc_tweets = TweetPreprocessor().preprocess_corpus(tweets)
for tweet in proc_tweets[:5]:
    print tweet

 what   gadget can you not travel without? stop by stand d on wed at   
start-ups from every continent heading to  including   
i'm at the  this week. on ali(at)goss(dot)ie if anyone wants to say hi! 👋🏻🙂
 what   gadget can you not travel without? stop by stand d on wed at   
 is about to kickoff in  what are you looking forward to the most? 


### Topic modeling
We can now vectorize our data by representing each tweet as 10k dimensional vector whose indices correspond to the 10k most frequent terms (excluding stopwords) in our corpus.
We feed this 77k x 11k feature matrix into [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or topic modeling to detect the latent topics in our data. We run LDA for 2,000 iterations to identify 15 topics and receive a 77k x 15 matrix of topic distributions of our tweets. 

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
import lda
max_features = 10000
vectorizer = CountVectorizer(max_features=max_features, stop_words='english')
features = vectorizer.fit_transform(proc_tweets)

n_topics = 15
n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(features)



Let us now look at those topics in more detail. Specifically, we can inspect the words that are most relevant to a topic. We save these words as topic summaries for later.

In [5]:
import numpy as np
n_top_words = 8
topic_summaries = []

topic_word = lda_model.topic_word_  # get the topic words
vocab = vectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: content social media digital marketing dan amp new
Topic 1: summit web dublin vote day live websummithq pitch
Topic 2: thanks just love like follow don ll cool
Topic 3: mobile amp new internet facebook future world years
Topic 4: app meet startup amp email need want help
Topic 5: come stand meet today booth visit amp say
Topic 6: data amp people need make big tech future
Topic 7: year wifi just lisbon like people think good
Topic 8: dublin sheep vr like gt reality start virtual
Topic 9: startups tech great michael start dell worth best
Topic 10: tinder interview ireland paddy just ceo founder talk
Topic 11: smartwatch amp world let blocks open thing week
Topic 12: stage centre pitch live talk founder marketing ceo
Topic 13: great day looking forward thanks good today amazing
Topic 14: dublin day night tonight ready free amp great


As we can see, we have 15 different topics, which can be roughly assigned to concepts like social media (0), follow and like invitations (2), invitations to come to a booth (5), big data (6), startup and tech-relatd topics (9), centre stage pitches and talks (12), appreciation and outlook (13), among others. Topic 7 might deal with current and future concerns.

Similarly, we can also look at the document topic distributions. Remember those five tweets we looked at earlier? We can now retrieve the topic that has the highest probability in their distribution.

In [6]:
doc_topic = lda_model.doc_topic_
for i in range(10):
    print "%s (top topic: %d)" % (tweets[i], doc_topic[i].argmax())

@sarahtavel What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2wFLAVpGiV (top topic: 5)
Start-ups from every continent heading to #websummit, including #LendInvest! https://t.co/7zvTM5ihcH @WebSummitHQ (top topic: 1)
I'm at the #WebSummit2015 this week. On ali(at)goss(dot)ie if anyone wants to say hi! 👋🏻🙂 (top topic: 5)
@jalak What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2vwDJdWIJJ (top topic: 5)
#websummit is about to kickoff in #dublin! What are you looking forward to the most?? @WebSummitHQ (top topic: 13)
I'm at the #WebSummit this week. On ali(at)goss(dot)ie if anyone wants to say hi! 👋🏻🙂 (top topic: 5)
Liz Halash: one question you need to ask yourself is how would you use your #app if blind? #ford #websummit #mHealth via @mHealthInsight (top topic: 10)
@fabricegrinda What #MustHave #tech gadget can you not travel without? Stop by stand D131 We

### Visualization
We now want to visualize our tweets. In order to do this, we have to reduce their dimensionality from 15 down to 2 or 3, depending if we want to visualize them in a 2- or 3-dimensional space. A common dimensionality reduction technique is [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis). Here, we are going to use another dimensionality reduction technique called [t-SNE](https://lvdmaaten.github.io/tsne/) that is particularly suited to visualizing high-dimensional datasets. It is most commonly used to visualize 100 to 300 dimensional embedding spaces. Although our topic distributions are not that high-dimensional, we should still get good results.

In order to still be able to inspect individual points in the visualization, we visualize only part of our data, a random sample of 2,000 tweets. Applying t-SNE on this 2,000 x 15 matrix reduces it to the number of components specified and thus yields a 2,000 x 2 dimensional matrix that we are able to visualize.

In [7]:
from sklearn.manifold import TSNE
sample_size = 2000
sample_indices = np.random.permutation(range(len(tweets)))[:sample_size]  # get random indeces
tsne_model = TSNE(n_components=2, random_state=0)
X_topics_tsne = tsne_model.fit_transform(X_topics[sample_indices])

For the visualization, we use `matplotlib`'s `pyplot`. In order to inspect our data points, we use [MPLD3](https://mpld3.github.io/), which enables us to display the contents of our tweets as tooltips. We create a legend containing the top word for each topic as the label along with the color of the topic in the chart. For some reason, when we display the data with `mpld3`, [not all topic colors are displayed in the legend](https://github.com/jakevdp/mpld3/issues/166).

In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib
import matplotlib.patches as mpatches
import mpld3
# create our figure
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)

# get the top topics for our sample tweets
top_topics = [doc_topic[i].argmax() for i in sample_indices]

# get distinct colors for our topics
topic_colors = plt.cm.rainbow(np.linspace(0, 1, n_topics))

# define our scatter plot with tooltips and colors corresponding to the top topics and
scat = plt.scatter(X_topics_tsne[:, 0], X_topics_tsne[:, 1], s=50, c=topic_colors[top_topics])
tooltip = mpld3.plugins.PointLabelTooltip(scat, labels=tweets[:sample_size])
mpld3.plugins.connect(fig, tooltip)

plt.title("Web Summit 2015 Tweet Topics")

# create our
handles = [mpatches.Patch(color=color_name, label=summary.split(" ")[0]) for (color_name, summary) in zip(topic_colors, topic_summaries)]
plt.legend(handles=handles)

# plt.show()  #  display like this to see all topic colors in the legend
mpld3.display()  # display like this to see tooltips

Now we have a nice visualization that shows the distribution of the topics of our tweets in a 2d space. That's all. Thanks for reading.