# tweet-SNE
*A t-SNE encoding of TF-IDF of tweets from @HillaryClinton and @theRealDonaldTrump. By [Alexandra Johnson](https://twitter.com/alexandraj777) and [Sam Ainsworth](https://github.com/samuela).*

t-SNE is a powerful tool for visualizing highly dimensional data. Roughly, the goal is to cluster similar tweets into similar areas of a 2-d graph. Here is a wonderful interactive [blog post](http://distill.pub/2016/misread-tsne/) to learn more about t-SNE.

Here, we use [scikit-learn](http://scikit-learn.org/) to vectorize and find TF-IDF for [Hillary Clinton and Donald Trump tweets](https://github.com/WiMLDS/election-data-hackathon/tree/master/clinton-trump-tweets#hillary-clinton-and-donald-trump-tweets). We again use scikit learn with the default TSNE hyperparameters to form the t-SNE of the TF-IDF matrix. Finally, we use [plotly](plotcon.plot.ly) to graph the encoding, with @HillaryClinton tweets in blue and @DonaldTrump tweets in red. Hover over points in the graph to see the original text of the tweets.

We've embedded the graph below ([link](https://plot.ly/~alexandraj777/2) for those of you viewing the notebook in a browser), but feel free to run the code and create the example for yourself!

In [1]:
import plotly.tools as tls

tls.embed("https://plot.ly/~alexandraj777/2")

## Run it yourself
To run this example, you'll need to install the following libraries (we used `pip`):
 * numpy
 * pandas
 * plotly
 * scikit-learn
 * scipy
 
You'll also need to download the `tweets.csv` file from [GitHub](https://github.com/WiMLDS/election-data-hackathon/tree/master/clinton-trump-tweets#hillary-clinton-and-donald-trump-tweets).

In [None]:
import pandas as pd
df = pd.read_csv('tweets.csv')

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer().fit_transform(df['text'])

In [None]:
from sklearn.manifold import TSNE
tsne = TSNE().fit_transform(tfidf.toarray())

In [None]:
df['color'] = df['handle'].map(lambda h: 'red' if h == 'realDonaldTrump' else 'blue')

In [None]:
import plotly.offline as offline
import plotly.graph_objs as go

offline.init_notebook_mode()

In [None]:
offline.iplot(dict(
    data=[
        go.Scattergl(
            x=tsne[:, 0],
            y=tsne[:, 1],
            text=df['text'],
            hoverinfo='text',
            marker=dict(
                size='8',
                color=df['color'],
                opacity=0.7,
            ),
            mode='markers'
        ),
    ], 
    layout=go.Layout(
        title="tweet-SNE",
        font=dict(size=16),
        xaxis=dict(
            showgrid=False,
            zeroline=False,
            showline=False,
            showticklabels=False,
        ),
        yaxis=dict(
            showgrid=False,
            zeroline=False,
            showline=False,
            showticklabels=False,
        )
    ),
))