# Visualizing and Analyzing Subreddit Vector Representations
In previous entries I generated t-SNE reduced vector representations for 1000 subreddits, and saved them to a file. Here I will produce some visualizations and applly k-means clustering to understand community similiarities and group subs. Note that the dataset includes a surprising number of 'NSFW' subreddits that appear throughout the vector space.

## Visualizing All Data
Using the amazingly simple plotly library, I produced an interactive view of the vector space. Labels appear next to subreddits showing the subreddit information and colors represent difference in subscriber numbers (default subreddits appear red) The interactive plot can be found [here](https://plot.ly/~rlindsay22/2/).

In [139]:
import pandas as pd
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.plotly as py

vecs_df = pd.read_csv('data/vecs.csv')

trace = go.Scatter(
    x = vecs_df['x'],
    y = vecs_df['y'],
    mode = 'markers',
    marker = dict(
        size = 6,
        color = vecs_df['subscribers'],
        colorscale='magma',
        showscale=True
    ),
    text = vecs_df['name']
)

data = [trace]

py.iplot(data, filename='subreddits')

### Observations
Just by looking, the data appears to have significant meaningful clustering. An interesting characteristic to note is the spread out nature of the default subreddits, which makes this representation unique from interest-mapping approaches. Each default sureddit is close to subs that it shares discussion features with, rather than user engagement.

## Clustering
Next I performed clustering using the scipy implementation of k-means. I decided to run the algorithm with initialized centroids corresponding to my best guesses at where clusters would lie, but it ended up being difficult to isolate the clusters that were actually visible in the data from the beginning, so k-means might not be the best method. I'll include the code and plot here nonetheless (I ended up plotting with k=12, but you'll notice that some of the 'clusters' are fairly meaningless.

In [149]:
from scipy.cluster.vq import kmeans as km

cent_guess = [[-42,10],
             [-40,-10],
             [15,-60],
             [35,30],
             [-5,-50],
             [20,-20],
             [-10,30],
             [-20,10],
             [5,50],
             [10,10],
             [-25,20],
             [0,0],
             [40,-30],
             [-8,8]]

k = len(cent_guess)

vecs = pd.read_csv('data/vecs.csv',usecols=[1,2])
X = vecs.values
cents,dist = km(X,cent_guess) # run k-means and get centroids

In [150]:
import numpy as np
import numpy.linalg as la

m = X.shape[0]
c_vals = np.zeros((m,1)) # array to hold centroid assignments

# compute distances
for i in range(0,m):
    row = X[i,:]
    diff = cents - row
    diff_vals = np.zeros((k,1))
    for c in range(0,k):
        diff_vals[c] = la.norm(diff[c,:])
    c_vals[i] = np.argmin(diff_vals)

vecs_df['cluster'] = c_vals
vecs_df.to_csv('data/vecs.csv', encoding='utf-8', columns=['name','x','y','subscribers','cluster'], index=False)

# save the centroids
centroid_data = {'x': cents[:,0], 'y': cents[:,1]}
cents_df = pd.DataFrame(centroid_data,columns=['x','y'])
cents_df.to_csv('data/centroids.csv',encoding='utf-8',columns=['x','y'], index=False)

In [151]:
# map each cluster assignment to a colour
color_map = {
    0.0: '#e6f2ff', 1.0: '#99ccff', 
    2.0: '#ccccff', 3.0: '#cc99ff', 
    4.0: '#ff99ff', 5.0: '#ff6699', 
    6.0: '#ff9966', 7.0: '#ff6600',
    8.0: '#ff5050', 9.0: '#ff0000',
    10.0: '#18ff01', 11.0: '#6a2b11',
    12.0: '#b7bf05', 13.0: '#2859c1'}
cols = vecs_df['cluster'].map(color_map)

# plot the data and centroids
data_points = go.Scatter(
    name = 'subreddits',
    x = vecs_df['x'],
    y = vecs_df['y'],
    mode = 'markers',
    marker = dict(
        size = 6,
        color = cols,
    ),
    text = vecs_df['name']
)

centroid_plot = go.Scatter(
    name = 'centroids',
    x = cents[:,0],
    y = cents[:,1],
    mode = 'markers',
    marker = dict(
        size = 6,
        color = 'black',
        symbol = 1
    ),
)

data = [data_points,centroid_plot]

py.iplot(data, filename='subreddit-clusters')

And here is [the plot](https://plot.ly/~rlindsay22/4/subreddits-vs-centroids/)