# Visualizing and Analyzing Subreddit Vector Representations
In previous entries I generated vector representations for 1000 subreddits, and saved them to a file. Here I will produce some visualizations and applly k-means clustering to understand community similiarities and group subs. I also pre-reduced the vectors and saved them as 'x' and 'y' columns in their own file for plotting. Note that the dataset includes a surprising number of 'NSFW' subreddits that appear throughout the vector space.

## Visualizing All Data
Using the amazingly simple plotly library, I produced an interactive view of the vector space. Labels appear next to subreddits showing the subreddit information and colors represent difference in subscriber numbers (default subreddits appear red) The interactive plot can be found [here](https://plot.ly/~rlindsay22/2/).

In [111]:
import pandas as pd
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.plotly as py

vecs_df = pd.read_csv('data/notebook_support/vecs.csv')

trace = go.Scatter(
    x = vecs_df['x'],
    y = vecs_df['y'],
    mode = 'markers',
    marker = dict(
        size = 6,
        color = vecs_df['subscribers'],
        colorscale='magma',
        showscale=True
    ),
    text = vecs_df['name']
)

data = [trace]

#py.iplot(data, filename='subreddits')
# commented out to not overwrite

### Non-Interactive Plot:
![Subreddit Vectors with Small-Medium Dataset](figures/subreddit_vecs.png)

### Observations
Just by looking, the data appears to have significant meaningful clustering. An interesting characteristic to note is the spread out nature of the default subreddits, which makes this representation unique from interest-mapping approaches. Each default sureddit is close to subs that it shares discussion features with, rather than user engagement. There are distinct and logical bubbles such as sports, videogames, TV etc. that supposedly justify some of the lesser-obvious groupings.

## Clustering
Next I performed clustering using the scipy implementation of k-means. I decided to run the algorithm with initialized centroids corresponding to my best guesses at where clusters would lie, but it ended up being difficult to isolate the clusters that were actually visible in the data from the beginning, so k-means might not be the best method. I'll include the code and plot here nonetheless (I ended up plotting with k=12, but you'll notice that some of the 'clusters' are fairly meaningless. The code shown initializes k-means with k=12, rather than with centroids. Note that in the following sections, I have decided to perform clustering **on the pre-reduced data**, as I believe it retains the amount of meaning I want

In [112]:
from scipy.cluster.vq import kmeans as km
import numpy as np
import numpy.linalg as la

def get_assignments(X, k):
    cents,dist = km(X,k) # run k-means and get centroids
    m = X.shape[0]
    c_vals = np.zeros((m,1)) # array to hold centroid assignments
    for i in range(0,m):
        row = X[i,:]
        diff = cents - row
        diff_vals = np.zeros((k,1))
        for c in range(0,k):
            diff_vals[c] = la.norm(diff[c,:])
        c_vals[i] = np.argmin(diff_vals)
    return cents, c_vals

vecs = pd.read_csv('data/notebook_support/vecs.csv',usecols=[1,2])
X = vecs.values
k = 12
cents, c_vals = get_assignments(X, k)

vecs_df['cluster'] = c_vals
vecs_df.to_csv('data/notebook_support/vecs.csv', encoding='utf-8', columns=['name','x','y','subscribers','cluster'], index=False)

# save the centroids
centroid_data = {'x': cents[:,0], 'y': cents[:,1]}
cents_df = pd.DataFrame(centroid_data,columns=['x','y'])
cents_df.to_csv('data/notebook_support/centroids.csv',encoding='utf-8',columns=['x','y'], index=False)

To visualize the clustering, I'll plot the data with colours corresponding to assignments. I use a manually-created colour map for this.

In [119]:
# map each cluster assignment to a colour
color_map = {
    0.0: '#e6f2ff', 1.0: '#99ccff', 
    2.0: '#ccccff', 3.0: '#cc99ff', 
    4.0: '#ff99ff', 5.0: '#ff6699', 
    6.0: '#ff9966', 7.0: '#ff6600',
    8.0: '#ff5050', 9.0: '#ff0000',
    10.0: '#18ff01', 11.0: '#6a2b11',
    12.0: '#b7bf05', 13.0: '#2859c1'}
    # supports up to 14 clusters

def cluster_plot_data(vecs_df, cents):
    cols = vecs_df['cluster'].map(color_map)

    data_points = go.Scatter(
        name = 'subreddits',
        x = vecs_df['x'],
        y = vecs_df['y'],
        mode = 'markers',
        marker = dict(
            size = 6,
            color = cols,
        ),
        text = vecs_df['name']
    )

    centroid_plot = go.Scatter(
        name = 'centroids',
        x = cents[:,0],
        y = cents[:,1],
        mode = 'markers',
        marker = dict(
            size = 6,
            color = 'black',
            symbol = 1
        ),
    )

    data = [data_points,centroid_plot]
    return data

data = cluster_plot_data(vecs_df, cents)

#py.iplot(data, filename='subreddit-clusters')

![Manually Initialize Clusters](figures/clusters_init.png)
And here is [the interactive plot](https://plot.ly/~rlindsay22/4/subreddits-vs-centroids/)

##  Improving Things
In order tighten things up and get a better workflow and data-representation, I rewrote the download script to filter nsfw subreddits among other new functions and downloaded the top 10,000 subreddits. (Around 8000 were saved, with the remaining ~2000 being skipped for nsfw or exclusivity reasons)

### Medium Dataset Plot
[Here](https://plot.ly/~rlindsay22/8/) is the plot for the medium-sized dataset. Subscribers are coloured on a logarithmic scale due to the skewed distribution of subscriber counts.

This plot preserves the relevant groupings seen before, but of course in higher density. Some new groupings appear with this amount of data.

In [120]:
vecs_df = pd.read_csv('data/dataset_medium/vecs.csv')

trace = go.Scatter(
    x = vecs_df['x'],
    y = vecs_df['y'],
    mode = 'markers',
    marker = dict(
        size = 6,
        color = np.log(vecs_df['subscribers']),
        colorscale='Viridis'
    ),
    text = vecs_df['name']
)

data = [trace]

#py.iplot(data, filename='subreddits_medium')

![Subreddits Medium Dataset with Logarithmic Colorscale](figures/subreddits_medium.png)

### Clusters
I decided to give k-means another shot. This time, I did not manually initialize the centroids but instead let the algorithm determine them to start. Once again, I store the centroids and example assignments in the centroids and vector files, respectively.

In [121]:
vecs = pd.read_csv('data/dataset_medium/vecs.csv',usecols=[1,2])
X = vecs.values
k = 14
cents, c_vals = get_assignments(X, k)

# save the things
vecs_df['cluster'] = c_vals
vecs_df.to_csv('data/dataset_medium/vecs.csv', encoding='utf-8', columns=['name','x','y','subscribers','cluster'], index=False)
centroid_data = {'x': cents[:,0], 'y': cents[:,1]}
cents_df = pd.DataFrame(centroid_data,columns=['x','y'])
cents_df.to_csv('data/dataset_medium/centroids.csv',encoding='utf-8',columns=['x','y'], index=False)

Ultimately, I ended up sticking with the arbitrary k=14 (the maximum number of color assignments I have). [The plot](https://plot.ly/~rlindsay22/14/subreddits-vs-centroids/) is shown below:

In [122]:
data = cluster_plot_data(vecs_df, cents)
#py.iplot(data, filename='subreddit-clusters-medium')

![Medium Dataset Clusters](figures/clusters_medium.png)

### Final Note on K-Means Clustering
While k-means effectively groups and minimizes distance (as it should) with this dataset, it often fails to separate the smaller 'modules' that represent specific content circles. Furthermore, this method does not group subreddits in the same way a human would just by looking at the plot and the subreddit themselves, but this is not the objective of k-means. K-means clustering in this case ends up being a way of dividing the data into distance-minimized, but rather meaningless, chunks.

## Next Steps
Having seen and gained a decent understanding of the data, my next steps will be to perform deeper analysis on the community circles and hopefully better quantify the connections between subreddits, as well as to use the vector representations in a recommender-like system.