# Applying TFIDF Vectors and t-SNE to Subreddit Content
While researching for this project, I discovered a whole lot of subreddit mappings and visualaztions out there. Most of them employ user engagement as a metric for modularity detection. (Insert Sources). Since this project is based on breaking free of traditional user interest-patterns, I decided to quantify subreddit differences by content alone, in much the same way as Andrej Kaparthy does in [this blog post](http://karpathy.github.io/2014/07/02/visualizing-top-tweeps-with-t-sne-in-Javascript/).

## Collecting Data
Using the previously mentioned post as a rough guide, I started implementing the community difference analysis. Initial analysis, I simply downloaded comment text from a large handful of default subreddits, saving a file for each sub with concatenated raw comment text, and a list of the files (for scikit to use later on).

In [2]:
import pickle
files_store = open('data/sub_files.pickle','rb')
files_list = pickle.load(files_store)
print(files_list)

['data/subs/gadgets.txt', 'data/subs/sports.txt', 'data/subs/gaming.txt', 'data/subs/pics.txt', 'data/subs/worldnews.txt', 'data/subs/videos.txt', 'data/subs/AskReddit.txt', 'data/subs/aww.txt', 'data/subs/funny.txt', 'data/subs/news.txt', 'data/subs/movies.txt', 'data/subs/blog.txt', 'data/subs/books.txt', 'data/subs/history.txt', 'data/subs/food.txt', 'data/subs/philosophy.txt', 'data/subs/Jokes.txt', 'data/subs/Art.txt', 'data/subs/DIY.txt', 'data/subs/space.txt', 'data/subs/Documentaries.txt', 'data/subs/askscience.txt', 'data/subs/nottheonion.txt', 'data/subs/todayilearned.txt', 'data/subs/personalfinance.txt', 'data/subs/gifs.txt', 'data/subs/listentothis.txt', 'data/subs/IAmA.txt', 'data/subs/announcements.txt', 'data/subs/TwoXChromosomes.txt', 'data/subs/creepy.txt', 'data/subs/nosleep.txt', 'data/subs/GetMotivated.txt', 'data/subs/WritingPrompts.txt', 'data/subs/LifeProTips.txt', 'data/subs/EarthPorn.txt', 'data/subs/explainlikeimfive.txt', 'data/subs/Showerthoughts.txt', 'dat

## Vectorization
Next I used the TfidfVectorizer from scikit to process and vectorize the content based on text features. These vectors are dimensioned according to the number of ngrams.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(input='filename',stop_words='english',lowercase=True, strip_accents='unicode', smooth_idf=True,sublinear_tf=False, use_idf=True, ngram_range=(1,2),min_df=2)
vecs = vectorizer.fit_transform(files_list)

This gives a 47 X 35155 matrix where each row is a subreddit TFIDF matrix with horizontal dimension corresponding to the ngram features.

In [4]:
vecs

<47x33304 sparse matrix of type '<class 'numpy.float64'>'
	with 166749 stored elements in Compressed Sparse Row format>

## Dimensionality Reduction
My next step was to apply t-SNE implementations in order to visualize the data. I used scikit's built-implementation for this task. For t-SNE the the dot product dissimilarities are required as input.

In [5]:
dists = (vecs * vecs.T).todense()

In [6]:
from sklearn.manifold import TSNE
vecs_2d = TSNE(n_components=2).fit_transform(dists)
vecs_2d

array([[  37.538536  ,  -99.21912   ],
       [-108.921745  ,  -20.812513  ],
       [ -55.803932  ,   62.793385  ],
       [  -6.9007916 ,  -35.677216  ],
       [  47.04386   ,   51.614872  ],
       [  59.107395  ,  -61.750397  ],
       [  38.684246  ,   77.7371    ],
       [ -58.77372   ,    5.91833   ],
       [  -2.7678335 ,   44.228683  ],
       [ -28.224083  ,   35.33197   ],
       [ -34.138416  ,   10.203723  ],
       [ -22.200624  ,  -69.67084   ],
       [  21.521759  ,   38.12975   ],
       [ -53.80779   ,  -86.54177   ],
       [ -58.01029   ,  -39.831154  ],
       [  38.77099   ,   -0.93123645],
       [ -79.02183   ,  -14.72237   ],
       [  11.799688  ,  -97.394     ],
       [  -6.448294  , -144.72159   ],
       [ -89.88406   ,   17.71      ],
       [  47.705425  ,  -32.904606  ],
       [  14.272783  , -124.421776  ],
       [  17.38983   ,   14.223973  ],
       [ -26.04302   ,   58.580845  ],
       [ -35.020267  ,   86.43248   ],
       [  73.35896   ,  -

Now I have (hopefully meaningful) 2-dimensional vector representations of all the subreddits!

## Visualizations
To get a rough idea of that the reduced vectors look like, I produced a simple labeled plot. Note that each execution of t-SNE results in different embeddings, while variance is preserved. Subreddit pairs like r/worldnews and r/news are always close, which is promising. Also note that the colours used for plotting have no meaning and simply look cool.

In [7]:
%matplotlib notebook
import matplotlib.pyplot as plt
x = vecs_2d[:,0]
y = vecs_2d[:,1]
plt.scatter(x,y,c=y,cmap='plasma')

# annotate the plot
names_store = open('data/sub_names.pickle','rb')
names = pickle.load(names_store)
for i, name in enumerate(names):
    plt.annotate(name,(x[i],y[i]))
    
plt.show()

<IPython.core.display.Javascript object>

My next steps will be to download a larger dataset and apply clustering/modularity analysis to visualize in more depth.

## A Slightly Larger Dataset
To get a more interesting set of subreddits, I downloaded a csv file containing info for all public subreddits (as of the posting of [this reddit post](https://www.reddit.com/r/datasets/comments/8isnek/list_of_every_subreddit_on_reddit/). I downloaded info for the first 100 and performed the same analysis procedure as before.

In [8]:
def apply(files_list_dir):
    files_store = open(files_list_dir,'rb')
    files_list = pickle.load(files_store)
    vectorizer = TfidfVectorizer(input='filename',stop_words='english',lowercase=True, strip_accents='unicode', smooth_idf=True,sublinear_tf=False, use_idf=True, ngram_range=(1,2),min_df=2)
    vecs = vectorizer.fit_transform(files_list)
    dists = (vecs * vecs.T).todense()
    vecs_2d = TSNE(n_components=2).fit_transform(dists)
    return vecs_2d

points = apply('data/sub_files_large.pickle')

plt.figure()
x = points[:,0]
y = points[:,1]
plt.scatter(x,y,c=y,cmap='plasma')
names_store = open('data/sub_names_large.pickle','rb')
names = pickle.load(names_store)
for i, name in enumerate(names):
    plt.annotate(name,(x[i],y[i]))
plt.show()

<IPython.core.display.Javascript object>

The annotations in this plot are a little jumbled, but it's possible to see the work of the algorithm with subreddits like programming, linux, and ruby closely surrounding r/software. The figure also shows some interesting interesting features (deviations from the pattern that one mught expect) which could shed light on discrepencies between subreddit names and the actual content of discussion happening in the comments.

### Even More Subreddits
Next I modified the download code to sort subreddits by subsciber count so that I could choose a variable-sized selection of top subs. Here is the algorithm run on 1000 subreddits (again, the colour is just for fun)

In [9]:
points = apply('data/sub_files_all.pickle')
plt.figure()
x = points[:,0]
y = points[:,1]
plt.scatter(x,y,c=y,cmap='plasma')
plt.show()

<IPython.core.display.Javascript object>

Some interesting clusters start to appear with this number of data-points. Later on, I will apply some sort of clustering algorithm - and find a way to display subreddit names without crowding them on a plot.

Saving the data for later use:

In [14]:
import pandas as pd
names_store = open('data/sub_names_all.pickle','rb')
names = pickle.load(names_store)
subscriber_counts_store = open('data/subscriber_counts_all.pickle','rb')
subscriber_counts = pickle.load(subscriber_counts_store)
data = {'name':names, 'x':x, 'y':y, 'subscribers': subscriber_counts}
df = pd.DataFrame(data,columns=['name','x','y','subscribers'])
df.to_csv('data/vecs.csv', encoding='utf-8',columns=['name','x','y','subscribers'], index=False)