# Applying TFIDF Vectors and t-SNE to Subreddit Content
While researching for this project, I discovered a whole lot of subreddit mappings and visualaztions out there. Most of them employ user engagement as a metric for modularity detection. (Insert Sources). Since this project is based on breaking free of traditional user interest-patterns, I decided to quantify subreddit differences by content alone, in much the same way as Andrej Kaparthy does in [this blog post](http://karpathy.github.io/2014/07/02/visualizing-top-tweeps-with-t-sne-in-Javascript/).

## Collecting Data
Using the previously mentioned post as a rough guide, I started implementing the community difference analysis. Initial analysis, I simply downloaded comment text from a large handful of default subreddits, with the code saved in '/download.py'. Basically I just save a file for each sub with concatenated raw comment text, and a list of the files (for scikit to use later on).

In [20]:
import pickle
files_store = open('data/files.pickle','rb')
files_list = pickle.load(files_store)
print(files_list)

['data/gadgets.txt', 'data/sports.txt', 'data/gaming.txt', 'data/pics.txt', 'data/worldnews.txt', 'data/videos.txt', 'data/AskReddit.txt', 'data/aww.txt', 'data/funny.txt', 'data/news.txt', 'data/movies.txt', 'data/blog.txt', 'data/books.txt', 'data/history.txt', 'data/food.txt', 'data/philosophy.txt', 'data/Jokes.txt', 'data/Art.txt', 'data/DIY.txt', 'data/space.txt', 'data/Documentaries.txt', 'data/askscience.txt', 'data/nottheonion.txt', 'data/todayilearned.txt', 'data/personalfinance.txt', 'data/gifs.txt', 'data/listentothis.txt', 'data/IAmA.txt', 'data/announcements.txt', 'data/TwoXChromosomes.txt', 'data/creepy.txt', 'data/nosleep.txt', 'data/GetMotivated.txt', 'data/WritingPrompts.txt', 'data/LifeProTips.txt', 'data/EarthPorn.txt', 'data/explainlikeimfive.txt', 'data/Showerthoughts.txt', 'data/Futurology.txt', 'data/photoshopbattles.txt', 'data/mildlyinteresting.txt', 'data/dataisbeautiful.txt', 'data/tifu.txt', 'data/OldSchoolCool.txt', 'data/UpliftingNews.txt', 'data/InternetI

## Vectorization
Next I used the TfidfVectorizer from scikit to process and vectorize the content based on text features. These vectors are dimensioned according to the number of ngrams.

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(input='filename',stop_words='english',lowercase=True, strip_accents='unicode', smooth_idf=True,sublinear_tf=False, use_idf=True, ngram_range=(1,2),min_df=2)
vecs = vectorizer.fit_transform(files_list)

This gives a 47 X 35155 matrix where each row is a subreddit TFIDF matrix with horizontal dimension corresponding to the ngram features.

In [31]:
vecs

<47x35155 sparse matrix of type '<class 'numpy.float64'>'
	with 175068 stored elements in Compressed Sparse Row format>