I should mention that this technique (latent semantic analysis,
otherwise known as LSA) is related to principal component analysis
(PCA).  Matt Osborne will be presenting much more on PCA during a
special session.



## Load some real-world data



Let&rsquo;s see how this might work by reprising our reddit data set.



In [19]:
import json
import bz2
comments = []
with bz2.open('/Users/neutrino/Downloads/RC_2010-10.bz2', 'r') as f:
    for line in f:
        comment = json.loads(line.strip().decode('utf-8'))
        if comment['subreddit'] == 'politics':
            if comment['body'] != '[deleted]':
                comments.append( comment )

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
corpus = [comment['body'] for comment in comments]
X = vectorizer.fit_transform(corpus)

In [17]:
comments

[{'retrieved_on': 1426494583,
  'author_flair_css_class': None,
  'author_flair_text': None,
  'id': 'c1112i2',
  'controversiality': 0,
  'downs': 0,
  'link_id': 't3_dl952',
  'archived': True,
  'name': 't1_c1112i2',
  'gilded': 0,
  'edited': False,
  'ups': 2,
  'distinguished': None,
  'subreddit': 'politics',
  'score_hidden': False,
  'body': 'For some people there is no hope. These folks *want* to believe.\n\nThey really do.',
  'score': 2,
  'parent_id': 't1_c111251',
  'created_utc': '1285891217',
  'author': 'xoites',
  'subreddit_id': 't5_2cneq'},
 {'ups': -5,
  'edited': False,
  'gilded': 0,
  'link_id': 't3_dl86s',
  'name': 't1_c1112lx',
  'archived': True,
  'controversiality': 0,
  'downs': 0,
  'id': 'c1112lx',
  'author_flair_css_class': None,
  'author_flair_text': None,
  'retrieved_on': 1426494584,
  'subreddit_id': 't5_2cneq',
  'author': 'sqlinjector',
  'created_utc': '1285891275',
  'parent_id': 't1_c110z8y',
  'body': 'nope.  missed it by a mile',
  'score'

At this point `X` is rows of vectors for each &ldquo;document&rdquo; which in this
case is a reddit comment.



## Reduce dimension



In [2]:
from sklearn.decomposition import  TruncatedSVD
tsvd = TruncatedSVD(n_components=300)
tsvd.fit(X)  
X2 = tsvd.transform(X)

You might enjoy changing `n_components`.  In this case, `300` is a
&ldquo;recommended number.&rdquo;



## Then cluster



Why is it reasonable (or a good idea) to first perform SVD before
clustering?



In [3]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10).fit(X2)
kmeans.labels_

array([0, 8, 7, ..., 4, 8, 8], dtype=int32)

## Explore the clusters



In [20]:
import numpy as np
for j in np.unique(kmeans.labels_):
    print("****************************************************************")
    print("Cluster",j)
    for i in np.random.choice( np.nonzero(kmeans.labels_ == j)[0], size=20, replace=False ):
        print(corpus[i][0:50])

****************************************************************
Cluster 0
I do like French horror movies of the past couple 
yes because if the republicans took majorities in 
Where is the tea party for THIS violation of the c
If they don't get it from us, I would think they w
They won't submit  to their enlightened betters, t
We have to have it.  The auto companies would neve
So you don't think Media Matters wishes to influen
Then maybe this will help small businesses grow. I
By voting those thugs (both reps and dems) in you’
&gt;I personally think any skills they teach shoul
When I was in high school, I remember the statisti
Of course, I was on their Digg list of folks to bu
FDA is wonderful!  I work in pharmaceutical advert
I agree with this, but when people like OP pile on
correlation doesn't prove causation. They weren't 
Punch and Judy have different voting records, and 
They cheered that because she had said negative th
Why would the President want to see Navy kick Army
Sufferi

## Homework



Given a document, find (and print) documents which are nearby in the
&ldquo;semantic space&rdquo; computed by SVD.

