## Latent Semantic Analysis

In this notebook we're going to look at how we can 'mine' concepts from a corpus (collection) of text documents.

In the first week of class everyone wrote their own definition of data science.   This week I'm going to show you how to extract 'concepts' from that corpus mathematically.  The techinque we're going to use is called latent semantic analysis.  

In [77]:
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [78]:
#run this only once
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/peymanmohajerian/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

To get the newsgroup data 'ec.sport.baseball'

In [79]:
from sklearn.datasets import fetch_20newsgroups
categories = ['rec.sport.baseball']
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data

In [80]:
corpus = [x.lower() for x in corpus]

Stopwords are words that I don't want to convert to featurs,becuase they aren't especially useful.  Words like 'a', 'and', and 'the' are good stopwords in english.   I can use a built in list of stopwords from nltk to get me started.  Then, I'll add some custom stopwords that are 'html junk' that I need to clean out of my data.

In [109]:
stopset = set(stopwords.words('english'))
stopset.update(['lt','p','/p','br','amp','quot','field','font','normal','span','0px','rgb','style','51', 
                'spacing','text','helvetica','size','family', 'space', 'arial', 'height', 'indent', 'letter'
                'line','none','sans','serif','transform','line','variant','weight','times', 'new','strong', 'video', 'title'
                'white','word','letter', 'roman','0pt','16','color','12','14','21', 'neue', 'apple', 'class', 'nntp', 
               '00', 'would', '00 00', '00 00 00', 'edu', 'com', 'david', 'lafayette', '000 000 151', '000 000 74',
               '000 000', '000 000 000', '10', '000 000', '000 000 crunch', '000 000 74', '000 000 067',
               '000 000 assuming'])


### TF-IDF Vectorizing

I'm going to use scikit-learn's TF-IDF vectorizer to take my corpus and convert each document into a sparse matrix of TFIDF Features...

In [110]:
#Before!
corpus[0]

"from: writingctr@leo.bsuvc.bsu.edu\nsubject: re: cub fever.\norganization: ball state university, muncie, in - univ. computing svc's\nlines: 21\n\n\nin article <kingoz.735285670@camelot>, kingoz@camelot.bradley.edu (orin roth) writes:\n> \n>    cub fever is hitting me again. i'm beginning to think they have a \n>    chance this year. (what the heck am i thinking?)\n>    sorry. just a moment of incompetence.\n>    i'll be ok. really. \n>    orin.\n>    bradley u.\n> \n> --\n> i'm really a jester in disguise!                                   \ni hear ya!  then again, we must remember that we are indeed cub fans, and\nthat the cubs will eventually blow it.  after all, the cubs are the easiest\nteam in the national league to root for.  no pressure.  you know they will\nlose eventually.  oh well, i suppose we must have faith.  after all, they\ndo look pretty good, and they don't even have sandberg back yet.  \n\ncubs in '93!!!!!\n\ncha\n"

In [111]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(corpus)

In [112]:
X[0]

<1x186595 sparse matrix of type '<class 'numpy.float64'>'
	with 224 stored elements in Compressed Sparse Row format>

### LSA

Input:  X, a matrix where m is the number of documents I have, and n is the number of terms.

Process:   I'm going to decompose X into three matricies called U, S, and T.  When we do the decomposition, we have to pick a value k, that's how many concepts we are going to keep.  

$$X \approx USV^{T}$$

U will be a m x k matrix.  The rows will be documents and the columns will be 'concepts'

S will be a k x k diagnal matrix.   The elements will be the amount of variation captured from each concept.

V will be a n x k (mind the transpose) matrix.   The rows will be terms and the columns will be conepts.  

In [113]:
X.shape

(994, 186595)

In [114]:
lsa = TruncatedSVD(n_components=17, n_iter=100)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=17, n_iter=100,
       random_state=None, tol=0.0)

In [115]:
#This is the first row for V
lsa.components_[0]

array([ 0.01545924,  0.0019262 ,  0.000441  , ...,  0.00111875,
        0.00111875,  0.00111875])

In [116]:
import sys
print (sys.version)

3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]


In [117]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Concept %d:" % i )
    for term in sortedTerms:
        print(term[0])
    print (" ")

Concept 0:
year
team
writes
game
cs
article
baseball
players
games
one
 
Concept 1:
host
subject
jewish
may
two
roger
players
jewish baseball
games
game
 
Concept 2:
braves
university
lines
think
article
hirschbeck
time
big
gant
years
 
Concept 3:
players
runs
pitching
host
posting host
one
000 000 067
something
good
time
 
Concept 4:
one
runs
games
baseball
morris
also
know
year
make
hitter
 
Concept 5:
000 000 assuming
first
say
pitcher
could
writes article
subject
two
year
last year
 
Concept 6:
good
better
roger
time
lot
players
hitter
think
000 000 crunch
000 000 151
 
Concept 7:
team
games
mets
posting host
john
year
sox
little
play
000 000 74
 
Concept 8:
game
cs
article
think
best
say
jewish
roger
ca
organization university
 
Concept 9:
team
university
players
baseball
games
subject
years
least
something
ted
 
Concept 10:
games
writes
one
university
really
say
last year
posting host
000 000
hit
 
Concept 11:
year
time
university
last
000 000 assuming
win
also
players
good
maybe

In [118]:
lsa.components_

array([[  1.54592373e-02,   1.92619966e-03,   4.40999914e-04, ...,
          1.11875033e-03,   1.11875033e-03,   1.11875033e-03],
       [  5.06577499e-03,  -5.19664411e-02,  -3.20938720e-02, ...,
          1.58752946e-04,   6.16250150e-05,   6.16250150e-05],
       [  3.86491324e-03,   5.10748277e-02,   2.81205607e-02, ...,
          1.03794495e-03,   9.88211495e-04,   9.88211495e-04],
       ..., 
       [ -5.58862633e-02,   9.87913578e-04,  -1.43179243e-02, ...,
         -1.33125616e-04,  -1.57117708e-04,  -1.57117708e-04],
       [  4.50331924e-02,   5.36283695e-02,   5.65795199e-03, ...,
         -1.48653161e-03,  -1.46668337e-03,  -1.46668337e-03],
       [ -5.21057368e-02,   1.20813186e-01,   4.09935571e-02, ...,
          1.09470061e-03,   1.07514995e-03,   1.07514995e-03]])