## Latent Semantic Analysis - Week 4 - Data Science - Jared Knowles

## NOTE:  This version of Latent Semantic Analysis is a modified copy of the original provided by Michael Bernico.

Added import for fetch_20newsgroups

In [3]:
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import fetch_20newsgroups

In [4]:
#run this only once
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/keri/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Following cell modified to import data from "alt.atheism" newgroup instead of 'raw_forum_posts.dat'

In [5]:
categories = ['alt.atheism']
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data

In [41]:
stopset = set(stopwords.words('english'))
stopset.update(['lt','p','/p','br','amp','quot','field','font','normal','span','0px','rgb','style','51', 
                'spacing','text','helvetica','size','family', 'space', 'arial', 'height', 'indent', 'letter'
                'line','none','sans','serif','transform','line','variant','weight','times', 'new','strong', 'video', 'title'
                'white','word','letter', 'roman','0pt','16','color','12','14','21', 'neue', 'apple', 'class', 'nntp', 
                'posting', 're', 'host', 'jon', 'keith', 'livesey', 'organization', 'kent', 'sandvik', 'cobb', 'cs', 'solntze', 
                'schneider', 'sgi', 'ico', 'com', 'tek', 'edu', 'wpd', 'de', 'uk', 'caltech', 'cc', 'kmr4', 'uiuc', 
                'isn', 'robert', 'beauchaine', 'osrhe', 'okcforum', 'frank', 'dwyer', 'bobbe', 'subject', 've', 'conner', 'benedikt',
                'cwru', 'jaeger', 'cco', 'dbstu1', 'rz', 'tu', 'alexia', 'lis', 'alt', 'us', 'bu', 'university', 'monash', 'faq',
                'allan', 'co', 'bill', 'd012s658', 'swinburne', 'would', 'however', 'au', 'jim', 'could', 'gregg', 'po', 'rh',
                'fi', 'oulu', 'distribution', 'statement', 'etc', 'perry', 'mantis', 'well', 'll', 'whether', 'either', 'like',
                'yet', 'also', 'really', 'actually', 'go', 'get', 'much', 'rather', 'perhaps', 'reply', 'yes', 'article', 'dan', ])


### TF-IDF Vectorizing

I'm going to use scikit-learn's TF-IDF vectorizer to take my corpus and convert each document into a sparse matrix of TFIDF Features...

In [42]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(corpus)

In [43]:
X[0]

<1x197801 sparse matrix of type '<type 'numpy.float64'>'
	with 525 stored elements in Compressed Sparse Row format>

In [44]:
#After
print X[0]

  (0, 9053)	0.0343411268565
  (0, 161372)	0.0343411268565
  (0, 76994)	0.0343411268565
  (0, 177782)	0.0343411268565
  (0, 6168)	0.0364813736275
  (0, 5844)	0.0364813736275
  (0, 5692)	0.0364813736275
  (0, 190905)	0.0364813736275
  (0, 98047)	0.0364813736275
  (0, 122998)	0.0343411268565
  (0, 187555)	0.0343411268565
  (0, 43669)	0.0343411268565
  (0, 93486)	0.0343411268565
  (0, 9636)	0.0343411268565
  (0, 9620)	0.0386788868944
  (0, 146342)	0.0419221813194
  (0, 26100)	0.0386788868944
  (0, 34945)	0.0434619997185
  (0, 14562)	0.0434619997185
  (0, 61319)	0.0386788868944
  (0, 64428)	0.0386788868944
  (0, 119971)	0.0386788868944
  (0, 48162)	0.0386788868944
  (0, 4920)	0.0386788868944
  (0, 37984)	0.0343411268565
  :	:
  (0, 97520)	0.0291126612461
  (0, 147735)	0.0204065399088
  (0, 49741)	0.0406640580797
  (0, 78480)	0.0241172063404
  (0, 194937)	0.00868177165955
  (0, 112078)	0.0272361627501
  (0, 46326)	0.031391518164
  (0, 119190)	0.0229644374089
  (0, 1303)	0.0482451125427
  (0,

###LSA

Input:  X, a matrix where m is the number of documents I have, and n is the number of terms.

Process:   I'm going to decompose X into three matricies called U, S, and T.  When we do the decomposition, we have to pick a value k, that's how many concepts we are going to keep.  

$$X \approx USV^{T}$$

U will be a m x k matrix.  The rows will be documents and the columns will be 'concepts'

S will be a k x k diagnal matrix.   The elements will be the amount of variation captured from each concept.

V will be a m x k (mind the transpose) matrix.   The rows will be terms and the columns will be conepts.  

In [45]:
X.shape

(799, 197801)

In [46]:
lsa = TruncatedSVD(n_components=27, n_iter=100)
lsa.fit(X)



TruncatedSVD(algorithm='randomized', n_components=27, n_iter=100,
       random_state=None, tol=0.0)

In [47]:
#This is the first row for V
lsa.components_[0]

array([ 0.00175525,  0.0004119 ,  0.0004119 , ...,  0.00045795,
        0.00045795,  0.00045795])

In [48]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print "Concept %d:" % i
    for term in sortedTerms:
        print term[0]
    print " "

Concept 0:
god
one
people
think
writes
say
morality
believe
atheism
moral
 
Concept 1:
theism
fanatism
belief
irrational
rational
evidence
many
fanaticism
people
say
 
Concept 2:
god
islam
religion
liar
atheists
atheism
said
lunatic
bible
jesus
 
Concept 3:
god
books
know
press
made
prometheus
prometheus books
often
history
existence god
 
Concept 4:
god
know
see
reason
say
morality
vice
never
moral
take
 
Concept 5:
time
morality
moral
society
people
case
atheists
saying
right
need
 
Concept 6:
god
atheism
way
religious
atheists
thing
cannot
faith
atheist
things
 
Concept 7:
god
atheists
others
faith
real
far
good
different
back
agree
 
Concept 8:
atheists
many
people
may
religion
jesus
god
must
take
matthew
 
Concept 9:
jesus
islamic
moral
bible
matthew
thought
something
make
religious
said
 
Concept 10:
must
time
jesus
things
religion
mean
one
matthew
peace
person
 
Concept 11:
believe
god
islam
many
moral
know
say
atheists
real
rushdie
 
Concept 12:
atheists
writes
atheism
people
p

### At this point, it would appear at a basic level, if your "computer" or researcher at the moment didn't exactly know what atheism is all about, the observer would at least know by now that it has something to do with religion, god, morality, rational vs. irrational, etc.  Maybe by further fine tuning the stop words, one could get to a much closer view of what atheism is, but at least its getting into its realm. ###