## Latent Semantic Analysis - Week 4

Analyzed by Sahil Phule

In [272]:

import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD


In [273]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/sahil/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [298]:
from sklearn.datasets import fetch_20newsgroups
categories = ['rec.sport.baseball']

dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data #Data Imported is already text. So no need to use Beautiful Soup.


In [275]:
# Making everything lowercase
corpus = [x.lower() for x in corpus]

## Stopwords

Using stopwords provided in the nltk package. Also adding some of my own to the list. 

In [288]:
stopset = set(stopwords.words('english'))
stopset.update(['lt','p','/p','br','amp','quot','field','font','normal','span','0px','rgb','style','51', 
                'spacing','text','helvetica','size','family', 'space', 'arial', 'height', 'indent', 'letter'
                'line','none','sans','serif','transform','line','variant','weight','times', 'new','strong', 'video', 'title'
                'white','word','letter', 'roman','0pt','16','color','12','14','21', 'neue', 'apple', 'class',  ])
#Adding a few of my own

stopset.update(['\n','\t','-','*','0','00','000','01','0010','006','cs','ca',
                '001211', '18457','000th','001','edu','com','001th','nntp','vb30',
               '002' ,'755','002251w','734117130','0000ahc', 'udcps3' ,'cps', '003' ,'759',
                '0023' ,'lafibm',
                '004746' ,'13007', 'ramsey','0096a95c',
'003015', 'vmsb', 'csupomona', '0062','0096b0f0', 'c5de05a0',
               '005314', '5700', 'mnemosyne','007','844',
                '00cgbabbitt', '00ecgillespi','00ecgillespie',
               'uiuc','bsu','005','866', '0114', '619', '534',
               '00bjgood', 'leo', 'bsuvc','00mbstultz', 'subject','00x',
                '00pmlemen','010745','acad', '010423', '11050','012139','13444',
               
               ])



## Vectorizing

In [295]:
#Earlier Data
corpus[0]

u"from: writingctr@leo.bsuvc.bsu.edu\nsubject: re: cub fever.\norganization: ball state university, muncie, in - univ. computing svc's\nlines: 21\n\n\nin article <kingoz.735285670@camelot>, kingoz@camelot.bradley.edu (orin roth) writes:\n> \n>    cub fever is hitting me again. i'm beginning to think they have a \n>    chance this year. (what the heck am i thinking?)\n>    sorry. just a moment of incompetence.\n>    i'll be ok. really. \n>    orin.\n>    bradley u.\n> \n> --\n> i'm really a jester in disguise!                                   \ni hear ya!  then again, we must remember that we are indeed cub fans, and\nthat the cubs will eventually blow it.  after all, the cubs are the easiest\nteam in the national league to root for.  no pressure.  you know they will\nlose eventually.  oh well, i suppose we must have faith.  after all, they\ndo look pretty good, and they don't even have sandberg back yet.  \n\ncubs in '93!!!!!\n\ncha\n"

In [289]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(corpus)

In [294]:
X[0]

<1x187320 sparse matrix of type '<type 'numpy.float64'>'
	with 212 stored elements in Compressed Sparse Row format>

In [296]:
#vectorized Data
print X[0]

  (0, 50608)	0.0757266939156
  (0, 186580)	0.0757266939156
  (0, 28540)	0.0757266939156
  (0, 144261)	0.0757266939156
  (0, 62001)	0.0757266939156
  (0, 75641)	0.0757266939156
  (0, 131370)	0.0757266939156
  (0, 101113)	0.0757266939156
  (0, 64393)	0.0757266939156
  (0, 112468)	0.0757266939156
  (0, 160590)	0.0757266939156
  (0, 178565)	0.0757266939156
  (0, 117710)	0.0757266939156
  (0, 62142)	0.0757266939156
  (0, 101605)	0.0757266939156
  (0, 93400)	0.0757266939156
  (0, 131285)	0.0757266939156
  (0, 141383)	0.0757266939156
  (0, 96038)	0.0757266939156
  (0, 113110)	0.0757266939156
  (0, 162999)	0.0757266939156
  (0, 58609)	0.0757266939156
  (0, 50648)	0.0757266939156
  (0, 35368)	0.0757266939156
  (0, 62128)	0.0757266939156
  :	:
  (0, 184967)	0.0231320652283
  (0, 42869)	0.0403579719618
  (0, 164973)	0.0231672531944
  (0, 32392)	0.0552876358453
  (0, 82328)	0.0352862910281
  (0, 183100)	0.0160286564865
  (0, 141634)	0.0569067726271
  (0, 120393)	0.113813545254
  (0, 37178)	0.10205

## LSA

In [290]:
X.shape


(994, 187320)

In [291]:
lsa = TruncatedSVD(n_components=27, n_iter=100)
lsa.fit(X)


TruncatedSVD(algorithm='randomized', n_components=27, n_iter=100,
       random_state=None, tol=0.0)

In [292]:
lsa.components_[0]

array([ 0.00028248,  0.00028248,  0.00028248, ...,  0.00111762,
        0.00111762,  0.00111762])

In [293]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print "Concept %d:" % i
    for term in sortedTerms:
        print term[0]
    print " "

Concept 0:
year
team
would
game
writes
article
baseball
players
games
one
 
Concept 1:
writes article
game
jewish
roger
first
one
players
lafayette
jhu
come
 
Concept 2:
clutch
morris
batting
run
pitcher
win
pitching
average
want
hirschbeck
 
Concept 3:
year
win
first
team
host
morris
good
baseball
world
also
 
Concept 4:
games
first
last
two
know
aix
anyone
guys
season
10
 
Concept 5:
article
last
player
ibm
04
get
could
league
say
011653 7403
 
Concept 6:
team
university
organization
going
one
see
game
go
play
would
 
Concept 7:
runs
hitter
year
good
three
game
many
get
morris
013
 
Concept 8:
hit
last
still
better
hitter
winning
scott
host
first
may
 
Concept 9:
win
011653
know
average
even
lines
right
organization university
fans
york
 
Concept 10:
games
article
go
writes
even
011653
013
hitter
roger
might
 
Concept 11:
two
game
think
let
roger
ball
season
win
anyone
ted
 
Concept 12:
article
year
011653
see
cubs
runs
better
writes article
people
three
 
Concept 13:
games
baseball


## Conclusion

I have done the Latent Symantic analysis for the baseball newsgroup. The analysis on the newsgroup speaks about the winning and losing trends of various teams, and also about the performance of the players in it.Like which one was the best picher, which one could have done better, etc.

Stopwords were most significant factor in the analysis. In this example I have manually added the stopwords by looking at the data and generated concepts.
