### classification of text documents


In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups



## Working with documents

* how do we represent?


In [3]:
ng = fetch_20newsgroups()  # get the training subset

In [6]:
ng.keys()

dict_keys(['data', 'target', 'target_names', 'filenames', 'DESCR'])

In [35]:
len(ng.data)

11314

In [4]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

In [4]:
ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [5]:
print(ng.target_names[ng.target[50]], "\n\n", ng.data[50])

comp.windows.x 

 From: johnc@crsa.bu.edu (John Collins)
Subject: Problem with MIT-SHM
Organization: Boston University
Lines: 27

I am trying to write an image display program that uses
the MIT shared memory extension.  The shared memory segment
gets allocated and attached to the process with no problem.
But the program crashes at the first call to XShmPutImage,
with the following message:

X Error of failed request:  BadShmSeg (invalid shared segment parameter)
  Major opcode of failed request:  133 (MIT-SHM)
  Minor opcode of failed request:  3 (X_ShmPutImage)
  Segment id in failed request 0x0
  Serial number of failed request:  741
  Current serial number in output stream:  742

Like I said, I did error checking on all the calls to shmget
and shmat that are necessary to create the shared memory
segment, as well as checking XShmAttach.  There are no
problems.

If anybody has had the same problem or has used MIT-SHM without
having the same problem, please let me know.

By the way, I 

### Bag of Words
### TF-IDF

* Term Frequency
  * raw frequency
  * raw frequency / max raw freq of any word -- prevent bias
* Inverse Document Frequency
  * log(Number of Documents / Number of Documents containing the word)

See:  https://en.wikipedia.org/wiki/Tf–idf


In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(use_idf=False)

In [39]:
X = vectorizer.fit_transform(ng.data)



OK, now we have a big sparse matrix!   The rows of the matrix correspond to our documents and the columns of the matrix correspond to the features (words) in the document.  The values in the cells represent the Term Frequency (normalized) for the document.

The challenge is that the columns are indexed numerically.   Further, because this is a sparse data structure the index is really a tuple.  Lets start by printing out the non zero entries of the sparse matrix for the first few documents

Uuuummmm, a tuple  as an index?  How does that work?

In [59]:
x = {}
x[(1,2)] = 'hello'
print(x[(1,2)])
# Now for something cool...
print(x[1,2])


hello
hello


But how is this a sparse matrix?   Take a look at the ``defaultdict`` version of the above.


In [62]:
from collections import defaultdict
x = defaultdict(float)
x[1,2] = 'hello'
print(x[1,2])
print(x[2,3])

hello
0.0


OK, now you should understand what is going on below.

In [55]:
len(vectorizer.get_feature_names())

for i in range(4):  # There are really 11314 rows
    for j in range(130107):  # yes this is big but remember most are 0.0 and not really there.
        if X[i,j] != 0.0:
            print("document = ", i, " score = ", X[i,j],j,vectorizer.get_feature_names()[j])
            

document =  0  score =  0.0640184399664 4605 15
document =  0  score =  0.0640184399664 16574 60s
document =  0  score =  0.0640184399664 18299 70s
document =  0  score =  0.0640184399664 26073 addition
document =  0  score =  0.0640184399664 27436 all
document =  0  score =  0.128036879933 28615 anyone
document =  0  score =  0.0640184399664 32311 be
document =  0  score =  0.0640184399664 34181 body
document =  0  score =  0.0640184399664 34995 bricklin
document =  0  score =  0.0640184399664 35187 brought
document =  0  score =  0.0640184399664 35612 bumper
document =  0  score =  0.0640184399664 35983 by
document =  0  score =  0.0640184399664 37433 called
document =  0  score =  0.0640184399664 37565 can
document =  0  score =  0.320092199832 37780 car
document =  0  score =  0.0640184399664 40998 college
document =  0  score =  0.0640184399664 42876 could
document =  0  score =  0.0640184399664 45295 day
document =  0  score =  0.0640184399664 48618 door
document =  0  score =  0

But, But, this is all numbers what if I want to get the word frequency value for a particular word in a particular document?  How do I know what index to use?   ``get_feature_names()`` returns a list of all of the words in the corpus, and the index value of the word in the list can be used.  So to go from word to index we just look up the word in the list.

In [68]:
wordList = vectorizer.get_feature_names()
wordList[37780:37790]

['car',
 'car377',
 'caraballo',
 'caralv',
 'caramate',
 'caramel',
 'caramelizing',
 'caramete',
 'carat',
 'caratzas']

OK, as a final test lets look at a raw document, and then print out all of the non-zero entries in that row of the matrix for that document.  All of the words should be there.

In [52]:
ng.data[4]



In [54]:
for i in range(len(wordList)):
    if X[4,i] != 0.0:
        print(wordList[i])

213
23
4h9
about
after
already
am
an
and
are
aren
article
as
astrophysical
baker
basically
be
because
before
bugs
but
by
c5jlwx
c5owcb
cambridge
caution
cfa
checked
clear
cmu
code
com
conditions
crew
cs
curious
distribution
don
dumb
edu
error
errors
etrat
expected
fix
from
harvard
have
head
if
ignore
in
introduce
is
it
jcm
jonathan
just
knew
known
launch
liftoff
lines
ma
mcdowell
meaning
memory
might
my
n3p
new
no
observatory
of
ok
or
organization
pack
parity
possibly
previously
question
quote
rat
rather
re
real
really
right
sci
see
set
shuttle
smithsonian
software
sorry
std
subject
suchlike
system
tell
than
that
the
they
things
this
till
to
tom
tombaker
ttacs1
ttu
understanding
unexpected
usa
values
verify
waivered
we
were
what
wondering
world
writes
yes
yet
you


Finally, lets pick a word and see what document numbers that word appears in.  Lets use 'car'


In [69]:
idx = wordList.index('car')
idx

37780

In [70]:
for i in range(11314):
    if X[i,37780] != 0.0:
        print('car appears in document ',i,'with score ', X[i,idx])

car appears in document  0 with score  0.320092199832
car appears in document  17 with score  0.115454019671
car appears in document  29 with score  0.138675049056
car appears in document  30 with score  0.151584765648
car appears in document  48 with score  0.0854357657717
car appears in document  56 with score  0.0263798071274
car appears in document  64 with score  0.0813788458771
car appears in document  71 with score  0.0371904001653
car appears in document  72 with score  0.0336717514851
car appears in document  77 with score  0.0406222231851
car appears in document  84 with score  0.0518475847365
car appears in document  156 with score  0.230283093236
car appears in document  181 with score  0.0648203723552
car appears in document  201 with score  0.0701862406344
car appears in document  262 with score  0.11032394541
car appears in document  274 with score  0.0821994936527
car appears in document  292 with score  0.0510976130308
car appears in document  438 with score  0.0873704

If we just want a list of words for a particular document, independent of the score we can get the whole list easily using a the ``inverse_transform`` function.  The following just prints out all of the words in document 4:

In [50]:
vectorizer.inverse_transform(X)[4]

array(['from', 'edu', 'my', 'subject', 'what', 'is', 'this',
       'organization', 'of', 'lines', 'wondering', 'if', 'the', 'it', 'to',
       'be', 'were', 'really', 'in', 'or', 'you', 'have', 'by', 'article',
       'and', 'are', 'distribution', 'usa', 'after', 'as', 'new', 'than',
       'that', 'expected', 'an', 'but', 'don', 'about', 'just', 'rather',
       'real', 'question', 'might', 'tom', 're', 'world', 'com', 'writes',
       'no', 'jonathan', 'jcm', 'head', 'cfa', 'harvard', 'mcdowell',
       'shuttle', 'launch', 'smithsonian', 'astrophysical', 'observatory',
       'cambridge', 'ma', 'sci', '23', 'c5owcb', 'n3p', 'std', 'tombaker',
       'baker', 'c5jlwx', '4h9', 'cs', 'cmu', 'etrat', 'ttacs1', 'ttu',
       'unexpected', 'errors', 'am', 'error', 'sorry', 'dumb', 'parity',
       'previously', 'known', 'conditions', 'waivered', 'yes', 'we',
       'already', 'knew', 'curious', 'meaning', 'quote', 'understanding',
       'basically', 'bugs', 'system', 'software', 'things