<h2>Text featurization and simplification using spaCy</h2>
<p>There are many packages for doing NLP, particularly:<br/>
<ul>
<li><a href="https://spacy.io/">spaCy</a></li>
<li><a href="http://www.nltk.org/">NLTK (Stanford)</a></li>
<li><a href="https://github.com/mit-nlp/MITIE">MITIE (MIT)</a></li>
</ul>

I'll be walking through spaCy and NLTK which is the most prevolent.  Let's start by loading our module.
spaCy uses NN to perform all of the part of speech/lemetization/tokenization tasks that we hope to do.  The following call is reading in a set of weights (that need to be downloaded):<br/>
```
nlp = spacy.load('en')
```
</p>

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import spacy
nlp = spacy.load('en')

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

<h2>Basic Clean Up</h2>
We will start by reviewing the effects of the following steps:
<ul>
<li>Lowering \& punctuation stripping</li>
<li>Stemming</li>
<li>Lemmatization</li>
<li>Stop Word Removal</li>
</ul>

In [2]:
dataset = fetch_20newsgroups(shuffle=True, random_state=13, remove=('headers', 'footers', 'quotes'))

data = dataset.data
print(data[0])

Well i'm not sure about the story nad it did seem biased. What
I disagree with is your statement that the U.S. Media is out to
ruin Israels reputation. That is rediculous. The U.S. media is
the most pro-israeli media in the world. Having lived in Europe
I realize that incidences such as the one described in the
letter have occured. The U.S. media as a whole seem to try to
ignore them. The U.S. is subsidizing Israels existance and the
Europeans are not (at least not to the same degree). So I think
that might be a reason they report more clearly on the
atrocities.
	What is a shame is that in Austria, daily reports of
the inhuman acts commited by Israeli soldiers and the blessing
received from the Government makes some of the Holocaust guilt
go away. After all, look how the Jews are treating other races
when they got power. It is unfortunate.



In [7]:
''' 
text to sentence tokens ->
Not always nessisary, but certainly useful when we want to preserve 
some contextual ques.
'''
l_s = sent_tokenize(data[0])
print("Sentence Tokens:\n",l_s[:2])

tokens = list(map(word_tokenize, l_s))
print("\nWord Tokens:\n", tokens[:2])

s_stop = set(stopwords.words())
tokens_stop_free = [[word for word in sent if word not in s_stop] for sent in tokens]
print("\nNo stop word tokens:\n", tokens_stop_free[:2])

st = PorterStemmer()
stems = [[st.stem(word) for word in sent] for sent in tokens_stop_free]
print("\nNo stop word tokens:\n", stems[:2])

Sentence Tokens:
 ["Well i'm not sure about the story nad it did seem biased."]

Word Tokens:
 [['Well', 'i', "'m", 'not', 'sure', 'about', 'the', 'story', 'nad', 'it', 'did', 'seem', 'biased', '.']]

No stop word tokens:
 [['Well', "'m", 'sure', 'story', 'nad', 'seem', 'biased', '.']]

No stop word tokens:
 [['well', "'m", 'sure', 'stori', 'nad', 'seem', 'bias', '.']]


<h2>Vectorization</h2>
<p>Basic vectorization using sklearn</p>

In [8]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                stop_words='english')

vectorized = tf_vectorizer.fit_transform(data)

vectorized[0].indices

array([36008, 27476, 16546, 28724, 35309, 19850, 21661,  6175, 16885,
       17780, 22208, 16568, 29102,  7165, 32431,  9608,  4301, 18960,
       29679, 11106,  6040, 31583,  5926,  9204, 29673, 29057, 34703,
       11567, 14137, 14350, 33554, 18368, 35470, 25282, 21231, 11803,
       18665, 29036, 14135, 21515, 17294, 38004, 19576, 27819, 29228,
       29725, 19582, 30569, 22786, 33080, 12220,  6935, 12059, 24262,
       33307, 33831], dtype=int32)

In [9]:
#vectorized[0].indices
print([ x for x in vectorized[0].toarray()[0] if x!=0])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [10]:
tf_vectorizer.get_feature_names()[vectorized[0].indices[0]]

'unfortunate'

In [27]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(data)

print([ x for x in tfidf[0].toarray()[0] if x!=0])

[0.10645192432513927, 0.13206693492719013, 0.13911073526391529, 0.078280425970325543, 0.1221234257718059, 0.15225085694513146, 0.09499209337222235, 0.14727716815782282, 0.11380286854613492, 0.11080947107018407, 0.10263239106092439, 0.063333491255439217, 0.10645192432513927, 0.10645192432513927, 0.13565443116491285, 0.13273216768757162, 0.067786925676237994, 0.076248650507555291, 0.13206693492719013, 0.074382503502486347, 0.12694299022735112, 0.11168121952486858, 0.16962115296576974, 0.15424819608892076, 0.20696272450507047, 0.3238693490546905, 0.095650721990963791, 0.10104064353212389, 0.1061163901077104, 0.069708577574150082, 0.079384274707155839, 0.39735285671770609, 0.16539097862634769, 0.12961744379182066, 0.074952211469356267, 0.098878623466974791, 0.13818581093036333, 0.099690370220266764, 0.077128900942245643, 0.0957467079600101, 0.16193467452734525, 0.097040876749444088, 0.098319373814489477, 0.13565443116491285, 0.15225085694513146, 0.13206693492719013, 0.11080947107018407, 0.

<h2>Tokenization</h2>
<p>Spacey allows us to do some interesting things during the tokenizaiton process.   Particularly, we can do `part of speech (pos)' and lemitization (lemma_)</p>

In [11]:
doc = nlp(data[0])

In [12]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

Well well INTJ UH intj Xxxx True False
i -PRON- PRON PRP nsubj x True True
'm be VERB VBP ROOT 'x False False
not not ADV RB neg xxx True True
sure sure ADJ JJ acomp xxxx True False
about about ADP IN prep xxxx True True
the the DET DT det xxx True True
story story NOUN NN compound xxxx True False
nad nad NOUN NN pobj xxx True False
it -PRON- PRON PRP nsubj xx True True
did do VERB VBD aux xxx True True
seem seem VERB VB relcl xxxx True True
biased biased ADJ JJ oprd xxxx True False
. . PUNCT . punct . False False
What what NOUN WP pobj Xxxx True False

 
 SPACE   
 False False
I -PRON- PRON PRP nsubj X True False
disagree disagree VERB VBP csubj xxxx True False
with with ADP IN prep xxxx True True
is be VERB VBZ ROOT xx True True
your -PRON- ADJ PRP$ poss xxxx True True
statement statement NOUN NN nsubj xxxx True False
that that ADP IN mark xxxx True True
the the DET DT det xxx True True
U.S. u.s. PROPN NNP compound X.X. False False
Media media PROPN NNP nsubj Xxxxx True False
is be V

<h2>NER</h2>
<p>Spacey also allows for Named Entity Recognition (NER).  It specifically keys on nouns being capitolized, then uses contextual cues to try to ascribe what type of enity the object is.</p>

In [30]:
for ent in doc.ents:
    print(ent, ent.start_char, ent.end_char, ent.label_)


 62 63 GPE
U.S. 106 110 GPE
Media 111 116 ORG

 126 127 GPE
Israels 132 139 GPE
U.S. 176 180 GPE

 189 190 GPE
Europe
 247 254 LOC

 312 313 GPE
U.S. 338 342 GPE

 374 375 GPE
U.S. 392 396 GPE
Israels 412 419 GPE

 437 438 GPE
Europeans 438 447 NORP

 501 502 GPE

 556 557 GPE
Austria 597 604 GPE

 622 623 GPE
Israeli 652 659 NORP

 685 686 GPE
Government 704 714 ORG
the Holocaust guilt
 729 749 ORG
Jews 782 786 NORP

 811 812 GPE

 851 852 GPE


In [15]:
temp = "Katy Perry has collaborated with Juicy J in the past in a concert in San Diego for Apple"
doc = nlp(temp)
for ent in doc.ents:
    print(ent, ent.start_char, ent.end_char, ent.label_)

Katy Perry 0 10 PERSON
Juicy J 33 40 PERSON
San Diego 69 78 GPE
Apple 83 88 ORG


<h1>TF-IDF</h1>


$idf_{t} = \log(\frac{N}{n_t})$ 


or smoothed

$idf_{t} = \log(1 + \frac{N}{n_t})$ 



