# Lesson 9 Text Processing

# Setup

Run this command from an Anaconda prompt:

```
conda install nltk spacy scikit-learn pandas
```


# Topics
- Overview
- Parsing, Stemming, Lemmatization
- Named Entity Recognition
- Stop Words
- Frequency Analysis
- Document Summarization

# Toolkits

### NLTK: NLP toolkit

Book: http://www.nltk.org/book/

Wiki: https://github.com/nltk/nltk/wiki

Corpus: http://www.nltk.org/nltk_data/

### spaCy: another NLP toolkit

Simpler to use than NLTK (but usually fewer knobs)

API: https://spacy.io/api/

Models: https://spacy.io/usage/models

Tutorial: https://spacy.io/usage/spacy-101

# What is Text Processing?

- A sub-field of Natural Language Processing (NLP)
- Natural Language Processing is ...
 - Teaching machines to understand and produce language (text, speech)
 - A combination of computer science and computational linguistics

# Text Processing Tasks

- Word categorization and tagging: part of speech, type of entity
- Semantic Analysis: finding meanings of documents
- Topic Modeling: finding topics from documents
- Document similarity: comparing if two documents are semantically similar
- etc.

Note: Speech is text processing + acoustic model

# Parsing, Stemming & Lemmatization

- Tokenization: splitting text into words
- Sentence boundary detection: splitting text into sentences
- Stemming: finding word stems
   - stating => state, reference => refer
- Lemmatization: finding the base form of words
   - was => be

## Tokenization

- Segmenting text into words, punctuation, etc.
- Rule-based

In [3]:
# Download the English model
# You can find other models here: https://spacy.io/models/en
!python -m spacy download en_core_web_sm


    Error: Couldn't link model to 'en_core_web_sm'
    Creating a symlink in spacy/data failed. Make sure you have the required
    permissions and try re-running the command as admin, or use a
    virtualenv. You can still import the model as a module and call its
    load() method, or create the symlink manually.

    C:\Users\yungh\Anaconda3\envs\sa48\lib\site-packages\en_core_web_sm -->
    C:\Users\yungh\Anaconda3\envs\sa48\lib\site-packages\spacy\data\en_core_web_sm


    Creating a shortcut link for 'en' didn't work (maybe you don't have
    admin permissions?), but you can still load the model via its full
    package name: nlp = spacy.load('{name}')
    Download successful but linking failed



In [4]:
text = u"This is a test. A quick brown fox jumps over the lazy dog."

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(text)

# sentence tokenizer
for sent in doc.sents:
    print()
    print(sent)


This is a test.

A quick brown fox jumps over the lazy dog.


In [6]:
# word tokenizer
for token in doc:
    print(token.text)

This
is
a
test
.
A
quick
brown
fox
jumps
over
the
lazy
dog
.


In [7]:
spacy.explain('DET')

'determiner'

https://spacy.io/api/token

https://spacy.io/api/token#attributes

In [8]:
from spacy import displacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

displacy.render(doc, style='dep', jupyter=True, options={'distance': 140})

### Tokenization with NLTK

http://www.nltk.org/api/nltk.tokenize.html

nltk.tokenize
 - sent_tokenize
 - word_tokenize
 - wordpunc_tokenize


In [9]:
# Download the Punkt sentence tokenizer
# https://www.nltk.org/_modules/nltk/tokenize/punkt.html

# List of available corpora: http://www.nltk.org/book/ch02.html#tab-corpora
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yungh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [10]:
from nltk.tokenize import sent_tokenize

# list of sentences
sent_tokenize(text)

['This is a test.', 'A quick brown fox jumps over the lazy dog.']

In [11]:
from nltk.tokenize import word_tokenize

# flat list of words and punctuations
word_tokenize(text)

['This',
 'is',
 'a',
 'test',
 '.',
 'A',
 'quick',
 'brown',
 'fox',
 'jumps',
 'over',
 'the',
 'lazy',
 'dog',
 '.']

In [12]:
from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(text)

# list of lists
[word_tokenize(sentence) for sentence in sentences]

[['This', 'is', 'a', 'test', '.'],
 ['A', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']]

In [13]:
from nltk.tokenize import wordpunct_tokenize

text2 = "'The time is now 5.30am,' he said."

print(word_tokenize(text2))

print(wordpunct_tokenize(text2))

["'The", 'time', 'is', 'now', '5.30am', ',', "'", 'he', 'said', '.']
["'", 'The', 'time', 'is', 'now', '5', '.', '30am', ",'", 'he', 'said', '.']


In [14]:
# Part of speech tagging
import nltk
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(text)
sentences = [word_tokenize(sentence) for sentence in sentences]

[nltk.pos_tag(word) for word in sentences]

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\yungh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


[[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN'), ('.', '.')],
 [('A', 'DT'),
  ('quick', 'JJ'),
  ('brown', 'NN'),
  ('fox', 'NN'),
  ('jumps', 'VBZ'),
  ('over', 'IN'),
  ('the', 'DT'),
  ('lazy', 'JJ'),
  ('dog', 'NN'),
  ('.', '.')]]

In [15]:
spacy.explain('JJ')

'adjective'

#### Twitter-aware tokenizer

`nltk.tokenize.TweetTokenizer`

http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual

In [16]:
from nltk.tokenize import TweetTokenizer

tknzr = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"

tknzr.tokenize(tweet)

['This',
 'is',
 'a',
 'cooool',
 '#dummysmiley',
 ':',
 ':-)',
 ':-P',
 '<3',
 'and',
 'some',
 'arrows',
 '<',
 '>',
 '->',
 '<--']

In [17]:
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
tweet = '@remy: This is waaaaayyyy too much for you!!!!!!'

tknzr.tokenize(tweet)

[':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']

## Stemming vs. Lemmatization

- Stemming uses rule-based heuristics
  - ponies => poni
  - Quicker, but less precision
- Lemmatization uses vocabulary and morphological analysis
  - ponies => pony
  - For English, not much improvement over stemming because context of word use is more important

https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

## Porter Stemmer

- 5 sequential phases of word reductions
- Applies rules such as "sses -> ss", "ies => i"


### Stemming & Lemmatization with spaCy

`spacy.lemmatizer.Lemmatizer`

https://spacy.io/api/lemmatizer

In [18]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES

nlp = spacy.load('en_core_web_sm')
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)

doc = nlp(text)

for token in doc:
    print(lemmatizer(token.text, token.pos_))

['this']
['be']
['a']
['test']
['.']
['a']
['quick']
['brown']
['fox']
['jump']
['over']
['the']
['lazy']
['dog']
['.']


### Stemming & Lemmatization with NLTK

`nltk.stem`
- `PorterStemmer`
- `WordNetLemmatizer`

http://www.nltk.org/api/nltk.stem.html

In [19]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

tokens = word_tokenize(text)

for token in tokens:
    print(stemmer.stem(token))

thi
is
a
test
.
A
quick
brown
fox
jump
over
the
lazi
dog
.


In [20]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

tokens = word_tokenize(text)

for token in tokens:
    print(lemmatizer.lemmatize(token))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\yungh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


This
is
a
test
.
A
quick
brown
fox
jump
over
the
lazy
dog
.


## Named Entity Recognition

- Find and classify entities within text
  - Persons
  - Organizations
  - Locations
  - Time expressions
  - Quantities
  - Phone numbers
  - etc
  
- Grammar-based models, trained classifiers

- Can be corpus-dependent, see https://spacy.io/api/annotation#named-entities

### Named Entity Recognition with spaCy

https://spacy.io/api/annotation#named-entities

In [22]:
nlp = spacy.load('en_core_web_sm')

text3 = u"Flight 224 is scheduled to arrive in Frankfurt at 4pm July 5th, 2018."
doc = nlp(text3)

for entity in doc.ents:
    print(entity.text, '==', entity.label_,  '==', entity.start_char,  '==', entity.end_char)

Flight 224 == PRODUCT == 0 == 10
Frankfurt == GPE == 37 == 46
4pm == TIME == 50 == 53
July 5th, 2018 == DATE == 54 == 68


In [23]:
text3[54:68]

'July 5th, 2018'

In [26]:
spacy.explain('NORP')
spacy.explain('GPE')

'Countries, cities, states'

In [25]:
from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

### Named Entity Recognition with NLTK

```
nltk.ne_chunk()
```

https://www.nltk.org/book/ch07.html

In [27]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(text3)
sentences = [word_tokenize(sentence) for sentence in sentences]

# Input to ne_chunk needs to be a part-of-speech tagged word
sentences_pos_tagged = [nltk.pos_tag(word) for word in sentences]

[nltk.ne_chunk(word_pos) for word_pos in sentences_pos_tagged]

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\yungh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker.zip.
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\yungh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


[Tree('S', [('Flight', 'NNP'), ('224', 'CD'), ('is', 'VBZ'), ('scheduled', 'VBN'), ('to', 'TO'), ('arrive', 'VB'), ('in', 'IN'), Tree('GPE', [('Frankfurt', 'NNP')]), ('at', 'IN'), ('4pm', 'CD'), ('July', 'NNP'), ('5th', 'CD'), (',', ','), ('2018', 'CD'), ('.', '.')])]

## Stop words

Stop words are high-frequency words that don't contribute much lexical content:

- the
- a
- to

NLP libraries usually include a corpus of stop words.

Stop word lists:
- http://www.nltk.org/book/ch02.html#stopwords_index_term
- https://www.semantikoz.com/blog/free-stop-word-lists-in-23-languages/

### Stop words with spaCy

`spacy.lang.en.stop_words`

`token.is_stop`

In [28]:
from spacy.lang.en.stop_words import STOP_WORDS

STOP_WORDS

{'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'front',
 'full',
 'further',
 'get',
 'give',
 'g

In [29]:
# Deutsch
from spacy.lang.de.stop_words import STOP_WORDS

STOP_WORDS

{'a',
 'ab',
 'aber',
 'ach',
 'acht',
 'achte',
 'achten',
 'achter',
 'achtes',
 'ag',
 'alle',
 'allein',
 'allem',
 'allen',
 'aller',
 'allerdings',
 'alles',
 'allgemeinen',
 'als',
 'also',
 'am',
 'an',
 'andere',
 'anderen',
 'andern',
 'anders',
 'auch',
 'auf',
 'aus',
 'ausser',
 'ausserdem',
 'außer',
 'außerdem',
 'bald',
 'bei',
 'beide',
 'beiden',
 'beim',
 'beispiel',
 'bekannt',
 'bereits',
 'besonders',
 'besser',
 'besten',
 'bin',
 'bis',
 'bisher',
 'bist',
 'da',
 'dabei',
 'dadurch',
 'dafür',
 'dagegen',
 'daher',
 'dahin',
 'dahinter',
 'damals',
 'damit',
 'danach',
 'daneben',
 'dank',
 'dann',
 'daran',
 'darauf',
 'daraus',
 'darf',
 'darfst',
 'darin',
 'darum',
 'darunter',
 'darüber',
 'das',
 'dasein',
 'daselbst',
 'dass',
 'dasselbe',
 'davon',
 'davor',
 'dazu',
 'dazwischen',
 'daß',
 'dein',
 'deine',
 'deinem',
 'deiner',
 'dem',
 'dementsprechend',
 'demgegenüber',
 'demgemäss',
 'demgemäß',
 'demselben',
 'demzufolge',
 'den',
 'denen',
 'denn

In [30]:
doc = nlp(text3)

for token in doc:
    print(token.text, token.is_stop)

Flight False
224 False
is True
scheduled False
to True
arrive False
in True
Frankfurt False
at True
4 False
pm False
July False
5th False
, False
2018 False
. False


In [33]:
# Adding stop words
from spacy.lang.en.stop_words import STOP_WORDS

STOP_WORDS.add('SA48')

doc = nlp(u"Sorry I'm not free tonite, I have SA48 (lowercase: mldds).")

for token in doc:
    print(token.text,  '==', token.is_stop)

Sorry == False
I == False
'm == False
not == True
free == False
tonite == False
, == False
I == False
have == True
SA48 == True
( == False
lowercase == False
: == False
mldds == False
) == False
. == False


### Stop words with NLTK

```
nltk.corpus.stopwords
```

In [34]:
# Download corpus
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yungh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [35]:
stopwords.words('german')

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 '

In [36]:
tokens = nltk.word_tokenize(text3)

stops = set(stopwords.words('english'))

for token in tokens:
    print(token, token in stops)

Flight False
224 False
is True
scheduled False
to True
arrive False
in True
Frankfurt False
at True
4pm False
July False
5th False
, False
2018 False
. False


In [38]:
# Adding stop words
stops = stopwords.words('english')
stops.append("SA48")
stops = set(stops)

tokens = nltk.word_tokenize(u"Sorry I'm not free tonite, I have SA48 (lowercase: mldds).")

for token in tokens:
    print(token,  '==', token in stops)

Sorry == False
I == False
'm == False
not == True
free == False
tonite == False
, == False
I == False
have == True
SA48 == True
( == False
lowercase == False
: == False
mldds == False
) == False
. == False


# Frequency Analysis

Answers two questions:

1. How often does a word appear in a document?

2. How important is a word in a document?

Measure: Term Frequency - Inverse Document Frequency (TF-IDF)

## Term Frequency

Most common formula:

$$\frac{f_{t, d}}{\sum_{t' \in d} \, f_{t',d}}$$

$f_{t, d}$: count of term $t$ in document $d$

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

## Inverse Document Frequency

Most common formula:

$$log\frac{N}{\mid\{d \in D : t \in d \}\mid}$$

$N$: number of documents

$\mid\{d \in D : t \in d \}\mid$: number of documents containing term $t$

## TD-IDF

$$tfidf(t, d, D) = tf(t, d) * idf(t, D)$$

|term|tf|idf|tf-idf|
|--|--|--|--|--|
|to|large|very small|closer to 0|
|coffee|small|large|not closer to 0|

## Computing TF-IDF

#### Scikit-learn:

```
sklearn.feature_extraction.text.CountVectorizer

sklearn.feature_extraction.text.TfidfVectorizer
```
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html


#### NLTK
Supports tf-idf, but less popular
```
nltk.text.TextCollection
```

http://www.nltk.org/api/nltk.html#nltk.text.TextCollection

In [39]:
text5 = u"This is a test.\n" \
    u"The quick brown fox jumps over the lazy dog.\n" \
    u"The early bird gets the worm.\n"

#### Computing Word Counts

In [40]:
# http://scikit-learn.org/stable/modules/feature_extraction.html
from sklearn.feature_extraction.text import CountVectorizer

nlp = spacy.load('en_core_web_sm')
doc = nlp(text5)
sentences = [sent.text for sent in doc.sents]

# Count word occurrences
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

# convert sparse matrix to dense matrix
X_dense = X.todense()

X_dense

matrix([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0],
        [0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 2, 0, 0],
        [1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 1]], dtype=int64)

In [41]:
vectorizer.get_feature_names()

['bird',
 'brown',
 'dog',
 'early',
 'fox',
 'gets',
 'is',
 'jumps',
 'lazy',
 'over',
 'quick',
 'test',
 'the',
 'this',
 'worm']

In [42]:
# display as a dataframe
import pandas as pd

df_wc = pd.DataFrame(X_dense, columns=vectorizer.get_feature_names())
df_wc

Unnamed: 0,bird,brown,dog,early,fox,gets,is,jumps,lazy,over,quick,test,the,this,worm
0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0
1,0,1,1,0,1,0,0,1,1,1,1,0,2,0,0
2,1,0,0,1,0,1,0,0,0,0,0,0,2,0,1


#### Computing TF-IDF

In [43]:
# http://scikit-learn.org/stable/modules/feature_extraction.html
from sklearn.feature_extraction.text import TfidfVectorizer

# TfidfVectorizer is a combination of
#   CountVectorizer + TfidfTransformer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

# convert sparse matrix to dense matrix
X_dense = X.todense()

print(X_dense.shape)
print(vectorizer.get_feature_names())
X_dense

(3, 15)
['bird', 'brown', 'dog', 'early', 'fox', 'gets', 'is', 'jumps', 'lazy', 'over', 'quick', 'test', 'the', 'this', 'worm']


matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.57735027, 0.        , 0.        , 0.        ,
         0.        , 0.57735027, 0.        , 0.57735027, 0.        ],
        [0.        , 0.32767345, 0.32767345, 0.        , 0.32767345,
         0.        , 0.        , 0.32767345, 0.32767345, 0.32767345,
         0.32767345, 0.        , 0.49840822, 0.        , 0.        ],
        [0.39798027, 0.        , 0.        , 0.39798027, 0.        ,
         0.39798027, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.60534851, 0.        , 0.39798027]])

In [44]:
# for each sentence, get the highest tf-idf
import numpy as np

terms = vectorizer.get_feature_names()
tfidf_arr = np.array(X_dense)

for i in np.arange(len(sentences)):
    print(sentences[i])
    sorted_idx = np.argsort(tfidf_arr[i])[::-1]
    [print(terms[j], tfidf_arr[i][j]) for j in sorted_idx]
    print()

This is a test.

this 0.5773502691896257
test 0.5773502691896257
is 0.5773502691896257
worm 0.0
the 0.0
quick 0.0
over 0.0
lazy 0.0
jumps 0.0
gets 0.0
fox 0.0
early 0.0
dog 0.0
brown 0.0
bird 0.0

The quick brown fox jumps over the lazy dog.

the 0.4984082163022241
quick 0.3276734545947569
over 0.3276734545947569
lazy 0.3276734545947569
jumps 0.3276734545947569
fox 0.3276734545947569
dog 0.3276734545947569
brown 0.3276734545947569
worm 0.0
this 0.0
test 0.0
is 0.0
gets 0.0
early 0.0
bird 0.0

The early bird gets the worm.

the 0.6053485081062917
worm 0.3979802707840827
gets 0.3979802707840827
early 0.3979802707840827
bird 0.3979802707840827
this 0.0
test 0.0
quick 0.0
over 0.0
lazy 0.0
jumps 0.0
is 0.0
fox 0.0
dog 0.0
brown 0.0



## Exercise

1. Get 3-5 of your own sample sentences
2. Compute the TF-IDF
3. Compute the TF-IDF with stop_words filtered out:

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

```
from spacy.lang.en.stop_words import STOP_WORDS

vectorizer = TfidfVectorizer(stop_words=STOP_WORDS)

...

```

## N-grams

TF-IDF can be applied to N-grams (N words at a time), to try to capture some context information.

```
CountVectorizer(ngram_range=(minN, maxN)), ..)

TfidfVectorizer(ngram_range=(minN, maxN)), ..)
```

In [45]:
text5 = u"This is a test.\n" \
    u"The quick brown fox jumps over the lazy dog.\n" \
    u"The early bird gets the worm.\n"

In [46]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

nlp = spacy.load('en_core_web_sm')
doc = nlp(text5)
sentences = [sent.text for sent in doc.sents]

# Count word occurrences using 1 and 2-grams
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(sentences)

# convert sparse matrix to dense matrix
X_dense = X.todense()
print(X_dense.shape)

pd.DataFrame(X_dense, columns=vectorizer.get_feature_names())

(3, 30)


Unnamed: 0,bird,bird gets,brown,brown fox,dog,early,early bird,fox,fox jumps,gets,...,quick brown,test,the,the early,the lazy,the quick,the worm,this,this is,worm
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0
1,0,0,1,1,1,0,0,1,1,0,...,1,0,2,0,1,1,0,0,0,0
2,1,1,0,0,0,1,1,0,0,1,...,0,0,2,1,0,0,1,0,0,1


## Exercise: TF-IDF with Trigrams

- Compute the TF-IDF for trigrams (1 to 3-grams), using your sample text.
- Try with and without stop words included

In [None]:
# Your code here















### NLTK N-gram support

You can also split text into trigrams and bigrams using NLTK.

In [47]:
from nltk import bigrams, trigrams, ngrams, word_tokenize

# http://www.taleswithmorals.com/aesop-fable-the-ant-and-the-grasshopper.htm
text6 = "In a field one summer's day a Grasshopper was hopping about, " \
        "chirping and singing to its heart's content."

words = word_tokenize(text6)

print(list(bigrams(words)))

[('In', 'a'), ('a', 'field'), ('field', 'one'), ('one', 'summer'), ('summer', "'s"), ("'s", 'day'), ('day', 'a'), ('a', 'Grasshopper'), ('Grasshopper', 'was'), ('was', 'hopping'), ('hopping', 'about'), ('about', ','), (',', 'chirping'), ('chirping', 'and'), ('and', 'singing'), ('singing', 'to'), ('to', 'its'), ('its', 'heart'), ('heart', "'s"), ("'s", 'content'), ('content', '.')]


In [48]:
print(list(trigrams(words)))

[('In', 'a', 'field'), ('a', 'field', 'one'), ('field', 'one', 'summer'), ('one', 'summer', "'s"), ('summer', "'s", 'day'), ("'s", 'day', 'a'), ('day', 'a', 'Grasshopper'), ('a', 'Grasshopper', 'was'), ('Grasshopper', 'was', 'hopping'), ('was', 'hopping', 'about'), ('hopping', 'about', ','), ('about', ',', 'chirping'), (',', 'chirping', 'and'), ('chirping', 'and', 'singing'), ('and', 'singing', 'to'), ('singing', 'to', 'its'), ('to', 'its', 'heart'), ('its', 'heart', "'s"), ('heart', "'s", 'content'), ("'s", 'content', '.')]


In [49]:
print(list(ngrams(words, 4)))

[('In', 'a', 'field', 'one'), ('a', 'field', 'one', 'summer'), ('field', 'one', 'summer', "'s"), ('one', 'summer', "'s", 'day'), ('summer', "'s", 'day', 'a'), ("'s", 'day', 'a', 'Grasshopper'), ('day', 'a', 'Grasshopper', 'was'), ('a', 'Grasshopper', 'was', 'hopping'), ('Grasshopper', 'was', 'hopping', 'about'), ('was', 'hopping', 'about', ','), ('hopping', 'about', ',', 'chirping'), ('about', ',', 'chirping', 'and'), (',', 'chirping', 'and', 'singing'), ('chirping', 'and', 'singing', 'to'), ('and', 'singing', 'to', 'its'), ('singing', 'to', 'its', 'heart'), ('to', 'its', 'heart', "'s"), ('its', 'heart', "'s", 'content'), ('heart', "'s", 'content', '.')]


# NLP Datasets

http://www.nltk.org/nltk_data/

https://github.com/niderhoff/nlp-datasets

# Workshop: Paragraph Summarization

- Download a corpus from NLTK
- Split the corpus into paragraphs
- Compute TF-IDF score for each word in a paragraph corresponding to its level of "importance"
- Rank each sentence using (sum of TF-IDF(words) / number of tokens)
- Extract the top N highest scoring sentences and return them as our "summary"

Credits: https://github.com/charlieg/A-Smattering-of-NLP-in-Python

### Download a corpus

Select a corpus from http://www.nltk.org/nltk_data/. 

Suggestions:
- reuters
- abc
- gutenberg

Example
```
# download the corpus you selected
import nltk
nltk.download('abc')

# update the import with the corpus you selected
from nltk.corpus import abc as corpus
```

In [None]:
# Your code here













### Explore the corpus

For example, try printing the raw text of one of the files:

```
fileids = corpus.fileids()

print(fileids)

print(corpus.raw(fileids[0])

```

In [None]:
# Your code here











### Split the text into paragraphs

NLTK doesn't include a paragraph tokenizer, so we'll try to create our own.

One logic that may work is this:
- a paragraph is detected if there are consecutive newline characters

Adapt this function to your corpus, and adjust the logic if necessary to get paragraphs.

```
def tokenize_paragraph(text):
    """Tokenizes text into paragraphs
    Args:
        text - the raw text
    Returns:
        A list of paragraphs for the raw text
    """
    # Note: you may need to customize this logic for the 
    # corpus you selected
    return [p for p in text.split('\n\n') if p]

# test code
paragraphs = tokenize_paragraph(corpus.raw(fileids[0]))
print(paragraphs[:5])
```

In [None]:
# Your code here













### Collect all paragraphs in your corpus

Using the paragraph tokenizer, create a list containing all paragraphs for all the files in the corpus.

We will be using this to train TF-IDF.

Some starter code:

```
all_paragraphs = []
for fileid in corpus.fileids():
    text = corpus.raw(fileid)

    # Split text into paragraphs and add to all_paragraphs
    # You can use the syntax list1 = list1 + list2
    #
    # Your code here
    ...
    

# test code
print(len(all_paragraphs))
print(all_paragraphs[:5])
```

In [None]:
# Your code here












In [50]:
def tokenize_and_stem(text):
    """Helper function to tokenize and stem words in text
    Arg:
        text: the input text
    Return:
        the tokenized stem words
    """
    import re
    from nltk.stem import PorterStemmer

    stemmer = PorterStemmer()

    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []

    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)

    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

### Compute TF-IDF

- Treating a paragraph as a document, compute the TF-IDF using TfidfVectorizer
- Pass `tokenize_and_stem` as a tokenizer to TfidfVectorizer
- Filter out stop words in TfidfVectorizer
- `fit_transform` the TfidfVectorizer (this may take about a minute or two)

```
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english')
...

```

In [None]:
# Your code here









### Explore the TF-IDF matrix

Explore the TF-IDF matrix, counting the terms, documents, and printing the first few terms

```
feature_names = vectorizer.get_feature_names()

# number of terms
print("Number of terms:", len(feature_names))

# number of documents (paragraphs)
print("Number of paragraphs:", tfidf.shape[0])

# first 20 terms
print(feature_names[:20])

```

In [99]:
# Your code here








### Paragraph Summarization

- Pick a random paragraph
- Tokenize the paragraph into sentences
- Rank each sentence by getting the average word score for it
- Extract the top N highest scoring sentences and return them as our "summary"

In [None]:
import random

# Get a random index for all_paragraphs
paragraph_index = random.randint(0, len(all_paragraphs)-1)
paragraph = all_paragraphs[paragraph_index]

In [None]:
# Tokenize the selected paragraph into sentences
# for each sentence, compute the sum of TF-IDF divided by tokens
sentence_scores = []
for sentence in sent_tokenize(paragraph):
    tfidf_sum = 0

    sent_tokens = tokenize_and_stem(sentence)
    feature_tokens = [t for t in sent_tokens if t in feature_names]

    for ft in feature_tokens:
        tfidf_sum += tfidf[paragraph_index, feature_names.index(ft)]
    
    sentence_score = tfidf_sum / len(sent_tokens)
    sentence_scores.append((sentence_score, sentence))

In [None]:
# Get the top-N scores and create the summary

# in case paragraph has less than 2 sentences
n = min(2, len(sentence_scores))

# sort by sentence_score
sentence_scores.sort(key=lambda x: x[0], reverse=True)

print('*** SUMMARY ***')
for summary_sentence in sentence_scores[:n]:
    print(summary_sentence[1], '(score: %.2f)' % summary_sentence[0])

print('\n*** ORIGINAL ***')
print(paragraph)

print('\n*** SENTENCE SCORES ***')
for (score, sentence) in sentence_scores:
    print(score, sentence)