# Natural Language Processing / Naive Bayes

## Objectives

- Make computers read text.

- Turn raw text into features we can feed into our classifiers.

- Classify email into ham and spam.

- Find out which documents match a search most closely.

- Build a Naive Bayes classifier.

## Natural Language Processing

<br><details><summary>
What could computers do if they could read?
</summary>

1. Filter out email spam.<br>
2. Scan resumes.<br>
3. Detect plagiarism.<br>
4. Classify software bugs into different categories.<br>
5. Cluster news like Google News.<br>
6. Find out which headlines will get the most clicks.<br>
7. Find out which ad copy will get the most clicks.<br>
</details>

<br><details><summary>
What is NLP?
</summary>

1. NLP transforms unstructured text into vectors.<br>
2. These vectors can be used for machine learning applications.<br>
3. E.g. classification, clustering, regression, recommender systems, etc.<br>
4. Instead of just apply ML to numeric data with NLP we can apply it to text.<br>
</details>

<br><details><summary>
What is unstructured text? How is it different from structured text?
</summary>

1. Unstructured text is text without a schema.<br>
2. Structured text is text with a schema, e.g. CSV, JSON.<br>
3. A schema specifies column names and types for the data.<br>
4. Unstructured text has no columns, no specified types.<br>
5. Usually is just a blob of text.<br>
</details>

<br><details><summary>
What are some sources of unstructured text?
</summary>

1. Twitter, Facebook, social media.<br>
2. Legal documents.<br>
3. News stories, blogs, online comments.<br>
4. Voice recognition software , OCR, digitized documents.<br>
5. Bugs, code comments, documentation.<br>
</details>


## Text Classification

Suppose we want to classify email into spam and not spam. Consider
these two email messages.

### Email 1

> From: Joe    
> Subject: R0lex    
>     
> Want to buy cheap R0leXX watches?    

### Email 2

> From: Jim    
> Subject: Coffee    
>     
> Want to grab coffee at 4?    

<br><details><summary>
Which one of these is likely to be spam?
</summary>

Email 1.<br>
</details>


<br><details><summary>
Why?
</summary>

1. It contains words that are spammy.<br>
2. It contains misspellings.<br>
</details>

## Vectorizing Text

Now as humans we can figure this out pretty easily. Our brains have
amazing NLP. But we are not scalable. 

<br><details><summary>
How can we automate this process?
</summary>

1. Convert text into feature vectors.<br>
2. Train classifier on these feature vectors.<br>
3. Use output vectors `[1]` to mean spam, and `[0]` to mean not-spam.<br>
</details>

What is *vectorizing*?

- Vectorizing is converting unstructured raw text to feature vectors. 

Consider these two texts. 

- Text 1: `Want to buy cheap R0leXX watches?`

- Text 2: `Want to grab coffee at 4?`

<br><details><summary>
How can we vectorize them?
</summary>

1. Lowercase all words.<br>
2. Replace words with common base words.<br>
3. Count how many times each word occurs.<br>
4. Store as vector.<br>

</details>

## Vectorized Texts

Here is what the emails look like vectorized. This is also known as a
*bag of words*.

Index |Word     |Text 1   |Text 2
----- |----     |------   |------
0     |want     |1        |1
1     |to       |1        |1
2     |buy      |1        |0
3     |cheap    |1        |0
4     |r0lexx   |1        |0
5     |watches  |1        |0
6     |grab     |0        |1
7     |coffee   |0        |1
8     |at       |0        |1
9     |4        |0        |1

## Terms

Term         |Meaning
----         |-------
Corpus       |Collection of documents (collection of articles).
Document     |A single document (a tweet, an email, an article).
Vocabulary   |Set of words in your corpus, or maybe the entire English dictionary. 
Bag of Words |Vector representation of words in a document.
Token        |Single word.
Stop Words   |Common ignored words because not useful in distinguishing text.
Vectorizing  |Converting text into a bag-of-words.

## Universal Parts of Speech

<table border="1" class="docutils" id="tab-universal-tagset">
<colgroup>
<col width="11%">
<col width="27%">
<col width="62%">
</colgroup>
<thead valign="bottom">
<tr><th class="head">Tag</th>
<th class="head">Meaning</th>
<th class="head">English Examples</th>
</tr>
</thead>
<tbody valign="top">
<tr><td><tt class="doctest"><span class="pre">ADJ</span></tt></td>
<td>adjective</td>
<td><span class="example">new, good, high, special, big, local</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">ADP</span></tt></td>
<td>adposition</td>
<td><span class="example">on, of, at, with, by, into, under</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">ADV</span></tt></td>
<td>adverb</td>
<td><span class="example">really, already, still, early, now</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">CONJ</span></tt></td>
<td>conjunction</td>
<td><span class="example">and, or, but, if, while, although</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">DET</span></tt></td>
<td>determiner, article</td>
<td><span class="example">the, a, some, most, every, no, which</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">NOUN</span></tt></td>
<td>noun</td>
<td><span class="example">year, home, costs, time, Africa</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">NUM</span></tt></td>
<td>numeral</td>
<td><span class="example">twenty-four, fourth, 1991, 14:24</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">PRT</span></tt></td>
<td>particle</td>
<td><span class="example">at, on, out, over per, that, up, with</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">PRON</span></tt></td>
<td>pronoun</td>
<td><span class="example">he, their, her, its, my, I, us</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">VERB</span></tt></td>
<td>verb</td>
<td><span class="example">is, say, told, given, playing, would</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">.</span></tt></td>
<td>punctuation marks</td>
<td><span class="example">. , ; !</span></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">X</span></tt></td>
<td>other</td>
<td><span class="example">ersatz, esprit, dunno, gr8, univeristy</span></td>
</tr>
</tbody>


</table>

## Building an NLP Pipeline

Lets build an NLP Pipeline to turn unstructured text data into
something we can train a classifier on.

In [1]:
from pprint import pprint
import nltk
# nltk.download('all');

## Tokenizing

The first step is turning your raw text documents into lists of words.

In [2]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

document1 = 'Want to buy cheap R0leXX watches?'
document2 = 'Want to grab coffee at 4?'
document3 = 'What time works for you, for coffee after work?'
documents = [document1, document2, document3]
corpus = [word_tokenize(content.lower()) for content in documents]
print(corpus)

[['want', 'to', 'buy', 'cheap', 'r0lexx', 'watches', '?'], ['want', 'to', 'grab', 'coffee', 'at', '4', '?'], ['what', 'time', 'works', 'for', 'you', ',', 'for', 'coffee', 'after', 'work', '?']]


In [3]:
# or you can do it like this (or any number of ways!)
documents = " ".join([document1, document2, document3])
print([word_tokenize(content.lower()) for content in sent_tokenize(documents)])

[['want', 'to', 'buy', 'cheap', 'r0lexx', 'watches', '?'], ['want', 'to', 'grab', 'coffee', 'at', '4', '?'], ['what', 'time', 'works', 'for', 'you', ',', 'for', 'coffee', 'after', 'work', '?']]



## Question

<br><details><summary>
What are some use cases for sentence tokenizing?
</summary>

1. Classifying repetitive text that uses the same sentences.<br>
2. Classifying conversation between an airplane and a control tower.<br>
3. Classifying text generated by filling out a form.<br>
4. Breaking document down into sentences and treating each sentence as
   a document. For example, to give MPAA rating to a movie script.<br>
</details>

## Stop Words

NLTK has functionality for removing common words that show up in
almost every document.

In [4]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [5]:
# Remove stop words
corpus = [[token for token in document if token not in stop_words] for document in corpus]
pprint(corpus)

[['want', 'buy', 'cheap', 'r0lexx', 'watches', '?'],
 ['want', 'grab', 'coffee', '4', '?'],
 ['time', 'works', ',', 'coffee', 'work', '?']]


## Punctuation
In general, we can drop punctuation. It does not jive with the tf-idf technique we're about to learn...

In [6]:
import string
punctuation = set(string.punctuation)
corpus = [[token for token in document if token not in punctuation] for document in corpus]
pprint(corpus)

[['want', 'buy', 'cheap', 'r0lexx', 'watches'],
 ['want', 'grab', 'coffee', '4'],
 ['time', 'works', 'coffee', 'work']]


## Question

<br><details><summary>
What might be some applications where you don't want to remove stop
words from your text?
</summary>

1. Plagiarism detection.<br>
2. Investigating if dusty papers found in attic are a lost Shakespeare play.<br>
3. Identifying document types in Data Loss Prevention. E.g. resume, 
   insider information, legal contract, etc.<br> 
</details>

## Stemming and Lemmatization

<br><details><summary>
How will our code treat words like "watch" and "watches"?
</summary>

It will treat them as different words.<br>
</details>

<br><details><summary>
How can we fix this?
</summary>

1. Remove inflectional endings and return base form of word (known as the lemma).<br>
2. This is known as stemming or lemmatization.<br>
</details>

What is the difference between stemming and lemmatization?

- Lemmatization is more general.

- Converting *cars* to *car* is stemming.

- Converting *automobile* to *car* is lemmatization.

- Behavior depends on your toolkit.

In Python's NLTK:

- Stemming converts *cars* to *car*, but does not convert *children*
  to *child*.

- Lemmatization converts both.

## Stemming and Lemmatization Computation

Removing morphoglical affixes from words, and also replacing words with their
"lemma" (or base words: "good" is the lemma of "better").

- running -> run

- generously -> generous

- better -> good

- dogs -> dog

- mice -> mouse

Here is how to do this using NLTK. Don't try to write your own
functions for this.

In [7]:
from nltk.stem.porter   import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet  import WordNetLemmatizer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()

In [8]:
print(snowball.stem('running'))
print(wordnet.lemmatize('mice'))
print(wordnet.lemmatize('geese'))

run
mouse
goose


In [9]:
corpus = [[wordnet.lemmatize(word) for word in document] for document in corpus]
corpus

[['want', 'buy', 'cheap', 'r0lexx', 'watch'],
 ['want', 'grab', 'coffee', '4'],
 ['time', 'work', 'coffee', 'work']]

## Vocabulary
Build you vocabulary from the set of tokens contained in your corpus

In [10]:
vocab_set = set()
[[vocab_set.add(token) for token in tokens] for tokens in corpus]
vocab = list(vocab_set)
vocab

['grab',
 'watch',
 'buy',
 '4',
 'time',
 'work',
 'want',
 'r0lexx',
 'coffee',
 'cheap']

## N-grams

<br><details><summary>
Consider "rolex watch" vs "lets watch the game this weekend". Which
one is spam?
</summary>

The first one.<br>
</details>

<br><details><summary>
Why use N-grams?
</summary>

1. Sometimes context makes a difference.<br>
2. You want to see the words before and after.<br>
3. N-grams are strings consecutive words in your corpus.<br>
4. These are extra "features" in your data that contain more information that individual words.<br>
</details>

In [11]:
from nltk.util import ngrams, skipgrams
from pprint import pprint

# An n-gram is a sequence of n words
# when n = 2, we call this a bigram. when n=3, trigram, etc...
bigrams = [list(ngrams(sequence=document, n=2)) for document in corpus]

# Skipgrams are ngrams that allows tokens to be skipped.
skipgrams = [list(skipgrams(sequence=document, n=2, k=1)) for document in corpus]

print('n-grams:')
print('-----------')
pprint(bigrams)

print('skip grams:')
print('-----------')
pprint(skipgrams)

n-grams:
-----------
[[('want', 'buy'), ('buy', 'cheap'), ('cheap', 'r0lexx'), ('r0lexx', 'watch')],
 [('want', 'grab'), ('grab', 'coffee'), ('coffee', '4')],
 [('time', 'work'), ('work', 'coffee'), ('coffee', 'work')]]
skip grams:
-----------
[[('want', 'buy'),
  ('want', 'cheap'),
  ('buy', 'cheap'),
  ('buy', 'r0lexx'),
  ('cheap', 'r0lexx'),
  ('cheap', 'watch'),
  ('r0lexx', 'watch')],
 [('want', 'grab'),
  ('want', 'coffee'),
  ('grab', 'coffee'),
  ('grab', '4'),
  ('coffee', '4')],
 [('time', 'work'),
  ('time', 'coffee'),
  ('work', 'coffee'),
  ('work', 'work'),
  ('coffee', 'work')]]


## Bag-of-Words

How do you turn a document into a bag-of-words?

In [12]:
from collections import Counter
import numpy as np

def vectorize(doc, vocabulary):
    bag_of_words = Counter(doc.split(' '))
    doc_vector = np.zeros(len(vocabulary))
    for word_index, word in enumerate(vocabulary):
        if word in bag_of_words:
            doc_vector[word_index] += bag_of_words[word]
    return doc_vector

In [13]:
corpus = [' '.join(tokens) for tokens in corpus]
vectorized_documents = [vectorize(document, vocab) for document in corpus]
term_frequency_matrix = np.vstack(vectorized_documents)

In [14]:
print(["{:>6}".format(token) for token in vocab])
for row in term_frequency_matrix:
    print(["{:>6}".format(str(int(value))) for value in row])

['  grab', ' watch', '   buy', '     4', '  time', '  work', '  want', 'r0lexx', 'coffee', ' cheap']
['     0', '     1', '     1', '     0', '     0', '     0', '     1', '     1', '     0', '     1']
['     1', '     0', '     0', '     1', '     0', '     0', '     1', '     0', '     1', '     0']
['     0', '     0', '     0', '     0', '     1', '     2', '     0', '     0', '     1', '     0']


In [15]:
# lucky for us, Sklearn has built-in methods to do this efficiently.
from sklearn.feature_extraction.text import CountVectorizer
c = CountVectorizer(stop_words='english')
bag_of_words = c.fit(corpus)

feature_dict = bag_of_words.vocabulary_ # with numerical indices
print(feature_dict)

feature_list = bag_of_words.get_feature_names() # just the names, alphabetically
print(feature_list)

{'want': 6, 'buy': 0, 'cheap': 1, 'r0lexx': 4, 'watch': 7, 'grab': 3, 'coffee': 2, 'time': 5, 'work': 8}
['buy', 'cheap', 'coffee', 'grab', 'r0lexx', 'time', 'want', 'watch', 'work']


In [16]:
# finally we can convert this to a vector such that we can (soon) apply machine learning!
term_frequency_matrix = c.fit_transform(corpus).toarray()

print(["{:>6}".format(token) for token in feature_list])
for row in term_frequency_matrix:
    print(["{:>6}".format(str(value)) for value in row])

['   buy', ' cheap', 'coffee', '  grab', 'r0lexx', '  time', '  want', ' watch', '  work']
['     1', '     1', '     0', '     0', '     1', '     0', '     1', '     1', '     0']
['     0', '     0', '     1', '     1', '     0', '     0', '     1', '     0', '     0']
['     0', '     0', '     1', '     0', '     0', '     1', '     0', '     0', '     2']


## Bag-of-Words Limitations

<br><details><summary>
What are some limitations of the bag-of-words approach and
CountVectorizer?
</summary>

1. Longer documents weigh more than short documents.<br>
2. Does not consider uniqueness of words. Unique words should weigh more.<br>
3. We are losing a lot of structure. This is like a giant word grinder.<br>
4. We will address the first two issues. The third issue is part of the bargain we struck with the bag-of-words approach.<br>
5. Note: bag-of-words is not the only way to featurize text. It is simple, and surprisingly powerful.<br>
</details>

## L2 Normalization

<br><details><summary>
What is L2 normalization?
</summary>

1. Divide each vector by its L2-norm.<br>
2. Divide each vector by its magnitude.<br>
3. Divide each vector by square root of sum of squares of all elements.<br>
4. Makes long and short documents weigh the same.<br>
$$\frac{\vec{v}}{||v_i||}$$<br>
$$\frac{\vec{v}}{\sqrt[2]{\sum{v_i^2}}}$$
</details>

<br><details><summary>
What is L1 normalization? 
</summary>

Divide each vector by its L1-norm.<br>

$$\frac{\vec{v}}{\sum{|v_i|}}$$<br>
</details>

<br><details><summary>
What is L(n) normalization?
</summary>

Divide each vector by its L(n)-norm.<br>

$$\frac{\vec{v}}{\sqrt[n]{\sum{|v_i|^n}}}$$<br>
</details>

## Why L2?

<br><details><summary>
Why use the L2-norm?
</summary>

1. It makes dot products between vectors meaningful.<br>
2. Will see this later with cosine similarity.<br>
</details>

## Dot Product

<br><details><summary>
If two documents are highly similar what will the dot product of their
L2-normalized vectors be?
</summary>

The dot product will be 1.<br>
</details>

<br><details><summary>
What does a dot product of 0 indicate?
</summary>

There is no similarity between the documents.<br>
</details>

<br><details><summary>
What does a dot product of -1 indicate?
</summary>

This is not possible since all vector values are zero or positive.<br>
</details>

## TF-IDF Intuition

Intuitively, the idea of TF-IDF is this:

- Words that occur in every document are less useful than words that
  only occur in some documents.

- Instead of looking at the term frequency for each word in a document
  we want to scale up terms that are rare.

## Applications

<br><details><summary>
Which term is likely to be more significant in spam detection:
"hey", "rolex", and why?
</summary>

1. "Rolex" is going to be more significant.<br>
2. Because this is a unique word that does not occur in a lot of documents.<br>
3. "Hey" is a lot more common and is less likely to be a useful feature.<br>
</details>

## Inverse Document Frequency

How can we increase the weight of tokens that are rare in our corpus
and decrease the weight of tokens that are common?

- Use Inverse Document Frequency.

- Inverse Document Frequency or IDF is a measure of how unique a term
  is. 
  
- So we want to weigh terms high if they are unique.

What is the formula for IDF?

Suppose:

- *t* is a token

- *d* is a document

- *D* is a corpus

- *N* is the total number of documents in *D*

- *n(t)* is the number of documents containing *t*

$$idf(t, d) = \log{\left(\frac{N}{n}\right)}$$

## TF-IDF

What is TF-IDF?

- TF-IDF combines TF or the normalized token counts.
- Then it multiplies it with IDF.

What is the formula for TF-IDF?

Suppose:

- *t* is a token

- *d* is a document

- *D* is a corpus

- *N* is the total number of documents in *D*

- *n(t)* is the number of documents containing *t*

Then, TF-IDF is:

$$tfidf(t,d) = tf(t,d) * idf(t,d)$$

What is TF?

- TF is the number of times that the token $t$ appears in $d$ (often
  normalized by dividing by the total length of $d$).

$$tf(t, d) = freq(t, d)$$

What is IDF?

- IDF is a score for how unique a token is across the corpus.

$$idf(t, d) = \log{\left(\frac{N}{n}\right)}$$

- This is sometimes written with some smoothers like this:

$$idf(t, d) = \log{\left(\frac{N + 1}{n + 1}\right)} + 1$$

## Adding Ones

<br><details><summary>
Why are we adding 1's?
</summary>

1. This is called smoothing.
2. Adding 1 inside the log ensures that we never divide by 0.<br>
3. Adding 1 at the end ensures that *idf* is always non-zero.<br>
</details>


## Example

Consider a very small library with these books:

- Hadoop Handbook (HH)
- Beekeeping Bible (BB)

Here are the word frequencies.

Terms  |HH   |BB
-----  |--   |--
hadoop |100  |0
bees   |0    |150
hive   |20   |50

<br><details><summary>
Intuitively what do you expect the IDF scores of hadoop, bees, and
hive to be?
</summary>

1. Hadoop and bees should have a higher score.<br>
2. Hive should have a low score because it is not rare.<br>
</details>

<br><details><summary>
What are the IDF scores of hadoop, bees, and hive? Assume log base 2.
</summary>

1. Note these are independent of document.<br>
2. For hadoop: $N = 2, n = 1, \log(N/n) = \log(2) = 1$.<br>
3. For bees: $N = 2, n = 1, \log(N/n) = \log(2) = 1$.<br>
4. For hive: $N = 2, n = 2, \log(N/n) = \log(1) = 0$.<br>
</details>

## Computing  TF-IDF

In [17]:
import numpy as np
from sklearn.preprocessing import normalize


# compute total number of times each token appears across all documents
document_freq = np.sum(term_frequency_matrix > 0, axis=0)
print('Document Frequency: {}'.format(document_freq))

# N is the number of documents
N = term_frequency_matrix.shape[0]
print('Number of Documents: {}'.format(N))

# Divide each row by its L2 norm
L2_rows = np.sqrt(np.sum(term_frequency_matrix**2, axis=1)).reshape(N, 1)
term_frequency_matrix = term_frequency_matrix / L2_rows
print('L2 Normalization: \n{}'.format(L2_rows))

# Add a smoother to keep IDF values for words that do not appear in a given document
idf = np.log(float(N+1) / (1.0 + document_freq)) + 1.0
print('Inverse Document Frequeny: \n{}'.format(idf))

tfidf = np.multiply(term_frequency_matrix, idf)
tfidf = normalize(tfidf, norm='l2', axis=1)
print('Term Frequency-Inverse Document Frequence: \n{}'.format(tfidf))

Document Frequency: [1 1 2 1 1 1 2 1 1]
Number of Documents: 3
L2 Normalization: 
[[ 2.23606798]
 [ 1.73205081]
 [ 2.44948974]]
Inverse Document Frequeny: 
[ 1.69314718  1.69314718  1.28768207  1.69314718  1.69314718  1.69314718
  1.28768207  1.69314718  1.69314718]
Term Frequency-Inverse Document Frequence: 
[[ 0.46735098  0.46735098  0.          0.          0.46735098  0.
   0.35543247  0.46735098  0.        ]
 [ 0.          0.          0.51785612  0.68091856  0.          0.
   0.51785612  0.          0.        ]
 [ 0.          0.          0.32200242  0.          0.          0.42339448
   0.          0.          0.84678897]]


## TF-IDF with Sklearn
Instead of all that code above, you can just do this

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer
z = TfidfTransformer(norm='l2')
print(z.fit_transform(term_frequency_matrix).toarray())

[[ 0.46735098  0.46735098  0.          0.          0.46735098  0.
   0.35543247  0.46735098  0.        ]
 [ 0.          0.          0.51785612  0.68091856  0.          0.
   0.51785612  0.          0.        ]
 [ 0.          0.          0.32200242  0.          0.          0.42339448
   0.          0.          0.84678897]]


## Putting It All Together

In [19]:
# Using newsgroups data set, let's fetch all of it.
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

# Subset the data, this dataset is huge.
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

In [20]:
# let's take a peek at the data
random_index = np.random.randint(len(newsgroups_train['data']))
newsgroups_train['data'][random_index]

'From: schaefer@sal-sun121.usc.edu (Peter Schaefer)\nSubject: Re: Why not give $1 billion to first year-long moon residents?\nOrganization: University of Southern California, Los Angeles, CA\nLines: 29\nDistribution: world\nNNTP-Posting-Host: sal-sun121.usc.edu\n\n\nIn article <1993Apr19.130503.1@aurora.alaska.edu>, nsmca@aurora.alaska.edu writes:\n|> In article <6ZV82B2w165w@theporch.raider.net>, gene@theporch.raider.net (Gene Wright) writes:\n|> > With the continuin talk about the "End of the Space Age" and complaints \n|> > by government over the large cost, why not try something I read about \n|> > that might just work.\n|> > \n|> > Announce that a reward of $1 billion would go to the first corporation \n|> > who successfully keeps at least 1 person alive on the moon for a year. \n|> > Then you\'d see some of the inexpensive but not popular technologies begin \n|> > to be developed. THere\'d be a different kind of space race then!\n|> > \n|> > --\n|> >   gene@theporch.raider.net (G

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words=stopwords.words('english'), norm='l2')
print(tfidf.fit_transform(newsgroups_train['data']).toarray())

[[ 0.         0.         0.        ...,  0.         0.         0.       ]
 [ 0.         0.         0.        ...,  0.         0.         0.       ]
 [ 0.         0.         0.        ...,  0.         0.         0.       ]
 ..., 
 [ 0.         0.         0.        ...,  0.         0.         0.       ]
 [ 0.         0.0476151  0.        ...,  0.         0.         0.       ]
 [ 0.         0.         0.        ...,  0.         0.         0.       ]]


## Cosine Similarity

We need a way to compare our documents. Use the cosine similary metric!

In [22]:
doc0 = 'the cat in the hat'
doc1 = 'the cat ate my hat'
doc2 = 'the cat in the hat the cat in the hat the cat in the hat'
doc3 = 'kale is on sale'
corpus = [doc0, doc1, doc2, doc3]

vocab = sorted(list(set([word for doc in corpus for word in doc.split(' ')])))
term_frequency_matrix = np.vstack([vectorize(doc, vocab) for doc in corpus])

In [23]:
print(["{:>6}".format(token) for token in vocab])
for row in term_frequency_matrix:
    print(["{:>6}".format(str(int(value))) for value in row])

['   ate', '   cat', '   hat', '    in', '    is', '  kale', '    my', '    on', '  sale', '   the']
['     0', '     1', '     1', '     1', '     0', '     0', '     0', '     0', '     0', '     2']
['     1', '     1', '     1', '     0', '     0', '     0', '     1', '     0', '     0', '     1']
['     0', '     3', '     3', '     3', '     0', '     0', '     0', '     0', '     0', '     6']
['     0', '     0', '     0', '     0', '     1', '     1', '     0', '     1', '     1', '     0']


In [24]:
import itertools
from sklearn.metrics.pairwise import cosine_similarity

indices = list(range(term_frequency_matrix.shape[0]))
combinations = list(itertools.combinations(indices, 2))
similarity_dict = {}
for pair in combinations:
    similarity = cosine_similarity(
        term_frequency_matrix[pair[0]].reshape(1,-1), 
        term_frequency_matrix[pair[1]].reshape(1,-1)
    )
    similarity_dict.update({'{}'.format(pair) : '{:.4f}'.format(similarity.squeeze().round(4))})

similarity_dict

{'(0, 1)': '0.6761',
 '(0, 2)': '1.0000',
 '(0, 3)': '0.0000',
 '(1, 2)': '0.6761',
 '(1, 3)': '0.0000',
 '(2, 3)': '0.0000'}

## Hashing Trick

One of the limitations of CountVectorizer is that the vectors it
produces can be very large. 

<br><details><summary>
How can we fix this?
</summary>

1. Use `HashingVectorizer`.<br>
2. Hash words to collapse the vector.<br>
3. Vector still retains enough uniqueness to be useful.<br>
</details>

## Hashing Vectorizer

In [25]:
from sklearn.feature_extraction.text import HashingVectorizer
hv = HashingVectorizer(n_features=10)
features = hv.transform(corpus)
print(features.toarray())

[[ 0.          0.          0.          0.          0.33333333  0.          0.
   0.         -0.66666667  0.66666667]
 [ 0.4472136   0.4472136   0.          0.          0.4472136   0.          0.
   0.         -0.4472136   0.4472136 ]
 [ 0.          0.          0.          0.          0.33333333  0.          0.
   0.         -0.66666667  0.66666667]
 [ 0.         -0.5         0.          0.          0.          0.5         0.
   0.5        -0.5         0.        ]]


## Summarize

<br><details><summary>
What are some steps in vectorizing text?
</summary>

1. Tokenize.<br>
2. Stemming, lemmatization, lowercasing, etc.<br>
3. Count frequencies.<br>
4. Modify feature weights using TF-IDF<br>
5. Divide by L2 norm.<br>
6. Use hashing trick.<br>
</details>