# Text featurization: turning a bunch of words into a vector of numbers

### Objectives:
- think about how to featurize text
- make a dumb count vector
- make a text processing pipeline
 - split documents into words or "tokens"
 - drop "filler" or "stop" words
 - drop punctuation, capitalization
- make a word frequency vector
- make a word frequency / document frequency (TF-IDF) vector
- 

## Natural Language Processing
- So far we've been handed a lot of data that looks like big tables of numbers: many features and a target. All the machine learning techniques we've learned so far treat each data point as a *vector in feature space*: every data point has some numerical extension along each feature axis. And each algorithm looks for some pattern in that feature space, minimizing some loss function. 
- BUT WHAT ABOUT TEXT. Text is made of words, letters, strange characters... a poem is not a vector...


- ...OR IS IT?
- Let's imagine we have many texts (for example, email subject lines) and some corresponding labels (for example, spam or not spam). How can we turn each **text** into a **vector of features** about the text (for use in a machine learning algorithm)?

### Some common NLP tasks
- Information retrieval: How do you find a document or a particular fact within a document?
- Document classification: What category of documents does the text belong to?
- Machine translation: How do you write an English phrase in Chinese?
- Sentiment analysis: Was a product review positive or negative? 

Natural language processing is a huge field and we will just touch on some of the concepts.

### Some terminology
- **token**: the smallest meaningful unit of text. Usually, "token" means "word".
- **document**: a group of words (tokens). Depending on the problem at hand, this could be a tweet, or a sentence, or a paragraph, or an article.
- **corpus**: a set of documents. 
- **vocabulary**: the set of unique words that appear in the corpus.

## FIRST TRY: Word Count Vectors

Our corpus: these three sentences

In [1]:
corpus = ["Jeff stole my octopus sandwich.", 
    "'My hammy!' I sobbed, sandwichlessly.", 
    "'Drop the sandwiches!' said the sandwich police."]

Here's the simplest possible approach: a vector for a document (sentence) should encode the word count in that document for each word in the corpus vocabulary.

First, let's split the documents into words (tokens).

In [2]:
tokenized_documents = [doc.split() for doc in corpus]
tokenized_documents

[['Jeff', 'stole', 'my', 'octopus', 'sandwich.'],
 ["'My", "hammy!'", 'I', 'sobbed,', 'sandwichlessly.'],
 ["'Drop", 'the', "sandwiches!'", 'said', 'the', 'sandwich', 'police.']]

Now let's get the vocabulary:

In [3]:
vocab_set = set()
for doc in tokenized_documents:
    vocab_set.update(set(doc))
vocabulary = sorted(vocab_set)

In [4]:
vocabulary

["'Drop",
 "'My",
 'I',
 'Jeff',
 "hammy!'",
 'my',
 'octopus',
 'police.',
 'said',
 'sandwich',
 'sandwich.',
 "sandwiches!'",
 'sandwichlessly.',
 'sobbed,',
 'stole',
 'the']

Now make a word count vector for each sentence

In [5]:
import numpy as np
import pandas as pd

In [7]:
X = np.zeros(shape=(len(corpus), len(vocabulary)))

In [8]:
for i, doc in enumerate(tokenized_documents):
    for word in doc:
        j = vocabulary.index(word)
        X[i,j] += 1

In [9]:
pd.DataFrame(data=X, columns=vocabulary)

Unnamed: 0,'Drop,'My,I,Jeff,hammy!',my,octopus,police.,said,sandwich,sandwich.,sandwiches!',sandwichlessly.,"sobbed,",stole,the
0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0


## Anything unsatisfactory about the above?

## Let's make a better text processing pipeline:
- split each document into tokens
- lowercase & drop punctuation
- drop **stop words** (filler, like "the")
- reduce words to their root form using a huge dictionary of rules that many generations of linguists have compiled
 - **stemming**: simple, context-free rules like turning *sandwiches* into *sandwich*
 - **lemmatization**: edge cases and part-of-speech based rules like turning *better* into *good*
- now turn your corpus of cleaned, tokenized documents into an array of feature vectors

In [10]:
import string
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer

In [11]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/moses/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/moses/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/moses/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### step 1: split each document into a list of tokens

In [12]:
tokenized_docs = [doc.split() for doc in corpus]
tokenized_docs

[['Jeff', 'stole', 'my', 'octopus', 'sandwich.'],
 ["'My", "hammy!'", 'I', 'sobbed,', 'sandwichlessly.'],
 ["'Drop", 'the', "sandwiches!'", 'said', 'the', 'sandwich', 'police.']]

### step 2: lowercase, lose punctuation

In [13]:
tokenized_docs_lowered = [[word.lower() for word in doc]
                          for doc in tokenized_docs]

In [14]:
tokenized_docs_lowered

[['jeff', 'stole', 'my', 'octopus', 'sandwich.'],
 ["'my", "hammy!'", 'i', 'sobbed,', 'sandwichlessly.'],
 ["'drop", 'the', "sandwiches!'", 'said', 'the', 'sandwich', 'police.']]

In [15]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
punct = set(string.punctuation)

#### exercise: get rid of all punctuation in a string

In [17]:
word = 'mr.peanut!'

In [18]:
def remove_symbols(word, symbol_set):
    return ''.join([char for char in word if char not in symbol_set])

In [19]:
cleaned_docs = [[remove_symbols(word.lower(), punct) for word in doc] 
                for doc in tokenized_docs_lowered]

In [20]:
cleaned_docs

[['jeff', 'stole', 'my', 'octopus', 'sandwich'],
 ['my', 'hammy', 'i', 'sobbed', 'sandwichlessly'],
 ['drop', 'the', 'sandwiches', 'said', 'the', 'sandwich', 'police']]

### step 3: remove stop words

In [21]:
stop_words = set(nltk.corpus.stopwords.words('english'))

In [22]:
docs_no_stops = [[word for word in doc if word not in stop_words] 
                 for doc in cleaned_docs]

In [23]:
docs_no_stops

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['hammy', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwiches', 'said', 'sandwich', 'police']]

### step 4: stemming / lemmatization

In [24]:
lemmer = WordNetLemmatizer()
stemmer = SnowballStemmer('english')

In [25]:
lemmer.lemmatize('sandwiches')

'sandwich'

In [26]:
stemmer.stem("sandwiches")

'sandwich'

In [27]:
docs_lemmatized = [[lemmer.lemmatize(word) for word in doc]
                  for doc in docs_no_stops]
docs_lemmatized

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['hammy', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwich', 'said', 'sandwich', 'police']]

In [28]:
docs_stemmed = [[stemmer.stem(word) for word in doc]
                  for doc in docs_no_stops]
docs_stemmed

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['hammi', 'sob', 'sandwichless'],
 ['drop', 'sandwich', 'said', 'sandwich', 'polic']]

### We can combine all these steps into a function

In [29]:
def our_text_pipeline(doc, stops=None, lemmatize=False):
    doc = doc.lower().split()
    punct = set(string.punctuation)
    tokens = [''.join([char for char in tok if char not in punct]) 
              for tok in doc]
    if stops:
        tokens = [tok for tok in tokens if (tok not in stops)]
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
    return tokens

In [30]:
[our_text_pipeline(doc, stops=stop_words, lemmatize=True) for doc in corpus]

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['hammy', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwich', 'said', 'sandwich', 'police']]

`nltk` has a `word_tokenize` function that's a bit more complicated than `doc.split()`, feel free to incorporate it into your pipeline. See the [documentation](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize) for more details.

In [31]:
from nltk.tokenize import word_tokenize

In [32]:
word_tokenize(corpus[0])

['Jeff', 'stole', 'my', 'octopus', 'sandwich', '.']

## Count Vectors

OK, the work above cleaned up our text.

In [33]:
corpus

['Jeff stole my octopus sandwich.',
 "'My hammy!' I sobbed, sandwichlessly.",
 "'Drop the sandwiches!' said the sandwich police."]

In [34]:
docs_cleaned = [our_text_pipeline(doc, stops=stop_words, lemmatize=True) 
              for doc in corpus]
docs_cleaned

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['hammy', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwich', 'said', 'sandwich', 'police']]

Now let's turn our corpus into vectors of word counts. First we make a vocabulary.

In [35]:
vocab_set = set()
for doc in docs_cleaned:
    vocab_set.update(doc)
    
vocab = sorted(vocab_set)
vocab

['drop',
 'hammy',
 'jeff',
 'octopus',
 'police',
 'said',
 'sandwich',
 'sandwichlessly',
 'sobbed',
 'stole']

Then we count up the word occurences for each word

In [40]:
X = np.zeros(shape=(len(corpus), len(vocab)))

for i, doc in enumerate(docs_cleaned):
    for word in doc:
        j = vocab.index(word)
        X[i,j] += 1
        
count_vectors = pd.DataFrame(data=X, columns=vocab)
count_vectors

Unnamed: 0,drop,hammy,jeff,octopus,police,said,sandwich,sandwichlessly,sobbed,stole
0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,1.0,0.0,0.0,0.0,1.0,1.0,2.0,0.0,0.0,0.0


## Term frequency (TF) vectors
If you are comparing documents of different lengths, the *frequency* of a word in a document may be more useful than the raw count.


$$TF_{word,document} = \frac{\#\_of\_times\_word\_appears\_in\_document}{total\_\#\_of\_words\_in\_document}$$

In [48]:
tf_vectors = (count_vectors.apply(lambda row: row/row.sum(), 
                                 axis=1)
                           .round(2)
             )
tf_vectors

Unnamed: 0,drop,hammy,jeff,octopus,police,said,sandwich,sandwichlessly,sobbed,stole
0,0.0,0.0,0.25,0.25,0.0,0.0,0.25,0.0,0.0,0.25
1,0.0,0.33,0.0,0.0,0.0,0.0,0.0,0.33,0.33,0.0
2,0.2,0.0,0.0,0.0,0.2,0.2,0.4,0.0,0.0,0.0


## Document frequency for each word
$$ DF_{word} = \frac{\#\_of\_documents\_containing\_word}{total\_\#\_of\_documents} $$

In [57]:
doc_count = np.zeros(shape=(len(vocab),))

for i,word in enumerate(vocab):
    for doc in docs_cleaned:
        if word in doc:
            doc_count[i] += 1

doc_freq = pd.Series(data=doc_count, index=vocab)/len(corpus)
doc_freq.round(2)

drop              0.33
hammy             0.33
jeff              0.33
octopus           0.33
police            0.33
said              0.33
sandwich          0.67
sandwichlessly    0.33
sobbed            0.33
stole             0.33
dtype: float64

## Inverse document frequency
$$ IDF_{word} = \log\left(\frac{total\_\#\_of\_documents}{\#\_of\_documents\_containing\_word}\right) $$

In [60]:
inverse_doc_freq = np.log(1/doc_freq)
inverse_doc_freq

drop              1.098612
hammy             1.098612
jeff              1.098612
octopus           1.098612
police            1.098612
said              1.098612
sandwich          0.405465
sandwichlessly    1.098612
sobbed            1.098612
stole             1.098612
dtype: float64

# TFIDF

Vocabulary:
```
['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']
```
TF * IDF:

In [61]:
tf_vectors

Unnamed: 0,drop,hammy,jeff,octopus,police,said,sandwich,sandwichlessly,sobbed,stole
0,0.0,0.0,0.25,0.25,0.0,0.0,0.25,0.0,0.0,0.25
1,0.0,0.33,0.0,0.0,0.0,0.0,0.0,0.33,0.33,0.0
2,0.2,0.0,0.0,0.0,0.2,0.2,0.4,0.0,0.0,0.0


In [63]:
inverse_doc_freq

drop              1.098612
hammy             1.098612
jeff              1.098612
octopus           1.098612
police            1.098612
said              1.098612
sandwich          0.405465
sandwichlessly    1.098612
sobbed            1.098612
stole             1.098612
dtype: float64

In [64]:
tf_vectors*inverse_doc_freq

Unnamed: 0,drop,hammy,jeff,octopus,police,said,sandwich,sandwichlessly,sobbed,stole
0,0.0,0.0,0.274653,0.274653,0.0,0.0,0.101366,0.0,0.0,0.274653
1,0.0,0.362542,0.0,0.0,0.0,0.0,0.0,0.362542,0.362542,0.0
2,0.219722,0.0,0.0,0.0,0.219722,0.219722,0.162186,0.0,0.0,0.0


Now that we have turned our DOCUMENTS into VECTORS, we can put them into whatever machine learning algorithm we want! We can use whatever kind of similarity measure we please!

Wow!

In [39]:
from scipy.spatial.distance import pdist, squareform
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 