# Text featurization: turning a bunch of words into a vector of numbers

### Objectives:
- think about how to featurize text
- make a dumb count vector
- make a text processing pipeline
 - split documents into words or "tokens"
 - drop "filler" or "stop" words
 - drop punctuation, capitalization
- make a word frequency vector
- make a word frequency / document frequency (TF-IDF) vector
- compare documents with cosine similarity

## Natural Language Processing
- So far we've been handed a lot of data that looks like big tables of numbers: many features and a target. All the machine learning techniques we've learned so far treat each data point as a *vector in feature space*: every data point has some numerical extension along each feature axis. And each algorithm looks for some pattern in that feature space, minimizing some loss function. 
- BUT WHAT ABOUT TEXT. Text is made of words, letters, strange characters like 🐘... a poem is not a vector... or is it?
- Let's imagine we have many texts (for example, email subject lines) and some corresponding labels (for example, spam or not spam). How can we turn each **text** into a **vector of features** about that text (for use in a machine learning algorithm)?

### Some common NLP tasks
- Information retrieval: How do you find a document or a particular fact within a document?
- Document classification: What category of documents does the text belong to?
- Machine translation: How do you write an English phrase in Chinese?
- Sentiment analysis: Was a product review positive or negative? 

Natural language processing is a huge field and we will just touch on some of the concepts.

### Some terminology
- **token**: the smallest meaningful unit of text. Usually, "token" means "word".
- **document**: a group of words (tokens). Depending on the problem at hand, this could be a tweet, or a sentence, or a paragraph, or an article.
- **corpus**: a set of documents. 
- **vocabulary**: the set of unique words that appear in the corpus.
- **bag of words**: today we are ignoring word order; we simply treat each document as a collection of words. This is called a **bag of words** approach. It is surprisingly effective for how dumb it is. 
- **n-grams**: a sequence of **n** words in a row, to be treated as a single token. For example, the phrase "What a sad clown painting" contains the 2-grams "what a", "a sad", "sad clown" and "clown painting". Including 2-, 3-, and 4-gram tokenization in your pipeline below can begin to account for signal encoded in word order via multi-word phrases. 

## FIRST TRY: Word Count Vectors

Our corpus: these three sentences

In [1]:
corpus = ["Jeff stole my octopus sandwich.", 
    "'My hammy!' I sobbed, sandwichlessly.", 
    "'Drop the sandwiches!' said the sandwich police."]

Here's the simplest possible approach: a vector for a document (sentence) should encode the word count in that document for each word in the corpus vocabulary.

First, let's split the documents into words (tokens).

In [2]:
tokenized_documents = [doc.split() for doc in corpus]
tokenized_documents

[['Jeff', 'stole', 'my', 'octopus', 'sandwich.'],
 ["'My", "hammy!'", 'I', 'sobbed,', 'sandwichlessly.'],
 ["'Drop", 'the', "sandwiches!'", 'said', 'the', 'sandwich', 'police.']]

Now let's get the vocabulary:

In [3]:
vocab_set = set()
for doc in tokenized_documents:
    vocab_set.update(set(doc))
vocabulary = sorted(vocab_set)

In [4]:
vocabulary

["'Drop",
 "'My",
 'I',
 'Jeff',
 "hammy!'",
 'my',
 'octopus',
 'police.',
 'said',
 'sandwich',
 'sandwich.',
 "sandwiches!'",
 'sandwichlessly.',
 'sobbed,',
 'stole',
 'the']

Now make a word count vector for each sentence

In [5]:
import numpy as np
import pandas as pd

In [6]:
X = np.zeros(shape=(len(corpus), len(vocabulary)))

In [7]:
for i, doc in enumerate(tokenized_documents):
    for word in doc:
        j = vocabulary.index(word)
        X[i,j] += 1

In [8]:
X

array([[0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0.],
       [0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0., 2.]])

In [9]:
pd.DataFrame(data=X, columns=vocabulary)

Unnamed: 0,'Drop,'My,I,Jeff,hammy!',my,octopus,police.,said,sandwich,sandwich.,sandwiches!',sandwichlessly.,"sobbed,",stole,the
0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0


## Anything unsatisfactory about the above?

## Let's make a better text processing pipeline:
- split each document into tokens
- lowercase & drop punctuation
- drop **stop words** (filler, like "the")
- reduce words to their root form using a huge dictionary of rules that many generations of linguists have compiled
 - **stemming**: simple, context-free rules like turning *sandwiches* into *sandwich*
 - **lemmatization**: edge cases and part-of-speech based rules like turning *better* into *good*
- then turn your corpus of cleaned, tokenized documents into an array of feature vectors

In [10]:
import string
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer

In [11]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/moses/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/moses/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/moses/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### step 1: split each document into a list of tokens

In [12]:
tokenized_docs = [doc.split() for doc in corpus]
tokenized_docs

[['Jeff', 'stole', 'my', 'octopus', 'sandwich.'],
 ["'My", "hammy!'", 'I', 'sobbed,', 'sandwichlessly.'],
 ["'Drop", 'the', "sandwiches!'", 'said', 'the', 'sandwich', 'police.']]

### step 2: lowercase, lose punctuation

In [13]:
tokenized_docs_lowered = [[word.lower() for word in doc]
                          for doc in tokenized_docs]

In [14]:
tokenized_docs_lowered

[['jeff', 'stole', 'my', 'octopus', 'sandwich.'],
 ["'my", "hammy!'", 'i', 'sobbed,', 'sandwichlessly.'],
 ["'drop", 'the', "sandwiches!'", 'said', 'the', 'sandwich', 'police.']]

In [15]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
punct = set(string.punctuation)

#### exercise: get rid of all punctuation in a string

In [17]:
word = 'mr.peanut!'

In [18]:
def remove_symbols(word, symbol_set):
    return ''.join([char for char in word if char not in symbol_set])

In [19]:
cleaned_docs = [[remove_symbols(word, punct) for word in doc] 
                for doc in tokenized_docs_lowered]

In [20]:
cleaned_docs

[['jeff', 'stole', 'my', 'octopus', 'sandwich'],
 ['my', 'hammy', 'i', 'sobbed', 'sandwichlessly'],
 ['drop', 'the', 'sandwiches', 'said', 'the', 'sandwich', 'police']]

### step 3: remove stop words

In [21]:
stop_words = set(nltk.corpus.stopwords.words('english'))

In [22]:
print(stop_words)

{'does', 'those', 'be', 'nor', 'if', 'a', 'an', 'below', 'did', 'are', 'about', "it's", "doesn't", "you're", 'of', 'yourselves', 'between', 'but', 'don', 'her', 'themselves', 'and', 'against', 'because', 'm', 'off', 'out', 'when', 'had', 'mustn', 'again', 'or', "should've", 'how', 'more', 'as', 'while', 'up', 'these', 'any', 'can', "you'd", 'under', 'couldn', 'we', 'whom', 'doesn', 'during', 'itself', 'for', 'myself', 'they', 'both', 'should', 'what', "needn't", 'he', 'have', 'where', 'your', 'him', 'to', 'aren', "weren't", 'hers', 'me', 'why', 'all', "didn't", 'down', 'their', 'mightn', 'i', "shouldn't", 've', 'by', 'only', 'do', 'being', "hadn't", "hasn't", 'too', 'here', 'ourselves', 'the', 'than', 'which', 'isn', 't', 'didn', 'wasn', 'having', 'has', "she's", 'on', 'now', 'haven', 'there', 'once', 'through', "won't", 'his', 'my', 'own', 'were', 'after', 'y', "isn't", 'each', "wouldn't", 'most', 'she', 'just', 'is', 'yours', 'some', 'o', 'ma', 'will', 'who', 'll', 'herself', 'above'

In [23]:
docs_no_stops = [[word for word in doc if word not in stop_words] 
                 for doc in cleaned_docs]

In [24]:
docs_no_stops

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['hammy', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwiches', 'said', 'sandwich', 'police']]

### step 4: stemming / lemmatization

In [25]:
lemmer = WordNetLemmatizer()
stemmer = SnowballStemmer('english')

In [26]:
lemmer.lemmatize('sandwiches')

'sandwich'

In [27]:
stemmer.stem("sandwiches")

'sandwich'

In [28]:
docs_lemmatized = [[lemmer.lemmatize(word) for word in doc]
                  for doc in docs_no_stops]
docs_lemmatized

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['hammy', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwich', 'said', 'sandwich', 'police']]

In [29]:
docs_stemmed = [[stemmer.stem(word) for word in doc]
                  for doc in docs_no_stops]
docs_stemmed

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['hammi', 'sob', 'sandwichless'],
 ['drop', 'sandwich', 'said', 'sandwich', 'polic']]

### We can combine all these steps into a function

In [30]:
def our_text_pipeline(doc, stops={}, lemmatize=False):
    '''
    Args:
        doc (str): the text to be tokenized
        stops (set): an optional set of words (tokens) to exclude
        lemmatize (bool): if True, lemmatize the words
    
    Returns: 
        tokens (list of strings)
    '''
    doc = doc.lower().split()
    punct = set(string.punctuation)
    tokens = [''.join([char for char in tok if char not in punct]) 
              for tok in doc]
    if stops:
        tokens = [tok for tok in tokens if (tok not in stops)]
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
    return tokens

In [31]:
[our_text_pipeline(doc, stops=stop_words, lemmatize=True) 
 for doc in corpus]

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['hammy', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwich', 'said', 'sandwich', 'police']]

`nltk` has a `word_tokenize` function that's a bit more complicated than `doc.split()`, feel free to incorporate it into your pipeline. See the [documentation](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize) for more details.

In [32]:
from nltk.tokenize import word_tokenize

In [33]:
word_tokenize(corpus[0])

['Jeff', 'stole', 'my', 'octopus', 'sandwich', '.']

## Count Vectors

OK, the work above cleaned up our text.

In [34]:
corpus

['Jeff stole my octopus sandwich.',
 "'My hammy!' I sobbed, sandwichlessly.",
 "'Drop the sandwiches!' said the sandwich police."]

In [35]:
docs_cleaned = [our_text_pipeline(doc, stops=stop_words, lemmatize=True) 
              for doc in corpus]
docs_cleaned

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['hammy', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwich', 'said', 'sandwich', 'police']]

Now let's turn our corpus into vectors of word counts. Let's put that (inefficient) counting code from earlier into a function:

In [36]:
def our_count_vectorizer(docs):
    '''
    Args:
        docs (list of lists of strings): corpus
    Returns:
        X_count (numpy array): count vectors
        vocab (list of strings): alphabetical list 
                                 of unique words
    '''
    vocab_set = set()
    for doc in docs:
        vocab_set.update(doc)

    vocab = sorted(vocab_set)

    X_count = np.zeros(shape=(len(docs), len(vocab)))

    for i, doc in enumerate(docs):
        for word in doc:
            j = vocab.index(word)
            X_count[i,j] += 1
    return X_count, vocab  

In [37]:
X_count, vocab = our_count_vectorizer(docs_cleaned)

In [38]:
vocab

['drop',
 'hammy',
 'jeff',
 'octopus',
 'police',
 'said',
 'sandwich',
 'sandwichlessly',
 'sobbed',
 'stole']

In [39]:
X_count

array([[0., 0., 1., 1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1., 1., 2., 0., 0., 0.]])

In [40]:
count_vectors = pd.DataFrame(data=X_count, columns=vocab)
count_vectors

Unnamed: 0,drop,hammy,jeff,octopus,police,said,sandwich,sandwichlessly,sobbed,stole
0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,1.0,0.0,0.0,0.0,1.0,1.0,2.0,0.0,0.0,0.0


## Term frequency (TF) vectors
If you are comparing documents of different lengths, the *frequency* of a word in a document may be more useful than the raw count.


$$TF_{word,document} = \frac{\#\_of\_times\_word\_appears\_in\_document}{total\_\#\_of\_words\_in\_document}$$

In [41]:
X_count

array([[0., 0., 1., 1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1., 1., 2., 0., 0., 0.]])

In [42]:
X_count.sum(axis=1, keepdims=True)

array([[4.],
       [3.],
       [5.]])

In [43]:
X_tf = X_count / X_count.sum(axis=1, keepdims=True)
X_tf

array([[0.        , 0.        , 0.25      , 0.25      , 0.        ,
        0.        , 0.25      , 0.        , 0.        , 0.25      ],
       [0.        , 0.33333333, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.33333333, 0.33333333, 0.        ],
       [0.2       , 0.        , 0.        , 0.        , 0.2       ,
        0.2       , 0.4       , 0.        , 0.        , 0.        ]])

In [44]:
tf_vectors = pd.DataFrame(data=X_tf,
                          columns=vocab)
tf_vectors.round(2)

Unnamed: 0,drop,hammy,jeff,octopus,police,said,sandwich,sandwichlessly,sobbed,stole
0,0.0,0.0,0.25,0.25,0.0,0.0,0.25,0.0,0.0,0.25
1,0.0,0.33,0.0,0.0,0.0,0.0,0.0,0.33,0.33,0.0
2,0.2,0.0,0.0,0.0,0.2,0.2,0.4,0.0,0.0,0.0


## Term Frequency $*$ Inverse Document Frequency (TF-IDF) Vectors

In many NLP tasks related to information retrieval or document classification, rare words may be especially helpful in categorizing documents. For example, if I'm trying to pick out the music articles from a database of ~1 million articles, the word _arpeggiate_, though rare, probably exclusively appears in music texts. Conversely, words like _the_ appear in every article, and have no discriminatory power.

To quantify this idea, we first calculate the _document frequency_ of each word in the vocabulary
$$ DF_{word} = \frac{\#\_of\_documents\_containing\_word}{total\_\#\_of\_documents} $$

Then we give that word a logarithmically scaled _inverse document frequency_ weight
$$ IDF_{word} = \log\left(\frac{1}{DF_{word}}\right) $$

Then we multiply each word's term frequency (TF) in a document by its IDF weight to get a TF-IDF vector
$$TFIDF_{word,document} = TF_{word, document}*IDF_{word}$$

In [45]:
doc_count = np.zeros(shape=(len(vocab),))

for i,word in enumerate(vocab):
    for doc in docs_cleaned:
        if word in doc:
            doc_count[i] += 1

doc_freq = doc_count/len(corpus)
            
doc_freq_series = pd.Series(data=doc_freq, index=vocab)
doc_freq_series.round(2)

drop              0.33
hammy             0.33
jeff              0.33
octopus           0.33
police            0.33
said              0.33
sandwich          0.67
sandwichlessly    0.33
sobbed            0.33
stole             0.33
dtype: float64

In [46]:
inverse_doc_freq = np.log(1/doc_freq_series)
inverse_doc_freq

drop              1.098612
hammy             1.098612
jeff              1.098612
octopus           1.098612
police            1.098612
said              1.098612
sandwich          0.405465
sandwichlessly    1.098612
sobbed            1.098612
stole             1.098612
dtype: float64

In [47]:
X_tf

array([[0.        , 0.        , 0.25      , 0.25      , 0.        ,
        0.        , 0.25      , 0.        , 0.        , 0.25      ],
       [0.        , 0.33333333, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.33333333, 0.33333333, 0.        ],
       [0.2       , 0.        , 0.        , 0.        , 0.2       ,
        0.2       , 0.4       , 0.        , 0.        , 0.        ]])

In [48]:
idf = np.log(1/doc_freq)
idf

array([1.09861229, 1.09861229, 1.09861229, 1.09861229, 1.09861229,
       1.09861229, 0.40546511, 1.09861229, 1.09861229, 1.09861229])

In [49]:
X_tf*idf

array([[0.        , 0.        , 0.27465307, 0.27465307, 0.        ,
        0.        , 0.10136628, 0.        , 0.        , 0.27465307],
       [0.        , 0.3662041 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.3662041 , 0.3662041 , 0.        ],
       [0.21972246, 0.        , 0.        , 0.        , 0.21972246,
        0.21972246, 0.16218604, 0.        , 0.        , 0.        ]])

In [50]:
tf_idf = pd.DataFrame(data=X_tf*idf, columns=vocab)
tf_idf

Unnamed: 0,drop,hammy,jeff,octopus,police,said,sandwich,sandwichlessly,sobbed,stole
0,0.0,0.0,0.274653,0.274653,0.0,0.0,0.101366,0.0,0.0,0.274653
1,0.0,0.366204,0.0,0.0,0.0,0.0,0.0,0.366204,0.366204,0.0
2,0.219722,0.0,0.0,0.0,0.219722,0.219722,0.162186,0.0,0.0,0.0


Above I've explicitly done the calculations using numpy, but we could have used pandas instead.

In [51]:
tf_vectors.round(2)

Unnamed: 0,drop,hammy,jeff,octopus,police,said,sandwich,sandwichlessly,sobbed,stole
0,0.0,0.0,0.25,0.25,0.0,0.0,0.25,0.0,0.0,0.25
1,0.0,0.33,0.0,0.0,0.0,0.0,0.0,0.33,0.33,0.0
2,0.2,0.0,0.0,0.0,0.2,0.2,0.4,0.0,0.0,0.0


In [52]:
inverse_doc_freq

drop              1.098612
hammy             1.098612
jeff              1.098612
octopus           1.098612
police            1.098612
said              1.098612
sandwich          0.405465
sandwichlessly    1.098612
sobbed            1.098612
stole             1.098612
dtype: float64

In [53]:
tf_vectors*inverse_doc_freq

Unnamed: 0,drop,hammy,jeff,octopus,police,said,sandwich,sandwichlessly,sobbed,stole
0,0.0,0.0,0.274653,0.274653,0.0,0.0,0.101366,0.0,0.0,0.274653
1,0.0,0.366204,0.0,0.0,0.0,0.0,0.0,0.366204,0.366204,0.0
2,0.219722,0.0,0.0,0.0,0.219722,0.219722,0.162186,0.0,0.0,0.0


## So we've featurized documents based on word count & frequency within a document, and using document frequency... now what?

Now that we have turned our TEXT DOCUMENTS into VECTORS, we can put them into whatever machine learning algorithm we want! We can use whatever kind of similarity measures we please!

Wow!

#### pop quiz: 
I'm trying to use logistic regression to tell spam e-mail from non-spam e-mails. How do I pick which text featurization / vectorization to use? What are my options?

## Comparing document vectors using cosine similarity

example: comparing count vectors of short articles & long articles (see whiteboard)

Any two non-collinear vectors $\vec{u}$ and $\vec{v}$ form a plane (a 2D space), no matter how many dimensions $\vec{u}$ and $\vec{v}$ have. Let $\theta$ be the angle between the vectors in that plane. The **cosine similarity** between $\vec{u}$ and $\vec{v}$ is defined as
$$ \text{cosine_sim}(\vec{u},\vec{v}) = \cos{\theta} =\frac{\vec{u}\cdot\vec{v}}{|\vec{u}||\vec{v}|} $$
where the numerator is the **dot product** between the vectors and the denominator is the product of the vector **magnitudes** (L2 norms)

### example: a villanelle

> The highly structured villanelle is a nineteen-line poem with two repeating rhymes and two refrains. The form is made up of five tercets followed by a quatrain. The first and third lines of the opening tercet are repeated alternately in the last lines of the succeeding stanzas; then in the final stanza, the refrain serves as the poem’s two concluding lines. Using capitals for the refrains and lowercase letters for the rhymes, the form could be expressed as: A1 b A2 / a b A1 / a b A2 / a b A1 / a b A2 / a b A1 A2."

Probably the best known villanelle in English is "Do not go gentle into that good night" by Dylan Thomas

In [54]:
raw_poem = '''Do not go gentle into that good night,
Old age should burn and rave at close of day;
Rage, rage against the dying of the light.

Though wise men at their end know dark is right,
Because their words had forked no lightning they
Do not go gentle into that good night.

Good men, the last wave by, crying how bright
Their frail deeds might have danced in a green bay,
Rage, rage against the dying of the light.

Wild men who caught and sang the sun in flight,
And learn, too late, they grieved it on its way,
Do not go gentle into that good night.

Grave men, near death, who see with blinding sight
Blind eyes could blaze like meteors and be gay,
Rage, rage against the dying of the light.

And you, my father, there on the sad height,
Curse, bless, me now with your fierce tears, I pray.
Do not go gentle into that good night.
Rage, rage against the dying of the light.'''

### Let's treat each line as a document and the poem as our corpus, then vectorize this baby into a matrix that would make Dylan Thomas proud (sorry).

In [55]:
lines = raw_poem.split('\n')
lines

['Do not go gentle into that good night,',
 'Old age should burn and rave at close of day;',
 'Rage, rage against the dying of the light.',
 '',
 'Though wise men at their end know dark is right,',
 'Because their words had forked no lightning they',
 'Do not go gentle into that good night.',
 '',
 'Good men, the last wave by, crying how bright',
 'Their frail deeds might have danced in a green bay,',
 'Rage, rage against the dying of the light.',
 '',
 'Wild men who caught and sang the sun in flight,',
 'And learn, too late, they grieved it on its way,',
 'Do not go gentle into that good night.',
 '',
 'Grave men, near death, who see with blinding sight',
 'Blind eyes could blaze like meteors and be gay,',
 'Rage, rage against the dying of the light.',
 '',
 'And you, my father, there on the sad height,',
 'Curse, bless, me now with your fierce tears, I pray.',
 'Do not go gentle into that good night.',
 'Rage, rage against the dying of the light.']

In [56]:
#removing empty lines
lines = [l for l in lines if l]
lines

['Do not go gentle into that good night,',
 'Old age should burn and rave at close of day;',
 'Rage, rage against the dying of the light.',
 'Though wise men at their end know dark is right,',
 'Because their words had forked no lightning they',
 'Do not go gentle into that good night.',
 'Good men, the last wave by, crying how bright',
 'Their frail deeds might have danced in a green bay,',
 'Rage, rage against the dying of the light.',
 'Wild men who caught and sang the sun in flight,',
 'And learn, too late, they grieved it on its way,',
 'Do not go gentle into that good night.',
 'Grave men, near death, who see with blinding sight',
 'Blind eyes could blaze like meteors and be gay,',
 'Rage, rage against the dying of the light.',
 'And you, my father, there on the sad height,',
 'Curse, bless, me now with your fierce tears, I pray.',
 'Do not go gentle into that good night.',
 'Rage, rage against the dying of the light.']

In [57]:
poem = [our_text_pipeline(line) for line in  lines]
poem

[['do', 'not', 'go', 'gentle', 'into', 'that', 'good', 'night'],
 ['old', 'age', 'should', 'burn', 'and', 'rave', 'at', 'close', 'of', 'day'],
 ['rage', 'rage', 'against', 'the', 'dying', 'of', 'the', 'light'],
 ['though',
  'wise',
  'men',
  'at',
  'their',
  'end',
  'know',
  'dark',
  'is',
  'right'],
 ['because', 'their', 'words', 'had', 'forked', 'no', 'lightning', 'they'],
 ['do', 'not', 'go', 'gentle', 'into', 'that', 'good', 'night'],
 ['good', 'men', 'the', 'last', 'wave', 'by', 'crying', 'how', 'bright'],
 ['their',
  'frail',
  'deeds',
  'might',
  'have',
  'danced',
  'in',
  'a',
  'green',
  'bay'],
 ['rage', 'rage', 'against', 'the', 'dying', 'of', 'the', 'light'],
 ['wild', 'men', 'who', 'caught', 'and', 'sang', 'the', 'sun', 'in', 'flight'],
 ['and', 'learn', 'too', 'late', 'they', 'grieved', 'it', 'on', 'its', 'way'],
 ['do', 'not', 'go', 'gentle', 'into', 'that', 'good', 'night'],
 ['grave', 'men', 'near', 'death', 'who', 'see', 'with', 'blinding', 'sight'],
 [

In [58]:
poem_vec, poem_vocab = our_count_vectorizer(poem)

In [59]:
print(poem_vocab)

['a', 'against', 'age', 'and', 'at', 'bay', 'be', 'because', 'blaze', 'bless', 'blind', 'blinding', 'bright', 'burn', 'by', 'caught', 'close', 'could', 'crying', 'curse', 'danced', 'dark', 'day', 'death', 'deeds', 'do', 'dying', 'end', 'eyes', 'father', 'fierce', 'flight', 'forked', 'frail', 'gay', 'gentle', 'go', 'good', 'grave', 'green', 'grieved', 'had', 'have', 'height', 'how', 'i', 'in', 'into', 'is', 'it', 'its', 'know', 'last', 'late', 'learn', 'light', 'lightning', 'like', 'me', 'men', 'meteors', 'might', 'my', 'near', 'night', 'no', 'not', 'now', 'of', 'old', 'on', 'pray', 'rage', 'rave', 'right', 'sad', 'sang', 'see', 'should', 'sight', 'sun', 'tears', 'that', 'the', 'their', 'there', 'they', 'though', 'too', 'wave', 'way', 'who', 'wild', 'wise', 'with', 'words', 'you', 'your']


In [60]:
len(poem_vocab)

98

In [61]:
poem_vec.shape

(19, 98)

In [62]:
poem_vec[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [63]:
poem_vec[1]

array([0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [64]:
poem_vec[2]

array([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 2., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [65]:
from scipy.spatial.distance import pdist, squareform

In [66]:
pairwise_dist = squareform(pdist(poem_vec, metric='cosine'))
pairwise_dist.shape

(19, 19)

In [67]:
for i in range(len(lines[:5])):
    for ind in pairwise_dist[i].argsort()[:5]:
        print(lines[ind])
    print('-----')

Do not go gentle into that good night,
Do not go gentle into that good night.
Do not go gentle into that good night.
Do not go gentle into that good night.
Good men, the last wave by, crying how bright
-----
Old age should burn and rave at close of day;
And you, my father, there on the sad height,
Blind eyes could blaze like meteors and be gay,
Wild men who caught and sang the sun in flight,
Though wise men at their end know dark is right,
-----
Rage, rage against the dying of the light.
Rage, rage against the dying of the light.
Rage, rage against the dying of the light.
Rage, rage against the dying of the light.
And you, my father, there on the sad height,
-----
Though wise men at their end know dark is right,
Because their words had forked no lightning they
Good men, the last wave by, crying how bright
Grave men, near death, who see with blinding sight
Wild men who caught and sang the sun in flight,
-----
Because their words had forked no lightning they
Though wise men at their end 

## well all of this is implemented in sklearn, too. the exercise is about using it. here's a taste

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 

In [69]:
count_vectorizer = CountVectorizer(tokenizer=our_text_pipeline, stop_words='english')

In [70]:
poem_vec_sk = count_vectorizer.fit_transform(lines)

In [71]:
poem_vec_sk

<19x57 sparse matrix of type '<class 'numpy.int64'>'
	with 79 stored elements in Compressed Sparse Row format>

In [72]:
poem_vec_sk.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [73]:
print(count_vectorizer.vocabulary_)

{'gentle': 26, 'good': 27, 'night': 41, 'old': 42, 'age': 0, 'burn': 7, 'rave': 45, 'close': 9, 'day': 14, 'rage': 44, 'dying': 17, 'light': 35, 'wise': 55, 'men': 38, 'end': 18, 'know': 32, 'dark': 13, 'right': 46, 'words': 56, 'forked': 23, 'lightning': 36, 'wave': 52, 'crying': 10, 'bright': 6, 'frail': 24, 'deeds': 16, 'danced': 12, 'green': 29, 'bay': 1, 'wild': 54, 'caught': 8, 'sang': 48, 'sun': 50, 'flight': 22, 'learn': 34, 'late': 33, 'grieved': 30, 'way': 53, 'grave': 28, 'near': 40, 'death': 15, 'blinding': 5, 'sight': 49, 'blind': 4, 'eyes': 19, 'blaze': 2, 'like': 37, 'meteors': 39, 'gay': 25, 'father': 20, 'sad': 47, 'height': 31, 'curse': 11, 'bless': 3, 'fierce': 21, 'tears': 51, 'pray': 43}


In [74]:
print(count_vectorizer.get_feature_names())

['age', 'bay', 'blaze', 'bless', 'blind', 'blinding', 'bright', 'burn', 'caught', 'close', 'crying', 'curse', 'danced', 'dark', 'day', 'death', 'deeds', 'dying', 'end', 'eyes', 'father', 'fierce', 'flight', 'forked', 'frail', 'gay', 'gentle', 'good', 'grave', 'green', 'grieved', 'height', 'know', 'late', 'learn', 'light', 'lightning', 'like', 'men', 'meteors', 'near', 'night', 'old', 'pray', 'rage', 'rave', 'right', 'sad', 'sang', 'sight', 'sun', 'tears', 'wave', 'way', 'wild', 'wise', 'words']
