# From Texts to Numeric Vectors

In [None]:
import seaborn as sns
sns.set_context("talk")

from sklearn.feature_extraction.text import CountVectorizer

In [None]:
corpus = ['Time flies like an arrow.', 'Fruit flies like a banana.']

The default tokenizer suppresses single-character tokens. Let's modify it a bit in order to keep all words.

In [None]:
one_hot_vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b',binary=True)
one_hot_vectorizer

Apply the set-of-words vectorizer to our corpus

In [None]:
one_hot = one_hot_vectorizer.fit_transform(corpus)
print(one_hot.toarray())

Let's have some fun and show the one-hot-vectors one heat map...

In [None]:
sns.heatmap(one_hot.toarray(), annot=True, cmap="Reds",cbar=False, 
    xticklabels=one_hot_vectorizer.get_feature_names(), 
    yticklabels=[s[0:5]+"…" for s in corpus])

## Apply to new unseen test data
Applying the transformation to unseen texts. Just ignoring the out-of-vocabulary items is the proper way of doing it.

In [None]:
test_corpus = ['Fruit flies like an apple.']

In [None]:
one_hot_test = one_hot_vectorizer.transform(test_corpus)
print(one_hot_test.toarray())

In [None]:
sns.heatmap(one_hot_test.toarray(), annot=True, cmap="Reds",
cbar=False, xticklabels=one_hot_vectorizer.get_feature_names(), yticklabels=[s[0:5]+"…" for s in test_corpus])

You see... no apples here

In [None]:
help(one_hot_vectorizer)