# Prepare Text Data with scikit-learn
@ Sani Kamal, 2019

## Contents
- Convert text to word count vectors with `CountVectorizer`.
- Convert text to word frequency vectors with `TfidfVectorizer`.
- Convert text to unique integers with `HashingVectorizer`.

## Word Counts with CountVectorizer
The `CountVectorizer` provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

# list of text documents
text = ["All human beings are born free and equal in dignity and rights."]

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
# print(vector)
print(type(vector))
print(vector.toarray())

{'all': 0, 'human': 8, 'beings': 3, 'are': 2, 'born': 4, 'free': 7, 'and': 1, 'equal': 6, 'in': 9, 'dignity': 5, 'rights': 10}
(1, 11)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 2 1 1 1 1 1 1 1 1 1]]


In [17]:
# encode another document
text2 = ["born free and equal why"]
vector = vectorizer.transform(text2)
print(vector.toarray())

[[0 1 0 0 1 0 1 1 0 0 0]]


## Word Frequencies with TfidfVectorizer

`TF-IDF` calculate word frequencies. This is an acronym that stands for `Term Frequency - Inverse
Document Frequency` which are the components of the resulting scores assigned to each word.
- Term Frequency: This summarizes how often a given word appears within a document.
- Inverse Document Frequency: This downscales words that appear a lot across documents.

`TF-IDF` are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The `TfidfVectorizer` will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow encode new documents.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents
text = ["All human beings are born free and equal in dignity and rights.",
        "They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood."]

# create the transform
vectorizer = TfidfVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

# encode document
vector = vectorizer.transform([text[0]])

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

{'all': 1, 'human': 13, 'beings': 5, 'are': 4, 'born': 6, 'free': 12, 'and': 2, 'equal': 11, 'in': 14, 'dignity': 9, 'rights': 18, 'they': 21, 'endowed': 10, 'with': 23, 'reason': 17, 'conscience': 8, 'should': 19, 'act': 0, 'towards': 22, 'one': 16, 'another': 3, 'spirit': 20, 'of': 15, 'brotherhood': 7}
[1.40546511 1.40546511 1.         1.40546511 1.         1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.         1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511]
(1, 24)
[[0.         0.30099921 0.42832683 0.         0.21416342 0.30099921
  0.30099921 0.         0.         0.30099921 0.         0.30099921
  0.30099921 0.30099921 0.21416342 0.         0.         0.
  0.30099921 0.         0.         0.         0.         0.        ]]


## Hashing with HashingVectorizer

In [19]:
from sklearn.feature_extraction.text import HashingVectorizer

# list of text documents
text = ["All human beings are born free and equal in dignity and rights."]

# create the transform
vectorizer = HashingVectorizer(n_features=20)

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 20)
[[-0.26726124  0.          0.         -0.26726124  0.         -0.53452248
   0.          0.          0.          0.26726124  0.          0.26726124
   0.26726124  0.          0.          0.          0.53452248  0.
   0.26726124  0.        ]]
