Text data requires special preparation before you can start using it for predictive modeling. The
text must be parsed to remove words, called tokenization. Then the words need to be encoded
as integers or floating point values for use as input to a machine learning algorithm, called
feature extraction (or vectorization).

In [6]:
# Word Counts with CountVectorizer

#The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary

from sklearn.feature_extraction.text import CountVectorizer as cv

vectorizer = cv()

text = ["My name is Saif Gazali and my age is 22"]

vectorizer.fit(text)

print(vectorizer.vocabulary_)

vectors = vectorizer.transform(text)

print(vectors.shape)
print(vectors.toarray())

{'my': 5, 'name': 6, 'is': 4, 'saif': 7, 'gazali': 3, 'and': 2, 'age': 1, '22': 0}
(1, 8)
[[1 1 1 1 2 2 1 1]]


In [8]:
# using the same vector on other text
text2 = ["his name is what ?"]
vector = vectorizer.transform(text2)
print(vector.toarray())


[[0 0 0 0 1 0 1 0]]


In [13]:
# Word Frequencies with tfidVectorizer

#TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents

from sklearn.feature_extraction.text import TfidfVectorizer as tfv

vectorizer = tfv()

text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]


vectorizer.fit(text)

print(vectorizer.vocabulary_)

print(vectorizer.idf_)
#The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word: the at index 7.

vector = vectorizer.transform([text[0]])

print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


# Hashing with HashingVectorizer
Counts and frequencies can be very useful, but one limitation of these methods is that the
vocabulary can become very large. This, in turn, will require large vectors for encoding
documents and impose large requirements on memory and slow down algorithms. A clever work
around is to use a one way hash of words to convert them to integers. The clever part is that
no vocabulary is required and you can choose an arbitrary-long fixed length vector. A downside
is that the hash is a one-way function so there is no way to convert the encoding back to a word
(which may not matter for many supervised learning tasks).


In [14]:
from sklearn.feature_extraction.text import HashingVectorizer

#An arbitrary fixed-length vector size of 20 was chosen. This corresponds to the range of the hash function, where small values (like 20) may result in hash collisions.
vectorizer = HashingVectorizer(n_features=20)

vector = vectorizer.transform(text)

print(vector.shape)
print(vector.toarray())

(3, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.         -0.70710678  0.          0.          0.          0.
  -0.70710678  0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.         -0.70710678  0.          0.
  -0.70710678  0.        ]]
