# Prepare Text Data for Machine Learning with scikit-learn

Text data requires special preparation before you can start using it for predictive modeling.

The text must be parsed to remove words, called tokenization. 
<br>Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).

The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.

In this activity, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.
- How to convert text to word count vectors with CountVectorizer.
- How to convert text to word frequency vectors with TfidfVectorizer.
- How to convert text to unique integers with HashingVectorizer.

## Bag-of-Words Model

<br>We cannot work with text directly when using machine learning algorithms.

<br>Instead, we need to convert the text to numbers.

<br>We may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.

<br>A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.

<br>The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

<br>This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.

<br>This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

<br>There are many ways to extend this simple method, both by better clarifying what a “word” is and in defining what to encode about each word in the vector.

<br>The scikit-learn library provides 3 different schemes that we can use, and we will briefly look at each.

### Word Counts with CountVectorizer
The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

You can use it as follows:

1. Create an instance of the CountVectorizer class.
2. Call the fit() function in order to learn a vocabulary from one or more documents.
3. Call the transform() function on one or more documents as needed to encode each as a vector.

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.

The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function.

Below is an example of using the CountVectorizer to tokenize, build a vocabulary, and then encode a document.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["Post Graduate Program in Big Big Big Data Analytics...!"]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

{'post': 5, 'graduate': 3, 'program': 6, 'in': 4, 'big': 1, 'data': 2, 'analytics': 0}
(1, 7)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 3 1 1 1 1 1]]


One can see that all words were made lowercase by default and that the punctuation was ignored. 
<br>Refer all of the options in the API documentation at
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Importantly, the same vectorizer can be used on documents that contain words not included in the vocabulary. 
<br>These words are ignored and no count is given in the resulting vector.

For example, below is an example of using the vectorizer above to encode a document with few words in the vocabulary and few words that are not in the vocabulary.

In [2]:
# encode another document
text2 = ["Big Data analytics training and certification program"]
vector2 = vectorizer.transform(text2)
# summarize encoded vector
print(vectorizer.vocabulary_)
print(vector2.shape)
print(type(vector2))
print(vector2.toarray())

{'post': 5, 'graduate': 3, 'program': 6, 'in': 4, 'big': 1, 'data': 2, 'analytics': 0}
(1, 7)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 0 0 0 1]]


### Word Frequencies with TfidfVectorizer
Word counts are a good starting point, but are very basic.

One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

- Term Frequency: This summarizes how often a given word appears within a document.
- Inverse Document Frequency: This downscales words that appear a lot across documents.

TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. 
Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

The same create, fit, and transform process is used as with the CountVectorizer.

Below is an example of using the TfidfVectorizer to learn vocabulary and inverse document frequencies across 4 small documents and then encode one of those documents.

In [3]:
# list of text documents
data = '''Time flies like an arrow
Fruit flies like a banana
Cat sat on the mat
The cat is white.'''

dataset = data.split('\n')
# print dataset contents
dataset

['Time flies like an arrow',
 'Fruit flies like a banana',
 'Cat sat on the mat',
 'The cat is white.']

In [4]:
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer

# create the transform
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2))
#max_df=0.95, min_df=2, stop_words='english' #USE HELP TO SEE WHAT EACH DOES)

# tokenize and build vocab
t0 = time()

# tokenize and build vocab
tfidf_vectorizer.fit(dataset)
print("done in %0.3fs." % (time() - t0))

# summarize
print(tfidf_vectorizer.vocabulary_)
print(tfidf_vectorizer.idf_)

done in 0.004s.
{'time': 24, 'flies': 7, 'like': 13, 'an': 0, 'arrow': 2, 'time flies': 25, 'flies like': 8, 'like an': 14, 'an arrow': 1, 'fruit': 9, 'banana': 3, 'fruit flies': 10, 'like banana': 15, 'cat': 4, 'sat': 19, 'on': 17, 'the': 21, 'mat': 16, 'cat sat': 6, 'sat on': 20, 'on the': 18, 'the mat': 23, 'is': 11, 'white': 26, 'the cat': 22, 'cat is': 5, 'is white': 12}
[1.91629073 1.91629073 1.91629073 1.91629073 1.51082562 1.91629073
 1.91629073 1.51082562 1.51082562 1.91629073 1.91629073 1.91629073
 1.91629073 1.51082562 1.91629073 1.91629073 1.91629073 1.91629073
 1.91629073 1.91629073 1.91629073 1.51082562 1.91629073 1.91629073
 1.91629073 1.91629073 1.91629073]


A vocabulary of 27 words is learned from the documents and each word is assigned a unique integer index in the output vector.

The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.51082562 to the most frequently observed words: “the” at index 4, 7, 8, 13, 21.
For the words cat, flies, flies like, like and the.

In [5]:
# encode document
tfidf = tfidf_vectorizer.transform(dataset)
print(tfidf.data)
print("Type: ", type(tfidf))
print("Shape: ", tfidf.shape)
print(tfidf.toarray())

[0.35657982 0.35657982 0.35657982 0.28113163 0.28113163 0.28113163
 0.35657982 0.35657982 0.35657982 0.41292788 0.32555709 0.41292788
 0.41292788 0.32555709 0.32555709 0.41292788 0.34829919 0.27460308
 0.34829919 0.34829919 0.34829919 0.34829919 0.34829919 0.34829919
 0.27460308 0.40021825 0.40021825 0.31553666 0.40021825 0.40021825
 0.40021825 0.31553666]
Type:  <class 'scipy.sparse.csr.csr_matrix'>
Shape:  (4, 27)
[[0.35657982 0.35657982 0.35657982 0.         0.         0.
  0.         0.28113163 0.28113163 0.         0.         0.
  0.         0.28113163 0.35657982 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.35657982 0.35657982 0.        ]
 [0.         0.         0.         0.41292788 0.         0.
  0.         0.32555709 0.32555709 0.41292788 0.41292788 0.
  0.         0.32555709 0.         0.41292788 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.        ]
 [0.         0.         0.  

Finally, the 4 documents are encoded as an 27-element sparse array and we can review the final scorings of each word with different values for the words in the vocabulary.

The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

#### Write tfidf as dataframe

In [6]:
dense = tfidf.todense()
print(dense.shape)
print(dense[0])

(4, 27)
[[0.35657982 0.35657982 0.35657982 0.         0.         0.
  0.         0.28113163 0.28113163 0.         0.         0.
  0.         0.28113163 0.35657982 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.35657982 0.35657982 0.        ]]


In [7]:
feature_names = tfidf_vectorizer.get_feature_names()
print(len(feature_names))
print(feature_names,)

27
['an', 'an arrow', 'arrow', 'banana', 'cat', 'cat is', 'cat sat', 'flies', 'flies like', 'fruit', 'fruit flies', 'is', 'is white', 'like', 'like an', 'like banana', 'mat', 'on', 'on the', 'sat', 'sat on', 'the', 'the cat', 'the mat', 'time', 'time flies', 'white']


In [8]:
import pandas as pd
DF = pd.DataFrame(dense)
DF.columns = tfidf_vectorizer.get_feature_names()
DF['text'] = dataset
DF

Unnamed: 0,an,an arrow,arrow,banana,cat,cat is,cat sat,flies,flies like,fruit,...,on the,sat,sat on,the,the cat,the mat,time,time flies,white,text
0,0.35658,0.35658,0.35658,0.0,0.0,0.0,0.0,0.281132,0.281132,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.35658,0.35658,0.0,Time flies like an arrow
1,0.0,0.0,0.0,0.412928,0.0,0.0,0.0,0.325557,0.325557,0.412928,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Fruit flies like a banana
2,0.0,0.0,0.0,0.0,0.274603,0.0,0.348299,0.0,0.0,0.0,...,0.348299,0.348299,0.348299,0.274603,0.0,0.348299,0.0,0.0,0.0,Cat sat on the mat
3,0.0,0.0,0.0,0.0,0.315537,0.400218,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.315537,0.400218,0.0,0.0,0.0,0.400218,The cat is white.


In [9]:
DF.T

Unnamed: 0,0,1,2,3
an,0.35658,0,0,0
an arrow,0.35658,0,0,0
arrow,0.35658,0,0,0
banana,0,0.412928,0,0
cat,0,0,0.274603,0.315537
cat is,0,0,0,0.400218
cat sat,0,0,0.348299,0
flies,0.281132,0.325557,0,0
flies like,0.281132,0.325557,0,0
fruit,0,0.412928,0,0


In [10]:
# Write as CSV file
DF.to_csv('mytfidf.csv', index = False)

### Doc Similarity

Given a new query, how to find out which document is it closest to?

In [11]:
new = 'Time flies like Sam'
response = tfidf_vectorizer.transform([new])

In [12]:
response_array = response.toarray()

In [13]:
pd.DataFrame(response_array, columns=DF.columns[0:27])

Unnamed: 0,an,an arrow,arrow,banana,cat,cat is,cat sat,flies,flies like,fruit,...,on,on the,sat,sat on,the,the cat,the mat,time,time flies,white
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.401043,0.401043,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.508672,0.508672,0.0


In [14]:
from sklearn.metrics.pairwise import cosine_similarity

list(map(lambda x: cosine_similarity(response, x), dense))

[array([[0.70100165]]), array([[0.39168692]]), array([[0.]]), array([[0.]])]

## Application of TF-IDF in clustering

In [3]:
with(open('ende_cleaned.json', 'r')) as f:
    corpus = f.readlines()
print(len(corpus))

180


In [4]:
!type corpus

The system cannot find the file specified.


In [16]:
#Creating the labels for the documents
label1=[0]*90
label2=[1]*90
labels=label1+label2 #label1 correspond to german and label2 correspond to english

In [17]:
#Extractig the features
print("Extracting tf-idf features...")
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,3), max_features=10000)
#max_df=0.95, min_df=2, stop_words='english' #USE HELP TO SEE WHAT EACH DOES)
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(corpus)
print("done in %0.3fs." % (time() - t0))

Extracting tf-idf features...
done in 3.358s.


In [18]:
#Creating a sparse matrix 
dense = tfidf.todense()
dense.shape

(180, 10000)

In [19]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=2, init = 'k-means++')
model.fit(dense)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [20]:
model.labels_

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1], dtype=int32)

In [21]:
kmeans_labels = model.labels_
x=list(zip(kmeans_labels,labels))
print(x)

[(0, 0), (1, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (0, 1), (0, 1), (1, 1), (0, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (0, 1), (0, 1), (1, 1), (1, 1), (1, 1), (0, 1), (0, 1),

In [22]:
results =list(zip(corpus, labels))
print(results[0])

('{"url": "https://de.wikipedia.org/wiki/Wikipedia:Impressum", "article": "\\u201eWikipedia, Die freie Enzyklop\\u00e4die\\u201c ist im Internet unter www.wikipedia.org zu finden, die deutschsprachige Ausgabe unter de.wikipedia.org.Anbieterin dieser Website ist die Wikimedia Foundation Inc., eingetragen beim Florida Department of State, Division of Corporations unter der Nummer N03000005323. Die Wikimedia Foundation ist eine Stiftung nach dem Recht des US-Bundesstaates Florida. Die verantwortliche Ansprechperson \\u2013 gleichzeitig Designated Agent im Sinne des Digital Millennium Copyright Act \\u2013 ist Geoff Brigham.Bei Fragen und Intervieww\\u00fcnschen k\\u00f6nnen Sie sich auf informeller Basis auch gerne an die aktiven deutschsprachigen Ansprechpartner wenden, siehe Wikipedia:Kontakt und Wikipedia:Presse.Wikipedia ist eine freie Enzyklop\\u00e4die. N\\u00e4here Informationen zum Projekt erfahren Sie auf der Seite \\u00dcber Wikipedia sowie im Enzyklop\\u00e4die-Artikel zu Wikip

### Hashing with HashingVectorizer

Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.

This, in turn, will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms.

A clever work around is to use a one way hash of words to convert them to integers. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector. A downside is that the hash is a one-way function so there is no way to convert the encoding back to a word (which may not matter for many supervised learning tasks).

The HashingVectorizer class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.

The example below demonstrates the HashingVectorizer for encoding a single document.

An arbitrary fixed-length vector size of 20 was chosen. This corresponds to the range of the hash function, where small values (like 20) may result in hash collisions. Remembering back to compsci classes, I believe there are heuristics that you can use to pick the hash length and probability of collision based on estimated vocabulary size.

Note that this vectorizer does not require a call to fit on the training data documents. Instead, after instantiation, it can be used directly to start encoding documents.

In [23]:
from sklearn.feature_extraction.text import HashingVectorizer
# create the transform
hash_vectorizer = HashingVectorizer(n_features=20)
# encode document
hash_vector = hash_vectorizer.transform(dataset)
# summarize encoded vector
print(hash_vector.shape)
print(hash_vector.toarray())

(4, 20)
[[ 0.         -0.4472136   0.          0.         -0.4472136   0.
   0.          0.          0.          0.          0.         -0.4472136
   0.          0.          0.         -0.4472136   0.          0.4472136
   0.          0.        ]
 [ 0.          0.          0.          0.         -0.5        -0.5
   0.          0.          0.          0.          0.          0.
   0.          0.5         0.         -0.5         0.          0.
   0.          0.        ]
 [ 0.57735027  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.         -0.57735027
   0.          0.          0.          0.          0.          0.
  -0.57735027  0.        ]
 [ 0.          0.          0.          0.          0.         -0.5
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.5
  -0.5         0.5       ]]


In [24]:
hash_vectorizer.n_features

20

In [25]:
flat_list = [item for sublist in dataset for item in sublist.split()]
flat_list
# for sublist in dataset:
#     for item in sublist.split():
#         flat_list.append(item)

['Time',
 'flies',
 'like',
 'an',
 'arrow',
 'Fruit',
 'flies',
 'like',
 'a',
 'banana',
 'Cat',
 'sat',
 'on',
 'the',
 'mat',
 'The',
 'cat',
 'is',
 'white.']