## Vectorizers for converting features into numerical vectors

The following small examples show how the various vectorizers in scikit-learn work.

### DictVectorizer, for encoding features stored in dictionaries

First, let's make a small training set where the instances consist off a mix of symbolic, Boolean, and numerical features.

In [13]:
X = [{'f1':'B', 'f2':'F', 'f3':False, 'f4':7},
    {'f1':'B', 'f2':'M', 'f3':True, 'f4':2},
    {'f1':'O', 'f2':'F', 'f3':False, 'f4':9}]

The [`DictVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) is used when features are stored in dictionaries.

We call `fit_transform`, which is equivalent to first calling `fit` and then `transform`. `fit` goes through the training set and builds a vocabulary of features. `transform` can then carry out the conversion from the list of dictionaries into a numerical matrix.

In [14]:
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer()
Xe = vec.fit_transform(X)

# We use toarray in order to convert from a sparse matrix
# into a normal matrix, so that we can print the matrix nicely.
Xe.toarray()

array([[1., 0., 1., 0., 0., 7.],
       [1., 0., 0., 1., 1., 2.],
       [0., 1., 1., 0., 0., 9.]])

Let's inspect the vocabulary of features. This can be useful, for instance, when we need to interpret the weights of a linear classifier or regression model.

In [15]:
vec.get_feature_names()

['f1=B', 'f1=O', 'f2=F', 'f2=M', 'f3', 'f4']

### CountVectorizer, for encoding "bags of words" (typically documents)

The [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is designed for converting "bag-of-words" representations, typically used when classifying or clustering documents. In this case, the training set consists of a list of documents, where each document is represented as a single string.

Again, we call `fit_transform` to learn the mapping and then carry out the conversion. We then print the resulting matrix and the learned vocabulary.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

X = ['example text',
     'another text']

vec = CountVectorizer()
Xe = vec.fit_transform(X)
print(Xe.toarray())

print(vec.get_feature_names())

[[0 1 1]
 [1 0 1]]
['another', 'example', 'text']


The `CountVectorizer` has its own built-in *preprocessor* (text cleaner) and *tokenizer* (word splitter). In some cases, we deal with documents that have already been preprocessed and split into separate words, or we want to carry out those steps separately for some reason. In those cases, we need to disable the built-in preprocessor and tokenizer.

In the example below, we do this by providing "dummy functions" (the `lambda x: x` part) for the `preprocessor` and `tokenizer` arguments of the `CountVectorizer`'s constructor.

In [17]:
X = [['example', 'text'],
     ['another', 'text']]

vec = CountVectorizer(preprocessor = lambda x: x,
                      tokenizer = lambda x: x)
Xe = vec.fit_transform(X)
print(Xe.toarray())

[[0 1 1]
 [1 0 1]]


See also the [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), which is like a `CountVectorizer` except that it includes a method to give a lower weight to words that occur in many documents (IDF, *inverse document frequency*).

### Document classification example

To exemplify the use of vectorizers for documents, we take a look at a small collection of bug reports from the [Eclipse](https://www.eclipse.org/) project. You can download the dataset [here](http://www.cse.chalmers.se/~richajo/dit866/data/eclipse_bugs.tsv').

This is a simple tab-separated format, where the first column corresponds to the name of the component where the bug occurred, and the second column is the text of the bug report.

In [61]:
import pandas as pd
eclipse_bug_data = pd.read_csv('eclipse_bugs.tsv', sep='\t', header=None, names=['component', 'bugreport'])

eclipse_bug_data.head()

Unnamed: 0,component,bugreport
0,Platform,Java core dump in gtk_ctree_get_node_info This...
1,Platform,[Import/Export] Import existing project wizard...
2,Platform,StyledText - bidi - Win2K/XP support Because S...
3,JDT,Concurrent modification updating classpath Bui...
4,Platform,< wizard > should know the whole path of < cat...


We use scikit-learn's vectorizers to convert the documents into matrices. We try a `CountVectorizer` as well as a `TfidfVectorizer`; as mentioned previously, a `TfidfVectorizer` is similar to a `CountVectorizer` in that both will compute word frequencies in documents, but the `TfidfVectorizer` also downweights words that occur in many documents. The intuition is that words that appear "everywhere" (such as "and", "in", punctuation) are less informative for predictive tasks.

In [58]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

Y = eclipse_bug_data.component

vectorizer1 = CountVectorizer()
X_v1 = vectorizer1.fit_transform(eclipse_bug_data.bugreport)

vectorizer2 = TfidfVectorizer()
X_v2 = vectorizer2.fit_transform(eclipse_bug_data.bugreport)

X_v1.shape

(1000, 13895)

In both cases, the result is a matrix with 1000 rows and 13895 columns. What does this tell us about the dataset?

We can now use this dataset with any machine learning algorithm in scikit-learn. This time, we try the [perceptron](https://en.wikipedia.org/wiki/Perceptron), a simple mistake-driven algorithm for training linear classifiers. (We will see more of this algorithm in the next lecture.)

We get a cross-validation accuracy of about 0.81 with the `CountVectorizer` and 0.82 with the `TfidfVectorizer`. With such a small dataset, this difference is too small to draw any firm conclusions.

In [59]:
from sklearn.linear_model import Perceptron
from sklearn.model_selection import cross_val_score

cross_val_score(Perceptron(), X_v1, Y, cv=10).mean()

0.8099999999999999

In [60]:
cross_val_score(Perceptron(), X_v2, Y, cv=10).mean()

0.818

### The "hashing trick"

The `HashingVectorizer` is a vectorizer similar to `CountVectorizer`, but which does not need to keep a vocabulary. As discussed in the lecture, we sometimes want to avoid building the vocabulary, for instance because we cannot access the whole training set at a time, or because we cannot store the whole training set in memory.

The drawbacks of the `HashingVectorizer` is that we cannot inspect the vocabulary, for instance if we'd like to look at the useful features, and that there is a risk of different features "colliding" in the converted numerical vectors.

Otherwise, we can use the `HashingVectorizer` similarly to a `CountVectorizer`, but note that there is no vocabulary.

In [6]:
from sklearn.feature_extraction.text import HashingVectorizer

X = ['example text',
     'another text']

vec = HashingVectorizer()
Xe = vec.fit_transform(X)
print(Xe)

  (0, 162235)	-0.7071067811865475
  (0, 741852)	-0.7071067811865475
  (1, 741852)	-0.7071067811865475
  (1, 848104)	0.7071067811865475


For your information, the `HashingVectorizer` internally uses [*hash functions*](https://en.wikipedia.org/wiki/Hash_function) to map strings to vector-space dimensions. Here is an example showing how you can compute Python's built-in hash function for a couple of strings. (This isn't something you need to care of when using a `HashingVectorizer`.)

In [7]:
print("hello".__hash__())

print("hello2".__hash__())

-3954403858382023101
8629679716113171644
