We'll work with a small amount of fake data to illustrate, but nothing in principle prevents this from working with any set of text data.

In [1]:
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps behind the lazy dog.",
    "The lazy brown fox leaps beyond the sleeping dog.",
    "This lazy dog has a Twitter account that the fox subscribes to."
]

We import the count vectorizer as usual.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

To recap, we can make texts turn into a count vector by initializing a Count Vectorizer.

In [3]:
cv = CountVectorizer()
vectors = cv.fit_transform(texts)

In [4]:
vectors = vectors.toarray()

In [5]:
print(vectors.shape)
vectors

(4, 19)


array([[0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0],
       [0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 2, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]], dtype=int64)

The information about *which element in a vector* corresponds to *which word* is in the vectorizer object, as the attribute ``vocabulary_``.

In [6]:
print(cv.vocabulary_)

{'fox': 5, 'that': 14, 'lazy': 8, 'beyond': 2, 'over': 10, 'jumps': 7, 'this': 16, 'subscribes': 13, 'has': 6, 'leaps': 9, 'brown': 3, 'quick': 11, 'account': 0, 'the': 15, 'sleeping': 12, 'behind': 1, 'dog': 4, 'to': 17, 'twitter': 18}


The 0th element of the first count vector has a value of 0, we know the word "account" appears in it 0 times. The 1st element of the first count vector has a value of 1, so we know "brown" appears in it 1 time.

Sometimes, we want to modify how the Count Vectorizer does its vectorization. We can do this by *passing parameters to the factory method during initialization*.

For example, we have seen in some readings that **presence vectors** do better than **count vectors**. In a **presence vector** the value 1 indicates that the word is in the text, 0 means it's absence. We don't care how many times a word appears, though.

The factory method has a named argument called ``binary`` that, if you set to True, will produce presence vectors.

In [7]:
cv = CountVectorizer(binary=True)
vectors = cv.fit_transform(texts)
print(vectors.shape)
vectors.toarray()

(4, 19)


array([[0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]], dtype=int64)

Notice that in the sample sentences, some words appeared twice in the same text, and therefore are represented by a 2 in the Count Vectors. However, the presence vectors only have values of 0 and 1.

# N-Grams

As we've discussed before, an n-gram is a combination of n consecutive words. In the first sentence, ("the", "quick") is a bigram, and ("quick", "brown") is a bigram. Considering n-grams *preserves some of the information in the order of the words*. For example, with a bigram the classifier can know not just that the sentence had "quick" and "brown", but that they occurred together.

However, sometimes this comes at a cost: since there are *many combinations of 2 words*, the length of the vector becomes *much longer*. This applies in even greater magnitude if you consider 3-grams or 4-grams and so on. On the other hand, the data might become *very sparse* since combinations of words can be more rare than single words.

The count vector factory method has a parameter called ngram_range that takes into consideration which n-grams to count. The default value is ``(1, 1)``, which means it considers only unigrams. 

Here's an example where we consider only bigrams.

In [8]:
cv = CountVectorizer(ngram_range=(2, 2))
bigram_vectors = cv.fit_transform(texts)
print(bigram_vectors.shape)
bigram_vectors.toarray()

(4, 25)


array([[0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,
        0, 0, 0],
       [0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,
        0, 0, 0],
       [0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
        1, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 1, 1]], dtype=int64)

Note the dimensionality has increased. There are 19 unique words, but there are 25 unique bigrams.

If we peek into the vocabulary, we can see that the algorithm is counting combinations of two words, instead of single words.

In [9]:
cv.vocabulary_

{'account that': 0,
 'behind the': 1,
 'beyond the': 2,
 'brown fox': 3,
 'dog has': 4,
 'fox jumps': 5,
 'fox leaps': 6,
 'fox subscribes': 7,
 'has twitter': 8,
 'jumps behind': 9,
 'jumps over': 10,
 'lazy brown': 11,
 'lazy dog': 12,
 'leaps beyond': 13,
 'over the': 14,
 'quick brown': 15,
 'sleeping dog': 16,
 'subscribes to': 17,
 'that the': 18,
 'the fox': 19,
 'the lazy': 20,
 'the quick': 21,
 'the sleeping': 22,
 'this lazy': 23,
 'twitter account': 24}

Here's an example of a vectorizer that considers *both* unigrams and bigrams.

In [10]:
cv = CountVectorizer(ngram_range=(1, 2))
ngram_vectors = cv.fit_transform(texts)
print(ngram_vectors.shape)
ngram_vectors.toarray()

(4, 44)


array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1,
        0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1]], dtype=int64)

Considering bigrams and trigrams simultaneously has increased our dimensionality to 44.

In [11]:
cv.vocabulary_

{'account': 0,
 'account that': 1,
 'behind': 2,
 'behind the': 3,
 'beyond': 4,
 'beyond the': 5,
 'brown': 6,
 'brown fox': 7,
 'dog': 8,
 'dog has': 9,
 'fox': 10,
 'fox jumps': 11,
 'fox leaps': 12,
 'fox subscribes': 13,
 'has': 14,
 'has twitter': 15,
 'jumps': 16,
 'jumps behind': 17,
 'jumps over': 18,
 'lazy': 19,
 'lazy brown': 20,
 'lazy dog': 21,
 'leaps': 22,
 'leaps beyond': 23,
 'over': 24,
 'over the': 25,
 'quick': 26,
 'quick brown': 27,
 'sleeping': 28,
 'sleeping dog': 29,
 'subscribes': 30,
 'subscribes to': 31,
 'that': 32,
 'that the': 33,
 'the': 34,
 'the fox': 35,
 'the lazy': 36,
 'the quick': 37,
 'the sleeping': 38,
 'this': 39,
 'this lazy': 40,
 'to': 41,
 'twitter': 42,
 'twitter account': 43}

# Custom Tokenization

``CountVectorizer`` has a default tokenizer that we have been relying to up to this point. However, there are many situations that call for finer-grained control of the tokenization process.

In [12]:
tweettext = [
    "RT @joeschmoe I'm really interested in your tutorial on #python!",
    "@annaschneider Thanks! Hopefully there'll be more #python lessons on the way!"]
cv = CountVectorizer()
cv.fit_transform(tweettext)
cv.vocabulary_

{'annaschneider': 0,
 'be': 1,
 'hopefully': 2,
 'in': 3,
 'interested': 4,
 'joeschmoe': 5,
 'lessons': 6,
 'll': 7,
 'more': 8,
 'on': 9,
 'python': 10,
 'really': 11,
 'rt': 12,
 'thanks': 13,
 'the': 14,
 'there': 15,
 'tutorial': 16,
 'way': 17,
 'your': 18}

Note that the ``sklearn`` Count Vectorizer ignored "RT" and stripped the @ sign from the mention and the # sign from the retweet. 

We know that NLTK has a great Tweet Tokenizer:

In [13]:
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
tt.tokenize(tweettext[0])

['RT',
 '@joeschmoe',
 "I'm",
 'really',
 'interested',
 'in',
 'your',
 'tutorial',
 'on',
 '#python',
 '!']

What do we do if we want a the CountVectorizer object to use a *different tokenizer* before counting tokens and making count vectors? Again , we turn to a parameter to specify when we use the factory method.

This parameter is called ``tokenizer`` and *we pass to it the name of the function that we want to use to tokenize*.

Normally, if we're tokenizing a single text with the initialized Tweet tokenizer, we'd type:

```python
tt.tokenize(sometext)
```

When we want to pass it to a Count Vectorizer object, we use the parameter ``tokenizer=tt.tokenize``.

In [14]:
cv = CountVectorizer(tokenizer=tt.tokenize)
# CV will now use tt.tokenize() to tokenize a text it receives.
tweetvector = cv.fit_transform(tweettext)

In [15]:
tweetvector.toarray()

array([[1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1],
       [2, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0]], dtype=int64)

In [16]:
cv.vocabulary_

{'!': 0,
 '#python': 1,
 '@annaschneider': 2,
 '@joeschmoe': 3,
 'be': 4,
 'hopefully': 5,
 "i'm": 6,
 'in': 7,
 'interested': 8,
 'lessons': 9,
 'more': 10,
 'on': 11,
 'really': 12,
 'rt': 13,
 'thanks': 14,
 'the': 15,
 "there'll": 16,
 'tutorial': 17,
 'way': 18,
 'your': 19}

We can see that Count Vectorizer kep the mentions and hashtags intact, because it used the ``tt.tokenize`` function to split the text strings instead of the default tokenizer.

# Character N-Grams

A *character n-gram* is a situation where we **count characters instead of words**. This might seem strange at first, but machine learning researchers have found that in some situations, character n-grams provide better performance than word n-grams. 

To see what a character n-gram looks like, let's turn to ``sklearn``. 

There is no parameter we can set to look at characters instead of words. However! This is not a problem.

If you think about it, doing counts of characters instead of words is just involves a different way of tokenizing the text. Instead of **splitting the words** somehow, we split the characters.

We can use our knowledge of the CountVector parameter ``tokenizer`` to take advantage of this. We'll pass it a function that takes a string and splits it into all its characters.

In Python, splitting a text string into its constituent characters is actually quite easy: we simply convert the string into a list directly:

In [17]:
astring = "Hello world!"
characters = list(astring)
print(characters)

['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']


That being said, we can simply pass the ``list`` function directly to the Count Vectorizer's ``tokenizer`` parameter.

In [18]:
cv = CountVectorizer(tokenizer=list)
character_vectors = cv.fit_transform(texts)

In [19]:
character_vectors.toarray()

array([[ 8,  1,  1,  1,  1,  1,  3,  1,  1,  2,  1,  1,  1,  1,  1,  1,  4,
         1,  1,  2,  1,  2,  2,  1,  1,  1,  1,  1],
       [ 8,  1,  1,  2,  1,  2,  3,  1,  1,  3,  2,  1,  1,  1,  1,  2,  3,
         1,  1,  1,  1,  2,  2,  0,  1,  1,  1,  1],
       [ 8,  1,  2,  2,  0,  2,  6,  1,  2,  2,  1,  0,  0,  3,  0,  3,  4,
         2,  0,  1,  2,  2,  0,  0,  1,  1,  2,  1],
       [11,  1,  5,  2,  3,  1,  3,  1,  1,  4,  3,  0,  0,  1,  0,  1,  4,
         0,  0,  2,  5,  9,  2,  0,  1,  1,  1,  1]], dtype=int64)

In [20]:
cv.vocabulary_

{' ': 0,
 '.': 1,
 'a': 2,
 'b': 3,
 'c': 4,
 'd': 5,
 'e': 6,
 'f': 7,
 'g': 8,
 'h': 9,
 'i': 10,
 'j': 11,
 'k': 12,
 'l': 13,
 'm': 14,
 'n': 15,
 'o': 16,
 'p': 17,
 'q': 18,
 'r': 19,
 's': 20,
 't': 21,
 'u': 22,
 'v': 23,
 'w': 24,
 'x': 25,
 'y': 26,
 'z': 27}

We can see from ``cv.vocabulary_`` that the resulting vectors contain counts of characters, including the spaces and periods.

# Character n-grams AND Word n-grams

The Burger paper on gender in tweets used character n-grams and word n-grams together. They considered all character n-grams from 1 to 5, and all word 1-grams and 2-grams. Furthermore, they use presence vectors, not count vectors. Using our synthetic data, we'll represent them in the same way.

In [21]:
word_vectorizer = CountVectorizer(tokenizer=tt.tokenize, binary=True, ngram_range=(1, 2))
character_vectorizer = CountVectorizer(tokenizer=list, binary=True, ngram_range=(1, 5))

word_vectors = word_vectorizer.fit_transform(texts)
character_vectors = character_vectorizer.fit_transform(texts)

We now have a matrix countaining word counts:

In [22]:
word_vectors.shape

(4, 49)

There are 4 rows, 1 for each text, and 49 columns, so there are 49 unique 1-grams and 2-grams.

We also have a matrix countaining character counts:

In [23]:
character_vectors.shape

(4, 481)

It has 481 columns, so in this dataset there are 481 unique character 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams. 

Using a ``scipy`` function called ``hstack``(standing for ``hstack``), we can glue the rows together, so that we'll have a single new matrix with 4 rows and 481+49=530 columns.

``hstack`` takes a list as its argument; in the list, list the matrices you want to paste together. 

In [24]:
from scipy.sparse import hstack
vectors = hstack((word_vectors, character_vectors))
vectors.shape

(4, 530)

Right now, we have two seaprate vocabulary dictioanries, one for the word n-grams, one for the character n-grams. 

When we merged the matrices, we **attached the character vectors to the word vectors**. This means the word vector dictionary is correct (their positions didn't change), but we need to update the character n-grams dictionary.

We need to change the character n-grams dictionary. Right now, they tell us which character corresponds to the 0th element of a vector, which character corresponds to the 1st element of a vector, and so on. **But since we concatenated the vectors, these positions no longer hold**.

If we think about this, solving this will be simple. Right now, the character vocabulary associates the space symbol with index 0. But now, the space symbol occurs *after* all the word n-grams. We know there are a total of 49 word n-grams, so the new index of the space should be 0 + 49 = 49. The character vocabulary associates the bigram " a" with index 1. However, since the character n-grams comes after the 49 word n-grams, it should now be associted with the index 1 + 49 = 50.

In other words, we should go into the character dictionary and **add 49 to all of the indices**.

In [25]:
word_ngrams_count = len(word_vectorizer.vocabulary_)

new_character_vocabulary = {}

for character_ngram, index in character_vectorizer.vocabulary_.items():
    new_character_vocabulary[character_ngram] = index + word_ngrams_count

If you're working with the final matrix, and you need to know which words correspond to which index, you can use ``word_vectorizer.vocabulary_`` for the word n-grams and ``new_character_vocabulary`` for the character n-grams. 