# Application: Vectorize tweets using Bag of Words

Bag of words vectorization uses each possible word (or token) as a feature that is present or absent. Each feature value can either be the number of occurrences of a token for an instance or simply denote the presence or absence of a token with a 1 or a 0.

This creates a very large feature space where every possible word is a feature.

Each tweet, especially because of twitter size limits, will have a small number of words.

For example, if the vocabulary of words is 100,000 tokens and each tweet on average is 8 words, then the bag of words representation for each tweet will have around 99,992 zeros. This will require a lot of memory storage, especially as the number of tweets (and/or the vocabulary) grows.

### Reducing the feature space

There are techniques for reducing the number of tokens in the vocabulary and hence the number of features, including:

* Replacing new or low-frequency terms with the "unk" (unknown) token
* Pruning out "stop words" or words that are so common that they don't convey meaning (like "the", and "and".)
   * Note that these would likely depend on each problem and its semantic scope.
* Normalizing tokens (e.g., common-casing, stripping punctuation) to reduce duplication
* Collapsing categories of tokens into a single token representing the full category (for example all numbers are represented by the "number" token.

Application of any of these or other techniques depends on the problem being solved.

### Reducing the storage requirements

We can also use effecient data structures for storing relevant data instead of carrying around all those zeros. And, thanks to scipy's sparse package, we can do this without also losing all of the matrix math provided by numpy arrays!

## Case study: Tweet vectorization

Let's walk through the concrete example of vectorizing tweets for the purpose of predicting emojis relevant to each tweet using bag of words.

First, some initializations including reading in our sample tweets...

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import re
import scipy
import sys
from collections import Counter
from scipy.sparse import coo_matrix, csc_matrix, csr_matrix, eye, hstack, linalg


In [None]:
%%bash

# Download the tweet data
mkdir -p ../.data
cd ../.data
if [ ! -f emoji_tweets_5k.csv ]; then
    echo "File not found. Downloading from s3"
    wget -q https://s3-us-west-2.amazonaws.com/resero2/datasets/ml-foundations/emoji_tweets_5k.csv
else
    echo "File exists, not downloading from s3"
fi


In [None]:
# Load the tweet data
import csv
import json

texts = []
emojis = []

with open("../.data/emoji_tweets_5k.csv") as infile:
    for row in csv.reader(infile):
        text = json.loads(row[1]).strip()
        texts.append(text)
        emojis.append(json.loads(row[2]))

print(f'Text count: {len(texts)}')
print(f'Emojis count: {len(emojis)}')

#### To recall what these data look like...

In [None]:
texts[:5]

In [None]:
emojis[:5]

### Let's tokenize tweets into words such that we:

* Collapse all special twitter tokens:
   * user
   * hashtag
   * ticker
   * signed
* Collapse urls and emails
* Collapse tokens having a frequency of 5 or below (including new tokens) as "unknown"
* Normalize token case
* Remove trailing punctuation and empty tokens

To spare you the gory details, I created a tokenizer class that does all of this.

So let's test this a little bit and build our feature matrix...

In [None]:
import util.token_model as token_model

In [None]:
tokenizer = token_model.TokenModel()
tokenizer.fit(texts)

### Take a peek at what to expect when collapsing tokens:

In [None]:
np.info(token_model.TokenModel.collapse)

In [None]:
' '.join(tokenizer.tokenize('.@mention1 @mention2 #hashtag $ticker me@foo.com http://bit.ly wow. ^signed jk.'))

In [None]:
for text in texts[0:5]:
    tokenized = ' '.join(tokenizer.tokenize(text))
    print(f'\t{text}\n\t{tokenized}\n')

### Visualize the token frequencies...

* Count the number of occurrences of each token in the data set
* Lay the tokens down in order from most to least frequent
* The first token will be the most frequent and will have rank 0
   * Note that those tokens with the same count will be adjacent, but arbitrarily ranked

In [None]:
# Plot token frequencies ranked from highest to lowest
def show_freqs(start_freq=0):
    _, axis = plt.subplots()
    freqs = np.array([value for _,value in tokenizer.tok2freq.most_common()], dtype=np.int32)
    plt.plot(range(start_freq, len(freqs)), freqs[start_freq:])
    plt.xlabel('token rank (ordered most to least frequent)')
    plt.ylabel('token frequency (count of tokens)')
    maxfreq = freqs[start_freq]
    cutoff = np.where(freqs <= 5)[0][0]
    axis.annotate(f'freq<=5 (rank={cutoff}, token="{tokenizer.tok2freq.most_common()[cutoff][0]}"")',
                  xy=(cutoff, freqs[cutoff]),
                  xytext=(cutoff+200, freqs[cutoff]+maxfreq/6),
                  arrowprops=dict(facecolor='black', shrink=0.05, width=2))
show_freqs()

In [None]:
show_freqs(start_freq=100)

In [None]:
# what are the most frequent tokens in the dataset (histogram):
tokenizer.tok2freq.most_common(10)

In [None]:
# what are the least frequent tokens in the dataset (histogram):
tokenizer.tok2freq.most_common()[-10:]

## With tokenization worked out, it's time to vectorize all of the tweets...

* Create a sparse COO matrix where
   * each row represents a tweet
   * each column in each row is a bag of words vector encoding the count of each token present in the tweet.

#### Let's vectorize just one to get a feel for what's going on...

In [None]:
# Grab a tweet
tweet_idx = 0

print(texts[tweet_idx])
print(' '.join(tokenizer.tokenize(texts[tweet_idx])))

In [None]:
# Vectorize this tweet
tweet_tokens = tokenizer.transform(texts[tweet_idx]).toarray()
tweet_tokens

In [None]:
np.where(tweet_tokens > 0)[1]

In [None]:
# Reverse lookup the tokens that were set
nonzero = np.where(tweet_tokens > 0)
counts = tweet_tokens[nonzero]
ranks = nonzero[1]
tokens = [tokenizer.idx2tok(rank) for rank in ranks]
for token, count, rank in zip(tokens, counts, ranks):
    print(f'\t{token}\tcount={count}\trank={rank}')

#### Now let's vectorize them all!

In [None]:
# Vectorize
X = tokenizer.transform(texts)

# Plot the "spy" chart of the resulting feature matrix
plt.spy(X.toarray(), aspect='auto')
plt.show()
print(f'X.shape = {X.shape}, meaning we have {X.shape[0]} tweets, each having {X.shape[1]} features')
elt_count = X.shape[0] * X.shape[1]
sparse_size = X.data.nbytes + X.indptr.nbytes + X.indices.nbytes
dense_size = 8 * elt_count  # to store int8's
print(f'sparse size = {sparse_size} B, dense size = {dense_size} B')
density = X.count_nonzero() / elt_count
print(f'density = {density}')

#### Since we're counting token occurrences within tweets, let's see how many repetitions occur

In [None]:
# Find non-zero values and the counts of each value as
#    value: count
# Where value is the number of times a token appeared in a tweet
# and count is the number of features occuring "value" times
nzr, nzc, nzv, vcounts = tokenizer.find_nonzero(X)
tokenizer.show_vcounts(vcounts)

In [None]:
# Let's map these values back to the data to see:
#  - the repeating token
#  - the number of repeats
#  - the original tweet text
#  - the tokenized tweet text
# For certain repeat counts.
list(map(
    lambda idx: (tokenizer.idx2tok(nzc[idx]),
                 nzv[idx],
                 texts[nzr[idx]],
                 ' '.join(tokenizer.tokenize(texts[nzr[idx]]))),
    np.where(nzv > 17)[0]))

In [None]:
# what are the most common repeating tokens *within a tweet* (histogram):
rtokens = Counter()
for tok, count in map(lambda idx: (tokenizer.idx2tok(nzc[idx]), nzv[idx]), np.where(nzv >= 5)[0]):
    rtokens[tok] += count
rtokens.most_common()

Not surprisingly, the tokens that repeat the most within tweets are similar to the most common tokens overall.
Recall that the UNK token represents all of the "long tail" tokens with frequency < 5.

## Let's organize the emojis for modeling and create the train/test split...

* As previously done, we'll focus on just the top 10 emojis for targets
* And we'll split the data 90% for training, leaving 10% for testing

In [None]:
emoji_counter = Counter()
for emoji_dict in emojis:
    emoji_counter.update(emoji_dict.keys())

print(emoji_counter.most_common(10))

common_emojis = [e[0] for e in emoji_counter.most_common(10)]

print()
print('Top 10:')
print(common_emojis)

In [None]:
# Two useful lookup tables for converting emojis into integer indexes and vice-versa
emoji_to_index = {w : i for i, w in enumerate(common_emojis)}
index_to_emoji = {i : w for i, w in enumerate(common_emojis)}

def create_Y_matrix(emojis):
    n = len(common_emojis)
    m = len(emojis)
    Y = np.zeros([m, n])
    for i, single_tweet_emojis in enumerate(emojis):
        for emoji in single_tweet_emojis:
            index = emoji_to_index.get(emoji, None)
            if index is not None:
                Y[i, index] = 1
    return Y
                
Y = create_Y_matrix(emojis)

print(f'Y shape: {Y.shape}')

In [None]:
N = X.shape[0]
num_train = int(N * 0.9)

# Convert values to 1's for training
X1 = csr_matrix((np.ones(X.data.shape[0], dtype=np.int8), X.indices, X.indptr))

def split(array, split_point=num_train):
    return array[:split_point], array[split_point:]

X_train, X_test = split(X1)
Y_train, Y_test = split(Y)
texts_train, texts_test = split(texts)
emojis_train, emojis_test = split(emojis)

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print()
print(f'Y_train shape: {Y_train.shape}')
print(f'Y_test shape: {Y_test.shape}')

## Now we're ready to build a model!

* This time, we're building a model based on *sparse matrices* instead of *numpy arrays*
   * see the ^^^ attention marker in the code comments
* Just as we did in lesson #1, we'll use multivariate least squares regression:


Given the encoded emoji matrix, $\mathbf{y}$, and our featurized input matrix, $\mathbf{X}$ (prepended with a 1 for each instance to be multipled by the bias in the $\mathbf{\beta}$ vector), we're solving for $\mathbf{\beta}$:

$$ \mathbf{X} \mathbf{\beta} = \mathbf{y} $$

$$ \quad \begin{bmatrix} 1 & X_{1,1} & X_{1,2} & ... & X_{1,n} \\ 1 & X_{2,1} & X_{2,2} & ... & X_{2,n} \\ ... & ... & ... & ... & ... \\ 1 & X_{m,1} & X_{m,2} & ... & X_{m,n} \end{bmatrix} \begin{bmatrix} b_{1} & b_{2} & ... & b_{t} \\ w_{1,1} & w_{1,2} & ... & w_{1,t} \\ w_{2,1} & w_{2,2} & ... & w_{2,t} \\ ... & ... & ... & ... \\ w_{n,1} & w_{n,2} & ... & w_{n,t} \end{bmatrix} = \begin{bmatrix} y_{1,1} & y_{1,2} & ... & y_{1,t} \\ y_{2,1} & y_{2,2} & ... & y_{2,t} \\ ... & ... & ... & ... \\ y_{m,1} & y_{m,2} & ... & y_{m,t} \end{bmatrix} $$

Recall that solving for $\mathbf{\beta}$ and regularizing using "Ridge Regression", we get

$$ \mathbf{\beta} = ( \mathbf{X}^T \mathbf{X} + \lambda I )^{-1} \mathbf{X}^T \mathbf{y} $$

where $I$ is the identity matrix and $\lambda$ is the "ridge value", a small number, e.g., around $10^{-1}$ or $10^{-3}$.

It turns out that we need to use the ridge regression because the sparse values in our matrix ($\mathbf{X}^T$ $\mathbf{X}$) is singular (uninvertible). The regularization along the diagonal ensures that the determinant is non-zero and, therefore, invertible.

In [None]:
# Define a couple of helpers to keep things straight

def augment_X(X):
    '''
    Augment sparse X with the Bias vector (Beta) column of ones.
    We'll need to do this when training and when predicting.
    '''
    ones_col = csc_matrix(np.ones([X.shape[0], 1], dtype=np.float32))  # ^^^ Wrap np.ones with a sparse column matrix
    return hstack([ones_col, X], format='csr', dtype=np.float32)  # ^^^ Contrast sparse.hstack with np.hstack([ones_col, X])

class DataWrapper:
    '''
    Simple wrapper for adding the Beta column to X, holding y, and remembering sizes
    '''
    def __init__(self, X, y):
        """
        :param X: a 2d ndarray with shape (m, n) holding the independent variables
        :param y: a 2d ndarray with shape (m, 1) holding the targets
        """
        self.m, self.n = X.shape
        assert y.shape[0] == self.m
        
        # Augment X with the ones column
        self.X = augment_X(X)
        self.y = csr_matrix(y, dtype=np.float32)   # ^^^ Wrap y with a sparse row matrix

In [None]:
dw = DataWrapper(X_train, Y_train)

In [None]:
# Sparse implementation with ridge regression
def least_squares(dw, l=1.0e-2):
    """
    :param dw: a DataWrapper holding X and y
    :param l: Lambda for ridge regression, if present
    :returns: a 2d sparse array with shape (n+1, 1) holding the bias (first element) and the weights (rest of the elements)
    """
    # Solve the normal equations
    Xt = dw.X.T
    result_one = Xt @ dw.X  # ^^^ Contrast with non-sparse np.matmul(Xt, dw.X)
    
    # Regularize for inverting (ridge regularization)
    if l is not None:
        diag = np.ones(dw.n+1, dtype=np.float32) * l
        diag[0] = 0  # don't regularize the bias term
        I = eye(dw.n+1, dw.n+1, dtype=np.float32, format='csr')  # sparse "eye"-dentity matrix, get it?!
        I.setdiag(diag)
        result_one = result_one + I  # ^^^ sparse + sparse = sparse

    # Carry on with the calculations
    result_two = linalg.inv(result_one)  # ^^^ Contrast sparse.linalg.inv with non-sparse np.linalg.inv
    result_three = result_two @ Xt  # ^^^ Contrast with non-sparse np.matmul(result_two, Xt)
    return result_three @ dw.y  # ^^^ Contrast with non-sparse np.matmul(result_three, dw.y)

In [None]:
# And ... solve the model!
Betas = least_squares(dw)

In [None]:
# Look, ma, we're still working with sparse matrices!
Betas

In [None]:
# Separate the biases and the weights
Biases = Betas[0, :]
Weights = Betas[1:, :]

print(f'Betas shape: {Betas.shape}')
print(f'Biases shape: {Biases.shape}')
print(f'Weights shape: {Weights.shape}')

In [None]:
# ...and the slices are also still sparse...
Weights

In [None]:
def predict(X, Betas):
    """
    :param X: a 2d ndarray with shape (m, n) holding the independent variables
    :param Betas: a 2d ndarray with shape (n+1, k) holding the parameters of a linear model (the first row contains bias terms)
    :returns: a 2d ndarray with shape (m, k) holding the predictions
    """
    m, n = X.shape
    assert Betas.shape[0] == n + 1
    
    # Augment X with the ones column
    X = augment_X(X)

    # Apply model
    return X @ Betas

In [None]:
Y_test_pred = predict(X_test, Betas)
print(f'Y_test_pred shape: {Y_test_pred.shape}')

In [None]:
#
# Print results when an emoji prediction score exceeds a threshold (0.4)
#
result_counts = Counter()
for test_text, test_emoji, y_pred in zip(texts_test, emojis_test, Y_test_pred):
    y_pred = y_pred.toarray()[0]
    highest_scoring_emoji_index = y_pred.argmax()  # ^^^ Contrast with non-sparse np.argmax(y_pred)
    highest_score = y_pred[highest_scoring_emoji_index]
    tweets_common_emojis = [e for e in test_emoji if e in common_emojis]
    if len(tweets_common_emojis) > 0:
        result_counts['instances_with_emoji'] += 1
    if highest_score > 0.4:
        print('-'*40)
        print('Text:', test_text)
        print('Bag of words:', ' '.join(tokenizer.tokenize(test_text)))
        tweets_common_emojis = [e for e in test_emoji if e in common_emojis]
        print('Common emojis:', tweets_common_emojis)
        ordered_preds = sorted(zip(y_pred, common_emojis), reverse=True)
        if ordered_preds[0][1] in tweets_common_emojis:
            result_counts['top_match'] += 1
        for pred, emoji in ordered_preds:
            if emoji in tweets_common_emojis:
                result_counts['any_match'] += 1
                break
        print(ordered_preds)
print(f'\nresult_counts: {result_counts}')

* Compare above results to lesson 1:
   * result_counts: Counter({'instances_with_emoji': 214, 'any_match': 17, 'top_match': 14})
   * We doubled the number of top matches!
* Differences:
   * Used sparse matrices (which shouldn't make any difference in the results)
   * Evolved tokenization for the bag of words features
   * Added regularization
* Next steps:
   * Consider more feature engineering
   * Need a lot more training examples
      * And this is where the sparse matrices will be more necessary
   * Try other models

In [None]:
# Show words that are important (have the highest absolute weights) for an emoji
def print_important_words(emoji, count=10):
    emoji_index = emoji_to_index[emoji]
    emoji_betas = Betas[:, emoji_index]
    emoji_word_weights = emoji_betas[1:].toarray().ravel() # first term is bias
    sorted_idxs = np.argsort(np.abs(emoji_word_weights))[::-1]
    for idx in sorted_idxs[:count]:
        print(f'{idx}\t{emoji_word_weights[idx]}\t{tokenizer.idx2tok(idx)}')

print_important_words('❤')

In [None]:
print_important_words('🔥')