# 05. NumPy Project: Predicting Emojis with Regression

August 26, 2018

We've now learned enough to be dangerous! 

Let's use this knowledge to implement two famous algorithms:

* Ranking Web Pages with PageRank
* **Predicting Emojis in Tweets with Linear Regression**

In [None]:
import numpy as np
print(f'numpy version: {np.__version__}')

## Emojis in tweets

<img src="http://incrediblethings.com/wp-content/uploads/2014/12/top-100-twitter-emojis-cut-e1418995524941.jpg" width="400"/>

The problem statement is as follows: Given a tweet with its emojis removed, can we predict which emojis were present?

If we can produce an effective solution to this problem, then we have created a natural language processor which can infer sentiment, or emotion, from text.

### Example

Original Tweet:

> @gautam_rode4 @TatitaSim Amen 🙏🙏🙏 thank you bro

Modified Tweet:
> @gautam_rode4 @TatitaSim Amen thank you bro

Prediction Target:
> 🙏

## Download Some Data

For this project we have collected around 5 million tweets with emojis. For this lesson, however, we'll work with a small subset of this data that has only 5,000.

Run this cell to download the CSV file into the current directory:

In [None]:
import requests
import shutil

url = "https://s3-us-west-2.amazonaws.com/resero2/datasets/ml-foundations/emoji_tweets_5k.csv"

print('Downloading data...')

response = requests.get(url, stream=True)
with open("emoji_tweets_5k.csv", 'wb') as outfile:
    shutil.copyfileobj(response.raw, outfile)

print('Done.')

Now let's load the data using the csv library:

In [None]:
import csv
import json

texts = []
emojis = []

with open("emoji_tweets_5k.csv") as infile:
    for row in csv.reader(infile):
        text = json.loads(row[1]).strip()
        texts.append(text)
        emojis.append(json.loads(row[2]))

print(f'Text count: {len(texts)}')
print(f'Emojis count: {len(emojis)}')

In [None]:
for i in range(5):
    print(i, emojis[i], texts[i])
    print()

## Preprocess

Because we don't have a very large dataset, let's not attempt to predict all emojis that we find but instead only the top 10.

In [None]:
from collections import Counter

emoji_counter = Counter()
for emoji_dict in emojis:
    emoji_counter.update(emoji_dict.keys())

print(emoji_counter.most_common(10))

common_emojis = [e[0] for e in emoji_counter.most_common(10)]

print()
print('Top 10:')
print(common_emojis)

## Linear Least-Squares Regression

You may be wondering how we're going to use linear regression to predict emojis, but just bear with us a bit ;-)

### Univariate Least Squares

In its simplest form with one independent variable, $x$, and one target, $y$, linear least-squares regression finds a "best fit" prediction line that minimizes the sum of squared residuals between the prediction and the data:

$$ \underset{w, b}{\mathrm{argmin}} \sum_{i}{\left[y_{i} - f(x_{i}, w, b)\right]^2} $$

$$ f(x, w, b) = b + wx $$

where $w$ is the weight (or slope) and $b$ is the the bias (or offset, or intercept).

We can show this visually with some artificial data:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

X = np.random.rand(500, 1)
noiseless_y = 3 * X + 2
y = noiseless_y + np.random.randn(500, 1)
plt.scatter(X, y)
f = plt.plot(X, noiseless_y, c='r')

### Multivariate Least Squares

Univariate regression can be extended to incorporate multiple independent variables by simply adding a weight for each input.

We will represent the input vector, $\mathbf{x}$, and also the weight vector, $\mathbf{w}$, in bold font to distinguish them from scalars:

$$ \underset{\mathbf{w}, b}{\mathrm{argmin}} \sum_{i}{[y_{i} - f(\mathbf{x}_{i}, \mathbf{w}, b)]^2} $$

$$ f(\mathbf{x}, \mathbf{w}, b) = b + \mathbf{w}^T\mathbf{x} $$

Here $\mathbf{w}^T\mathbf{x}$ is the dot product, which sums the product of each element in $\mathbf{x}$ with the corresponding weight in $\mathbf{w}$.

$$ \begin{bmatrix} w_{1} & w_{2} & ... & w_{n} \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ ... \\ x_{n} \end{bmatrix} \quad = \quad \sum_{i} w_{i} x_{i} $$

### Finding the Solution

There are several ways one can find the best parameters $\mathbf{w}$ and $b$ to minimize the objective function above. A common technique is to solve the "normal equations".

In the equations below, $m$ is the number of data points (instances) and $n$ is the number of independent variables (features).

First we will combine the weights and the bias into the single vector, $\mathbf{\beta}$:

$$ \mathbf{\beta} = \begin{bmatrix} b \\ w_{1} \\ w_{2} \\... \\ w_{n} \end{bmatrix} $$

Next we will place each input vector into the row of a matrix, $\mathbf{X}$, and prepend a 1 for each instance which will be multiplied by the bias in the $\mathbf{\beta}$ vector.

$$ \mathbf{X} = \begin{bmatrix} 1 & X_{1,1} & X_{1,2} & ... & X_{1,n} \\ 1 & X_{2,1} & X_{2,2} & ... & X_{2,n} \\ ... & ... & ... & ... \\ 1 & X_{m,1} & X_{m,2} & ... & X_{m,n} \end{bmatrix} $$

You can see now that the bias term in $\beta$ and the constant column of all 1s will be mutliplied together to accomplish adding in the bias term in a single matrix multiply:

$$ \mathbf{X} \mathbf{\beta} = \quad \begin{bmatrix} 1 & X_{1,1} & X_{1,2} & ... & X_{1,n} \\ 1 & X_{2,1} & X_{2,2} & ... & X_{2,n} \\ ... & ... & ... & ... \\ 1 & X_{m,1} & X_{m,2} & ... & X_{m,n} \end{bmatrix} \begin{bmatrix} b \\ w_{1} \\ w_{2} \\... \\ w_{n} \end{bmatrix} $$

Finally we place our targets in a column vector $\mathbf{y}$.

$$ \mathbf{y} = \begin{bmatrix} y_{1} \\ y_{2} \\ ... \\ y_{m} \end{bmatrix} $$

In this construction, we can now write our objective as:

$$ \underset{\mathbf{w}, b}{\mathrm{argmin}} \quad \| \mathbf{y} -\mathbf{X} \mathbf{\beta} \|^2 $$

where $\| \|^2$ is the squared norm, or sum of squares.

By taking the derivative of this expression and setting it equal to zero (calculus not shown for brevity), we end up with:

$$ \mathbf{X}^T \mathbf{y} = \mathbf{X}^T \mathbf{X} \mathbf{\beta} $$

and so:

$$ \mathbf{\beta} = ( \mathbf{X}^T \mathbf{X} )^{-1} \mathbf{X}^T \mathbf{y} $$

where the superscript $^{T}$ denotes the transpose and superscript $^{-1}$ denotes the matrix inverse.

### Implement It!

Using the formulation above, implement the function below which finds the best fit parameters $\mathbf{\beta}$. Then use the function to find the best fit to the artificial data we created above.

In [None]:
# Sample Implementation

def least_squares(X, y):
    """
    :param X: a 2d ndarray with shape (m, n) holding the independent variables
    :param y: a 2d ndarray with shape (m, 1) holding the targets
    :returns: a 2d ndarray with shape (n+1, 1) holding the bias (first element) and the weights (rest of the elements)
    """
    m, n = X.shape
    assert y.shape[0] == m
    
    # Augment X
    ones_col = np.ones([m, 1])
    X = np.hstack([ones_col, X])

    # Solve the normal equations
    result_one = np.linalg.inv(np.matmul(X.T, X))
    result_two = np.matmul(result_one, X.T)
    return np.matmul(result_two, y)

In [None]:
def least_squares(X, y):
    """
    :param X: a 2d ndarray with shape (m, n) holding the independent variables
    :param y: a 2d ndarray with shape (m, 1) holding the targets
    :returns: a 2d ndarray with shape (n+1, 1) holding the bias (first element) and the weights (rest of the elements)
    """
    m, n = X.shape
    assert y.shape[0] == m
    
    # Your code here
    
    return beta

In [None]:
# Run it on our synthetic data
X = np.random.rand(500, 1)
noiseless_y = 3 * X + 2
y = noiseless_y + np.random.randn(500, 1)

beta = least_squares(X, y)
bias = beta[0]
weights = beta[1:]
print(f'bias: {bias}, weights: {weights}')

plt.scatter(X, y)
predictions = weights[0] * X + bias
f = plt.plot(X, predictions, c='r')

One thing to note is that we implemented the solution by using a matrix inversion, which in practice isn't the best method in terms of numerical stability. A better approach would instead use **np.linalg.solve()**, but we will leave that as an exercise for the reader!

## Classification via Regression

<img src="http://mlpy.sourceforge.net/docs/3.2/_images/lda_binary1.png" />

So how does linear regression relate to machine learning? Well, one way to formulate a binary classification problem is to encode one class as y=0 and the other class as y=1.

The plotting code below generates two synthetic classes with 2 independent variables (features). We create an $\mathbf{X}$ matrix, a $\mathbf{y}$ vector with our encoded class targets, and then run the linear regression calculation.

$$ \mathbf{X} = \begin{bmatrix} 0.4 & 0.2 \\ 0.3 & 0.1 \\ ... & ...\\ 0.5 & 0.8 \\ 0.6 & 0.7 \end{bmatrix}
\quad \mathbf{y} = \begin{bmatrix} 0 \\ 0 \\ ... \\ 1 \\ 1 \end{bmatrix}
$$

We color the points based on the model prediction, and also fill in the volume with semi-transparent cubes colored this way so that you can see the separating hyperplane. You could imagine creating a threshold on the score (corresponding to roughly halfway between red and blue) in order to get a discrete 0/1 class prediction.

Be sure to click and drag the 3D plot to rotate it around.

In [None]:
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot

init_notebook_mode(connected=True)

# Generate two classes with 2d gaussian distributions
x1 = np.random.randn(50, 2) * 0.1 + (0.25, 0.25)
x2 = np.random.randn(50, 2) * 0.1 + (0.75, 0.75)
X = np.vstack([x1, x2])

# Generate class indicators
y = np.array([0]*50 + [1]*50)

# Fit least squares model
beta = least_squares(X, y)
bias = beta[0]
weights = beta[1:]
predictions = np.matmul(X, weights) + bias

# Render least squares model
markers = dict(
    size=5,
    color=predictions,
    colorscale='Jet',   # choose a colorscale
    opacity=0.8
)

trace = go.Scatter3d(x=X[:, 0], y=X[:, 1], z=y, mode='markers', marker=markers)

# Fill volume with semi-transparent cubes
lin = np.linspace(0, 1, 11)
x, y, z = np.meshgrid(lin, lin, lin)
x = x.ravel(); y = y.ravel(); z = z.ravel(); 
fill_predictions = np.matmul(np.vstack([x, y]).T, weights) + bias

fill_markers = dict(
    size=20,
    color=fill_predictions,
    colorscale='Jet',   # choose a colorscale
    opacity=0.03,
    symbol='square'
)

fill_trace = go.Scatter3d(x=x, y=y, z=z, mode='markers', marker=fill_markers)

# Plot
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)

fig = go.Figure(data=[trace, fill_trace], layout=layout)
iplot(fig)

## Predicting Emojis

### Creating the X Matrix
In order to represent our twitter data in a form convenient for regression, we're going to transform each tweet text in a "binary bag of words" representation.

A "binary bag of words" simply means that we will create an independent variable (a feature) for each word in a vocabulary, and set that feature to 1 if the tweet contains the word. 

An example encoding for two sentences and small 3-word vocabulary is:

* sentence one: "the quick fox"
* sentence two: "the fox"

$$ \begin{matrix} the & quick & fox \end{matrix} \\
\begin{bmatrix} 1 \quad & 1 \quad & 1 \\ 1 \quad & 0 \quad & 1 \end{bmatrix} $$

The code for creating a vocabulary is below, use it to create an $\mathbf{X}$ matrix suitable for regression.

In [None]:
import re

def tokenize(text):
    # very basic regex tokenization
    text = text.replace("’", "'")
    tokens = [t.lower() for t in re.split(r"[^a-zA-Z0-9'&]+", text)]
    tokens = [t for t in tokens if len(t) > 0 and t != 't' and t != 'https' and t != 'co']
    return tokens

def create_vocab(texts, size):
    word_counts = Counter()
    for text in texts:
        words = tokenize(text)
        word_counts.update(words)
    return [e[0] for e in word_counts.most_common(size)]

vocab = create_vocab(texts, 300)

# Two useful lookup tables for converting words into integer indexes and vice-versa
word_to_index = {w : i for i, w in enumerate(vocab)}
index_to_word = {i : w for i, w in enumerate(vocab)}

print(f'Vocab: {vocab}')
print()
print(f'Top 10 word to index entries: {[e for e in list(word_to_index.items())[:10]]}')

Complete the code below to implement the X matrix creation.

In [None]:
# Sample Implementation

def create_X_matrix(texts):
    
    # Initialize X to all zeros
    m = len(texts)
    n = len(vocab)
    X = np.zeros([m, n])
    
    for i, text in enumerate(texts):
        for word in tokenize(text):
            index = word_to_index.get(word.lower(), None)
            if index is not None:
                X[i, index] = 1
    
    return X

X = create_X_matrix(texts)
#print(f'First row: {X[0]}')
print(f'Total number of ones in X: {np.sum(X)}')

In [None]:
def create_X_matrix(texts):
    
    # Initialize X to all zeros
    m = len(texts)
    n = len(vocab)
    
    X = # **** Your code here ****
    
    for i, text in enumerate(texts):
        for word in tokenize(text):
            index = word_to_index.get(word.lower(), None)
            if index is not None:
                # **** Your code here ****
    
    return X

X = create_X_matrix(texts)
print(f'First row: {X[0]}')
print(f'Total number of ones in X: {np.sum(X)}')

### Multiple Targets

We have our $\mathbf{X}$ matrix now, but what about the target vector $\mathbf{y}$? There are multiple emojis, and each tweet can have more than one type of emoji. How do we handle multiple targets? Our least squares example only showed how to predict a single target.

The solution is quite simple: instead of a $\mathbf{y}$ vector with a single column, we will create a $\mathbf{Y}$ matrix with one column per emoji. We'll use a target of 1 when an emoji exists, and a target of 0 when it does not. The math works out the same and we don't even need to alter our code!

An example of encoding two emoji targets is below:

$$ 
\begin{matrix} the & quick & fox \end{matrix} \quad \quad \begin{matrix} 😍 & \quad 🔥\end{matrix} \\
\begin{bmatrix} 1 \quad & 1 \quad & 1 \\ 1 \quad & 0 \quad & 1 \end{bmatrix} \quad \quad \begin{bmatrix} 1 \quad & 1 \\ 1 \quad & 0 \end{bmatrix}
$$

Write code for this encoding below.

In [None]:
# Sample implementation

# Two useful lookup tables for converting emojis into integer indexes and vice-versa
emoji_to_index = {w : i for i, w in enumerate(common_emojis)}
index_to_emoji = {i : w for i, w in enumerate(common_emojis)}

def create_Y_matrix(emojis):
    n = len(common_emojis)
    m = len(emojis)
    Y = np.zeros([m, n])
    for i, single_tweet_emojis in enumerate(emojis):
        for emoji in single_tweet_emojis:
            index = emoji_to_index.get(emoji, None)
            if index is not None:
                Y[i, index] = 1
    return Y
                
Y = create_Y_matrix(emojis)

print(f'Y shape: {Y.shape}')

In [None]:
# Two useful lookup tables for converting emojis into integer indexes and vice-versa
emoji_to_index = {w : i for i, w in enumerate(common_emojis)}
index_to_emoji = {i : w for i, w in enumerate(common_emojis)}

def create_Y_matrix(emojis):
    raise Exception('Implement me!')
                
Y = create_Y_matrix(emojis)

print(f'Y shape: {Y.shape}')

### Train / Test Split

We have our data ready to go, but how do we determine how well the model performs? A good way to do this is to withold some of the data so that the model can't train on it.

We'll fit our model on the training set, and then use it to predict on the test set.

In [None]:
N = X.shape[0]
num_train = int(N * 0.9)

def split(array, split_point=num_train):
    return array[:split_point], array[split_point:]

X_train, X_test = split(X)
Y_train, Y_test = split(Y)
texts_train, texts_test = split(texts)
emojis_train, emojis_test = split(emojis)

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print()
print(f'Y_train shape: {Y_train.shape}')
print(f'Y_test shape: {Y_test.shape}')

### Data Check

To make sure everything is correct, let's look at a tweet and its row in X and Y.

In [None]:
print(f'Text: {texts[-1]}')
print(f'Emojis: {emojis[-1]}')
print(f'X row: {X[-1]}')
print(f'Y row: {Y[-1]}')

In [None]:
print('Bag of words:')
print([index_to_word[i] for i, entry in enumerate(X[-1]) if entry > 0])

print()
print('Emoji targets:')
print([index_to_emoji[i] for i, entry in enumerate(Y[-1]) if entry > 0])

### Fit the Model

We have everything we need now. 

Let's run your least_squares() function to fit a model on X_train and Y_train.

In [None]:
Betas = least_squares(X_train, Y_train)
Biases = Betas[0, :]
Weights = Betas[1:, :]

print(f'Betas shape: {Betas.shape}')
print(f'Biases shape: {Biases.shape}')
print(f'Weights shape: {Weights.shape}')

### Predict With the Model

We now need a method that can use the model parameters to predict on new data. Implement the predict() function below:

In [None]:
# Sample Implementation
def predict(X, Betas):
    """
    :param X: a 2d ndarray with shape (m, n) holding the independent variables
    :param Betas: a 2d ndarray with shape (n+1, k) holding the parameters of a linear model (the first row contains bias terms)
    :returns: a 2d ndarray with shape (m, k) holding the predictions
    """
    m, n = X.shape
    assert Betas.shape[0] == n + 1
    
    # Augment X
    ones_col = np.ones([m, 1])
    X = np.hstack([ones_col, X])

    # Apply model
    return np.matmul(X, Betas)

In [None]:
def predict(X, Betas):
    """
    :param X: a 2d ndarray with shape (m, n) holding the independent variables
    :param Betas: a 2d ndarray with shape (n+1, k) holding the parameters of a linear model (the first row contains bias terms)
    :returns: a 2d ndarray with shape (m, k) holding the predictions
    """
    m, n = X.shape
    assert Betas.shape[0] == n + 1
    
    # Your code here

In [None]:
Y_test_pred = predict(X_test, Betas)
print(f'Y_test_pred shape: {Y_test_pred.shape}')

In [None]:
#
# Print results when an emoji prediction score exceeds a threshold (0.4)
# You'll need to scroll down a ways before you see a prediction that isn't 😂
#
for test_text, test_emoji, y_pred in zip(texts_test, emojis_test, Y_test_pred):
    highest_scoring_emoji_index = np.argmax(y_pred)
    highest_score = y_pred[highest_scoring_emoji_index]
    if highest_score > 0.4:
        print('-'*40)
        print('Text:', test_text)
        print('Bag of words:', [w.lower() for w in test_text.split() if w.lower() in vocab])
        print('Common emojis:', [e for e in test_emoji if e in common_emojis])
        print(sorted(zip(y_pred, common_emojis), reverse=True))

### Words with the Highest Weights

While not a great model, it does seem to have learned some important correlations between words and emojis. Let's look at the words with the highest absolute weights for the heart emoji:

In [None]:
def print_important_words(emoji, count=10):
    emoji_index = emoji_to_index[emoji]
    emoji_betas = Betas[:, emoji_index]
    emoji_word_weights = emoji_betas[1:] # first term is bias
    sorted_idxs = np.argsort(np.abs(emoji_word_weights))[::-1]
    for idx in sorted_idxs[:count]:
        print(emoji_word_weights[idx], '\t', index_to_word[idx])

print_important_words('❤')

## Parting Thoughts

This linear bag of words model is interesting but doesn't seem that great. It appears to have learned a few nice word correlations , but it seems to predict 😂too often.

How could we make this better? Here are a few ideas:

1. Use more training data, for example the full set of 5M tweets, and expand the vocabulary
3. Combine similar emojis, such as those with hearts, into one category
4. Expand beyond basic bag of words and linear least squares, which we will do in upcoming lessons!

## Addendum: Ridge Regression

In practice we want to implement a slight variant of this called "Ridge Regression", which will help avoid numerical problems and also help regularize the model so that the weights don't get too large. All we need to do is add a diagonal matrix filled with the "ridge" value that we will call $\lambda$:

$$ \mathbf{\beta} = ( \mathbf{X}^T \mathbf{X} + \lambda I )^{-1} \mathbf{X}^T \mathbf{y} $$

where $I$ is the identity matrix. Note that this also regularizes the bias term which is not always desirable, but we will leave the description this way for simplicity.

For extra credit, try implementing Ridge Regression!