# Logistic Regression using Keras in TensorFlow

This notebook shows how we can train a logistic regression using the `tf.keras` interface.

TensorFlow comes pre-packages with some imdb movie reviews
that are labeled for positive or negative sentament. We'll build a
classifier that learns to label new reviews as good or bad.

I learned about this dataset and some related ideas from [this
tensorflow example](https://github.com/tensorflow/models/blob/master/samples/core/tutorials/keras/basic_text_classification.ipynb).

## Imports

In [1]:
import tensorflow as tf
from tensorflow import keras

import numpy as np

# Print out the TensorFlow version to help others reproduce this notebook.
print(tf.__version__)

1.9.0


## Load in the data

In [2]:
imdb = keras.datasets.imdb

NUM_WORDS  = 1000  # Keep this many words (throw out least popular).
INDEX_FROM = 3    # Initial index of kept words.

(X_train, y_train), (X_test, y_test) = imdb.load_data(
    num_words = NUM_WORDS,
    index_from = INDEX_FROM
)

### Be able to reconstruct English reviews from our numeric data

Each review in `X_train` and `X_test` is a list of integers; each integer is
assigned to its own word based on an index that `keras` has determined for us,
and that we have access to. For the sake of human satisfaction, let's get a
sense for what the data is like.

In [3]:
word_index = imdb.get_word_index()

# Create a reverse map, and augment it with symbolic tokens.
id_to_word = {(v + INDEX_FROM): k for k, v in word_index.items()}
id_to_word[0] = '<PAD>'
id_to_word[1] = '<START>'
id_to_word[2] = '<UNK>'

def rebuild_original_review(word_ids):
    """ Return a string based on the given list of `word_ids`. """
    return ' '.join(id_to_word[id] for id in word_ids)

In [4]:
# Let's see two sample reviews.

for i in range(2):
    print()
    print('Label: %d' % y_train[i])
    print('Review:')
    print(rebuild_original_review(X_train[i])[:500])


Label: 1
Review:
<START> this film was just brilliant casting <UNK> <UNK> story direction <UNK> really <UNK> the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same <UNK> <UNK> as myself so i loved the fact there was a real <UNK> with this film the <UNK> <UNK> throughout the film were great it was just brilliant so much that i <UNK> the film as soon as it was released for <UNK> and would recommend it to everyone 

Label: 0
Review:
<START> big <UNK> big <UNK> bad music and a <UNK> <UNK> <UNK> these are the words to best <UNK> this terrible movie i love cheesy horror movies and i've seen <UNK> but this had got to be on of the worst ever made the plot is <UNK> <UNK> and ridiculous the acting is an <UNK> the script is completely <UNK> the best is the end <UNK> with the <UNK> and how he worked out who the killer is it's just so <UNK> <UNK> written the <UNK> are <UNK> and funny in <UNK> <UNK

### Convert data to fixed-length vectors

We'll use a bag-of-words model to convert each review to a fixed-length
vector, where each coordinate has value 0 or 1 depending on if a word
corresponding to that column is absent or present in the review.

In [5]:
def convert_to_bag_of_words(array_of_lists):
    """ Return a 0/1 matrix representing the input `array_of_lists`
        using the bag-of-words model. """
    n_pts = array_of_lists.shape[0]
    X = np.zeros((n_pts, NUM_WORDS))
    for row_idx, word_ids in enumerate(array_of_lists):
        X[row_idx, word_ids] = 1
    return X

In [6]:
X_train = convert_to_bag_of_words(X_train)
X_test  = convert_to_bag_of_words(X_test)

## Set up the keras model

In [7]:
model = keras.Sequential()
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(
    optimizer = tf.train.AdamOptimizer(0.0002),
    loss = 'binary_crossentropy',
    metrics = ['accuracy']
)

In [8]:
# This takes about 12s on my laptop.
history = model.fit(X_train, y_train, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [9]:
print('Losses:')
print(history.history['loss'][4::5])

Losses:
[0.39455839398384096, 0.3384308640670776]


In [10]:
print('Accuracies:')
print(history.history['acc'][4::5])

Accuracies:
[0.85364, 0.86692]


## Compute the test-set accuracy

In [13]:
results = model.evaluate(X_test, y_test)

print()
print('Test accuracy:')
print(results[1])


Test accuracy:
0.86048
