# Text classification of movie reviews using Keras

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Download-the-IMDB-dataset" data-toc-modified-id="Download-the-IMDB-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Download the IMDB dataset</a></span></li></ul></div>

Taken from [tensorflow.org](https://www.tensorflow.org/tutorials/keras/basic_text_classification)

This notebook classifies movie reviews as _positive_ or _negative_ using the text of the review. This is an example of _binary_—or two-class—classification, an important and widely applicable kind of machine learning problem.

We'll use the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/). These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are _balanced_, meaning they contain an equal number of positive and negative reviews.

This notebook uses [tf.keras](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow. For a more advanced text classification tutorial using [tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras), see the [MLCC Text Classification Guide](https://developers.google.com/machine-learning/guides/text-classification/).

In [5]:
from __future__ import absolute_import, division, print_function
import tensorflow as tf
from tensorflow import keras
import numpy as np

print("TensorFlor version:", tf.__version__)

TensorFlor version: 1.12.0


## Download the IMDB dataset

The IMDB dataset comes packaged with TensorFlow. It has already been preprocessed such that the reviews (sequences of words) have been converted to sequences of integers, where each integer represents a specific word in a dictionary.

The following code downloads the IMDB dataset to your machine (or uses a cached copy if you've already downloaded it):

In [6]:
imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = \
                    imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


The argument `num_words=10000` keeps the top 10,000 most frequently occurring words in the training data. The rare words are discarded to keep the size of the data manageable.