<a href="https://colab.research.google.com/github/rahiakela/automl-experiments/blob/main/auto-keras-practice-works/03_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Text Classification

Reference:

https://autokeras.com/tutorial/text_classification/

In [1]:
!pip -q install autokeras

[?25l[K     |██                              | 10 kB 18.6 MB/s eta 0:00:01[K     |████                            | 20 kB 16.3 MB/s eta 0:00:01[K     |██████                          | 30 kB 11.7 MB/s eta 0:00:01[K     |███████▉                        | 40 kB 9.9 MB/s eta 0:00:01[K     |█████████▉                      | 51 kB 5.7 MB/s eta 0:00:01[K     |███████████▉                    | 61 kB 5.9 MB/s eta 0:00:01[K     |█████████████▊                  | 71 kB 5.1 MB/s eta 0:00:01[K     |███████████████▊                | 81 kB 5.7 MB/s eta 0:00:01[K     |█████████████████▊              | 92 kB 5.6 MB/s eta 0:00:01[K     |███████████████████▋            | 102 kB 5.1 MB/s eta 0:00:01[K     |█████████████████████▋          | 112 kB 5.1 MB/s eta 0:00:01[K     |███████████████████████▋        | 122 kB 5.1 MB/s eta 0:00:01[K     |█████████████████████████▌      | 133 kB 5.1 MB/s eta 0:00:01[K     |███████████████████████████▌    | 143 kB 5.1 MB/s eta 0:00:01[K  

In [2]:
import os

import numpy as np
import tensorflow as tf
from sklearn.datasets import load_files

import autokeras as ak

##A Simple Example

The first step is to prepare your data. Here we use the IMDB dataset as an example.

In [4]:
dataset = tf.keras.utils.get_file(fname="aclImdb.tar.gz",
                                  origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
                                  extract=True)

# set path to dataset
IMDB_DATADIR = os.path.join(os.path.dirname(dataset), "aclImdb")

classes = ["neg", "pos"]
train_data = load_files(os.path.join(IMDB_DATADIR, "train"), shuffle=True, categories=classes)
test_data = load_files(os.path.join(IMDB_DATADIR, "test"), shuffle=False, categories=classes)

x_train = np.array(train_data.data)
y_train = np.array(train_data.target)
x_test = np.array(test_data.data)
y_test = np.array(test_data.target)

print(x_train.shape)  # (25000,)
print(y_train.shape)  # (25000, 1)
print(x_train[0][:50])  # this film was just brilliant casting

(25000,)
(25000,)
b'Zero Day leads you to think, even re-think why two'


The second step is to run the TextClassifier. As a quick demo, we set epochs to 2.

You can also leave the epochs unspecified for an adaptive number of epochs.

In [7]:
# Initialize the text classifier
clf = ak.TextClassifier(overwrite=True, max_trials=1)
# Feed the text classifier with training data
clf.fit(x_train, y_train, epochs=2)

Trial 1 Complete [00h 04m 46s]
val_loss: 0.27647531032562256

Best val_loss So Far: 0.27647531032562256
Total elapsed time: 00h 04m 46s
INFO:tensorflow:Oracle triggered exit
Epoch 1/2
Epoch 2/2
INFO:tensorflow:Assets written to: ./text_classifier/best_model/assets


<tensorflow.python.keras.callbacks.History at 0x7fac8ae793d0>

In [8]:
# Predict with the best model
predicted_y = clf.predict(x_test)
# Evaluate the best model with testing data.
print(clf.evaluate(x_test, y_test))

[0.2690293490886688, 0.891040027141571]


##Validation Data

By default, AutoKeras use the last 20% of training data as validation data. 

As shown in the example below, you can use `validation_split` to specify the percentage.

In [None]:
clf.fit(
    x_train,
    y_train,
    # Split the training data and use the last 15% as validation data.
    validation_split=0.15,
)

You can also use your own validation set instead of splitting it from the training data with `validation_data`.

In [None]:
split = 5000
x_val = x_train[split:]
y_val = y_train[split:]
x_train = x_train[:split]
y_train = y_train[:split]

clf.fit(
    x_train,
    y_train,
    epochs=2,
    # Use your own validation set.
    validation_data=(x_val, y_val),
)

##Customized Search Space

For advanced users, you may customize your search space by using AutoModel instead of TextClassifier. You can configure the TextBlock for some high-level configurations, e.g., vectorizer for the type of text vectorization method to use. 

You can use 'sequence', which uses TextToInteSequence to convert the words to integers and use Embedding for embedding the integer sequences, or you can use 'ngram', which uses TextToNgramVector to vectorize the sentences. You can also do not specify these arguments, which would leave the different choices to be tuned automatically. 

In [9]:
input_node = ak.TextInput()
output_node = ak.TextBlock(block_type="ngram")(input_node)
output_node = ak.ClassificationHead()(output_node)

clf = ak.AutoModel(inputs=input_node, outputs=output_node, overwrite=True, max_trials=1)
clf.fit(x_train, y_train, epochs=2)

Trial 1 Complete [00h 01m 02s]
val_loss: 0.32131633162498474

Best val_loss So Far: 0.32131633162498474
Total elapsed time: 00h 01m 02s
INFO:tensorflow:Oracle triggered exit
Epoch 1/2
Epoch 2/2
INFO:tensorflow:Assets written to: ./auto_model/best_model/assets


<tensorflow.python.keras.callbacks.History at 0x7fac8ac39f10>

The usage of AutoModel is similar to the functional API of Keras. Basically, you are building a graph, whose edges are blocks and the nodes are intermediate outputs of blocks. To add an edge from `input_node` to `output_node` with `output_node = ak.[some_block]([block_args])(input_node)`.

You can even also use more fine grained blocks to customize the search space even further.

In [11]:
input_node = ak.TextInput()
output_node = ak.TextToIntSequence()(input_node)
output_node = ak.Embedding()(output_node)
# Use separable Conv layers in Keras.
output_node = ak.ConvBlock(separable=True)(output_node)
output_node = ak.ClassificationHead()(output_node)

clf = ak.AutoModel(inputs=input_node, outputs=output_node, overwrite=True, max_trials=1)
clf.fit(x_train, y_train, epochs=2)

Trial 1 Complete [00h 02m 49s]
val_loss: 0.6931466460227966

Best val_loss So Far: 0.6931466460227966
Total elapsed time: 00h 02m 49s
INFO:tensorflow:Oracle triggered exit
Epoch 1/2
Epoch 2/2
INFO:tensorflow:Assets written to: ./auto_model/best_model/assets


<tensorflow.python.keras.callbacks.History at 0x7fac8b8d9a10>

##Data Format

The AutoKeras TextClassifier is quite flexible for the data format.

For the text, the input data should be one-dimensional For the classification labels, AutoKeras accepts both plain labels, i.e. strings or integers, and one-hot encoded encoded labels, i.e. vectors of 0s and 1s.

We also support using tf.data.Dataset format for the training data.

In [12]:
train_set = tf.data.Dataset.from_tensor_slices(((x_train,), (y_train,))).batch(32)
test_set = tf.data.Dataset.from_tensor_slices(((x_test,), (y_test,))).batch(32)

clf = ak.TextClassifier(overwrite=True, max_trials=2)
# Feed the tensorflow Dataset to the classifier.
clf.fit(train_set, epochs=2)
# Predict with the best model.
predicted_y = clf.predict(test_set)
# Evaluate the best model with testing data.
print(clf.evaluate(test_set))

Trial 2 Complete [00h 04m 50s]
val_loss: 0.3088071644306183

Best val_loss So Far: 0.2812400162220001
Total elapsed time: 00h 10m 45s
INFO:tensorflow:Oracle triggered exit
Epoch 1/2
Epoch 2/2
INFO:tensorflow:Assets written to: ./text_classifier/best_model/assets
[0.2734043300151825, 0.889959990978241]
