<a href="https://colab.research.google.com/github/rahiakela/automl-experiments/blob/main/auto-keras-practice-works/05_structured_data_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Structured Data Classification

Reference:

https://autokeras.com/tutorial/structured_data_classification/

##Setup

In [None]:
!pip -q install autokeras

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

import autokeras as ak

##A Simple Example

The first step is to prepare your data. Here we use the [Titanic
dataset](https://www.kaggle.com/c/titanic) as an example.

In [2]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

The second step is to run the
[StructuredDataClassifier](/structured_data_classifier).


As a quick demo, we set epochs to 10.
You can also leave the epochs unspecified for an adaptive number of epochs.


In [3]:
# Initialize the structured data classifier.
clf = ak.StructuredDataClassifier(overwrite=True, max_trials=3)
# Feed the structured data classifier with training data.
clf.fit(train_file_path, "survived", epochs=10)  # The path to the train.csv file and The name of the label column

Trial 3 Complete [00h 00m 12s]
val_accuracy: 0.852173924446106

Best val_accuracy So Far: 0.895652174949646
Total elapsed time: 00h 00m 27s
INFO:tensorflow:Oracle triggered exit
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
INFO:tensorflow:Assets written to: ./structured_data_classifier/best_model/assets


<tensorflow.python.keras.callbacks.History at 0x7f24d7c91ad0>

In [4]:
#  Predict with the best model.
predicted_y = clf.predict(test_file_path)
# Evaluate the best model with testing data.
print(clf.evaluate(test_file_path, "survived"))

[0.4397850036621094, 0.7992424368858337]


##Data Format

The AutoKeras StructuredDataClassifier is quite flexible for the data format.

The example above shows how to use the CSV files directly. Besides CSV files,
it also supports `numpy.ndarray`, `pandas.DataFrame` or [tf.data.Dataset](
https://www.tensorflow.org/api_docs/python/tf/data/Dataset?version=stable). The
data should be two-dimensional with numerical or categorical values.

For the classification labels,
AutoKeras accepts both plain labels, i.e. strings or integers, and one-hot encoded
encoded labels, i.e. vectors of 0s and 1s.
The labels can be numpy.ndarray, pandas.DataFrame, or pandas.Series.

The following examples show how the data can be prepared with numpy.ndarray,
pandas.DataFrame, and tensorflow.data.Dataset.


In [5]:
# x_train as pandas.DataFrame, y_train as pandas.Series
x_train = pd.read_csv(train_file_path)
print(type(x_train))  # pandas.DataFrame
y_train = x_train.pop("survived")
print(type(y_train))  # pandas.Series

# You can also use pandas.DataFrame for y_train.
y_train = pd.DataFrame(y_train)
print(type(y_train))  # pandas.DataFrame

# You can also use numpy.ndarray for x_train and y_train.
x_train = x_train.to_numpy()
y_train = y_train.to_numpy()
print(type(x_train))  # numpy.ndarray
print(type(y_train))  # numpy.ndarray

# Preparing testing data.
x_test = pd.read_csv(test_file_path)
y_test = x_test.pop("survived")

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [6]:
# It tries 10 different models.
clf = ak.StructuredDataClassifier(overwrite=True, max_trials=3)
# Feed the structured data classifier with training data.
clf.fit(x_train, y_train, epochs=10)
# Predict with the best model.
predicted_y = clf.predict(x_test)
# Evaluate the best model with testing data.
print(clf.evaluate(x_test, y_test))

Trial 3 Complete [00h 00m 02s]
val_accuracy: 0.8608695864677429

Best val_accuracy So Far: 0.8695651888847351
Total elapsed time: 00h 00m 09s
INFO:tensorflow:Oracle triggered exit
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
INFO:tensorflow:Assets written to: ./structured_data_classifier/best_model/assets
[0.43934163451194763, 0.7878788113594055]


The following code shows how to convert numpy.ndarray to `tf.data.Dataset`.


In [7]:
train_set = tf.data.Dataset.from_tensor_slices((x_train.astype(np.unicode), y_train))
test_set = tf.data.Dataset.from_tensor_slices(
    (x_test.to_numpy().astype(np.unicode), y_test)
)

clf = ak.StructuredDataClassifier(overwrite=True, max_trials=3)
# Feed the tensorflow Dataset to the classifier.
clf.fit(train_set, epochs=10)
# Predict with the best model.
predicted_y = clf.predict(test_set)
# Evaluate the best model with testing data.
print(clf.evaluate(test_set))

Trial 3 Complete [00h 00m 04s]
val_accuracy: 0.8695651888847351

Best val_accuracy So Far: 0.8782608509063721
Total elapsed time: 00h 00m 13s
INFO:tensorflow:Oracle triggered exit
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
INFO:tensorflow:Assets written to: ./structured_data_classifier/best_model/assets
[0.44628679752349854, 0.7840909361839294]


You can also specify the column names and types for the data as follows.  The
`column_names` is optional if the training data already have the column names,
e.g.  `pandas.DataFrame`, CSV file.  Any column, whose type is not specified will
be inferred from the training data.


In [8]:
# Initialize the structured data classifier
clf = ak.StructuredDataClassifier(
    column_names=["sex", "age", "n_siblings_spouses", "parch", "fare", "class", "deck", "embark_town", "alone"],
    column_types={"sex": "categorical", "fare": "numerical"},
    max_trials=10,
    overwrite=True)

# Feed the tensorflow Dataset to the classifier.
clf.fit(train_set, epochs=10)
# Predict with the best model.
predicted_y = clf.predict(test_set)
# Evaluate the best model with testing data.
print(clf.evaluate(test_set))

Trial 10 Complete [00h 00m 04s]
val_accuracy: 0.852173924446106

Best val_accuracy So Far: 0.8782608509063721
Total elapsed time: 00h 00m 56s
INFO:tensorflow:Oracle triggered exit
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
INFO:tensorflow:Assets written to: ./structured_data_classifier/best_model/assets
[0.4395553171634674, 0.7840909361839294]


##Validation Data

By default, AutoKeras use the last 20% of training data as validation data.  As
shown in the example below, you can use `validation_split` to specify the
percentage.

In [None]:
clf.fit(
    x_train,
    y_train,
    # Split the training data and use the last 15% as validation data.
    validation_split=0.15,
    epochs=10,
)

You can also use your own validation set
instead of splitting it from the training data with `validation_data`.


In [None]:
split = 500
x_val = x_train[split:]
y_val = y_train[split:]
x_train = x_train[:split]
y_train = y_train[:split]

clf.fit(
    x_train,
    y_train,
    # Use your own validation set.
    validation_data=(x_val, y_val),
    epochs=10,
)

## Customized Search Space

For advanced users, you may customize your search space by using
[AutoModel](/auto_model/#automodel-class) instead of
[StructuredDataClassifier](/structured_data_classifier). You can configure the
[StructuredDataBlock](/block/#structureddatablock-class) for some high-level
configurations, e.g., `categorical_encoding` for whether to use the
[CategoricalToNumerical](/block/#categoricaltonumerical-class). 

You can also do
not specify these arguments, which would leave the different choices to be
tuned automatically. See the following example for detail.

In [9]:
input_node = ak.StructuredDataInput()
output_node = ak.StructuredDataBlock(categorical_encoding=True)(input_node)
output_node = ak.ClassificationHead()(output_node)

clf = ak.AutoModel(inputs=input_node, outputs=output_node, overwrite=True, max_trials=3)
clf.fit(x_train, y_train, epochs=10)

Trial 3 Complete [00h 00m 05s]
val_loss: 0.5440530180931091

Best val_loss So Far: 0.5440530180931091
Total elapsed time: 00h 00m 17s
INFO:tensorflow:Oracle triggered exit
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
INFO:tensorflow:Assets written to: ./auto_model/best_model/assets


<tensorflow.python.keras.callbacks.History at 0x7f24d54cd850>

The usage of [AutoModel](/auto_model/#automodel-class) is similar to the
[functional API](https://www.tensorflow.org/guide/keras/functional) of Keras.
Basically, you are building a graph, whose edges are blocks and the nodes are
intermediate outputs of blocks.
To add an edge from `input_node` to `output_node` with
`output_node = ak.[some_block]([block_args])(input_node)`.

You can even also use more fine grained blocks to customize the search space even
further. See the following example.


In [10]:
input_node = ak.StructuredDataInput()
output_node = ak.CategoricalToNumerical()(input_node)
output_node = ak.DenseBlock()(output_node)
output_node = ak.ClassificationHead()(output_node)

clf = ak.AutoModel(inputs=input_node, outputs=output_node, overwrite=True, max_trials=3)
clf.fit(x_train, y_train, epochs=10)

Trial 3 Complete [00h 00m 03s]
val_loss: 2.928130865097046

Best val_loss So Far: 0.5403491854667664
Total elapsed time: 00h 00m 09s
INFO:tensorflow:Oracle triggered exit
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
INFO:tensorflow:Assets written to: ./auto_model/best_model/assets


<tensorflow.python.keras.callbacks.History at 0x7f24d50df7d0>

You can also export the best model found by AutoKeras as a Keras Model.


In [11]:
model = clf.export_model()
model.summary()
print(x_train.dtype)
# numpy array in object (mixed type) is not supported.
# convert it to unicode.
model.predict(x_train.astype(np.unicode))

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 9)]               0         
_________________________________________________________________
multi_category_encoding (Mul (None, 9)                 0         
_________________________________________________________________
dense (Dense)                (None, 32)                320       
_________________________________________________________________
re_lu (ReLU)                 (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
_________________________________________________________________
classification_head_1 (Activ (None, 1)                 0         
Total params: 353
Trainable params: 353
Non-trainable params: 0
_______________________________________________________________

array([[0.27231562],
       [0.74572015],
       [0.28487468],
       [0.6546192 ],
       [0.29729486],
       [0.4176831 ],
       [0.29402003],
       [0.6870272 ],
       [0.74572766],
       [0.35479733],
       [0.34603217],
       [0.46819496],
       [0.61613905],
       [0.3424906 ],
       [0.4147718 ],
       [0.26552534],
       [0.5814375 ],
       [0.77132523],
       [0.36426476],
       [0.27478027],
       [0.9129101 ],
       [0.2860205 ],
       [0.26359868],
       [0.59860706],
       [0.92004037],
       [0.28223622],
       [0.03853047],
       [0.7456721 ],
       [0.55836666],
       [0.27484316],
       [0.5695276 ],
       [0.13000727],
       [0.56315655],
       [0.2817968 ],
       [0.7049523 ],
       [0.2643512 ],
       [0.4588064 ],
       [0.4968613 ],
       [0.5157281 ],
       [0.5044556 ],
       [0.81126297],
       [0.5726882 ],
       [0.6305123 ],
       [0.65583694],
       [0.4636627 ],
       [0.26865113],
       [0.35198563],
       [0.355

## Reference

[StructuredDataClassifier](/structured_data_classifier),

[AutoModel](/auto_model/#automodel-class),

[StructuredDataBlock](/block/#structureddatablock-class),

[DenseBlock](/block/#denseblock-class),

[StructuredDataInput](/node/#structureddatainput-class),

[ClassificationHead](/block/#classificationhead-class),

[CategoricalToNumerical](/block/#categoricaltonumerical-class).
