https://www.tensorflow.org/alpha/tutorials/keras/feature_columns

In [1]:
#import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

## Prepare a CSV data.
dataURL = 'https://storage.googleapis.com/applied-dl/heart.csv'
dataframe = pd.read_csv(dataURL)  # pandas.DataFrame

The created `pandas.DataFrame` object has the following structure:

In [11]:
dataframe.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


In [6]:
print("Type:", type(dataframe))
print("Shape:", dataframe.shape)
print("Column names and dtypes:")
print(dataframe.dtypes)

Type: <class 'pandas.core.frame.DataFrame'>
Shape: (303, 14)
Column names and dtypes:
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object


Each *row* corresponds to a patient (or a data point), and each *column* corresponds to an attribute.

Note that column values can be accessed by giving a column name as either an *attribute* or a *key*, i.e., `dataframe.age` or `dataframe['age']` for the age values.

We split the dataframe into sub-dataframes for training, validating and testing:

In [12]:
trainFrame, testFrame = train_test_split(dataframe, test_size=0.2)
trainFrame, validateFrame = train_test_split(trainFrame, test_size=0.2)
print(trainFrame.shape)
print(validateFrame.shape)
print(testFrame.shape)

(193, 14)
(49, 14)
(61, 14)


Next, we wrap each (sub-)dataframe into a `tensorflow.data.Dataset` object. The latter becomes a bridge that maps the dataframe to feature columns, which will be used to train the model.

In [13]:
def dataframe2dataset(dataframe, shuffle=True, batchSize=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop('target')  # 1,0-diagnosis of hear disease.
    dataset = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
        # dict(dataframe).keys() -> the data attributes.
        # dict(dataframe).values() -> the data values.
    if shuffle:
        dataset = dataset.shuffle(buffer_size=len(dataframe))
    dataset = dataset.batch(batchSize)  # Dataset -> BatchDataset
    return dataset

batchSize = 5  # A small batch size for demonstration.
trainSet = dataframe2dataset(trainFrame, batchSize=batchSize)
validateSet = dataframe2dataset(validateFrame, False, batchSize)
testSet = dataframe2dataset(testFrame, False, batchSize)

`trainSet`, `validateSet` and `testSet` are `BatchDataset` objects. When iterated, they give one **batch** of data rows. Each batch is a *tuple* of a **feature batch** and a **label batch**. The feature batch is a dict mapping the column names to values.

In [17]:
exampleBatch = next(iter(trainSet))
print("Type and length:", type(exampleBatch), ",", len(exampleBatch))
print("batch[0] keys:", list(exampleBatch[0].keys()))
print("batch[0] value example:", exampleBatch[0]['age'])
print("batch[1]:", exampleBatch[1])

Type and length: <class 'tuple'> , 2
batch[0] keys: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
batch[0] value example: tf.Tensor([62 40 49 58 55], shape=(5,), dtype=int32)
batch[1]: tf.Tensor([0 0 0 0 1], shape=(5,), dtype=int32)


Our original data has different types of features, e.g., numerical, categorical or binary. `tensorflow.feature_column` provides various types of feature columns.

We will use the following helper function to see some examples.

In [18]:
def inspect(featureColumn):
    """A utility function to see how a feature batch is transformed
       to a feature column."""
    # First construct a feature layer.
    featureLayer = tf.keras.layers.DenseFeatures(featureColumn)
    # Provide an example batch to the layer,
    transformedBatch = featureLayer(exampleBatch[0])
    # and see how the raw input is transformed.
    print(transformedBatch.numpy(), ", shape:", transformedBatch.shape)

1. Numeric columns

In [20]:
age = tf.feature_column.numeric_column('age')
inspect(age)
print(exampleBatch[0]['age'])

[[62.]
 [40.]
 [49.]
 [58.]
 [55.]] , shape: (5, 1)
tf.Tensor([62 40 49 58 55], shape=(5,), dtype=int32)


2. Bucketized columns

In [22]:
ageBuckets = tf.feature_column.bucketized_column(
    age,
    boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65]
)
inspect(ageBuckets)

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]] , shape: (5, 11)


3. Categorical columns

In [24]:
thal = tf.feature_column.categorical_column_with_vocabulary_list(
    'thal', ['fixed', 'normal', 'reversible'])
thalOneHot = tf.feature_column.indicator_column(thal)
inspect(thalOneHot)
print(exampleBatch[0]['thal'])

[[0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]] , shape: (5, 3)
tf.Tensor([b'normal' b'reversible' b'normal' b'normal' b'reversible'], shape=(5,), dtype=string)


4. Embedding columns.
<br>Dense embedding of a categorical one-hot with a large number of categories.

In [26]:
thalEmbedding = tf.feature_column.embedding_column(thal, dimension=8)
inspect(thalEmbedding)

[[ 0.06894064  0.3722002   0.29687527 -0.03388098  0.04981663 -0.5150623
   0.17188948 -0.3192951 ]
 [ 0.10701027 -0.540475   -0.38190898 -0.21986264  0.6362094  -0.5586064
  -0.58962834 -0.59624755]
 [ 0.06894064  0.3722002   0.29687527 -0.03388098  0.04981663 -0.5150623
   0.17188948 -0.3192951 ]
 [ 0.06894064  0.3722002   0.29687527 -0.03388098  0.04981663 -0.5150623
   0.17188948 -0.3192951 ]
 [ 0.10701027 -0.540475   -0.38190898 -0.21986264  0.6362094  -0.5586064
  -0.58962834 -0.59624755]] , shape: (5, 8)


5. Hashed feature columns.
<br>Use `hash_bucket_size` number of hash buckets to encode category strings. `hash_bucket_size` can be much smaller than the vocabulary size.

In [28]:
thalHashed = tf.feature_column.categorical_column_with_hash_bucket(
    'thal', hash_bucket_size=1000)
inspect(tf.feature_column.indicator_column(thalHashed))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] , shape: (5, 1000)


6. Crossed feature columns.
<br>Hash encoding of **feature crosses**. The example below crosses the two features, age and thal.

In [31]:
ageThalCross = tf.feature_column.crossed_column(
    [ageBuckets, thal], hash_bucket_size=1000)
ageThalOneHot = tf.feature_column.indicator_column(ageThalCross)
inspect(ageThalOneHot)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] , shape: (5, 1000)


We now collect the feature columns that we will use to transform our raw input data.

In [39]:
featureColumns = []
# Numeric columns
for header in ['age', 'trestbps', 'chol', 'thalach',
               'oldpeak', 'slope', 'ca']:
    featureColumns.append(tf.feature_column.numeric_column(header))
# Bucketized columns
featureColumns.append(ageBuckets)
# Indicator columns
featureColumns.append(thalOneHot)
# Embedding columns
featureColumns.append(thalEmbedding)
# Crossed columns
featureColumns.append(ageThalOneHot)

Using the feature columns, we define a feature layer, as done in `inspect`.

In [40]:
featureLayer = tf.keras.layers.DenseFeatures(featureColumns)

And we resplit the dataset using a larger batch size.

In [41]:
batchSize = 32
trainSet = dataframe2dataset(trainFrame, batchSize=batchSize)
validateSet = dataframe2dataset(validateFrame, False, batchSize)
testSet = dataframe2dataset(testFrame, False, batchSize)

Finally we define, compile, train and evaluate the model.

In [43]:
model = tf.keras.Sequential([
    featureLayer,
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.fit(trainSet,
          validation_data=validateSet,
          epochs=5)
model.evaluate(testSet)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


[0.574438214302063, 0.6557377]