# Classifying structured data with tf.keras

In this tutorial we will learn how to make predictions on structured data (meaning, a CSV file, or a spreadsheet). We will use a small [dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) provided by the Cleveland Clinic Foundation for Heart Disease. There are 303 rows and 14 columns. Each row describes a patient, and each column describes a feature. We will use this information to predict whether a patient has heart disease.

In [0]:
import tensorflow as tf
tf.enable_eager_execution()

import pprint

from tensorflow.python.feature_column import feature_column_v2 as fc

print(tf.__version__)

In [0]:
!wget 'https://storage.googleapis.com/amitpatankar-datasets/heart-disease-uci.zip'
!unzip -o 'heart-disease-uci.zip'

Here is a [description](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names) of this dataset:

>Column| Description| Feature Type | Data Type
>------------|--------------------|----------------------|-----------------
>Age | Age in years | Numerical | integer
>Sex | (1 = male; 0 = female) | Categorical | integer
>CP | Chest pain type (0, 1, 2, 3, 4) | Categorical | integer
>Trestbpd | Resting blood pressure (in mm Hg on admission to the hospital) | Numerical | integer
>Chol | Serum cholestoral in mg/dl | Numerical | integer
>FBS | (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) | Categorical | integer
>RestECG | Resting electrocardiographic results (0, 1, 2) | Categorical | integer
>Thalach | Maximum heart rate achieved | Numerical | integer
>Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical | integer
>Oldpeak | ST depression induced by exercise relative to rest | Numerical | integer
>Slope | The slope of the peak exercise ST segment | Numerical | float
>CA | Number of major vessels (0-3) colored by flourosopy | Numerical | integer
>Thal | 3 = normal; 6 = fixed defect; 7 = reversable defect | Categorical | string
>Target | Diagnosis of heart disease (1 = true; 0 = false) | Classification | integer

## Explore the data with Pandas


In [0]:
import pandas as pd
df = pd.read_csv('heart-disease-uci/heart_train.csv')

Inspect the first few rows:

In [0]:
df.head(3)

We can use ```describe``` to see summary statistics about our dataset:

In [0]:
df.describe(include="all")

We see there are 14 columns. The first 13 are features, and the last is the target (or class label) we want to predict. There are both numeric faetures (like age) and categorical features (like sex).

## Load the dataset

Here, we'll use tf.data to load this CSV file using the [make_csv_dataset](https://www.tensorflow.org/api_docs/python/tf/contrib/data/make_csv_dataset) utility. 


In [0]:
dataset = tf.data.experimental.make_csv_dataset('heart-disease-uci/heart_train.csv', header=True, label_name='target', batch_size=32)

# let's cast our labels to float (to prevent model training failure later on)
dataset = dataset.map(lambda features,labels: (features, tf.to_float(labels)))

Here is an example of how to use this dataset:

In [0]:
# we'll use pprint here as it makes large dictionary print-outs more human readable
for features, labels in dataset.take(1):
  pprint.pprint(features)
  print()
  print(repr(labels))

Looking above, we can see that we now have a dictionary of tensors.

  ## Transform with FeatureColumns

You can think of feature columns as the itermediearies between the raw data in your CSV and the model that will process them. 

Note: Feature Columns are only used when working with structured data. If you're classifying images, for example, you can skip this step.

The dataset we created above generates dictionaries of feature tensors and labels. Now, we will use feature columns to represent these in a way that is meaningful to our model.

A feature_column is a configuration object. It doesn’t hold any data itself, but it tells our model how to transform the raw input data into a useful format. 

# TODO: explain different types of FCs.

In [0]:
# a list of all our column names
HEADERS = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']

# print our list of columns
HEADERS

In [0]:
# create lists of various feature types
NUMERICAL_COLUMNS = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']
STRING_COLUMNS = ['thal']
BUCKETIZED_COLUMNS = ['target']

In [0]:
# for our categorical features we will create a list of tuples, which contain the column name and the number of buckets
CATEGORICAL_COLUMNS = [('sex',2), ('cp',5), ('fbs',2), ('restecg',3), ('exang',2)]

Now that we have appropriately separated our features, we can define appropriate feature columns. 

We'll create a list to store our newly created feature columns in.

In [0]:
# list to store our feature columns in
feature_columns = []

### Numeric Columns

Data that is already numeric is straightforward, we just use  ```numeric_column```. The code below iterates through the numerical features in our dataset and appends numeric feature columns to our feature_columns list.


In [0]:
NUMERICAL_COLUMNS

In [0]:
for header in NUMERICAL_COLUMNS:
  feature_columns.append(fc.numeric_column(header))

You’ll note that all we’ve done here is define a type of feature, and we haven’t passed any of our data into this feature yet, it's just a configuration object.

It's worth noting that transformations that are applied by feature columns become part of the model’s graph, and are therefore exported with the SavedModel. So it is reccommended to push any transformations that should be applied to data during training and inference into feature_columns.

### TODO: explain Categorical Identity Column

In a categorical identity column, each bucket represents a single, unique integer, this is commonly reffered to as one-hot encoding. For example, let's say you want to represent the integer range [0, 4). That is, you want to represent the integers 0, 1, 2 or 3. In this case, the categorical identity mapping looks like this:

A categorical identity column mapping. Note that this is a one-hot encoding, not a binary numerical encoding.

In [0]:
CATEGORICAL_COLUMNS

In [0]:
for header, num_buckets in CATEGORICAL_COLUMNS:
    
    # create categorical identity feature column
    cci = fc.categorical_column_with_identity(header, num_buckets=num_buckets)
    
    # create an indicator column to generate a mulit-hot representation
    indicator = fc.indicator_column(cci)
    
    # append our categorical feature columns
    feature_columns.append(indicator)

#### Using Indicator Columns

In the above example we have taken our one-hot encoded categorical identity column and used this as an input to a indicator column. 
Indicator columns never work on the raw features themselves but instead take categorical columns as an input and allow us to encode a multi-hot representation.

### Categorical Column
We cannot input strings directly to a model. Instead, we must first map strings to numeric or categorical values. Categorical vocabulary columns provide a good way to represent strings as a one-hot vector. 


In [0]:
STRING_COLUMNS

In [0]:
for header in STRING_COLUMNS:
  
  # list of words within our 'thal' variable
  vocabulary_list = ['normal', 'fixed', 'reversible']
  
  # create categorical vocabulary feature column
  ccv = fc.categorical_column_with_vocabulary_list(header, vocabulary_list=vocabulary_list)
  
  # 
  embedding = fc.embedding_column(ccv, dimension=3)
  
  # append our categorical vocabulary embedding to our list
  feature_columns.append(embedding)

\**To learn more about embeddings, see the tutorial linked from the **Next Steps** section at the base of this colab*


In [0]:
# Let's print out a description of our feature columns
for feature_column in feature_columns:
  print(feature_column)

So now we have configured all of our features, these will become the first layer of our model using a FeatureLayer. When we train our model, this first layer will act like any other keras layer, but it’s primary role will be to take the raw data and transform it into the appropriate representations that our neural net is expecting. This layer will also handle creating and training our embeddings.

So, if you have data that needs transformation before it fits into a model - maybe it’s categorical or has string names and vocabularies - you can use feature_columns to handle those transformations batch by batch in TensorFlow, rather than having a whole separate pipeline to do feature transformations in memory. TensorFlow provides many feature columns, and even ways to combine individual columns into more complex representations of the data that your model can learn. 


In [0]:
feature_layer = fc.FeatureLayer(feature_columns)

## Define and compile our model

Here, we will use a simple sequential model. Notice the first layer!

In [0]:
model = tf.keras.Sequential([
  feature_layer,
  tf.keras.layers.Dense(128, activation=tf.nn.relu),
  tf.keras.layers.Dense(64, activation=tf.nn.relu),
  tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

In [0]:
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss=tf.keras.losses.binary_crossentropy,
              metrics=['accuracy'])

### Train our model

Given the small size of our dataset, we will train for 5 epochs.

In [0]:
for epoch in range(5):
  print ("Epoch {}:", epoch)
  model.fit(dataset, steps_per_epoch=8)

Looking at the above, we can see that our model converges quickly on this small and simple dataset 

### Validating our Model

TODO: one sentence on test data and creating a new tf.data dataset

In [0]:
# Read in our test data with tf.data
# our test data has no header row, so we will assign the column names using the HEADERS list from earlier
test_data = tf.data.experimental.make_csv_dataset('heart-disease-uci/heart_test.csv', column_names=HEADERS, header=False, batch_size=32)

Note here that because we took care of our data transformations using feature columns, we know that the transformation of our input validation data will happen in the same way as it did for our training data, which is critical to ensuring repeatable results.

In [0]:
model.evaluate(test_data, steps=5)

Next, we will save and export our model. Once this is done, we can either load it back into our Python program for later use, or serve it with tf.serving, or run it in a webpage using TensorFlow.js.


## Export using SavedModel

TensorFlow provides a model saving format that works across the suite of TensorFlow products, including TensorFlow serving and TensorFlow.JS. 

The TensorFlow SavedModel includes a checkpoint with all of our weights and variables, and it also includes the graph that we built for training, evaluating, and predicting. Keras now natively exports to TensorFlow SavedModel format for serving.

Tip from markdaoust@ 

SavedModel failing:  https://github.com/tensorflow/tensorflow/issues/22837

In [0]:
export_dir = tf.contrib.saved_model.save_keras_model(model, 'keras_n')

This saved model is a fully-contained serialization of your model, so you can load back in to Python later if you want to retrain or reuse your model.

In [0]:
restored_model = tf.contrib.saved_model.
  load_keras_model(export_dir)

# Marks Code - Not implemented as of yet
The cells below contain suggested changes from markdaoust@ , which have not been succesfully implemented as of yet

#### Data Import

In [0]:
# Mark suggested that using utils.get_file might be more portable
URL = 'https://storage.googleapis.com/amitpatankar-datasets/'
data = tf.keras.utils.get_file('heart-disease-uci.zip', URL, extract=True).replace('.zip','')

In [0]:
# same dataset as before
dataset2 = tf.data.experimental.make_csv_dataset('heart-disease-uci/heart_train.csv', header=True, label_name='target', batch_size=30)

In [0]:
# create lists of various feature types
NUMERICAL_COLUMNS = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']
STRING_COLUMNS = ['thal']
BUCKETIZED_COLUMNS = ['target']
CATEGORICAL_COLUMNS = [('sex',2), ('cp',5), ('fbs',2), ('restecg',3), ('exang',2)]

#### Output fc.FeatureLayers as they're introduced

Mark suggestion: 

Given that you're in eager mode, it would be easy to demonstrate the output of `fc.FeatureLayer` for the various column types, as they're introduced, to provide a little more "show me, don't tell me"

In [0]:
example_batch = list(dataset.take(1))[0]

In [0]:
numeric_columns = [fc.numeric_column(header) for header in NUMERICAL_COLUMNS]
feature_layer = fc.FeatureLayer(numeric_columns)
print(feature_layer(example_batch).numpy())

In [0]:
identity_columns = [fc.categorical_column_with_identity(header) for header in NUMERICAL_COLUMNS]

In [0]:
indicator_columns = [fc.indicator_column(col) for col in identity_columns]

In [0]:
# one
feature_layer = fc.FeatureLayer([indicator_columns[0]])
print(feature_layer(example_batch).numpy())

In [0]:
# all
feature_layer = fc.FeatureLayer(indicator_columns)
print(feature_layer(example_batch).numpy())

In [0]:
embedding_columns = [fc.embedding_column(col, depth) for col in identity_columns]

In [0]:
# one
feature_layer = fc.FeatureLayer([embedding_columns[0]])
print(feature_layer(example_batch).numpy())

In [0]:
# all
feature_layer = fc.FeatureLayer(embedding_columns)
print(feature_layer(example_batch).numpy())