# Preparing Tabular Data with TensorFlow

Tabular data consist of rows and columns. The values of the categorical columns have to encode as one-hot encoding. In this tutorail, I am going to cover how to preparing tabular data. To show this, I'll use Titanic dataset. First of all, let's import libraries. 

In [1]:
#Importing libraries.
import functools
import numpy as np
import tensorflow as tf
import pandas as pd
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

## Loading the Dataset

The Titanic dataset is open source and tabular dataset. This dataset consist of columns as such age, gender, cabin grade, and whether or not they survived. Google provide this dataset. Let me create variables that contain URLs of train and test datasets.

In [2]:
# Creating variables for urls of datasets.
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

I am going to use get_files() method which downloads a file from a URL if it not already in the cache.

In [3]:
# Creating variables for paths of datasets.
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv",  TEST_DATA_URL)

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv


Pandas is the most popular library of Python. You can manipulate dataset with Pandas. To read these datasets, you can use read_csv () method in Pandas.

In [4]:
#Converting train_file_path into pandas dataframe.
train_df = pd.read_csv(train_file_path, header='infer')
test_df = pd.read_csv(test_file_path, header='infer')

Let me take a look the first five rows of train dataset.

In [5]:
#Take a look titanic dataset.
train_df.head()

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
2,1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
3,1,female,35.0,1,0,53.1,First,C,Southampton,n
4,0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y


## Preprocessing the Datasets

As you can see above dataseti dataset consist of numeric and categorical columns. You will need to mark "survived" columns as the target and mark the rest of the columns as features. To do this I am going to use tf.data.experimental.make_csv_dataset() method. This method reads CSV files into a dataset, where each element of the dataset is a (features, labels) tuple that corresponds to a batch of CSV rows.

In [6]:
#Creating the target and featrues variables.
LABEL_COLUMN = 'survived'
LABELS = [0, 1]
# Let's specify file path, batch size, label name, missing value parameters in make_csv_dataset method.
train_ds = tf.data.experimental.make_csv_dataset(
        train_file_path,
        batch_size = 3,
        label_name=LABEL_COLUMN,
        na_value="?",
        num_epochs= 1,
        ignore_errors=True)
# Let's create test dataset as above.
test_ds = tf.data.experimental.make_csv_dataset(
        test_file_path,
        batch_size=3,
        label_name=LABEL_COLUMN,
        na_value="?",
        num_epochs=1,
        ignore_errors=True)

Let's take a look columns of train dataset in the first batch.

In [7]:
for batch, label in train_ds.take(1):
    print(label)
    for key, value in batch.items():
        print(f"{key}: {value.numpy()}")

tf.Tensor([1 0 0], shape=(3,), dtype=int32)
sex: [b'female' b'male' b'male']
age: [31. 28. 64.]
n_siblings_spouses: [1 0 1]
parch: [1 0 4]
fare: [ 20.525  15.05  263.   ]
class: [b'Third' b'Second' b'First']
deck: [b'unknown' b'unknown' b'C']
embark_town: [b'Southampton' b'Cherbourg' b'Southampton']
alone: [b'n' b'y' b'n']


Now that I loaded train and test datasets. Let me arrange columns by feature types. First of all, I am going to designate numerics columns.

In [8]:
# Setting numeric columns
feature_columns = []
# numeric columns
for header in ['age', 'n_siblings_spouses', 'parch', 'fare']:
    feature_columns.append(feature_column.numeric_column(header))

If you want, you can bin age into a bucket. First, let's take a look statistics of age column. To do this, I am going to use Pandas.

In [9]:
titanic_df = pd.read_csv(train_file_path, header='infer')
titanic_df.describe()

Unnamed: 0,survived,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0,627.0
mean,0.38756,29.631308,0.545455,0.379585,34.385399
std,0.487582,12.511818,1.15109,0.792999,54.59773
min,0.0,0.75,0.0,0.0,0.0
25%,0.0,23.0,0.0,0.0,7.8958
50%,0.0,28.0,0.0,0.0,15.0458
75%,1.0,35.0,1.0,0.0,31.3875
max,1.0,80.0,8.0,5.0,512.3292


Let me try three bin boundaries for age : 23, 28, and 35.

In [10]:
# Bucketizing age columns
age = feature_column.numeric_column('age')
age_buckets = feature_column.bucketized_column(age, boundaries=[23, 28, 35])

To use one-hot encode, I am going to see the distinct values.

In [11]:
#Deteriming categorical columns
h = {}
for col in titanic_df:
    if col in ['sex', 'class', 'deck', 'embark_town', 'alone']:
        print(col, ':', titanic_df[col].unique())
        h[col] = titanic_df[col].unique()

sex : ['male' 'female']
class : ['Third' 'First' 'Second']
deck : ['unknown' 'C' 'G' 'A' 'B' 'D' 'F' 'E']
embark_town : ['Southampton' 'Cherbourg' 'Queenstown' 'unknown']
alone : ['n' 'y']


Let's use categorical_column_with_vocabulary_list since inputs are in string format. Let me keep track of these unique values using h variable.

In [12]:
# Converting categorical columns and encoding unique categorical values
sex_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', h.get('sex').tolist())
sex_type_one_hot = feature_column.indicator_column(sex_type)

class_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', h.get('class').tolist())
class_type_one_hot = feature_column.indicator_column(class_type)

deck_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', h.get('deck').tolist())
deck_type_one_hot = feature_column.indicator_column(deck_type)

embark_town_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', h.get('embark_town').tolist())
embark_town_type_one_hot = feature_column.indicator_column(embark_town_type)

alone_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', h.get('alone').tolist())
alone_one_hot = feature_column.indicator_column(alone_type)

"deck" column has eight unique values so I am going to embed this column. 

In [13]:
# Embeding the "deck" column and reducing its dimension to 3.
deck = feature_column.categorical_column_with_vocabulary_list(
      'deck', titanic_df.deck.unique())
deck_embedding = feature_column.embedding_column(deck, dimension=3)

Let's reduce the dimensions of class columns using a hashed feature column. 

In [14]:
# Reducing class column
class_hashed = feature_column.categorical_column_with_hash_bucket(
      'class', hash_bucket_size=4)

There may be interaction between passenger gender and cabin class. Let's encode those intercations using crossed_column() method.

In [15]:
cross_type_feature = feature_column.crossed_column(['sex', 'class'], hash_bucket_size=5)

Now that I am going to put together what I've done. Let's create a list to hold all the feature.

In [16]:
feature_columns = []

# appending numeric columns
for header in ['age', 'n_siblings_spouses', 'parch', 'fare']:
    feature_columns.append(feature_column.numeric_column(header))
    
# appending bucketized columns
age = feature_column.numeric_column('age')
age_buckets = feature_column.bucketized_column(age, boundaries=[23, 28, 35])
feature_columns.append(age_buckets)

# appending categorical columns
indicator_column_names = ['sex', 'class', 'deck', 'embark_town', 'alone']
for col_name in indicator_column_names:
    categorical_column = feature_column.categorical_column_with_vocabulary_list(
        col_name, titanic_df[col_name].unique())
    indicator_column = feature_column.indicator_column(categorical_column)
    feature_columns.append(indicator_column)
    
# appending embedding columns
deck = feature_column.categorical_column_with_vocabulary_list(
      'deck', titanic_df.deck.unique())
deck_embedding = feature_column.embedding_column(deck, dimension=3)
feature_columns.append(deck_embedding)

# appending crossed columns
feature_columns.append(feature_column.indicator_column(cross_type_feature))

Now I am going to create a feature layer. This layer will serve as the first (input) layer in the model.

In [17]:
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)  

Let me split test_df into validation and test datasets. Hyperparameters are fine tuned using validation dataset and model is evaluated using test dataset.  

In [18]:
val_df, test_df = train_test_split(test_df, test_size=0.4)

Let me specify target variable.

In [19]:
labels = train_df.pop("survived")

To stream the data into the training process with the dataset, I am going to create a function. 

In [20]:
def pandas_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop('survived')
    # To transform the DataFrame into a key-value pair. 
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    # To shuffle and batch
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

Appliying this function to both validation and test data.

In [21]:
batch_size=32
val_ds = pandas_to_dataset(val_df, shuffle=False, batch_size=batch_size)
test_ds = pandas_to_dataset(test_df, shuffle=False, batch_size=batch_size)

## Building the Model

In [22]:
model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dropout(.1),
  layers.Dense(1)
])

Take a look summary of the model.

## Compiling the Model

In [23]:
model.compile(optimizer='adam', 
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Training the Model

In [24]:
model.fit(train_ds,
          validation_data=val_ds,
          epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fa0aff829d0>

That is all. In this tutorail, I am going to showed how to prepare tabular dataset to analyze and deal with multiple data types. 