##### Copyright 2020.

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Deep Dive CSVDataset in TensorFlow Data Pipeline (tf.data)

## Overview

As comma-separated values (CSV) file format is the most popular file format in data science, it is important to import data stored in a CSV file into TensorFlow. This tutorial will cover the basics of working with CSV files in TensorFlow through data pipeline API (`tf.data`).

## Setup and download data

As with other tutorials, configure Colab to use TensorFlow 2.x with `%tensorflow_version 2.x`:

In [0]:
try:
  %tensorflow_version 2.x
except Exception:
  pass

Now import TensorFlow into your program:

In [0]:
import tensorflow as tf

The dataset used in this tutorial are taken from the Titanic passenger list. The labels are whether or not passengers survived and the characteristics like age, gender, ticket class, and whether the person was traveling alone are the features.

For simplicity, the file is downloaded first:




In [0]:
!curl -OL "https://storage.googleapis.com/tf-datasets/titanic/train.csv"

You can take a sneak preview of the data through command line `head train.csv`:

In [0]:
!head -5 train.csv

|survived|sex|age|n_siblings_spouses|parch|fare|class|deck|embark_town|alone|
|-|-|-|-|-|-|-|-|-|-|
|0|male|22.0|1|0|7.25|Third|unknown|Southampton|n|
|1|female|38.0|1|0|71.2833|First|C|Cherbourg|n|
|1|female|26.0|0|0|7.925|Third|unknown|Southampton|y|
|1|female|35.0|1|0|53.1|First|C|Southampton|n|



## Data processing with `tf.data.experimental.CsvDataset`

The API of loading CSV into a `tf.data.Dataset` class is exposed through [tf.data.experimental.CsvDataset](https://www.tensorflow.org/api_docs/python/tf/data/experimental/CsvDataset).

There are many parameter arguments but only two are necessary: `filenames` is a tensor consists of one or more filenames of the CSV files, and `record_defaults` is a list of default values for the CSV fields. In addition, it is also useful to use `header=True` to skip the header line (first line), and use `select_cols` to only select the desired columns.

Then with the following line you can load selected `survived, sex, age, fare`) into a dataset:

In [0]:
dataset = tf.data.experimental.CsvDataset("train.csv", [0, "", 0.0, 0.0], header=True, select_cols=[0, 1, 2, 5])

print("dataset: {}".format(dataset.element_spec))

The dataset is not very convinient as the name of the fields is not so obvious:

```
dataset: (
  TensorSpec(shape=(), dtype=tf.int32, name=None),
  TensorSpec(shape=(), dtype=tf.string, name=None),
  TensorSpec(shape=(), dtype=tf.float32, name=None),
  TensorSpec(shape=(), dtype=tf.float32, name=None),
)
```

We can use `map(func)` to convert the tuple of fields into a dict:

In [0]:
def func(survived, sex, age, fare):
  return {"survived": survived, "sex": sex, "age": age, "fare": fare}
  
dataset = dataset.map(func)

print("dataset: {}".format(dataset.element_spec))

The output is shown below:

```
dataset: {
  'survived': TensorSpec(shape=(), dtype=tf.int32, name=None),
  'sex': TensorSpec(shape=(), dtype=tf.string, name=None),
  'age': TensorSpec(shape=(), dtype=tf.float32, name=None),
  'fare': TensorSpec(shape=(), dtype=tf.float32, name=None),
}
```


We can also `batch` the dataset at any time in order to group elements and potentially speed up the performance:

In [0]:
dataset = dataset.batch(1024)

print("dataset: {}".format(dataset.element_spec))

The fields of the batched dataset have the shape `(None,)` now:
```
dataset: {
  'survived': TensorSpec(shape=(None,), dtype=tf.int32, name=None),
  'sex': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'age': TensorSpec(shape=(None,), dtype=tf.float32, name=None),
  'fare': TensorSpec(shape=(None,), dtype=tf.float32, name=None),
}
```

The categorical `sex` field could be converted into a numerical field with `map(func)` as well. Because the majority of the TensorFlow ops performs shape broadcast, you even don't need to worry about the shape `(None,)`:

In [0]:
def f(sex):
  return tf.where(sex == "female", 1, 0)

dataset = dataset.map(lambda e: {"survived": e["survived"], "sex": f(e["sex"]), "age": e["age"], "fare": e["fare"]})

print("dataset: {}".format(dataset.element_spec))

Now the dataset's fields are much more friendly:

```
dataset: {
  'survived': TensorSpec(shape=(None,), dtype=tf.int32, name=None),
  'sex': TensorSpec(shape=(None,), dtype=tf.int32, name=None),
  'age': TensorSpec(shape=(None,), dtype=tf.float32, name=None),
  'fare': TensorSpec(shape=(None,), dtype=tf.float32, name=None),
}
```

This tutorial shows many operations on top of `CsvDataset`. With those operations it is q1uite straightforward to import CSV data and construct a dataset that could be passed to succint `tf.keras` API, which will be very useful for data scientists to combine data with machine learning.