# Load CSV and Numpy File Types

**Popular APIs:**
* `tf.keras.utils.get_file("file.csv", FILE_DATA_URL)`
* `tf.data.experimental.make_csv_dataset`
* `tf.feature_column.numeric_column`
* `tf.feature_column.categorical_column_with_vocabulary_list`
* `tf.feature_column.indicator_column(cat_col)`
* `tf.keras.layers.DenseFeatures(columns)`
* `tf.data.Dataset.from_tensor_slices((train_examples, train_labels))`
* `tf.keras.Sequential`

**Resources:**
1. Read data into a Pandas DataFrame - https://www.tensorflow.org/tutorials/load_data/pandas_dataframe
2. Load text data - this link: https://www.tensorflow.org/tutorials/load_data/text 
3. TF.text - this link:  https://www.tensorflow.org/tutorials/tensorflow_text/intro
4. Load image data - https://www.tensorflow.org/tutorials/load_data/images
5. How to represent Unicode strings in TensorFlow - https://www.tensorflow.org/tutorials/load_data/unicode
6. TFRecord and tf.Example -  https://www.tensorflow.org/tutorials/load_data/tfrecord     

## Load CSV data

In [1]:
import functools

import numpy as np
import tensorflow as tf

print("TensorFlow version: ",tf.version.VERSION)

TensorFlow version:  2.4.1


In [2]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv


In [3]:
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

Titanic passenger data: the model will predict the likelihood a passenger survived based on characteristics like age, gender, ticket class, and whether the person was traveling alone.

In [4]:
!head {train_file_path}

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n


## Data loading

Load CSV data from a file into a `tf.data.Dataset`.

To scale up to a large set of files, or if you need a loader that integrates with [TensorFlow and tf.data](../../guide/data.ipynb) then use the `tf.data.experimental.make_csv_dataset` function.

Alternative: load data using pandas, and pass the NumPy arrays to TensorFlow. 

In [5]:
LABEL_COLUMN = 'survived' 
LABELS = [0,1]

In [6]:
# get_dataset() retrieve a Dataverse dataset or its metadata
def get_dataset(file_path, **kwargs):
    
    # read the CSV data from the file and create a dataset 
    
    dataset = tf.data.experimental.make_csv_dataset(
        file_path, 
        num_epochs=1,
        batch_size=5, # Artificially small to make examples easier to show.
        label_name=LABEL_COLUMN,
        na_value="?",
        ignore_errors=True, 
        **kwargs
    )
    
    return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

Each item in the dataset is a batch, represented as a tuple of (*many examples*, *many labels*). The data from the examples is organized in column-based tensors (rather than row-based tensors), each with as many elements as the batch size (5 in this case).

In [7]:
def show_batch(dataset):
    for batch, label in dataset.take(1):
        for key, value in batch.items():
            print("{:20s}: {}".format(key,value.numpy()))

In [8]:
show_batch(raw_train_data)

sex                 : [b'female' b'female' b'male' b'male' b'male']
age                 : [28. 19. 46. 50. 28.]
n_siblings_spouses  : [0 0 0 0 0]
parch               : [0 0 0 0 0]
fare                : [110.883  30.     79.2    13.     15.05 ]
class               : [b'First' b'First' b'First' b'Second' b'Second']
deck                : [b'unknown' b'B' b'B' b'unknown' b'unknown']
embark_town         : [b'Cherbourg' b'Southampton' b'Cherbourg' b'Southampton' b'Cherbourg']
alone               : [b'y' b'y' b'y' b'y' b'y']


The columns in the CSV are named. The dataset constructor will pick these names up automatically. If the file you are working with does not contain the column names in the first line, pass them in a list of strings to  the `column_names` argument in the `make_csv_dataset` function.

In [9]:
CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)

show_batch(temp_dataset)

sex                 : [b'female' b'female' b'male' b'female' b'female']
age                 : [ 9. 14. 42. 25. 18.]
n_siblings_spouses  : [2 1 1 1 0]
parch               : [2 0 0 2 2]
fare                : [ 34.375  11.242  52.    151.55   13.   ]
class               : [b'Third' b'Third' b'First' b'First' b'Second']
deck                : [b'unknown' b'unknown' b'unknown' b'C' b'unknown']
embark_town         : [b'Southampton' b'Cherbourg' b'Southampton' b'Southampton' b'Southampton']
alone               : [b'n' b'n' b'n' b'n' b'n']


If you need to omit some columns from the dataset, create a list of **just the columns you plan to use**, and pass it into the (optional) `select_columns` argument of the constructor.


In [10]:
SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'class', 'deck', 'alone']

temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)

show_batch(temp_dataset)

age                 : [27. 21. 28. 28. 39.]
n_siblings_spouses  : [0 2 0 0 1]
class               : [b'Second' b'First' b'Third' b'First' b'First']
deck                : [b'unknown' b'B' b'unknown' b'C' b'E']
alone               : [b'y' b'n' b'y' b'y' b'n']


## Data preprocessing

Convert from mixed types of data from a CSV file to a fixed length vector before feeding the data into your model.

TensorFlow has a built-in system for describing common input conversions: `tf.feature_column`. You can preprocess your data using tools like [nltk](https://www.nltk.org/) or [sklearn](https://scikit-learn.org/stable/), and just pass the processed output to TensorFlow. 

Build the preprocessing pipeline inside model allows you to export the model with the preprocessing. Then you can pass the raw data directly to your model.

For continuous data (data already in an appropriate numeric format), just pack the data into a vector before passing it off to the model.

In [11]:
SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
temp_dataset = get_dataset(train_file_path, 
                           select_columns=SELECT_COLUMNS,
                           column_defaults = DEFAULTS)

show_batch(temp_dataset)

age                 : [38. 39.  9. 39. 17.]
n_siblings_spouses  : [1. 1. 4. 0. 0.]
parch               : [5. 1. 2. 0. 0.]
fare                : [31.388 79.65  31.388 13.    14.458]


In [12]:
example_batch, labels_batch = next(iter(temp_dataset)) 

[API docs](https://www.tensorflow.org/api_docs/python/tf/stack) for `tf.stack`. Think of numpy equivalent `np.stack`.

Pack together all the columns and apply this to each element of the dataset:

In [13]:
def pack(features, label):
    return tf.stack(list(features.values()), axis=-1), label

In [14]:
packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
    print(features.numpy())
    print()
    print(labels.numpy())

[[44.     0.     1.    57.979]
 [28.     0.     0.     7.75 ]
 [18.     0.     0.    13.   ]
 [18.     1.     0.    17.8  ]
 [25.     1.     0.    17.8  ]]

[1 1 0 0 0]


Switch back to the mixed dataset - if you have mixed datatypes you may want to separate out these simple-numeric fields. The `tf.feature_column` api can handle them, but this incurs some overhead and should be avoided unless really necessary. Define a more general preprocessor (see `PackNumericFeatures` below) that selects a list of numeric features and packs them into a single column:

In [15]:
show_batch(raw_train_data)

sex                 : [b'female' b'female' b'male' b'female' b'female']
age                 : [29. 38. 18.  7. 35.]
n_siblings_spouses  : [1 0 0 0 0]
parch               : [1 0 0 2 0]
fare                : [ 10.462 227.525   7.75   26.25   21.   ]
class               : [b'Third' b'First' b'Third' b'Second' b'Second']
deck                : [b'G' b'C' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Cherbourg' b'Southampton' b'Southampton' b'Southampton']
alone               : [b'n' b'y' b'y' b'n' b'y']


In [16]:
example_batch, labels_batch = next(iter(temp_dataset)) 

In [17]:
class PackNumericFeatures(object):
    def __init__(self, names):
        self.names = names
        
    def __call__(self, features, labels):
        numeric_features = [features.pop(name) for name in self.names]
        numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]
        numeric_features = tf.stack(numeric_features, axis=-1)
        features['numeric'] = numeric_features

        return features, labels

In [18]:
NUMERIC_FEATURES = ['age','n_siblings_spouses','parch', 'fare']

packed_train_data = raw_train_data.map(PackNumericFeatures(NUMERIC_FEATURES))
packed_test_data = raw_test_data.map(PackNumericFeatures(NUMERIC_FEATURES))

In [19]:
show_batch(packed_train_data)

sex                 : [b'male' b'female' b'male' b'female' b'male']
class               : [b'Third' b'First' b'Third' b'First' b'Third']
deck                : [b'unknown' b'C' b'unknown' b'A' b'unknown']
embark_town         : [b'Southampton' b'Cherbourg' b'Cherbourg' b'Cherbourg' b'Queenstown']
alone               : [b'n' b'n' b'y' b'n' b'y']
numeric             : [[18.     1.     0.     6.496]
 [38.     1.     0.    71.283]
 [11.     0.     0.    18.788]
 [48.     1.     0.    39.6  ]
 [28.     0.     0.     6.95 ]]


In [20]:
example_batch, labels_batch = next(iter(packed_train_data)) 

**Numeric data & Data Normalization** - continuous data should always be normalized. The mean based normalization used here requires knowing the means of each column ahead of time.

In [21]:
import pandas as pd
desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()
desc

Unnamed: 0,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0
mean,29.631308,0.545455,0.379585,34.385399
std,12.511818,1.15109,0.792999,54.59773
min,0.75,0.0,0.0,0.0
25%,23.0,0.0,0.0,7.8958
50%,28.0,0.0,0.0,15.0458
75%,35.0,1.0,0.0,31.3875
max,80.0,8.0,5.0,512.3292


In [22]:
MEAN = np.array(desc.T['mean'])
STD = np.array(desc.T['std'])

print(MEAN, STD)

[29.631  0.545  0.38  34.385] [12.512  1.151  0.793 54.598]


In [23]:
def normalize_numeric_data(data, mean, std):
    
    # Center the data
    
    return (data-mean)/std

Now create a numeric column. The `tf.feature_columns.numeric_column` API accepts a `normalizer_fn` argument, which will be run on each batch.

Bind the `MEAN` and `STD` to the normalizer fn using [`functools.partial`](https://docs.python.org/3/library/functools.html#functools.partial).

In [24]:
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)

numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column

NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalize_numeric_data at 0x1645d59d8>, mean=array([29.631,  0.545,  0.38 , 34.385]), std=array([12.512,  1.151,  0.793, 54.598])))

Include this feature column to select and center the block of numeric data in the model training:

In [25]:
example_batch['numeric']

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[28.   ,  0.   ,  0.   ,  8.05 ],
       [29.   ,  1.   ,  0.   , 66.6  ],
       [30.   ,  0.   ,  0.   ,  7.896],
       [28.   ,  1.   ,  0.   , 15.5  ],
       [51.   ,  0.   ,  0.   ,  8.05 ]], dtype=float32)>

In [26]:
numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

array([[-0.13 , -0.474, -0.479, -0.482],
       [-0.05 ,  0.395, -0.479,  0.59 ],
       [ 0.029, -0.474, -0.479, -0.485],
       [-0.13 ,  0.395, -0.479, -0.346],
       [ 1.708, -0.474, -0.479, -0.482]], dtype=float32)

**Categorical data**: use the `tf.feature_column` API to create a collection with a `tf.feature_column.indicator_column` for each categorical column.

In [27]:
CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}

In [28]:
categorical_columns = []
for feature, vocab in CATEGORIES.items():
    cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
    categorical_columns.append(tf.feature_column.indicator_column(cat_col))

In [29]:
categorical_columns

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

In [30]:
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])

[1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]


**Combined preprocessing layer**: add the two feature column collections and pass them to a `tf.keras.layers.DenseFeatures` to create an input layer that will extract and preprocess both input types:

In [33]:
preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)

In [34]:
print(preprocessing_layer(example_batch).numpy()[0])

[ 1.     0.     0.     0.     1.     0.     0.     0.     0.     0.
  0.     0.     0.     0.     0.     0.     0.     0.    -0.13  -0.474
 -0.479 -0.482  1.     0.   ]


## Load NumPy data

In [35]:
import numpy as np
import tensorflow as tf

print("TensorFlow version: ",tf.version.VERSION)

TensorFlow version:  2.4.1


### Load data from `.npz` file

Use the MNIST dataset in Keras.

In [47]:
DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'

path = tf.keras.utils.get_file('mnist.npz', DATA_URL)
with np.load(path) as data:
    train_examples = data['x_train']
    train_labels = data['y_train']
    test_examples = data['x_test']
    test_labels = data['y_test']

In [48]:
print(train_examples.shape)
print(train_labels.shape)
print(test_examples.shape)
print(test_labels.shape)

(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)


### Load NumPy arrays with `tf.data.Dataset`

Assuming you have an array of examples and a corresponding array of labels, pass the two arrays as a tuple into `tf.data.Dataset.from_tensor_slices` to create a `tf.data.Dataset`.

In [49]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))

In [50]:
train_dataset

<TensorSliceDataset shapes: ((28, 28), ()), types: (tf.uint8, tf.uint8)>

A next step would be to build a `tf.keras.Sequential`, starting with the `preprocessing_layer`.