## 一、数据API
整个数据API都围绕着数据集的概念。

**使用tf.data.Dataset.from_tensor_slices（）在RAM中完全创建一个数据集：**

In [1]:
import tensorflow as tf
X = tf.range(10) # numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

2022-11-07 17:13:14.262050: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-07 17:13:20.420365: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

from_tensor_slices（）函数采用一个张量并创建一个tf.data.Dataset，其元素都是X的切片（沿第一个维度），因此此数据集包含10个元素：张量0，1，2，…，9。在这种情况下，如果我们使用tf.data.Dataset.range（10），则将获得相同的数据集。

In [2]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


In [3]:
dataset = dataset.map(lambda x: x * 2) 

In [4]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)


### 1、链式转换
有了数据集后，我们可以通过调用其转换方法对其应用各种转换。每个方法都返回一个新的数据集，我们可以像这样进行链式转换:

In [5]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)
tf.Tensor([16 18], shape=(2,), dtype=int32)


> 数据集方法不会修改数据集，而是创建新数据集，因此请保留对这些新数据集的引用（例如，使用dataset=...），否则将不会发生任何事情。

还可以通过调用map（）方法来变换元素。

例如，这将创建一个新数据集，其中所有元素均是原来的两倍：

In [6]:
dataset = dataset.map(lambda x: x * 2) 

In [7]:
for item in dataset:
    print(item)

tf.Tensor([ 0  4  8 12 16 20 24], shape=(7,), dtype=int32)
tf.Tensor([28 32 36  0  4  8 12], shape=(7,), dtype=int32)
tf.Tensor([16 20 24 28 32 36  0], shape=(7,), dtype=int32)
tf.Tensor([ 4  8 12 16 20 24 28], shape=(7,), dtype=int32)
tf.Tensor([32 36], shape=(2,), dtype=int32)


### 2、乱序数据
当训练集中的实例相互独立且分布均匀时，梯度下降效果最佳。确保这一点的一种简单方法是使用shuffle（）方法对实例进行乱序。它会创建一个新的数据集，该数据集首先将源数据集的第一项元素填充到缓冲区中。然后无论任何时候要求提供一个元素，它都会从缓冲区中随机取出一个元素，并用源数据集中的新元素替换它，直到完全遍历源数据集为止。它将继续从缓冲区中随机抽取元素直到其为空。你必须指定缓冲区的大小，重要的是要使其足够大，否则乱序不会非常有效。不要超出你有的RAM的数量，即使你有足够的RAM，也不需要超出数据集的大小。如果每次运行程序都想要相同的随机顺序，你可以提供随机种子。

例如，创建并显示一个包含整数0到9的数据集，重复3次，使用大小为5的缓冲区和42的随机种子进行乱序，并以7的批次大小进行批处理：

In [9]:
dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=5, seed=42).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


对于不适合内存的大型数据集，这种简单的缓冲区乱序方法可能不够用，因为与数据集相比，缓冲区很小。

一种常见的方法是将源数据拆分为多个文件，然后在训练过程中以随机顺序读取它们。但是位于同一文件中的实例仍然相互接近。为了避免这种情况，你可以随机选择多个文件并同时读取它们，并且交错它们的记录。然后最重要的是，你可以使用shuffle（）方法添加一个乱序缓冲区。

虽然听起来这很麻烦，但是不用担心：Data API只需几行代码就可以实现所有这些功能。

### Split the California dataset to multiple CSV files
首先，假设已经加载了加州住房数据集，对其进行乱序，然后将其分为训练集、验证集和测试集。之后将每个集合分成许多类似如下的CSV文件（每行包含8个输入特征以及目标房屋中间值）：

In [14]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, 
    housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

For a very large dataset that does not fit in memory, you will typically want to split it into many files first, then have TensorFlow read these files in parallel. To demonstrate this, let's start by splitting the housing dataset and save it to 20 CSV files:

In [25]:
import os

def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.getcwd()
    housing_dir = os.path.join(housing_dir, "datasets", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")

    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

In [27]:
import numpy as np
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]

header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)

In [28]:
import pandas as pd
pd.read_csv(train_filepaths[0]).head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,3.5214,15.0,3.049945,1.106548,1447.0,1.605993,37.63,-122.43,1.442
1,5.3275,5.0,6.49006,0.991054,3464.0,3.44334,33.69,-117.39,1.687
2,3.1,29.0,7.542373,1.591525,1328.0,2.250847,38.44,-122.98,1.621
3,7.1736,12.0,6.289003,0.997442,1054.0,2.695652,33.55,-117.7,2.621
4,2.0549,13.0,5.312457,1.085092,3297.0,2.244384,33.93,-116.93,0.956


In [31]:
with open(train_filepaths[0]) as f:
    for i in range(5):
        print(f.readline(), end="")

MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442
5.3275,5.0,6.490059642147117,0.9910536779324056,3464.0,3.4433399602385686,33.69,-117.39,1.687
3.1,29.0,7.5423728813559325,1.5915254237288134,1328.0,2.2508474576271187,38.44,-122.98,1.621
7.1736,12.0,6.289002557544757,0.9974424552429667,1054.0,2.6956521739130435,33.55,-117.7,2.621


In [33]:
train_filepaths[0:3]

['/Users/dayao/Github/Architect-CTO-growth/人工智能技术/《机器学习实战：基于Scikit-Learn、Keras和TensorFlow》笔记及练习/datasets/housing/my_train_00.csv',
 '/Users/dayao/Github/Architect-CTO-growth/人工智能技术/《机器学习实战：基于Scikit-Learn、Keras和TensorFlow》笔记及练习/datasets/housing/my_train_01.csv',
 '/Users/dayao/Github/Architect-CTO-growth/人工智能技术/《机器学习实战：基于Scikit-Learn、Keras和TensorFlow》笔记及练习/datasets/housing/my_train_02.csv']

### Building an Input Pipeline
创建一个仅包含以下文件路径的数据集：

In [35]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

调用interleave（）方法一次读取5个文件并交织它们的行（使用skip（）方法跳过每个文件的第一行，即标题行）：

In [36]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath:tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers
)

interleave（）方法将创建一个数据集，该数据集将从filepath_dataset中拉出5个文件路径，对于每个路径，它将调用你为其提供的函数（在此示例中为lambda）来创建新的数据集（在此示例中为TextLineDataset）。为了清楚起见，在此阶段总共有7个数据集：文件路径数据集、交织数据集和由交织数据集在内部创建的5个TextLineDataset。当我们遍历交织数据集时，它将循环遍历这5个TextLineDatasets，每次读取一行，直到所有数据集都读出为止。然后它将从filepath_dataset获取剩下的5个文件路径，并以相同的方式对它们进行交织，以此类推，直到读完文件路径。

随机选择的5个CSV文件的第一行（忽略标题行）

In [37]:
for line in dataset.take(5):
    print(line.numpy())

b'4.2083,44.0,5.323204419889502,0.9171270718232044,846.0,2.3370165745856353,37.47,-122.2,2.782'
b'4.1812,52.0,5.701388888888889,0.9965277777777778,692.0,2.4027777777777777,33.73,-118.31,3.215'
b'3.6875,44.0,4.524475524475524,0.993006993006993,457.0,3.195804195804196,34.04,-118.15,1.625'
b'3.3456,37.0,4.514084507042254,0.9084507042253521,458.0,3.2253521126760565,36.67,-121.7,2.526'
b'3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442'


### 预处理数据
实现一个执行预处理的小函数：

In [38]:
n_inputs = 8

def preprocess(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (x - X_mean) / X_std, y

In [39]:
preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')

(<tf.Tensor: shape=(8,), dtype=float32, numpy=
 array([ 0.16579159,  1.216324  , -0.05204564, -0.39215982, -0.5277444 ,
        -0.2633488 ,  0.8543046 , -1.3072058 ], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.782], dtype=float32)>)

## 二、TFRecord格式
TFRecord格式是TensorFlow首选的格式，用于存储大量数据并有效读取数据。这是一种非常简单的二进制格式，只包含大小不同的二进制记录序列（每个记录由一个长度、一个用于检查长度是否损坏的CRC校验和、实际数据以及最后一个CRC校验和组成）。

我们可以使用tf.io.TFRecordWriter类轻松创建TFRecord文件：

In [11]:
with tf.io.TFRecordWriter("datasets/my_data.tfrecord") as f:
    f.write(b"This is the forst record")
    f.write(b"And this is the second record")

然后我们可以使用tf.data.TFRecordDataset读取一个或多个TFRecord文件：

In [12]:
filepaths = ["datasets/my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

tf.Tensor(b'This is the forst record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)


> 默认情况下，TFRecordDataset将一个接一个地读取文件，但是我们可以通过设置num_parallel_reads使其并行读取多个文件并交织记录。另外，我们可以使用list_files（）和interleave（）得到与前面读取多个CSV文件相同的结果。

## 三、预处理输入特征（Data API）
为神经网络准备数据需要将所有特征转换为数值特征，通常将其归一化等。特别是如果我们的数据包含分类特征或文本特征，则需要将它们转换为数字。在准备数据文件时，可以使用任何我们喜欢的工具（例如NumPy、pandas或Scikit-Learn）提前完成此操作。或者，可以在使用Data API加载数据时动态地预处理数据（例如使用数据集的map（）方法），也可以在模型中直接包含预处理层。现在让我们来看最后一个选项。

## 四、　TF Transform

In [43]:
try:
    import tensorflow_transform as tft

    def preprocess(inputs):  # inputs is a batch of input features
        median_age = inputs["housing_median_age"]
        ocean_proximity = inputs["ocean_proximity"]
        standardized_age = tft.scale_to_z_score(median_age - tft.mean(median_age))
        ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
        return {
            "standardized_median_age": standardized_age,
            "ocean_proximity_id": ocean_proximity_id
        }
except ImportError:
    print("TF Transform is not installed. Try running: pip3 install -U tensorflow-transform")

In [42]:
!pip3 install -U tensorflow-transform

Collecting tensorflow-transform
  Downloading tensorflow_transform-1.10.1-py3-none-any.whl (439 kB)
[K     |████████████████████████████████| 439 kB 63 kB/s eta 0:00:011
Collecting apache-beam[gcp]<3,>=2.40
  Downloading apache_beam-2.42.0-cp39-cp39-macosx_10_9_x86_64.whl (4.8 MB)
[K     |████████████████████████████████| 4.8 MB 47 kB/s eta 0:00:01     |████████████▏                   | 1.8 MB 46 kB/s eta 0:01:04
Collecting tfx-bsl<1.11.0,>=1.10.1
  Downloading tfx_bsl-1.10.1-cp39-cp39-macosx_10_14_x86_64.whl (22.9 MB)
[K     |████████████████████████████████| 22.9 MB 66 kB/s eta 0:00:014    |████▉                           | 3.4 MB 91 kB/s eta 0:03:34
[?25hCollecting tensorflow-metadata<1.11.0,>=1.10.0
  Downloading tensorflow_metadata-1.10.0-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 93 kB/s eta 0:00:011
Collecting pyarrow<7,>=6
  Downloading pyarrow-6.0.1-cp39-cp39-macosx_10_13_x86_64.whl (19.2 MB)
[K     |████████████████████████████████| 19.2 MB

  Downloading google_cloud_pubsub-2.13.10-py2.py3-none-any.whl (236 kB)
[K     |████████████████████████████████| 236 kB 80 kB/s eta 0:00:01
[?25hCollecting google-api-core!=2.8.2,<3
  Downloading google_api_core-2.10.2-py3-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 80 kB/s eta 0:00:01
[?25hCollecting google-cloud-spanner<2,>=1.13.0
  Downloading google_cloud_spanner-1.19.3-py2.py3-none-any.whl (255 kB)
[K     |████████████████████████████████| 255 kB 89 kB/s eta 0:00:01
[?25hCollecting googleapis-common-protos<2.0dev,>=1.56.2
  Downloading googleapis_common_protos-1.56.4-py2.py3-none-any.whl (211 kB)
[K     |████████████████████████████████| 211 kB 97 kB/s eta 0:00:01
[?25hCollecting fasteners>=0.14
  Downloading fasteners-0.18-py3-none-any.whl (18 kB)
Collecting oauth2client>=1.4.12
  Downloading oauth2client-4.1.3-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 88 kB/s eta 0:00:01
Collecting google-resumable-media<3.0

Collecting google-api-python-client<2,>=1.7.11
  Downloading google_api_python_client-1.12.11-py2.py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 70 kB/s eta 0:00:01
Collecting tensorflow-serving-api!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,<3,>=1.15
  Downloading tensorflow_serving_api-2.10.0-py2.py3-none-any.whl (37 kB)
Collecting uritemplate<4dev,>=3.0.0
  Downloading uritemplate-3.0.1-py2.py3-none-any.whl (15 kB)
Collecting tensorflow-serving-api!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,<3,>=1.15
  Downloading tensorflow_serving_api-2.9.2-py2.py3-none-any.whl (37 kB)
Building wheels for collected packages: crcmod, dill, google-apitools, docopt
  Building wheel for crcmod (setup.py) ... [?25ldone
[?25h  Created wheel for crcmod: filename=crcmod-1.7-cp39-cp39-macosx_10_9_x86_64.whl size=22091 sha256=8cae2013a7cc997e6a36c7992513d4cd043f618b4974df3b9c94ae4b4bd6b322
  Stored in directory: /Users/daya

**借助Data API、TFRecords、Keras预处理层和TF Transform，你可以构建高度可扩展的输入流水线来进行训练，并从生产环境中的快速而便携的数据预处理中受益。**但是如果你只想使用标准数据集怎么办？在这种情况下，事情要简单得多：只需使用TFDS！

## 五、TensorFlow数据集项目
TensorFlow数据集（TFDS）项目使下载通用数据集变得非常容易，从小型数据集（如MNIST或Fashion MNIST）到大型数据集（如ImageNet）等。
TFDS没有与TensorFlow捆绑在一起，因此我们需要安装tensorflow_datasets库。然后调用tfds.load（）函数，它会下载你想要的数据（除非之前已经下载过），并将该数据作为数据集的目录返回（通常一个用于训练，另一个用于测试，但这取决于你选择的数据集）。

例如，让我们下载MNIST：

In [45]:
!pip3 install -U tensorflow_datasets

Collecting tensorflow_datasets
  Downloading tensorflow_datasets-4.7.0-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 2.0 MB/s eta 0:00:01
Collecting etils[epath]
  Downloading etils-0.9.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 2.0 MB/s eta 0:00:01
Collecting promise
  Downloading promise-2.3.tar.gz (19 kB)
Collecting importlib_resources
  Downloading importlib_resources-5.10.0-py3-none-any.whl (34 kB)
Building wheels for collected packages: promise
  Building wheel for promise (setup.py) ... [?25ldone
[?25h  Created wheel for promise: filename=promise-2.3-py3-none-any.whl size=21502 sha256=90c2e073f976b676ab907190bf0611437a2b87b96c52fa3fc75ed7f9ceb2e7c0
  Stored in directory: /Users/dayao/Library/Caches/pip/wheels/e1/e8/83/ddea66100678d139b14bc87692ece57c6a2a937956d2532608
Successfully built promise
Installing collected packages: etils, importlib-resources, promise, tensorflow-datasets
Successfully installed etils-0.9

In [46]:
import tensorflow_datasets as tfds

dataset = tfds.load(name="mnist")
mnist_train, mnist_test = dataset["train"], dataset["test"]

2022-11-08 17:33:25.764144: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-11-08 17:34:26.770296: E tensorflow/core/platform/cloud/curl_http_request.cc:614] The transmission  of request 0x7fbea06ceea0 (URI: https://www.googleapis.com/storage/v1/b/tfds-data/o/dataset_info%2Fmnist%2F3.0.1?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.007877 (No error), connect time: 0 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)
2022-11-08 17:35:27.920599: E tensorflow/core/platform/

[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /Users/dayao/tensorflow_datasets/mnist/3.0.1...[0m


2022-11-08 17:47:05.484065: E tensorflow/core/platform/cloud/curl_http_request.cc:614] The transmission  of request 0x7fbe9ae98d60 (URI: https://www.googleapis.com/storage/v1/b/tfds-data/o/datasets%2Fmnist%2F3.0.1?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.00165 (No error), connect time: 0 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)
2022-11-08 17:48:07.558776: E tensorflow/core/platform/cloud/curl_http_request.cc:614] The transmission  of request 0x7fbead82cd30 (URI: https://www.googleapis.com/storage/v1/b/tfds-data/o/datasets%2Fmnist%2F3.0.1?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.013479 (No error), connect time: 0 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)
2022-11-08 17:49:09.119262: E tensorflow/core/platform/clou

Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /Users/dayao/tensorflow_datasets/mnist/3.0.1.incompleteAAB2JU/mnist-train.tfrecord*...:   0%|       …

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /Users/dayao/tensorflow_datasets/mnist/3.0.1.incompleteAAB2JU/mnist-test.tfrecord*...:   0%|        …

[1mDataset mnist downloaded and prepared to /Users/dayao/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.[0m
