本笔记参考了下面的书籍、文献、博客或者官方说明：
* TensorFlow2官方文档：https://tensorflow.google.cn/
* 简单粗暴TensorFlow 2：https://github.com/snowkylin/tensorflow-handbook
* TensorFlow 2.0 学习笔记：https://zhuanlan.zhihu.com/p/74441082

未注明出处的代码示例，`大概`就是我自己编的，`大概`的意思就是也有极小的概率是忘记注明了。。。

In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_datasets  as tfds
from tensorflow import keras
from tensorflow.keras import Model
from tensorflow.keras import layers
from tensorflow.keras import preprocessing as prep
from matplotlib import pyplot as plt

# Stateful Container

### Trackable

In [2]:
from tensorflow.python.training.tracking.base import Trackable

In [3]:
x = Trackable()
y = Trackable()
x._track_trackable(y, 'ccc') # x引用y，并且叫该引用命名为'ccc'，或者说x依赖y

<tensorflow.python.training.tracking.base.Trackable at 0x143f1f588>

In [4]:
x._lookup_dependency('ccc') is y  # 返回名称为'ccc'的引用

True

In [5]:
y

<tensorflow.python.training.tracking.base.Trackable at 0x143f1f588>

In [6]:
x._lookup_dependency('ccc')

<tensorflow.python.training.tracking.base.Trackable at 0x143f1f588>

In [7]:
del y

In [8]:
x._lookup_dependency('ccc')

<tensorflow.python.training.tracking.base.Trackable at 0x143f1f588>

可以看到删除y之后，不影响x对其引用。因此只要根节点x没有被回收，那么x所依赖的对象就不会被回收。

### AutoTrackable
AutoTrackabke类继承Trackable类，通过`__setattr__`和`__getattr__`属性拦截访问和设置新属性（访问和建立依赖关系）。

In [9]:
from tensorflow.python.training.tracking.tracking import AutoTrackable

In [10]:
x = AutoTrackable()
y = AutoTrackable()
x.ccc = y

In [11]:
x._lookup_dependency('ccc') is y

True

In [12]:
v = tf.Variable([1,2,3])

In [13]:
x.vvv = v

In [14]:
x._unconditional_checkpoint_dependencies

[TrackableReference(name='ccc', ref=<tensorflow.python.training.tracking.tracking.AutoTrackable object at 0x143f365c0>),
 TrackableReference(name='vvv', ref=<tf.Variable 'Variable:0' shape=(3,) dtype=int32, numpy=array([1, 2, 3], dtype=int32)>)]

In [15]:
v

<tf.Variable 'Variable:0' shape=(3,) dtype=int32, numpy=array([1, 2, 3], dtype=int32)>

### 可以被保存的对象
**tf.Variable和MutableHashTable**  
tf.Variable类和MutableHashTable类是可以被保存的对象(用于tf.train.Checkpoint)，这两个类继承自Trackable类，并且覆盖了`_gather_saveables_for_checkpoint`方法，用tf.train.Checkpoint来保存。

In [16]:
from tensorflow.python.ops.lookup_ops import MutableHashTable

In [17]:
# 可以看到 x（AutoTrackable实例）的_gather_saveables_for_checkpoint方法并不会收集变量
x._gather_saveables_for_checkpoint()

{}

In [18]:
x.vvv._gather_saveables_for_checkpoint()

{'VARIABLE_VALUE': <tf.Variable 'Variable:0' shape=(3,) dtype=int32, numpy=array([1, 2, 3], dtype=int32)>}

实际上，Checkpoint使用了ObjectGraphView类，遍历整个DAG节点，并调用`_gather_saveables_for_checkpoint`方法类收集可以被保存的对象以及它们的依赖关系并存储。

### Restore-on-Creation

In [19]:
class MyModule(tf.Module):
    def assign(self, init=tf.constant([1., 2., 3.]), name=None):
        with self.name_scope:
          self.w = tf.Variable(init)
    def operate(self, value):
        self.w.assign_add(value)

m = MyModule(name='test')
m.assign()
m.operate([1., 1., 1.])
m.w

<tf.Variable 'test/Variable:0' shape=(3,) dtype=float32, numpy=array([2., 3., 4.], dtype=float32)>

In [20]:
ckpt = tf.train.Checkpoint(module=m)
ckpt.save('data/ckpt.save.test')

'data/ckpt.save.test-1'

In [21]:
module = MyModule(name='test')
try:
    module.w
except AttributeError as e:
    print("w doesn't exist.")
else:
    print("w already exists.")

w doesn't exist.


由于没用调用assign方法，可以看到w属性是不存在的。

In [22]:
ckpt = tf.train.Checkpoint(module=module)
ckpt.restore(tf.train.latest_checkpoint('data'))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x143f36048>

In [23]:
try:
    module.w
except AttributeError as e:
    print("w doesn't exist.")
else:
    print("w already exists.")

w doesn't exist.


可以看到由于w属性没有建立，因此restore之后，w依然是不存在的。但是当调用assign方法建立w属性的时候，restore就会起作用了，可以看到结果是restore得到的结果，并不是assign的参数所指定的`tf.constant([1., 1., 1.])`。  

**Restore-on-Creation机制就是在权重没有建立时，暂时不加载checkpoint保存的权重，一旦建立，则立即加载。**

In [24]:
module.assign(tf.constant([1., 1., 1.]))
module.w  # so you see...

<tf.Variable 'test/Variable:0' shape=(3,) dtype=float32, numpy=array([2., 3., 4.], dtype=float32)>

In [25]:
module.assign(tf.constant([2.,2.,2.]))
module.w

<tf.Variable 'test/Variable:0' shape=(3,) dtype=float32, numpy=array([2., 2., 2.], dtype=float32)>

### tf.Module

`tf.variables`：收集所有变量；  
`tf.trainable_variables`：收集所有可训练的变量；  
`tf.submodules`：收集所有子模块，也就是依赖或者引用的tf.Module实例。

> You can enter the name scope explicitly using `with self.name_scope:` or you can annotate methods(apart from `__init__`) with `@tf.Module.with_name_scope`.

注意使用`@tf.Module.with_name_scope`或者`with self.name_scope`，必须在`__init__`中调用`super().__init__`，以此来调用`tf.Module`类的构建函数`__init__`

In [26]:
class Dense(tf.Module):
  def __init__(self, input_features, output_features, name=None):
    super(Dense, self).__init__(name=name)
    with self.name_scope:
      self.w = tf.Variable(tf.random.normal([input_features, output_features], name='w'))
      self.b = tf.Variable(tf.zeros([output_features,]), name='b')
  @tf.Module.with_name_scope
  def __call__(self, x):
    self.test = tf.Variable([2.,3.], name='ahaha')
    y = tf.matmul(x, self.w) + self.b
    return tf.nn.relu(y)

d = Dense(input_features=5, output_features=3, name='Jason')
d(tf.ones([6, 5]))

<tf.Tensor: shape=(6, 3), dtype=float32, numpy=
array([[0.       , 2.2095194, 0.       ],
       [0.       , 2.2095194, 0.       ],
       [0.       , 2.2095194, 0.       ],
       [0.       , 2.2095194, 0.       ],
       [0.       , 2.2095194, 0.       ],
       [0.       , 2.2095194, 0.       ]], dtype=float32)>

In [27]:
d.variables[0].name

'Jason/b:0'

In [28]:
d.name_scope.name

'Jason/'

In [29]:
d.name

'Jason'

In [30]:
d.test

<tf.Variable 'Jason/ahaha:0' shape=(2,) dtype=float32, numpy=array([2., 3.], dtype=float32)>

In [31]:
d.test = tf.Variable([1.,1.])

In [32]:
d.test

<tf.Variable 'Variable:0' shape=(2,) dtype=float32, numpy=array([1., 1.], dtype=float32)>

In [33]:
d(tf.ones([6, 5]))
d.test

<tf.Variable 'Jason/ahaha:0' shape=(2,) dtype=float32, numpy=array([2., 3.], dtype=float32)>

In [34]:
list(d._flatten())

['Jason',
 <tensorflow.python.framework.ops.name_scope_v2 at 0x143f7cef0>,
 set(),
 True,
 -1,
 <tf.Variable 'Jason/b:0' shape=(3,) dtype=float32, numpy=array([0., 0., 0.], dtype=float32)>,
 <tf.Variable 'Jason/ahaha:0' shape=(2,) dtype=float32, numpy=array([2., 3.], dtype=float32)>,
 <tf.Variable 'Jason/Variable:0' shape=(5, 3) dtype=float32, numpy=
 array([[-0.09760045, -1.0636637 ,  0.20886582],
        [-0.21772884,  0.57946175, -0.8993447 ],
        [-1.1776452 ,  1.3941711 ,  0.02221104],
        [ 0.12294083,  1.2387425 , -0.32721543],
        [ 0.635029  ,  0.06080775, -0.4095479 ]], dtype=float32)>]

# tf.function

### 基本特征

* tf.function 装饰器返回的是def_function.Function对象；
* Function对象是由一个个的ConcreteFunction函数组成；ConcreteFunction对象是由包含了FunctionGraph和structured_input_signature；
* FunctionGraph是tf.Graph的子类，strucured_input_signature是函数签名；
* 如果传入的参数是一个python值，则会对每一个遇到的pyhon值创建一个ConcreteFunction，实际上python值会成为Graph的一个固定的值，如果创建ConcreteFunction时，参数是一个python的引用，则此时引用的值就被固定在Graph中；
* 这也说明，如果是参数是可变了python值，那么，在函数中就不能运行原处改变的操作，因为该值已经被固定在Graph中了；

### 运行过程

1. 运行函数的每一行代码，代码分为两类：
  * 纯python代码；
  * tensorflow代码，如`tf.add`，以及可以转换为计算节点的python代码；  
运行的结果就是：纯python代码会与运行普通的python代码相同，tensorflow代码与可以转换为计算节点的python代码会构建为计算图。
2. 运行计算图一次
3. 基于函数的名字和输入的函数参数类型生成一个哈希值，并将计算的计算图缓存到一个哈希表中

**AutoGraph与if，while循环：**  
* for：如果iterable是张量，则转换；
* while：如果while条件是张量，则转换。

### 实例

In [35]:
@tf.function
def add(x, y):
    return tf.add(x, y)

In [36]:
add(tf.random.normal((2, 3)), tf.random.normal((3,)))

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[-0.50342584, -0.28319857,  1.4980152 ],
       [-2.072191  ,  0.6334422 , -1.1234066 ]], dtype=float32)>

In [37]:
add(tf.random.normal((2, 6)), tf.random.normal((6,)))

<tf.Tensor: shape=(2, 6), dtype=float32, numpy=
array([[-0.44974923,  0.00243461,  1.0163785 ,  1.558114  , -1.9935882 ,
        -1.4193089 ],
       [-1.1722422 ,  1.5539018 ,  0.96692514,  0.89248896, -1.2205807 ,
         1.5422093 ]], dtype=float32)>

In [38]:
add._list_all_concrete_functions_for_serialization()

[<ConcreteFunction add(x, y) at 0x144557E48>,
 <ConcreteFunction add(x, y) at 0x143F7C358>]

In [39]:
add(6,9)

<tf.Tensor: shape=(), dtype=int32, numpy=15>

In [40]:
add._list_all_concrete_functions_for_serialization()

[<ConcreteFunction add(x=6, y=9) at 0x144561828>,
 <ConcreteFunction add(x, y) at 0x144557E48>,
 <ConcreteFunction add(x, y) at 0x143F7C358>]

In [41]:
add._list_all_concrete_functions()  # 跟_list_all_concrete_functions_for_serialization的区别是啥？

[<ConcreteFunction add(x=6, y=9) at 0x144561828>,
 <ConcreteFunction add(x, y) at 0x144557E48>,
 <ConcreteFunction add(x, y) at 0x143F7C358>,
 <ConcreteFunction add(x, y) at 0x143F1FD68>,
 <ConcreteFunction add(x, y) at 0x1444F61D0>]

In [42]:
add._list_all_concrete_functions_for_serialization()[0].structured_input_signature

((6, 9), {})

In [43]:
add._list_all_concrete_functions_for_serialization()[1].structured_input_signature

((TensorSpec(shape=(2, 6), dtype=tf.float32, name='x'),
  TensorSpec(shape=(6,), dtype=tf.float32, name='y')),
 {})

In [44]:
add._list_all_concrete_functions_for_serialization()[2].structured_input_signature

((TensorSpec(shape=(2, 3), dtype=tf.float32, name='x'),
  TensorSpec(shape=(3,), dtype=tf.float32, name='y')),
 {})

In [47]:
# 参数是python值所对应的ConcreteFunction函数不需要传入参数了，因为参数值已经固定在里面了
# 注意下标是python值6和9为参数的ConcreteFunction
add._list_all_concrete_functions_for_serialization()[0]()

<tf.Tensor: shape=(), dtype=int32, numpy=15>

In [48]:
add._list_all_concrete_functions_for_serialization()

[<ConcreteFunction add(x=6, y=9) at 0x144561828>,
 <ConcreteFunction add(x, y) at 0x144557E48>,
 <ConcreteFunction add(x, y) at 0x143F7C358>]

In [49]:
sig = add._list_all_concrete_functions_for_serialization()[0].structured_input_signature
sig

((6, 9), {})

`.get_concrete_function`获取ConcreteFunction，奇怪的是两种方式获得ConcreteFunction并不相等

In [50]:
a = add.get_concrete_function(tf.TensorSpec(shape=[2,6], dtype=tf.float32), tf.TensorSpec(shape=[6,], dtype=tf.float32))
a

<ConcreteFunction add(x, y) at 0x14456B4E0>

In [51]:
add._list_all_concrete_functions_for_serialization()[0].structured_input_signature

((6, 9), {})

In [52]:
add._list_all_concrete_functions_for_serialization()[0]

<ConcreteFunction add(x=6, y=9) at 0x144561828>

tf.function只允许在第一次调用函数时，创建tf.Variable；因此典型用法应当是在`__init__`方法中设置权重为`None`，然后在`build`方法中加以判断，如果权重为`None`，则初始化权重。

In [53]:
v = None

def f(x):
    global v
    if v is None:
      v = tf.Variable(x)
    return v
f = tf.function(f)

In [54]:
f._list_all_concrete_functions_for_serialization()

[]

In [55]:
f(tf.constant([2., 3., 4.]))

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([2., 3., 4.], dtype=float32)>

In [56]:
f(tf.constant([2., 3.]))

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([2., 3., 4.], dtype=float32)>

当我把v重新设置成None时，导致再次调用函数f时会试图创建variable，因此抛出异常。

In [57]:
try:
    v = None
    f(tf.constant([1.,2, 3.]))
except ValueError:
    print("ValueError when create variable non-first call")
else:
    print("isn't ok?")

ValueError when create variable non-first call


正确的用法应当是：

In [4]:
class MyModule(tf.Module):
    def __init__(self, name, units=10):
        super(MyModule, self).__init__(name=name)
        self.w = None
        self.b = None
        self.units = units
        self.built = False  # tf.keras.layers.Layer会设置此属性，并且子类会继承，用于指示是否建立权重
    @tf.Module.with_name_scope
    def build(self, input_shape):
        if self.w is None:
            self.w = tf.Variable(tf.random.normal([input_shape[-1], self.units]))
        if self.b is None:
            self.b = tf.Variable(tf.random.normal([self.units, ]))
        self.built = True  # 设置为True
    def call(self, input):
        return tf.matmul(input, self.w) + self.b
    @tf.function
    def __call__(self, input):
        if not self.built:  # 第一次调用时built=False，调用build方法，建立权重
          self.build(input.shape)
        return self.call(input)

In [5]:
m = MyModule('testModule')
input = tf.random.normal([5,3])
m(input).shape

TensorShape([5, 10])

In [6]:
m.__call__._list_all_concrete_functions_for_serialization()[0].structured_input_signature

((TensorSpec(shape=(5, 3), dtype=tf.float32, name='input'),), {})

如果注释掉`build`方法中的两个`if`判断语句，导致`ValueError when create variable non-first call`

实际上在tf.keras.layers.Layer中用的方法更巧妙，第一次调用__call__的时候会调用build建立权重，之后便不会调用build，如下：

In [7]:
class MyModule(tf.Module):
    def __init__(self, name, units=10):
        super(MyModule, self).__init__(name=name)
        self.units = units
        self.built = False  # tf.keras.layers.Layer会设置此属性，并且子类会继承，用于指示是否建立权重
    @tf.Module.with_name_scope
    def build(self, input_shape):
        self.w = tf.Variable(tf.random.normal([input_shape[-1], self.units]))
        self.b = tf.Variable(tf.random.normal([self.units, ]))
        self.built = True  # 设置为True
    def call(self, input):
        return tf.matmul(input, self.w) + self.b
    @tf.function
    def __call__(self, input):
        if not self.built:  # 第一次调用时built=False，调用build方法，建立权重
          self.build(input.shape)
        return self.call(input)

In [8]:
m = MyModule('testModule')
input = tf.random.normal([5,3])
m(input).shape

TensorShape([5, 10])

### 可变类型作为函数的参数

In [61]:
@tf.function
def f(x):
    print(x)
    # 这一行会导致错误，也就是说参数是可变类型的原处操作会导致运行错误
    #x.append(100) 
    return x[-1] + 100

In [62]:
x = [1.,2.]

In [63]:
f(x)

[1.0, 2.0]


<tf.Tensor: shape=(), dtype=float32, numpy=102.0>

In [64]:
f(x)  # 第二次调用时，print语句不会执行，如果是tf.print则会执行

<tf.Tensor: shape=(), dtype=float32, numpy=102.0>

In [65]:
f.get_concrete_function(x)()

<tf.Tensor: shape=(), dtype=float32, numpy=102.0>

In [66]:
f._list_all_concrete_functions_for_serialization()[0].structured_input_signature

(([1.0, 2.0],), {})

可以看到上面的例子说明：python的可变类型作为参数时，除了不能用原处操作的方法外，其他的和python值作为参数时是相同的。

下面这个例子来自于TensorFlow 2官方文档：

In [67]:
l = [] 
@tf.function 
def f(x): 
  for i in x: 
    print(i)
    l.append(i)    # Caution! Will only happen once when tracing 
f([1, 2, 3])
l

1
2
3


[1, 2, 3]

In [68]:
f([1,2,3])
l  # 第二次调用并没有改变l的值

[1, 2, 3]

In [69]:
l = []
f(tf.constant([1,2,3]))

Tensor("while/TensorArrayV2Read/TensorListGetItem:0", shape=(), dtype=int32)


In [70]:
l  # 换了参数类型为tensorflow原生类型，则转换为图

[<tf.Tensor 'while/TensorArrayV2Read/TensorListGetItem:0' shape=() dtype=int32>]

In [71]:
f._list_all_concrete_functions_for_serialization()[0]

Tensor("while/TensorArrayV2Read/TensorListGetItem:0", shape=(), dtype=int32)


<ConcreteFunction f(x) at 0x144786FD0>

In [72]:
l

[<tf.Tensor 'while/TensorArrayV2Read/TensorListGetItem:0' shape=() dtype=int32>,
 <tf.Tensor 'while/TensorArrayV2Read/TensorListGetItem:0' shape=() dtype=int32>]

In [73]:
l = []
@tf.function
def f(a):
    for i in range(a):
        l.append(0)  # 只会在构建计算图时运行一次
        tf.print(a)  # 会成为计算图的一个计算节点，每次调用都会运行

In [74]:
f(3)
l

3
3
3


[0, 0, 0]

In [75]:
f(3)  # 第二次调用并不会改变list的值，因为第二次只会运行计算图
l

3
3
3


[0, 0, 0]

### 自定义类的序列化

In [76]:
class Person:
    def __init__(self, age):
        self.age = age

@tf.function
def f(year, p):
    print(year)
    return p.age + year

p = Person(100)

In [77]:
f(1, p)

1


<tf.Tensor: shape=(), dtype=int32, numpy=101>

In [78]:
f(2, p)

2


<tf.Tensor: shape=(), dtype=int32, numpy=102>

In [79]:
f(2,p)

<tf.Tensor: shape=(), dtype=int32, numpy=102>

In [80]:
f.get_concrete_function(2,p).structured_input_signature

((2, <tensorflow.python.framework.func_graph.UnknownArgument at 0x14470f0f0>),
 {})

可能是由于Person类并没有序列化，因此导致`_list_all_concrete_functions_for_serialization`并不能获取`ConcreteFunction`

In [81]:
f._list_all_concrete_functions_for_serialization()

INFO:tensorflow:Unsupported signature for serialization: ((2, <tensorflow.python.framework.func_graph.UnknownArgument object at 0x14470f0f0>), {}).
INFO:tensorflow:Unsupported signature for serialization: ((1, <tensorflow.python.framework.func_graph.UnknownArgument object at 0x1446e34a8>), {}).


[]

In [82]:
@tf.function
def concat_with_padding():
    x = tf.zeros([5, 10])
    tf.print(x.shape)
    x = x[:4]
    tf.print(x.shape)
    for i in tf.range(4):
        x = tf.concat([x[:i], tf.ones([1, 10])], axis=0) # 循环时张量形状不能改变
        tf.print(x.shape)
        x.set_shape([4, 10])
        tf.print(x.shape)
    return x
concat_with_padding()

TensorShape([5, 10])
TensorShape([4, 10])
TensorShape([None, 10])
TensorShape([4, 10])
TensorShape([None, 10])
TensorShape([4, 10])
TensorShape([None, 10])
TensorShape([4, 10])
TensorShape([None, 10])
TensorShape([4, 10])


<tf.Tensor: shape=(4, 10), dtype=float32, numpy=
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)>

# tf.io

In [83]:
x = tf.constant([3.,6.])
a = tf.io.serialize_tensor(x)
a

<tf.Tensor: shape=(), dtype=string, numpy=b'\x08\x01\x12\x04\x12\x02\x08\x02"\x08\x00\x00@@\x00\x00\xc0@'>

In [84]:
tf.io.parse_tensor(a, tf.float32)

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([3., 6.], dtype=float32)>

# tf.data

### tf.data.Dataset

* `drop_remainder=True`：如果最后一个批次样本数不足，则弃之不用
* 常用顺序 `train.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE).repeat()`
* train.prefetch(2)  提前取出两个样本放入内存
* train.batch(20).prefetch(2)  提前取出2个批次放入内存，prefetch方法一般写在最后

#### tf.data.Dataset.range

In [85]:
a = tf.data.Dataset.range(1, 4)  # ==> [ 1, 2, 3]
b = tf.data.Dataset.range(4, 5)  # ==> [ 4,]
c = a.concatenate(b)
list(iter(c))

[<tf.Tensor: shape=(), dtype=int64, numpy=1>,
 <tf.Tensor: shape=(), dtype=int64, numpy=2>,
 <tf.Tensor: shape=(), dtype=int64, numpy=3>,
 <tf.Tensor: shape=(), dtype=int64, numpy=4>]

#### tf.data.Dataset.from_tensor_slices

In [86]:
a = tf.data.Dataset.from_tensor_slices((tf.random.normal([4, 3]), [99., 0, 1, 0]))
next(iter(a.enumerate()))

(<tf.Tensor: shape=(), dtype=int64, numpy=0>,
 (<tf.Tensor: shape=(3,), dtype=float32, numpy=array([-0.7512379 ,  0.02733659,  0.39045817], dtype=float32)>,
  <tf.Tensor: shape=(), dtype=float32, numpy=99.0>))

In [87]:
dataset = tf.data.Dataset.from_tensor_slices(({"a": [1, 2, 20], "b": [3, 4, 20], "c":[10,11, 20]}, [100,200,300]))
next(dataset.batch(2).as_numpy_iterator())

({'a': array([1, 2], dtype=int32),
  'b': array([3, 4], dtype=int32),
  'c': array([10, 11], dtype=int32)},
 array([100, 200], dtype=int32))

##### 还有这个不同，啊哈

In [88]:
a = tf.data.Dataset.from_tensor_slices([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])
next(iter(a.batch(2)))

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[1, 2, 3],
       [4, 5, 6]], dtype=int32)>

In [89]:
a = tf.data.Dataset.from_tensor_slices(( [1, 2, 3], [4, 5, 6], [7, 8, 9] ))
next(iter(a))

(<tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=int32, numpy=4>,
 <tf.Tensor: shape=(), dtype=int32, numpy=7>)

#### tf.data.Dataset.from_generator

The constructor takes a callable as input, not an iterator. This allows it to restart the generator when it reaches the end. It takes an optional args argument, which is passed as the callable's arguments.

In [90]:
def count(stop):
  i = 0
  while i<stop:
    yield i
    i += 1
tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes = (), )

<FlatMapDataset shapes: (), types: tf.int32>

In [91]:
x = [[1, 2, 3, 4, 5, 6], [1, 2], [1, 2], [1, 2, 3, 4], [1, 2], [1, 2, 3]]

d = tf.data.Dataset.from_generator(lambda: x, tf.int32)

In [92]:
try:
    next(iter(d.batch(2)))
except:
    print("一个批次形状不一样，所以错误")

一个批次形状不一样，所以错误


In [93]:
dd = d.padded_batch(2, [-1])
next(iter(dd))

<tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[1, 2, 3, 4, 5, 6],
       [1, 2, 0, 0, 0, 0]], dtype=int32)>

In [94]:
d = tf.data.Dataset.from_generator(generator=lambda: [3,4,5], output_types=tf.int32, output_shapes=())
# generator必须是callable，并返回支持iter方法的对象

In [95]:
d

<FlatMapDataset shapes: (), types: tf.int32>

In [96]:
next(iter(d))

<tf.Tensor: shape=(), dtype=int32, numpy=3>

#### tf.data.Dataset.map

Dataset.map方法是以图模式运行的，Dataset.map接受的是一个Tensor而不是EagerTensor，因此不能直接使用EagerTensor.numpy方法，如果要用.numpy方法，则需要tf.py_function包装。

In [97]:
elements = [(1, "foo"), (2, "bar"), (3, "baz)")]
dataset = tf.data.Dataset.from_generator(lambda: elements, (tf.int32, tf.string))
result = dataset.map(lambda x_int, y_str: y_str)
list(result.as_numpy_iterator())

[b'foo', b'bar', b'baz)']

In [98]:
next(result.batch(2).as_numpy_iterator())

array([b'foo', b'bar'], dtype=object)

In [99]:
d = tf.data.Dataset.from_generator(lambda: elements, output_types=(tf.int32, tf.string))

In [100]:
next(d.as_numpy_iterator())

(1, b'foo')

In [101]:
next(d.batch(2).as_numpy_iterator())

(array([1, 2], dtype=int32), array([b'foo', b'bar'], dtype=object))

In [102]:
# from_tensor_sliaces方法不能处理类型不同的序列
try:
  next(tf.data.Dataset.from_tensor_slices(elements).as_numpy_iterator())
except ValueError:
  print("Can't convert python sequence with mixed type to Tensor")

Can't convert python sequence with mixed type to Tensor


In [103]:
elements =  ([{"a": 1, "b": "foo"}, {"a": 2, "b": "bar"}, {"a": 3, "b": "baz"}])
dataset = tf.data.Dataset.from_generator(lambda: elements, {"a": tf.int32, "b": tf.string})
result = dataset.map(lambda d: tf.strings.as_string(d["a"]) +'-' + d["b"])
tmp = list(result.as_numpy_iterator())
list(tmp)

[b'1-foo', b'2-bar', b'3-baz']

In [104]:
next(dataset.batch(2).as_numpy_iterator())

{'a': array([1, 2], dtype=int32), 'b': array([b'foo', b'bar'], dtype=object)}

In [105]:
def test(x): return x.numpy()

In [106]:
elements =  ([{"a": 1, "b": "foo"}, {"a": 2, "b": "bar"}, {"a": 3, "b": "baz"}])
dataset = tf.data.Dataset.from_generator(lambda: elements, {"a": tf.int32, "b": tf.string})
#result = dataset.map(lambda x: x['a'])
# test函数的参数可以看做是placeholder, py_function的参数inp指定填充值
result = dataset.map(lambda x: tf.py_function(func=test, inp=[x['a']], Tout=tf.int32))
list(result.as_numpy_iterator())

[1, 2, 3]

In [107]:
dataset = tf.data.Dataset.from_generator(lambda: elements, {"a": tf.int32, "b": tf.string})
result = dataset.map(lambda x: x['a'])
list(dataset)

[{'a': <tf.Tensor: shape=(), dtype=int32, numpy=1>,
  'b': <tf.Tensor: shape=(), dtype=string, numpy=b'foo'>},
 {'a': <tf.Tensor: shape=(), dtype=int32, numpy=2>,
  'b': <tf.Tensor: shape=(), dtype=string, numpy=b'bar'>},
 {'a': <tf.Tensor: shape=(), dtype=int32, numpy=3>,
  'b': <tf.Tensor: shape=(), dtype=string, numpy=b'baz'>}]

#### tf.data.TFRecordDataset

In [108]:
n_observations = int(1e4)
feature0 = np.random.choice([False, True], n_observations)
feature1 = np.random.randint(0, 5, n_observations)
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]
feature3 = np.random.randn(n_observations)

In [109]:
f0,f1,f2,f3 = feature0[0], feature1[0], feature2[0], feature3[0]
def serialize_example(f0,f1,f2,f3):
    f2 = isinstance(f2, type(tf.constant(0))) and f2.numpy() or f2  # BytesList won't unpack a string from an EagerTensor.
    feature = {
      'feature0': tf.train.Feature(int64_list=tf.train.Int64List(value=[f0])),
      'feature1': tf.train.Feature(int64_list=tf.train.Int64List(value=[f1])),
      'feature2': tf.train.Feature(bytes_list=tf.train.BytesList(value=[f2])),
      'feature3': tf.train.Feature(float_list=tf.train.FloatList(value=[f3])),
  }
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

def tf_serialize_example(f0,f1,f2,f3):
    example_string = tf.py_function(serialize_example, (f0,f1,f2,f3), tf.string)
    #return tf.reshape(example_string, ())  # 这样子返回的是tf.string类型的Tensor
    return example_string  # 这样子返回的是bytes

In [112]:
tf.train.Feature(int64_list=tf.train.Int64List(value=[f0]))

int64_list {
  value: 1
}

##### 序列化Example对象与重建Example对象

In [120]:
feature = {
      'feature0': tf.train.Feature(int64_list=tf.train.Int64List(value=[f0])),
      'feature1': tf.train.Feature(int64_list=tf.train.Int64List(value=[f1])),
      'feature2': tf.train.Feature(bytes_list=tf.train.BytesList(value=[f2])),
      'feature3': tf.train.Feature(float_list=tf.train.FloatList(value=[f3])),
  }
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
# f0,f1,f2,f3必须是scalar

In [121]:
a = example_proto.SerializeToString()
a

b'\nQ\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04!g\xc0\xbf'

In [122]:
tf.train.Example.FromString(a) == example_proto

True

In [123]:
example = tf.train.Example()
example.ParseFromString(a)
example == example_proto

True

##### TFRecord的Dataset写入文件

In [124]:
feature0[:3], feature1[:3], feature2[:3], feature3[:3]

(array([ True,  True, False]),
 array([0, 4, 2]),
 array([b'cat', b'goat', b'chicken'], dtype='|S7'),
 array([-1.50314722, -0.50317397, -1.93570228]))

In [125]:
feature_description = {
    'feature0': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature1': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature2': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'feature3': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}

In [126]:
dataset = tf.data.Dataset.from_tensor_slices((feature0[:3], feature1[:3], feature2[:3], feature3[:3]))
dataset = dataset.map(tf_serialize_example)

In [128]:
list(dataset.as_numpy_iterator())

[b'\nQ\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04!g\xc0\xbf',
 b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x02\xd0\x00\xbf\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04',
 b'\nU\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x02\n\x17\n\x08feature2\x12\x0b\n\t\n\x07chicken\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x18\xc5\xf7\xbf']

In [129]:
writer1 = tf.data.experimental.TFRecordWriter("data/test_1.tfrecord")
writer1.write(dataset)
tf.io.parse_example(list(dataset), feature_description)

{'feature0': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([1, 1, 0])>,
 'feature1': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 4, 2])>,
 'feature2': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'cat', b'goat', b'chicken'], dtype=object)>,
 'feature3': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([-1.5031472 , -0.50317395, -1.9357023 ], dtype=float32)>}

In [130]:
list(dataset)

[<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04!g\xc0\xbf'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x02\xd0\x00\xbf'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'\nU\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x02\n\x17\n\x08feature2\x12\x0b\n\t\n\x07chicken\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x18\xc5\xf7\xbf'>]

In [139]:
list(dataset.map(lambda x: tf.io.parse_single_example(x, feature_description)))

[{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>,
  'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=2>,
  'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'chicken'>,
  'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.68143743>},
 {'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>,
  'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>,
  'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>,
  'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.47863847>},
 {'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>,
  'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=3>,
  'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'horse'>,
  'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.9046812>}]

In [131]:
dataset = tf.data.Dataset.from_tensor_slices((feature0[3:6], feature1[3:6], feature2[3:6], feature3[3:6]))
dataset = dataset.map(tf_serialize_example)
writer2 = tf.data.experimental.TFRecordWriter("data/test_2.tfrecord")
writer2.write(dataset)
tf.io.parse_example(list(dataset), feature_description)

{'feature0': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 0, 1])>,
 'feature1': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([2, 4, 3])>,
 'feature2': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'chicken', b'goat', b'horse'], dtype=object)>,
 'feature3': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([-0.68143743, -0.47863847, -1.9046812 ], dtype=float32)>}

##### 读取TFRecord文件到Dataset

In [141]:
files = ["data/test_2.tfrecord", "data/test_1.tfrecord"]
dataset = tf.data.TFRecordDataset(files)

tf.io.parse_example(list(dataset), feature_description)

{'feature0': <tf.Tensor: shape=(6,), dtype=int64, numpy=array([0, 0, 1, 1, 1, 0])>,
 'feature1': <tf.Tensor: shape=(6,), dtype=int64, numpy=array([2, 4, 3, 0, 4, 2])>,
 'feature2': <tf.Tensor: shape=(6,), dtype=string, numpy=
 array([b'chicken', b'goat', b'horse', b'cat', b'goat', b'chicken'],
       dtype=object)>,
 'feature3': <tf.Tensor: shape=(6,), dtype=float32, numpy=
 array([-0.68143743, -0.47863847, -1.9046812 , -1.5031472 , -0.50317395,
        -1.9357023 ], dtype=float32)>}

In [142]:
def _parse_single_example(example):
    return tf.io.parse_single_example(example, feature_description)

In [143]:
list(dataset.map(_parse_single_example).take(2))

[{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>,
  'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=2>,
  'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'chicken'>,
  'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.68143743>},
 {'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>,
  'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>,
  'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>,
  'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.47863847>}]

In [144]:
d = dataset.map(_parse_single_example)

In [145]:
a = [tf.feature_column.numeric_column('feature1'), tf.feature_column.numeric_column('feature0')]

In [146]:
df = layers.DenseFeatures(a)

In [151]:
df(next(d.batch(2).as_numpy_iterator()))

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[0., 2.],
       [0., 4.]], dtype=float32)>

In [154]:
b = list(d.batch(2).take(1))
b

[{'feature0': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 0])>,
  'feature1': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([2, 4])>,
  'feature2': <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'chicken', b'goat'], dtype=object)>,
  'feature3': <tf.Tensor: shape=(2,), dtype=float32, numpy=array([-0.68143743, -0.47863847], dtype=float32)>}]

In [155]:
df(b[0])

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[0., 2.],
       [0., 4.]], dtype=float32)>

##### 写入文件的另一种方法

In [156]:
with tf.io.TFRecordWriter('data/test.tfrecord') as writer:
    for i in dataset:
        writer.write(i.numpy())

In [157]:
tf.io.parse_example(list(tf.data.TFRecordDataset(['data/test.tfrecord']).take(1)), feature_description)

{'feature0': <tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>,
 'feature1': <tf.Tensor: shape=(1,), dtype=int64, numpy=array([2])>,
 'feature2': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'chicken'], dtype=object)>,
 'feature3': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-0.68143743], dtype=float32)>}

# tf.ragged

官方文档：https://tensorflow.google.cn/guide/ragged_tensor

In [150]:
digits = tf.ragged.constant([[3, 1, 4, 1], [], [5, 9, 2], [6], []])
words = tf.ragged.constant([["So", "long"], ["thanks", "for", "all", "the", "fish"]])
print(tf.add(digits, 3))
print(tf.reduce_mean(digits, axis=1))
print(tf.concat([digits, [[5, 3]]], axis=0))
print(tf.tile(digits, [1, 2]))
print(tf.strings.substr(words, 0, 2))

<tf.RaggedTensor [[6, 4, 7, 4], [], [8, 12, 5], [9], []]>
tf.Tensor([2.25              nan 5.33333333 6.                nan], shape=(5,), dtype=float64)
<tf.RaggedTensor [[3, 1, 4, 1], [], [5, 9, 2], [6], [], [5, 3]]>
<tf.RaggedTensor [[3, 1, 4, 1, 3, 1, 4, 1], [], [5, 9, 2, 5, 9, 2], [6, 6], []]>
<tf.RaggedTensor [[b'So', b'lo'], [b'th', b'fo', b'al', b'th', b'fi']]>


In [151]:
tf.ragged.map_flat_values(tf.math.square, digits)

<tf.RaggedTensor [[9, 1, 16, 1], [], [25, 81, 4], [36], []]>

In [154]:
digits.to_list()

[[3, 1, 4, 1], [], [5, 9, 2], [6], []]

In [155]:
tf.RaggedTensor.from_value_rowids(
    values=[3, 1, 4, 1, 5, 9, 2],
    value_rowids=[0, 0, 0, 0, 2, 2, 3])

<tf.RaggedTensor [[3, 1, 4, 1], [], [5, 9], [2]]>

In [156]:
tf.RaggedTensor.from_row_lengths(
    values=[3, 1, 4, 1, 5, 9, 2],
    row_lengths=[4, 0, 2, 1])

<tf.RaggedTensor [[3, 1, 4, 1], [], [5, 9], [2]]>

In [157]:
tf.RaggedTensor.from_row_splits(
    values=[3, 1, 4, 1, 5, 9, 2],
    row_splits=[0, 4, 4, 6, 7])

<tf.RaggedTensor [[3, 1, 4, 1], [], [5, 9], [2]]>