<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/12-custom-models-and-training-with-tensorflow/02_special_data_structures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Special Data Structures

In fact, 95% of the use cases you will encounter will not require anything other than `tf.keras` and `tf.data`.

But now it’s time to dive deeper into TensorFlow
and take a look at its lower-level Python API. This will be useful when you need extra
control to write custom loss functions, custom metrics, layers, models, initializers,
regularizers, weight constraints, and more. 

You may even need to fully control the
training loop itself, for example to apply special transformations or constraints to the
gradients (beyond just clipping them) or to use multiple optimizers for different parts
of the network.

TensorFlow’s API revolves around tensors, which flow from operation to operation—hence the name TensorFlow.

We will take a very quick look at the data structures supported by
TensorFlow, beyond regular float or integer tensors. This includes strings, ragged tensors,
sparse tensors, tensor arrays, sets, and queues.



##Setup

In [1]:
import sys
import sklearn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
from tensorflow import keras

import numpy as np
import os
import time

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## Strings

Tensors can hold byte strings, which is useful in particular for natural language processing.

In [2]:
tf.constant(b"hello world")

<tf.Tensor: shape=(), dtype=string, numpy=b'hello world'>

In [3]:
# build a tensor with a Unicode string
tf.constant("café")

<tf.Tensor: shape=(), dtype=string, numpy=b'caf\xc3\xa9'>

In [6]:
# create tensors representing Unicode strings
u = tf.constant([ord(c) for c in "café"])
u

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([ 99,  97, 102, 233], dtype=int32)>

In [11]:
# count the number of bytes in a byte string
b = tf.strings.unicode_encode(u, "UTF-8")
tf.strings.length(b, unit="UTF8_CHAR")

<tf.Tensor: shape=(), dtype=int32, numpy=4>

In [12]:
tf.strings.unicode_decode(b, "UTF-8")

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([ 99,  97, 102, 233], dtype=int32)>

In [13]:
# manipulate tensors containing multiple strings
p = tf.constant(["Café", "Coffee", "caffè", "咖啡"])

In [14]:
tf.strings.length(p, unit="UTF8_CHAR")

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([4, 6, 5, 2], dtype=int32)>

In [17]:
r = tf.strings.unicode_decode(p, "UTF8")
r

<tf.RaggedTensor [[67, 97, 102, 233], [67, 111, 102, 102, 101, 101],
 [99, 97, 102, 102, 232], [21654, 21857]]>

In [18]:
print(r)

<tf.RaggedTensor [[67, 97, 102, 233], [67, 111, 102, 102, 101, 101],
 [99, 97, 102, 102, 232], [21654, 21857]]>


##Ragged Tensors

A ragged tensor is a special kind of tensor that represents a list of arrays of different
sizes. 

More generally, it is a tensor with one or more ragged dimensions, meaning
dimensions whose slices may have different lengths.

In [19]:
# let’s look at the second element of the ragged tensor
print(r[1])

tf.Tensor([ 67 111 102 102 101 101], shape=(6,), dtype=int32)


In [20]:
# create a second ragged tensor
r2 = tf.ragged.constant([[65, 66], [], [67]])
# concatenate it with first along axis 0
print(tf.concat([r, r2], axis=0))

<tf.RaggedTensor [[67, 97, 102, 233], [67, 111, 102, 102, 101, 101],
 [99, 97, 102, 102, 232], [21654, 21857], [65, 66], [], [67]]>


In [21]:
# concatenate along axis 1
r3 = tf.ragged.constant([[68, 69, 70], [71], [], [72, 73]])
print(tf.concat([r, r3], axis=1))

<tf.RaggedTensor [[67, 97, 102, 233, 68, 69, 70], [67, 111, 102, 102, 101, 101, 71],
 [99, 97, 102, 102, 232], [21654, 21857, 72, 73]]>


In [22]:
tf.strings.unicode_encode(r3, "UTF-8")

<tf.Tensor: shape=(4,), dtype=string, numpy=array([b'DEF', b'G', b'', b'HI'], dtype=object)>

In [23]:
# converte to a regular tensor
r.to_tensor()

<tf.Tensor: shape=(4, 6), dtype=int32, numpy=
array([[   67,    97,   102,   233,     0,     0],
       [   67,   111,   102,   102,   101,   101],
       [   99,    97,   102,   102,   232,     0],
       [21654, 21857,     0,     0,     0,     0]], dtype=int32)>

In [24]:
r2.to_tensor()

<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[65, 66],
       [ 0,  0],
       [67,  0]], dtype=int32)>

In [25]:
r3.to_tensor()

<tf.Tensor: shape=(4, 3), dtype=int32, numpy=
array([[68, 69, 70],
       [71,  0,  0],
       [ 0,  0,  0],
       [72, 73,  0]], dtype=int32)>

##Sparse Tensors

TensorFlow can also efficiently represent sparse tensors (i.e., tensors containing
mostly zeros).

In [33]:
# specifying the indices and values of the nonzero elements and the tensor’s shape
s = tf.SparseTensor(indices=[[0, 1], [1, 0], [2, 3]],
                    values=[1., 2., 3.],
                    dense_shape=[3, 4])
print(s)

SparseTensor(indices=tf.Tensor(
[[0 1]
 [1 0]
 [2 3]], shape=(3, 2), dtype=int64), values=tf.Tensor([1. 2. 3.], shape=(3,), dtype=float32), dense_shape=tf.Tensor([3 4], shape=(2,), dtype=int64))


In [34]:
tf.sparse.to_dense(s)

<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[0., 1., 0., 0.],
       [2., 0., 0., 0.],
       [0., 0., 0., 3.]], dtype=float32)>

In [35]:
# multiply a sparse tensor by any scalar value
s2 = s * 3.14
tf.sparse.to_dense(s2)

<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[0.  , 3.14, 0.  , 0.  ],
       [6.28, 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 9.42]], dtype=float32)>

In [36]:
# but you cannot add a scalar value to a sparse tensor
try:
  s3 = s + 1
except TypeError as ex:
  print(ex)

unsupported operand type(s) for +: 'SparseTensor' and 'int'


In [37]:
s4 = tf.constant([
  [10., 20.],
  [30., 40.],
  [50., 60.], 
  [70., 80.]                
])

tf.sparse.sparse_dense_matmul(s, s4)

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[ 30.,  40.],
       [ 20.,  40.],
       [210., 240.]], dtype=float32)>

In [38]:
s5 = tf.SparseTensor(indices=[[0, 2], [0, 1]],
                    values=[1., 2.],
                    dense_shape=[3, 4])
print(s5)

SparseTensor(indices=tf.Tensor(
[[0 2]
 [0 1]], shape=(2, 2), dtype=int64), values=tf.Tensor([1. 2.], shape=(2,), dtype=float32), dense_shape=tf.Tensor([3 4], shape=(2,), dtype=int64))


In [39]:
try:
  tf.sparse.to_dense(s5)
except tf.errors.InvalidArgumentError as ex:
  print(ex)

indices[1] = [0,1] is out of order. Many sparse ops require sorted indices.
    Use `tf.sparse.reorder` to create a correctly ordered copy.

 [Op:SparseToDense]


In [42]:
s6 = tf.sparse.reorder(s5)
tf.sparse.to_dense(s6)

<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[0., 2., 1., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]], dtype=float32)>

##Tensor Arrays

A `tf.TensorArray` represents a list of tensors. This can be handy in dynamic models
containing loops, to accumulate results and later compute some statistics.

In [43]:
array = tf.TensorArray(dtype=tf.float32, size=3)
array = array.write(0, tf.constant([1., 2.]))
array = array.write(1, tf.constant([3., 10.]))
array = array.write(2, tf.constant([5., 7.]))

# returns (and pops!)
tensor1 = array.read(1)
tensor1

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([ 3., 10.], dtype=float32)>

In [46]:
# stack all the items into a regular tensor
array.stack()

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[1., 2.],
       [0., 0.],
       [5., 7.]], dtype=float32)>

##Sets

TensorFlow supports sets of integers or strings (but not floats). It represents them
using regular tensors.

In [47]:
# let’s create two sets and compute their union
a = tf.constant([[1, 5, 9]])
b = tf.constant([[5, 6, 9, 11]])
u = tf.sets.union(a, b)
u

<tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7fbdb26cc110>

In [48]:
tf.sparse.to_dense(u)

<tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[ 1,  5,  6,  9, 11]], dtype=int32)>

In [50]:
# we can also compute the union of multiple pairs of sets simultaneously
a = tf.constant([[2, 3, 5, 7], [7, 9, 0, 0]])
b = tf.constant([[4, 5, 6], [9, 10, 0]])
u = tf.sets.union(a, b)
tf.sparse.to_dense(u)

<tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[ 2,  3,  4,  5,  6,  7],
       [ 0,  7,  9, 10,  0,  0]], dtype=int32)>

In [51]:
tf.sparse.to_dense(tf.sets.difference(a, b))

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[2, 3, 7],
       [7, 0, 0]], dtype=int32)>

In [52]:
tf.sparse.to_dense(tf.sets.intersection(a, b))

<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[5, 0],
       [0, 9]], dtype=int32)>

In [53]:
tf.sparse.to_dense(tf.sets.union(a, b))

<tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[ 2,  3,  4,  5,  6,  7],
       [ 0,  7,  9, 10,  0,  0]], dtype=int32)>

In [54]:
# If you prefer to use a different padding value
tf.sparse.to_dense(u, default_value=-1)

<tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[ 2,  3,  4,  5,  6,  7],
       [ 0,  7,  9, 10, -1, -1]], dtype=int32)>

In [56]:
tf.sparse.to_dense(u, default_value=1)

<tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[ 2,  3,  4,  5,  6,  7],
       [ 0,  7,  9, 10,  1,  1]], dtype=int32)>