![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2FDev%2Fnew&dt=Autoencoders+-+Data+To+Training.ipynb)

# Autoencoders - Data To Training

How to retrieve data for training, and using, an autoencoder.

This workflow covers getting data to the location of training, in this case, many methods of getting BigQuery data into a Pandas DataFrame.  Then the dataframe is used as inputs for batches to TensorFlow with named inputs (columns).  Additionally, the TensorFlow I/O reader for BigQuery is used to directly read batches from BigQuery without the need to first load an entire dataframe.

**Applied Autoencoders Series**

This workflow is part of a [series](./readme.md) focused on training and using autoencoders.  The series starts from the foundation of reading data efficiently and incrementally introduces concepts.

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20Autoencoders/Autoencoders%20-%20Data.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [43]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [1]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [2]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.bigquery', 'google-cloud-bigquery'),
    ('google.cloud.bigquery_storage', 'google-cloud-bigquery-storage'),
    ('bigframes', 'bigframes'),
    ('pandas_gbq', 'pandas-gbq'),
    ('tensorflow', 'tensorflow', '2.10'),
    ('tensorflow_io', '--no-deps tensorflow-io'),
    ('graphviz', 'graphviz'),
    ('pydot', 'pydot')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

In [3]:
#!sudo apt-get -qq install graphviz

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [4]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

inputs:

In [5]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [6]:
REGION = 'us-central1'
EXPERIMENT = 'data'
SERIES = 'applied-autoencoders'

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

# specify a GCS Bucket
GCS_BUCKET = PROJECT_ID

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id,splits' # add more variables to the string with comma delimiters

packages:

In [8]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

from google.cloud import bigquery
from google.cloud import bigquery_storage
import bigframes.pandas as bpd
import pandas as pd
import numpy as np
import concurrent.futures

from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
import tensorflow as tf

#from datetime import datetime

#from google.protobuf import json_format
#from google.protobuf.struct_pb2 import Value
#import json
#import numpy as np


clients:

In [9]:
bq = bigquery.Client(project = PROJECT_ID)
bqstorage = bigquery_storage.BigQueryReadClient()
bpd.options.bigquery.project = PROJECT_ID

---
## Review Data

The data source here was prepared in [01 - BigQuery - Table Data Source](../01%20-%20Data%20Sources/01%20-%20BigQuery%20-%20Table%20Data%20Source.ipynb).  In this notebook we will use prepared BigQuery table as input for TensorFlow.

This is a table of 284,807 credit card transactions classified as fradulant or normal in the column `Class`.  In order protect confidentiality, the original features have been transformed using [principle component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) into 28 features named `V1, V2, ... V28` (float).  Two descriptive features are provided without transformation by PCA:
- `Time` (integer) is the seconds elapsed between the transaction and the earliest transaction in the table
- `Amount` (float) is the value of the transaction

The data preparation included added splits for machine learning with a column named `splits` with 80% for training (`TRAIN`), 10% for validation (`VALIDATE`) and 10% for testing (`TEST`).  Additionally, a unique identifier was added to each transaction, `transaction_id`.  

Review the number of records for each level of the data splits:

In [12]:
query = f"""
    SELECT splits, count(*) as n
    FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
    GROUP BY splits
"""
print(query)


    SELECT splits, count(*) as n
    FROM `statmike-mlops-349915.fraud.fraud_prepped`
    GROUP BY splits



In [13]:
bq.query(query = query).to_dataframe()

Unnamed: 0,splits,n
0,TEST,28502
1,TRAIN,228061
2,VALIDATE,28244


---
## From BigQuery To Pandas DataFrame

### Common Query

In [14]:
query = f'''
SELECT * EXCEPT({','.join([VAR_TARGET] + VAR_OMIT.replace(' ', '').split(','))})
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
WHERE splits = 'TRAIN'
'''
print(query)


SELECT * EXCEPT(Class,transaction_id,splits)
FROM `statmike-mlops-349915.fraud.fraud_prepped`
WHERE splits = 'TRAIN'



### BigQuery Cell Magic

https://cloud.google.com/python/docs/reference/bigquery/latest/magics

In [15]:
%%bigquery bq_data_magic
SELECT * EXCEPT(Class,transaction_id,splits)
FROM `statmike-mlops-349915.fraud.fraud_prepped`
WHERE splits = 'TRAIN'

Query is running:   0%|          |

Downloading:   0%|          |

In [16]:
bq_data_magic.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,2812,-0.633403,0.963616,2.494946,2.099051,-0.404331,0.235862,-0.007932,0.211442,-0.209817,...,-0.174985,0.014676,0.016278,-0.061462,0.355196,-0.179086,-0.106947,-0.215039,0.050698,0.0
1,3150,1.313281,-0.257923,0.118463,-0.735557,-0.569308,-0.733577,-0.138659,-0.141641,1.708019,...,-0.145144,-0.082467,0.126066,-0.223157,-0.074977,0.92194,-0.528283,0.064476,0.013132,0.0
2,16676,1.15848,0.168947,0.536345,1.187908,-0.265547,-0.076325,-0.355844,0.144615,1.462346,...,-0.355289,0.016492,0.263518,-0.076711,-0.079402,0.502827,-0.270819,-0.004966,-0.003372,0.0
3,17701,-1.279231,-0.153303,3.29631,3.320441,1.139018,0.542343,-0.729928,-0.051774,0.922712,...,0.028639,-0.409746,-0.342575,-0.493297,-0.017046,-0.107404,0.101164,-0.19794,-0.435654,0.0
4,28131,1.069507,-0.000362,1.448936,2.874498,-0.736266,0.831932,-0.762267,0.406772,0.626473,...,-0.292305,0.035393,0.444433,-0.085413,0.09909,0.506438,0.246418,0.057864,0.021133,0.0


In [17]:
type(bq_data_magic)

pandas.core.frame.DataFrame

In [18]:
bq_data_magic.shape

(228061, 30)

### BigQuery Python Client

https://cloud.google.com/python/docs/reference/bigquery/latest

In [19]:
bq_data_client = bq.query(query = query).to_dataframe()
bq_data_client.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,2812,-0.633403,0.963616,2.494946,2.099051,-0.404331,0.235862,-0.007932,0.211442,-0.209817,...,-0.174985,0.014676,0.016278,-0.061462,0.355196,-0.179086,-0.106947,-0.215039,0.050698,0.0
1,3150,1.313281,-0.257923,0.118463,-0.735557,-0.569308,-0.733577,-0.138659,-0.141641,1.708019,...,-0.145144,-0.082467,0.126066,-0.223157,-0.074977,0.92194,-0.528283,0.064476,0.013132,0.0
2,16676,1.15848,0.168947,0.536345,1.187908,-0.265547,-0.076325,-0.355844,0.144615,1.462346,...,-0.355289,0.016492,0.263518,-0.076711,-0.079402,0.502827,-0.270819,-0.004966,-0.003372,0.0
3,17701,-1.279231,-0.153303,3.29631,3.320441,1.139018,0.542343,-0.729928,-0.051774,0.922712,...,0.028639,-0.409746,-0.342575,-0.493297,-0.017046,-0.107404,0.101164,-0.19794,-0.435654,0.0
4,28131,1.069507,-0.000362,1.448936,2.874498,-0.736266,0.831932,-0.762267,0.406772,0.626473,...,-0.292305,0.035393,0.444433,-0.085413,0.09909,0.506438,0.246418,0.057864,0.021133,0.0


In [20]:
type(bq_data_client)

pandas.core.frame.DataFrame

In [21]:
bq_data_client.shape

(228061, 30)

### BigQuery BigFrames Client

https://cloud.google.com/python/docs/reference/bigframes/latest

In [22]:
bq_data_bigframes = bpd.read_gbq(query)
bq_data_bigframes.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,117857,-0.34221,0.903781,0.556961,-0.195003,0.112991,-0.51582,0.811486,-0.554063,-1.763437,...,0.714474,-0.078788,0.108075,-0.290118,1.113121,0.447556,0.199248,-0.081135,-0.061349,40.0
1,56447,1.42638,-0.611825,0.16567,-0.898404,-0.671324,-0.14292,-0.756738,0.090045,-0.472077,...,-0.047684,-0.143361,-0.569391,0.082969,-0.838399,0.21394,-0.393502,0.015332,0.006712,6.18
2,26096,-4.254652,4.612257,-0.286959,-0.95134,0.979931,-0.953636,3.114519,-2.523468,5.570837,...,4.261157,-1.124268,0.597059,-0.133162,0.470766,0.230323,-0.678818,1.085622,-0.940805,1.79
3,125780,-0.345134,1.036943,-0.230124,-0.755815,0.374212,-0.506093,0.613336,0.387138,-0.522661,...,-0.064826,-0.17186,-0.48293,0.102741,-0.463138,-0.434558,0.151072,0.118073,0.021872,19.98
4,56861,-3.021637,-3.317537,1.372621,-2.25474,0.75967,0.605632,-2.111818,1.355714,-2.332184,...,0.281491,0.446124,0.429207,0.072856,-1.389918,0.079758,-0.127151,-0.002403,-0.410168,118.0


In [23]:
type(bq_data_bigframes)

bigframes.dataframe.DataFrame

In [24]:
bq_data_bigframes.shape

(228061, 30)

In [25]:
bq_data_bigframes = bq_data_bigframes.to_pandas()
type(bq_data_bigframes)

pandas.core.frame.DataFrame

In [26]:
bq_data_bigframes.shape

(228061, 30)

### BigQuery Storage Client

https://cloud.google.com/python/docs/reference/bigquerystorage/latest

In [27]:
read_session = bqstorage.create_read_session(
    request = dict(
        parent = f'projects/{PROJECT_ID}',
        read_session = dict(
            table = f"projects/{BQ_PROJECT}/datasets/{BQ_DATASET}/tables/{BQ_TABLE}",
            data_format = bigquery_storage.types.DataFormat.ARROW,
            read_options = dict(
                row_restriction = "splits = 'TRAIN'",
                selected_fields = bq_data_bigframes.columns.tolist()
            )
        ),
        max_stream_count = 0
    )
)

In [28]:
len(read_session.streams)

1

In [29]:
def read_stream(stream):
    # setup a reader
    reader = bqstorage.read_rows(name = stream.name)
    # read rows from reader into a dataframe.  Note this is actually multiple operations - read and convert
    return reader.to_dataframe()


bq_data_storage = []
with concurrent.futures.ThreadPoolExecutor(max_workers = len(read_session.streams)) as executor:
    futures = {
        executor.submit(read_stream, stream): stream for stream in read_session.streams
    }
    for future in concurrent.futures.as_completed(futures):
        stream = futures[future]
        bq_data_storage.append(future.result())

In [30]:
len(bq_data_storage)

1

In [31]:
bq_data_storage[0].shape

(228061, 30)

In [32]:
bq_data_storage = pd.concat(bq_data_storage)
bq_data_storage.shape

(228061, 30)

In [33]:
type(bq_data_storage)

pandas.core.frame.DataFrame

In [34]:
bq_data_storage.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,2812,-0.633403,0.963616,2.494946,2.099051,-0.404331,0.235862,-0.007932,0.211442,-0.209817,...,-0.174985,0.014676,0.016278,-0.061462,0.355196,-0.179086,-0.106947,-0.215039,0.050698,0.0
1,3150,1.313281,-0.257923,0.118463,-0.735557,-0.569308,-0.733577,-0.138659,-0.141641,1.708019,...,-0.145144,-0.082467,0.126066,-0.223157,-0.074977,0.92194,-0.528283,0.064476,0.013132,0.0
2,16676,1.15848,0.168947,0.536345,1.187908,-0.265547,-0.076325,-0.355844,0.144615,1.462346,...,-0.355289,0.016492,0.263518,-0.076711,-0.079402,0.502827,-0.270819,-0.004966,-0.003372,0.0
3,17701,-1.279231,-0.153303,3.29631,3.320441,1.139018,0.542343,-0.729928,-0.051774,0.922712,...,0.028639,-0.409746,-0.342575,-0.493297,-0.017046,-0.107404,0.101164,-0.19794,-0.435654,0.0
4,28131,1.069507,-0.000362,1.448936,2.874498,-0.736266,0.831932,-0.762267,0.406772,0.626473,...,-0.292305,0.035393,0.444433,-0.085413,0.09909,0.506438,0.246418,0.057864,0.021133,0.0


### Indirect BigQuery with `pandas-gbq`

When working with [Pandas](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) the methods above show the client returning data to pandas dataframes.  This section will show a pandas mudule, [pandas-gbq](https://pandas-gbq.readthedocs.io/en/latest/) that wraps the BigQuery client so that pandas can retrieve BigQuery data to dataframes.

References:
- [Comparison of BigQuery Client with pandas-gbq](https://cloud.google.com/bigquery/docs/pandas-gbq-migration)

In [35]:
bq_data_pandasgbq = pd.read_gbq(query, project_id = PROJECT_ID)
bq_data_pandasgbq.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,2812,-0.633403,0.963616,2.494946,2.099051,-0.404331,0.235862,-0.007932,0.211442,-0.209817,...,-0.174985,0.014676,0.016278,-0.061462,0.355196,-0.179086,-0.106947,-0.215039,0.050698,0.0
1,3150,1.313281,-0.257923,0.118463,-0.735557,-0.569308,-0.733577,-0.138659,-0.141641,1.708019,...,-0.145144,-0.082467,0.126066,-0.223157,-0.074977,0.92194,-0.528283,0.064476,0.013132,0.0
2,16676,1.15848,0.168947,0.536345,1.187908,-0.265547,-0.076325,-0.355844,0.144615,1.462346,...,-0.355289,0.016492,0.263518,-0.076711,-0.079402,0.502827,-0.270819,-0.004966,-0.003372,0.0
3,17701,-1.279231,-0.153303,3.29631,3.320441,1.139018,0.542343,-0.729928,-0.051774,0.922712,...,0.028639,-0.409746,-0.342575,-0.493297,-0.017046,-0.107404,0.101164,-0.19794,-0.435654,0.0
4,28131,1.069507,-0.000362,1.448936,2.874498,-0.736266,0.831932,-0.762267,0.406772,0.626473,...,-0.292305,0.035393,0.444433,-0.085413,0.09909,0.506438,0.246418,0.057864,0.021133,0.0


In [36]:
type(bq_data_pandasgbq)

pandas.core.frame.DataFrame

In [37]:
bq_data_pandasgbq.shape

(228061, 30)

## From Pandas Dataframe To TensorFlow

The methods above read data to a Pandas dataframe that is local to this session.  This section shows how to make the dataframe ready for TensorFlow as a `tf.data` object.  More methods are [covered here](https://www.tensorflow.org/tutorials/load_data/pandas_dataframe) in the TensorFlow tutorials.

Make a copy of one of the dataframes above:

In [237]:
training_data = bq_data_storage.copy()

Setup a `tf.data` object to read the dataframe.  In this case, cast the dataframe to a dictonary to preserve the column names in the inputs. Otherwise each row would be read as an array.

References:
    - [tf.data.Dataset.from_tensor_slices()](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices)
    - [Load a pandas DataFrame - A DataFrame as a dictionary](https://www.tensorflow.org/tutorials/load_data/pandas_dataframe#a_dataframe_as_a_dictionary)

In [206]:
training_reader = tf.data.Dataset.from_tensor_slices(dict(training_data))

Setup and [iterator](https://docs.python.org/3/library/functions.html#iter) and review a return value from the [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset):

In [292]:
ds_iter = iter(training_reader)
{key: value.numpy() for key, value in next(ds_iter).items()}

{'Time': 2812,
 'V1': -0.6334029882736469,
 'V2': 0.9636160386293929,
 'V3': 2.4949456217577497,
 'V4': 2.0990509863350297,
 'V5': -0.4043306727875379,
 'V6': 0.23586157953548997,
 'V7': -0.00793190515031739,
 'V8': 0.211441518482132,
 'V9': -0.20981682042808,
 'V10': 0.308297602896481,
 'V11': -1.20499230853772,
 'V12': -0.4747078092970429,
 'V13': -0.654063561632139,
 'V14': -0.474599113137004,
 'V15': -0.428417793384727,
 'V16': 0.5366514815446061,
 'V17': -0.380654616844995,
 'V18': 0.0286505393093891,
 'V19': -0.687969434192997,
 'V20': -0.174984760363205,
 'V21': 0.0146755277991034,
 'V22': 0.0162781765829899,
 'V23': -0.061462472923487,
 'V24': 0.35519634316361604,
 'V25': -0.17908550429831896,
 'V26': -0.10694742544378999,
 'V27': -0.21503925668538898,
 'V28': 0.0506977952270228,
 'Amount': 0.0}

Build a function and compile it as a `tf.function` that creates a new input feature that is an array of all numeric features values. Also, cast columns to a common datatype as needed.

References:
- [tf.function](https://www.tensorflow.org/api_docs/python/tf/function)
- [Introduction to graphs and tf.function](https://www.tensorflow.org/guide/intro_to_graphs)

In [294]:
@tf.function
def fn1(x):
    y = {}
    y.update(x)
    feature_array = []
    for col in training_data.columns:
        if x[col].dtype != tf.float64:
            feature_array.append(tf.cast(x[col], tf.float64))
        else:
            feature_array.append(x[col])
    
    y['feature_array'] = feature_array
    return y

Use an iterator to return and review a value from the Dataset while [mapping](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) this function to each call:

In [295]:
ds_iter = iter(training_reader.map(fn1))
{key: value.numpy() for key, value in next(ds_iter).items()}

{'Time': 2812,
 'V1': -0.6334029882736469,
 'V2': 0.9636160386293929,
 'V3': 2.4949456217577497,
 'V4': 2.0990509863350297,
 'V5': -0.4043306727875379,
 'V6': 0.23586157953548997,
 'V7': -0.00793190515031739,
 'V8': 0.211441518482132,
 'V9': -0.20981682042808,
 'V10': 0.308297602896481,
 'V11': -1.20499230853772,
 'V12': -0.4747078092970429,
 'V13': -0.654063561632139,
 'V14': -0.474599113137004,
 'V15': -0.428417793384727,
 'V16': 0.5366514815446061,
 'V17': -0.380654616844995,
 'V18': 0.0286505393093891,
 'V19': -0.687969434192997,
 'V20': -0.174984760363205,
 'V21': 0.0146755277991034,
 'V22': 0.0162781765829899,
 'V23': -0.061462472923487,
 'V24': 0.35519634316361604,
 'V25': -0.17908550429831896,
 'V26': -0.10694742544378999,
 'V27': -0.21503925668538898,
 'V28': 0.0506977952270228,
 'Amount': 0.0,
 'feature_array': array([ 2.81200000e+03, -6.33402988e-01,  9.63616039e-01,  2.49494562e+00,
         2.09905099e+00, -4.04330673e-01,  2.35861580e-01, -7.93190515e-03,
         2.11441

Now test the Dataset reader with option that will be used, like:
- [batch()](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch) to limit the number of rows per call
- [map(lambda v: (v. v.pop('feature_array')))](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) to turn the response into a tuple with the feature columns in the first element and the new 'feature_array' in the second element.

Use [take(1)](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#take) to limit the number of calls for batches to a single request.

In [297]:
for features, feature_array in training_reader.map(fn1).map(lambda v: (v, v.pop('feature_array'))).batch(2).take(1):
    print('features:\n',list(features.keys()))
    for feature in features.items():
        print(feature)
    print('feature array:\n', feature_array)

features:
 ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']
('Time', <tf.Tensor: shape=(2,), dtype=int64, numpy=array([2812, 3150])>)
('V1', <tf.Tensor: shape=(2,), dtype=float64, numpy=array([-0.63340299,  1.31328087])>)
('V2', <tf.Tensor: shape=(2,), dtype=float64, numpy=array([ 0.96361604, -0.25792282])>)
('V3', <tf.Tensor: shape=(2,), dtype=float64, numpy=array([2.49494562, 0.11846283])>)
('V4', <tf.Tensor: shape=(2,), dtype=float64, numpy=array([ 2.09905099, -0.73555665])>)
('V5', <tf.Tensor: shape=(2,), dtype=float64, numpy=array([-0.40433067, -0.56930772])>)
('V6', <tf.Tensor: shape=(2,), dtype=float64, numpy=array([ 0.23586158, -0.73357721])>)
('V7', <tf.Tensor: shape=(2,), dtype=float64, numpy=array([-0.00793191, -0.13865918])>)
('V8', <tf.Tensor: shape=(2,), dtype=float64, numpy=array([ 0.21144152, -0.14164134])>)
('V9', 

### Training In TensorFlow

Build a normalization layer with [Keras Preprocessing Layers](https://www.tensorflow.org/guide/keras/preprocessing_layers). In this case all columns are already numeric so applying [tf.keras.layers.Normalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization) to the 'feature_array' column using the option `axis = -1` will calculate the mean and variance for each element of the 'feature_array' across all records.  This calculation only need to be done onece and can be triggered with the built in [.adapt()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization#adapt) method.  The `adapt()` method forces the calculation of the mean and variance across and input dataset.  In the case, we modify the `tf.data.Dataset` reader built above to only return the 'feature_array' column and then pass it to the adapt method.

> Tip: apply a larger batch size to this reader to speed up performance.  The only calculations being made are the mean and variance and they are only computed once, prior to training.

In [301]:
feature_array_reader = training_reader.map(fn1).map(lambda v: v.pop('feature_array'))
normalizer = tf.keras.layers.Normalization(name = 'normalize', axis = -1)
normalizer.adapt(feature_array_reader.prefetch(2).batch(10000))
normalizer.mean, normalizer.variance

(<tf.Tensor: shape=(1, 30), dtype=float32, numpy=
 array([[ 9.4811133e+04, -2.1519326e-04,  3.1602196e-04, -5.2488595e-04,
          6.9466559e-04, -1.2641819e-03,  2.0892750e-03, -7.2106207e-04,
         -1.0636140e-03,  1.4059977e-03, -7.1558170e-05, -6.4140302e-04,
         -1.5961546e-03,  1.8235012e-03, -6.6740625e-04,  4.2201288e-04,
         -2.3144973e-04,  5.9940410e-04, -7.0123409e-04, -1.1209438e-03,
          7.4361498e-04, -5.4229691e-04,  7.6822005e-04,  3.2623837e-04,
          3.5052517e-04, -5.9398869e-04,  4.6557584e-04, -6.2941969e-04,
         -8.2514744e-05,  8.8535156e+01]], dtype=float32)>,
 <tf.Tensor: shape=(1, 30), dtype=float32, numpy=
 array([[2.2556255e+09, 3.8344266e+00, 2.7213738e+00, 2.3109169e+00,
         2.0030899e+00, 1.9093831e+00, 1.7799078e+00, 1.5511755e+00,
         1.4520742e+00, 1.2102603e+00, 1.1956733e+00, 1.0426062e+00,
         1.0046861e+00, 9.9211246e-01, 9.2476696e-01, 8.3907151e-01,
         7.6823246e-01, 7.2538882e-01, 7.0431167e-01,

Similarly, build a de-normalizer to help return final reconstructed values from the autoencoder to the original scale.  This also uses [tf.keras.layers.Normalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization) but now we can directly input the mean and variance calculated for the normalization layer while also using the `invert = True` argument to indicate an inverse transformation.

In [302]:
denormalizer = tf.keras.layers.Normalization(
    name = 'denormalize',
    mean = normalizer.mean,
    variance = normalizer.variance,
    invert = True
)

Build the model with layers:

In [412]:
# feature inputs: autoencoder
feature_inputs = [tf.keras.Input(shape = (1,), dtype = dtypes.float64, name = feature) for feature in training_data.columns]

# input layer of concatenated features
feature_layer = tf.keras.layers.Concatenate(name = 'feature_layer')(feature_inputs)

# use pre-learned normalizer a layer in model
norm_layer = normalizer(feature_layer)

# encoder
encoder = tf.keras.layers.Dense(128, activation = tf.nn.relu)(norm_layer)
encoder = tf.keras.layers.Dense(64, activation = tf.nn.relu)(encoder)
encoder = tf.keras.layers.Dense(8, activation = tf.nn.relu, name = 'encoder')(encoder)

# decoder
decoder = tf.keras.layers.Dense(64, activation = tf.nn.relu)(encoder)
decoder = tf.keras.layers.Dense(128, activation = tf.nn.relu)(decoder)
decoder = tf.keras.layers.Dense(feature_layer.shape[1], activation = tf.nn.sigmoid, name = 'decoder')(decoder)

# de-normalize 
reconstruct = denormalizer(decoder)

# define loss function - custom
def mae_loss(norm_layer, decoder):
    return tf.keras.losses.mae(norm_layer, decoder)
    
# map back to columns
#reconstructed = tf.split

Create a model from the layers using [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model).  The inputs will be the first layer, 'feature_inputs' and the outputs will each of: encoder, decoder, and reconstruct layers.

In [413]:
model = tf.keras.Model(
    inputs = feature_inputs,
    #outputs = [feature_layer, norm_layer, encoder, decoder, reconstruct, mean_absolute_error],
    outputs = {
        'feature_layer': feature_layer,
        'norm_layer': norm_layer,
        'decoder': decoder,
        'reconstruct': reconstruct
    },
    name = 'autoencoder_from_dataframe'
)

Compile the model to make it ready for training.  

In [414]:
model.compile(
    optimizer = tf.keras.optimizers.Adam(), #SGD or Adam
    loss = {'decoder': tf.keras.losses.MeanAbsoluteError()},
    #loss = {'decoder': mae_loss},
    metrics = {'decoder': [
        tf.keras.metrics.RootMeanSquaredError(name = 'rmse'),
        tf.keras.metrics.MeanSquaredError(name = 'mse'),
        tf.keras.metrics.MeanAbsoluteError(name = 'mae'),
        tf.keras.metrics.MeanSquaredLogarithmicError(name = 'msle'),
    ]}
)

In [415]:
model.fit(
    training_reader.prefetch(2).map(fn1).map(lambda v: (v, v.pop('feature_array'))).shuffle(1000).batch(100),
    epochs = 2
)

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7f60e4bbc820>

In [416]:
prediction = model.predict(training_reader.batch(1).take(1))
prediction



{'feature_layer': array([[ 2.8120000e+03, -6.3340300e-01,  9.6361601e-01,  2.4949455e+00,
          2.0990510e+00, -4.0433067e-01,  2.3586158e-01, -7.9319049e-03,
          2.1144152e-01, -2.0981681e-01,  3.0829760e-01, -1.2049923e+00,
         -4.7470781e-01, -6.5406358e-01, -4.7459912e-01, -4.2841780e-01,
          5.3665149e-01, -3.8065460e-01,  2.8650539e-02, -6.8796945e-01,
         -1.7498475e-01,  1.4675528e-02,  1.6278177e-02, -6.1462473e-02,
          3.5519636e-01, -1.7908551e-01, -1.0694742e-01, -2.1503925e-01,
          5.0697796e-02,  0.0000000e+00]], dtype=float32),
 'norm_layer': array([[-1.937092  , -0.32335705,  0.58393896,  1.6415733 ,  1.482617  ,
         -0.2916958 ,  0.17522429, -0.00578969,  0.17634982, -0.19200009,
          0.28201008, -1.1794863 , -0.472007  , -0.6584891 , -0.49283284,
         -0.4681614 ,  0.61253834, -0.44764012,  0.03497453, -0.8430712 ,
         -0.22730693,  0.02058318,  0.02136702, -0.10057119,  0.5855404 ,
         -0.34210312, -0.2228

In [460]:
norm_abs_diffs

<KerasTensor: shape=(None, 30) dtype=float32 (created by layer 'tf.math.abs_25')>

In [461]:
# metric calcs
mean_absolute_error = tf.keras.losses.mae(reconstruct, feature_layer)
mean_squared_error = tf.keras.losses.mse(reconstruct, feature_layer)
mean_squared_log_error = tf.keras.losses.msle(reconstruct, feature_layer)

# errors order by norm error absolute magnitude
norm_abs_diffs = tf.math.abs(norm_layer - decoder)
errors = {feature_inputs[v].name : val for v, val in enumerate(norm_abs_diffs[0,:])}

In [463]:
post_model = tf.keras.Model(
    inputs = {k: v for k, v in model.output.items() if k in ['reconstruct', 'feature_layer']},
    outputs = {
        'mean_absolute_error': mean_absolute_error[0],
        'mean_squared_error': mean_squared_error[0],
        'mean_squared_log_error': mean_squared_log_error[0],
        'norm_abs_diffs': norm_abs_diffs,
        'errors': errors
    },
    name = 'autoencoder_post'
)

In [464]:
full_model = tf.keras.Model(
    inputs = model.inputs,
    outputs = post_model(model(model.inputs))
)

In [465]:
full_model.predict(training_reader.batch(1).take(1))



{'mean_absolute_error': 4661.46826171875,
 'mean_squared_error': 648609664.0,
 'mean_squared_log_error': 1.702450156211853,
 'norm_abs_diffs': array([[2.9370918 , 0.32335705, 0.37803096, 0.6415809 , 1.4758447 ,
         0.2917732 , 0.17522429, 0.04035453, 0.22188339, 0.27408174,
         0.28201008, 1.1796379 , 0.5047729 , 0.6593223 , 0.5023773 ,
         0.48417845, 0.31570557, 0.44764012, 0.05711917, 0.8523536 ,
         0.2277058 , 0.00534623, 0.05908938, 0.1005712 , 0.29074565,
         0.3421032 , 0.23876813, 0.5292185 , 0.13162944, 1.352224  ]],
       dtype=float32),
 'errors': {'Time': 2.937091827392578,
  'V1': 0.3233570456504822,
  'V2': 0.3780309557914734,
  'V3': 0.6415808796882629,
  'V4': 1.4758447408676147,
  'V5': 0.2917732000350952,
  'V6': 0.17522428929805756,
  'V7': 0.04035453125834465,
  'V8': 0.2218833863735199,
  'V9': 0.2740817368030548,
  'V10': 0.2820100784301758,
  'V11': 1.1796379089355469,
  'V12': 0.5047729015350342,
  'V13': 0.6593223214149475,
  'V14': 0

In [442]:
full_model.summary()

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 Time (InputLayer)           [(None, 1)]                  0         []                            
                                                                                                  
 V1 (InputLayer)             [(None, 1)]                  0         []                            
                                                                                                  
 V2 (InputLayer)             [(None, 1)]                  0         []                            
                                                                                                  
 V3 (InputLayer)             [(None, 1)]                  0         []                            
                                                                                            

## From BigQuery To TensorFlow With TensorFlow I/O

A highly effective way to read batches directly to `tf.data` objects from BigQuery storage!

https://www.tensorflow.org/io

In [345]:
nclasses = bq.query(query = f'SELECT DISTINCT {VAR_TARGET} FROM {BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE} WHERE {VAR_TARGET} is not null').to_dataframe()
nclasses = nclasses.shape[0]
nclasses

2

In [346]:
query = f'''
SELECT *
FROM {BQ_PROJECT}.{BQ_DATASET}.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = '{BQ_TABLE}'
    AND COLUMN_NAME NOT IN ('transaction_id', 'splits')
'''
schema = bq.query(query).to_dataframe()

In [347]:
schema.data_type.unique().tolist()

['INT64', 'FLOAT64']

In [348]:
types = {
    'FLOAT64' : dtypes.float64,
    'INT64' : dtypes.int64
}

In [349]:
def prep(features):
    target = features.pop(VAR_TARGET)
    target = tf.one_hot(tf.cast(target, tf.int64), nclasses)
    target = tf.cast(target, tf.float64)
    return(features, target)

In [353]:
training_reader_tfio = BigQueryClient().read_session(
    parent = f"projects/{PROJECT_ID}",
    project_id = BQ_PROJECT,
    table_id = BQ_TABLE,
    dataset_id = BQ_DATASET,
    selected_fields = [x for x in schema.column_name.tolist()],
    output_types = [types[x] for x in schema.data_type.tolist()],
    row_restriction = f"splits='TRAIN'",
    requested_streams = 3
).parallel_read_rows(sloppy = True, num_parallel_calls = tf.data.experimental.AUTOTUNE)
type(training_reader_tfio)

tensorflow.python.data.ops.interleave_op._ParallelInterleaveDataset

In [354]:
for features, target in training_reader_tfio.map(prep).batch(5).take(1):
    print('features:\n',list(features.keys()))
    for feature in features.items():
        print(feature)
    print('\ntarget:\n',target)

features:
 ['Amount', 'Time', 'V1', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V2', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9']
('Amount', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([0., 0., 0., 0., 0.])>)
('Time', <tf.Tensor: shape=(5,), dtype=int64, numpy=array([ 2812,  3150, 16676, 17701, 28131])>)
('V1', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([-0.63340299,  1.31328087,  1.15847976, -1.27923083,  1.06950736])>)
('V10', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 0.3082976 , -1.10329377, -0.17276001,  0.84594969,  0.37324618])>)
('V11', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([-1.20499231, -1.08782009,  2.05305928,  1.38923569, -1.32944263])>)
('V12', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([-0.47470781,  0.64467588, -2.73649895, -2.44018135, -0.1867695 ])>)
('V13', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([-0.65406356, -0.21536864, -0.

### Training In TensorFlow

In [361]:
feature_inputs = [tf.keras.Input(shape = (1,), dtype = dtypes.float64, name = feature) for feature in schema.column_name if feature != VAR_TARGET]

In [362]:
feature_layer = tf.keras.layers.Concatenate(name = 'feature_layer')(feature_inputs)
norm_layer = tf.keras.layers.BatchNormalization(axis = -1, name = 'batch_normalization')(feature_layer)
logistic = tf.keras.layers.Dense(nclasses, activation = tf.nn.softmax, name = 'logistic')(norm_layer)

In [363]:
model = tf.keras.Model(
    inputs = feature_inputs,
    outputs = logistic,
    name = 'example_from_dataframe'
)

In [364]:
model.compile(
    optimizer = tf.keras.optimizers.SGD(), #SGD or Adam
    loss = tf.keras.losses.CategoricalCrossentropy(),
    metrics = ['accuracy', tf.keras.metrics.AUC(curve = 'PR', name = 'auprc')]
)

In [365]:
model.fit(training_reader_tfio.prefetch(2).map(prep).shuffle(1000).batch(100), epochs = 2)

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7f24fb151e70>