# Feature Focused Data Architecture

This workflow examines data architecture optimizations for making data more useful for ML features. That sounds opinionated and it probably is.  While there is no perfect way there are tips that will make MLOps process more manageable, scalable, and useful. 

Machine Learning (ML) is far more than just training a model:

1. Find data sources
    - discovery
    - understandinng
    - formats
    - frequency
    - preparation, ETL
2. Combine data sources
    - formats
    - frequency
    - preparation, ETL
3. Feature Enginneering
    - Converting raw data columns into useful signal for ML methods
4. Training ML Models
    - Splits for train/validate/test
    - Iterate Features and Feature Engineering
5. Evaluate Models
    - Continously
6. Serve Models
    - Format features for predition
    - Serve features for prediction
7. Monitor Models
    - Skew: Change from training
    - Drift: Change over time
    - Continously
    - Monitor Features for change
    
When the goal is training a model (4), it might seem easy to ad-hoc work through 1-3. _Let's be honest - it's what we do most of the time._ But then, when a model version proves useful, many compromises are needed to get 5-7 to ~~work~~ - it rarely works correctly.

**What if**
- you could make careful decision during 1-3 that could essentially automate 5-7 seemlessly?
- it was not hard or time consuming?
- it makes it easier to train and iterate?
- it made everything easier?


# How? BigQuery!
BigQuery is a data warehouse right?  That seems perfect for 1-3 ... until you have fast changing data and low latency serving needs.  Actually, its perfect for this as well.  Let's proceed and discover together!


---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [6]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.bigquery', 'google-cloud-bigquery'),
    ('google.cloud.storage', 'google-cloud-storage')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [7]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

Inputs

In [8]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [9]:
REGION = 'us-central1'
EXPERIMENT = 'architecture'
SERIES = 'feature-data-architecture'

# BigQuery Parameters
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2]

# specify a GCS Bucket
GCS_BUCKET = PROJECT_ID

Packages

In [101]:
import json
import pandas as pd
from google.cloud import storage
from google.cloud import bigquery

Clients

In [14]:
# gcs client: assumes bucket already exists
gcs = storage.Client(project = PROJECT_ID)
bucket = gcs.bucket(GCS_BUCKET)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)
%load_ext google.cloud.bigquery

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


---
## The Idea!


### Create A Dataset

Referneces:
- [Create datasets](https://cloud.google.com/bigquery/docs/datasets)
- [`CREATE SCHEMA` statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_schema_statement)

In [20]:
%%bigquery
CREATE SCHEMA IF NOT EXISTS `statmike-mlops-349915.feature_data_architecture`
    OPTIONS(
        location = 'US'
    )

Query is running:   0%|          |

### Create Table: `features_example`

In [26]:
%%bigquery
CREATE OR REPLACE TABLE feature_data_architecture.features_example AS
    SELECT 'customer_abc' AS entity_id, 'a string 1' AS feature_1, 123 AS feature_2, CURRENT_DATE() AS feature_3,
        'some words' AS feature_4, 1 AS feature_5, DATE_SUB(CURRENT_DATE(), INTERVAL CAST(FLOOR(8+100*RAND()) AS INT64) DAY) AS feature_6,
        TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(FLOOR(10080*RAND()) AS INT64) MINUTE) AS feature_timestamp
    UNION ALL
    SELECT 'customer_abd' AS entity_id, 'a string 2' AS feature_1, 124 AS feature_2, CURRENT_DATE() AS feature_3, 
        'some words' AS feature_4, 2 AS feature_5, NULL as feature_6,
        TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(FLOOR(10080*RAND()) AS INT64) MINUTE) AS feature_timestamp
    UNION ALL
    SELECT 'customer_abe' AS entity_id, 'a string 1' AS feature_1, 121 AS feature_2, CURRENT_DATE() AS feature_3, 
        'some words' AS feature_4, NULL AS feature_5, DATE_SUB(CURRENT_DATE(), INTERVAL CAST(FLOOR(8+100*RAND()) AS INT64) DAY) AS feature_6,
        TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(FLOOR(10080*RAND()) AS INT64) MINUTE) AS feature_timestamp
    UNION ALL
    SELECT 'customer_abf' AS entity_id, 'a string 2' AS feature_1, 120 AS feature_2, CURRENT_DATE() AS feature_3,
        NULL AS feature_4, 4 AS feature_5, DATE_SUB(CURRENT_DATE(), INTERVAL CAST(FLOOR(8+100*RAND()) AS INT64) DAY) AS feature_6,
        TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(FLOOR(10080*RAND()) AS INT64) MINUTE) AS feature_timestamp
;

SELECT * FROM feature_data_architecture.features_example;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,entity_id,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_timestamp
0,customer_abf,a string 2,120,2024-03-11,,4.0,2023-12-31,2024-03-04 07:45:21.070234+00:00
1,customer_abe,a string 1,121,2024-03-11,some words,,2024-01-27,2024-03-08 02:45:21.070234+00:00
2,customer_abc,a string 1,123,2024-03-11,some words,1.0,2024-01-12,2024-03-06 20:10:21.070234+00:00
3,customer_abd,a string 2,124,2024-03-11,some words,2.0,NaT,2024-03-08 05:31:21.070234+00:00


### Create Table: `features_eav`

In [119]:
%%bigquery
CREATE OR REPLACE TABLE feature_data_architecture.features_eav (
    entity_id STRING,
    feature_name STRING,
    feature_value STRUCT<STRING_value STRING, INT64_value INT64, DATE_value DATE>,
    feature_datatype STRING,
    feature_timestamp TIMESTAMP
);

Query is running:   0%|          |

In [120]:
%%bigquery features_eav_schema
SELECT column_name, data_type
FROM feature_data_architecture.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'features_eav'

Query is running:   0%|          |

Downloading:   0%|          |

In [121]:
features_eav_schema = features_eav_schema.set_index('column_name').to_dict()['data_type']
features_eav_schema

{'entity_id': 'STRING',
 'feature_name': 'STRING',
 'feature_value': 'STRUCT<STRING_value STRING, INT64_value INT64, DATE_value DATE>',
 'feature_datatype': 'STRING',
 'feature_timestamp': 'TIMESTAMP'}

### Data Source For `features_eav`

For the `features_eav` table, features are loaded as individual value.  This has the advantage of managing features separately and handling new features over time without schema updates.

To illustrate this, the `features_example` table above is downloaded and converted to the expected `features_eav` schema.

In [71]:
%%bigquery features_example
SELECT * FROM feature_data_architecture.features_example;

Query is running:   0%|          |

Downloading:   0%|          |

In [72]:
features_example

Unnamed: 0,entity_id,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_timestamp
0,customer_abf,a string 2,120,2024-03-11,,4.0,2023-12-31,2024-03-04 07:45:21.070234+00:00
1,customer_abe,a string 1,121,2024-03-11,some words,,2024-01-27,2024-03-08 02:45:21.070234+00:00
2,customer_abc,a string 1,123,2024-03-11,some words,1.0,2024-01-12,2024-03-06 20:10:21.070234+00:00
3,customer_abd,a string 2,124,2024-03-11,some words,2.0,NaT,2024-03-08 05:31:21.070234+00:00


In [73]:
%%bigquery features_example_schema
SELECT column_name, data_type
FROM feature_data_architecture.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'features_example'

Query is running:   0%|          |

Downloading:   0%|          |

In [74]:
features_example_schema = features_example_schema.set_index('column_name').to_dict()['data_type']
features_example_schema

{'entity_id': 'STRING',
 'feature_1': 'STRING',
 'feature_2': 'INT64',
 'feature_3': 'DATE',
 'feature_4': 'STRING',
 'feature_5': 'INT64',
 'feature_6': 'DATE',
 'feature_timestamp': 'TIMESTAMP'}

In [146]:
import datetime
import random

In [138]:
eav_data = []
features = [f for f in features_example.columns if f not in ['entity_id', 'feature_timestamp']]
for feature in features:
    # get non-null values for current feature
    convert = features_example.loc[features_example[feature].notnull(), ['entity_id', feature, 'feature_timestamp']]
    # add noise to feature_timestamp and then convert to string:
    convert['feature_timestamp'] = convert['feature_timestamp'] + datetime.timedelta(minutes = random.randint(1,10))
    convert['feature_timestamp'] = convert['feature_timestamp'].dt.strftime('%Y-%m-%d %H:%M:%S')
    if features_example_schema[feature] == 'DATE':
        convert[feature] = convert[feature].astype(str)
    convert = convert.to_dict(orient = 'records')
    for row in convert:
        eav_data.append(
            dict(
                entity_id = row['entity_id'],
                feature_name = feature,
                feature_value = {f'{features_example_schema[feature]}_value': row[feature]},
                feature_datatype = features_example_schema[feature],
                feature_timestamp = row['feature_timestamp']
            )
        )

In [140]:
eav_data[0]

{'entity_id': 'customer_abf',
 'feature_name': 'feature_1',
 'feature_value': {'STRING_value': 'a string 2'},
 'feature_datatype': 'STRING',
 'feature_timestamp': '2024-03-04 07:48:21'}

### Import `features_eav` Data Source

Loading the data in the schema of `features_eav` looks like appends to the table. This could happen from multiple job running as batch or streaming inserts.  Here, the process is illustrated using a batch load from a local JSON file.

References:
- [Introduction to loading data](https://cloud.google.com/bigquery/docs/loading-data)
    - Batch: Load jobs, SQL, BigQuery Data Transfer Service, BigQuery Storage Write API, Managed Services
    - Stream: Storage Write API, Dataflow, Datastream, BigQuery Connector for SAP, Pub/Sub
    - SQL: queries to append/overwrite
    - Third-party applications
- [Batch Loading Data](https://cloud.google.com/bigquery/docs/batch-loading-data)
- [Loading from local files (with bq cli)](https://cloud.google.com/bigquery/docs/batch-loading-data#loading_data_from_local_files)
    - [`bq load` CLI]()
- [Load JSON data](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#limitations)

Save as JSON lines:

In [141]:
with open('eav.json', 'w') as f:
    f.write('\n'.join(map(json.dumps, eav_data)))

In [150]:
!bq load --source_format=NEWLINE_DELIMITED_JSON --autodetect --replace=true statmike-mlops-349915:feature_data_architecture.features_eav ./eav.json

Upload complete.
Waiting on bqjob_r4d9b87a4b5238dd6_0000018e2d9c8d37_1 ... (2s) Current status: DONE   


In [151]:
%%bigquery
SELECT *
FROM feature_data_architecture.features_eav
LIMIT 10

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,feature_datatype,feature_name,feature_timestamp,feature_value,entity_id
0,DATE,feature_3,2024-03-04 07:53:21+00:00,"{'DATE_value': 2024-03-11, 'INT64_value': None...",customer_abf
1,DATE,feature_3,2024-03-08 02:53:21+00:00,"{'DATE_value': 2024-03-11, 'INT64_value': None...",customer_abe
2,DATE,feature_3,2024-03-06 20:18:21+00:00,"{'DATE_value': 2024-03-11, 'INT64_value': None...",customer_abc
3,DATE,feature_3,2024-03-08 05:39:21+00:00,"{'DATE_value': 2024-03-11, 'INT64_value': None...",customer_abd
4,DATE,feature_6,2024-03-04 07:46:21+00:00,"{'DATE_value': 2023-12-31, 'INT64_value': None...",customer_abf
5,DATE,feature_6,2024-03-08 02:46:21+00:00,"{'DATE_value': 2024-01-27, 'INT64_value': None...",customer_abe
6,DATE,feature_6,2024-03-06 20:11:21+00:00,"{'DATE_value': 2024-01-12, 'INT64_value': None...",customer_abc
7,INT64,feature_2,2024-03-04 07:46:21+00:00,"{'DATE_value': None, 'INT64_value': 120, 'STRI...",customer_abf
8,INT64,feature_2,2024-03-08 02:46:21+00:00,"{'DATE_value': None, 'INT64_value': 121, 'STRI...",customer_abe
9,INT64,feature_2,2024-03-06 20:11:21+00:00,"{'DATE_value': None, 'INT64_value': 123, 'STRI...",customer_abc


### Create Table: `features_history` From `features_eav` Table

Use a stored procedure to create a pivoted version of the `features_eav` table.  While this examples store procedure recreate the full pivot on each call, it could be parameterized to append only new records since a previous time.

References:
- [Working with SQL stored procedures](https://cloud.google.com/bigquery/docs/procedures)
- [Procedural Language](https://cloud.google.com/bigquery/docs/reference/standard-sql/procedural-language)
- [PIVOT operator](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#pivot_operator)

In [152]:
%%bigquery
CREATE OR REPLACE PROCEDURE feature_data_architecture.feature_history()
BEGIN
    DECLARE counter INT64 DEFAULT 0;
    DECLARE querystring STRING;
    DECLARE features STRING;

    SET querystring = """
CREATE OR REPLACE TABLE feature_data_architecture.features_from_eav AS
SELECT *
FROM
""";

    FOR datatype IN (SELECT DISTINCT feature_datatype FROM feature_data_architecture.features_eav) DO
        SET features = (SELECT STRING_AGG(DISTINCT CONCAT("'", feature_name, "'"), ',') FROM feature_data_architecture.features_eav WHERE feature_datatype = datatype.feature_datatype);
        SET counter = counter + 1;

        IF counter >= 2 THEN SET querystring = CONCAT(querystring, """ FULL JOIN """);
        END IF;

        SET querystring = CONCAT(querystring, """(
    SELECT *
    FROM (SELECT entity_id, feature_timestamp, feature_name, feature_value.""", datatype.feature_datatype, """_value as feature_value FROM feature_data_architecture.features_eav WHERE feature_datatype = '""", datatype.feature_datatype, """')
    PIVOT(MAX(feature_value) FOR feature_name IN (""", features,""")) 
)""");

        IF counter >= 2 THEN SET querystring = CONCAT(querystring, """ USING(entity_id, feature_timestamp) """);
        END IF;

    END FOR;
    EXECUTE IMMEDIATE querystring;
END

Query is running:   0%|          |

In [153]:
%%bigquery
CALL feature_data_architecture.feature_history();
SELECT *
FROM feature_data_architecture.features_from_eav;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,entity_id,feature_timestamp,feature_3,feature_6,feature_2,feature_5,feature_1,feature_4
0,customer_abc,2024-03-06 20:14:21+00:00,NaT,NaT,,,,some words
1,customer_abc,2024-03-06 20:13:21+00:00,NaT,NaT,,,a string 1,
2,customer_abc,2024-03-06 20:11:21+00:00,NaT,2024-01-12,123.0,,,
3,customer_abc,2024-03-06 20:17:21+00:00,NaT,NaT,,1.0,,
4,customer_abd,2024-03-08 05:38:21+00:00,NaT,NaT,,2.0,,
5,customer_abd,2024-03-08 05:34:21+00:00,NaT,NaT,,,a string 2,
6,customer_abd,2024-03-08 05:35:21+00:00,NaT,NaT,,,,some words
7,customer_abd,2024-03-08 05:32:21+00:00,NaT,NaT,124.0,,,
8,customer_abe,2024-03-08 02:46:21+00:00,NaT,2024-01-27,121.0,,,
9,customer_abe,2024-03-08 02:49:21+00:00,NaT,NaT,,,,some words


### Create View: `features` AS Current Point In Time

The table `feature_from_eav` is a history of feature values.  Use the `ML.FEATURES_AT_TIME` function to get all feature values for each value of `entity_id` at a specific point-in-time.  This will get the most recent value for each feature as of the the requested point-in-time.

Reference:
- [ML.FEATURES_AT_TIME function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-feature-time)

In [155]:
%%bigquery
CREATE OR REPLACE VIEW feature_data_architecture.features_current AS
    SELECT *
    FROM ML.FEATURES_AT_TIME(
        TABLE feature_data_architecture.features_from_eav,
        time => CURRENT_TIMESTAMP(),
        num_rows => 1,
        ignore_feature_nulls => TRUE
    )

Query is running:   0%|          |

In [156]:
%%bigquery
SELECT *
FROM feature_data_architecture.features_current

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,entity_id,feature_timestamp,feature_3,feature_6,feature_2,feature_5,feature_1,feature_4
0,customer_abf,2024-03-11 13:13:12.398605+00:00,2024-03-11,2023-12-31,120,4.0,a string 2,
1,customer_abd,2024-03-11 13:13:12.398605+00:00,2024-03-11,NaT,124,2.0,a string 2,some words
2,customer_abc,2024-03-11 13:13:12.398605+00:00,2024-03-11,2024-01-12,123,1.0,a string 1,some words
3,customer_abe,2024-03-11 13:13:12.398605+00:00,2024-03-11,2024-01-27,121,,a string 1,some words


### Compare: `features_example` To `features`

In [163]:
%%bigquery temp_a
SELECT * EXCEPT(feature_timestamp)
FROM feature_data_architecture.features_example
ORDER BY entity_id

Query is running:   0%|          |

Downloading:   0%|          |

In [164]:
%%bigquery temp_b
SELECT * EXCEPT(feature_timestamp)
FROM feature_data_architecture.features_current
ORDER BY entity_id

Query is running:   0%|          |

Downloading:   0%|          |

In [165]:
temp_b = temp_b[temp_a.columns]
temp_a.compare(temp_b)

In [166]:
temp_a.equals(temp_b)

True