# Feature Focused Data Architecture

This workflow examines data architecture optimizations for making data more useful for ML features. That sounds opinionated and it probably is.  While there is no perfect way there are tips that will make MLOps process more manageable, scalable, and useful. 

Machine Learning (ML) is far more than just training a model:

1. Find data sources
    - discovery
    - understandinng
    - formats
    - frequency
    - preparation, ETL
2. Combine data sources
    - formats
    - frequency
    - preparation, ETL
3. Feature Enginneering
    - Converting raw data columns into useful signal for ML methods
4. Training ML Models
    - Splits for train/validate/test
    - Iterate Features and Feature Engineering
5. Evaluate Models
    - Continously
6. Serve Models
    - Format features for predition
    - Serve features for prediction
7. Monitor Models
    - Skew: Change from training
    - Drift: Change over time
    - Continously
    - Monitor Features for change
    
When the goal is training a model (4), it might seem easy to ad-hoc work through 1-3. _Let's be honest - it's what we do most of the time._ But then, when a model version proves useful, many compromises are needed to get 5-7 to ~~work~~ - it rarely works correctly.

**What if**
- you could make careful decision during 1-3 that could essentially automate 5-7 seemlessly?
- it was not hard or time consuming?
- it makes it easier to train and iterate?
- it made everything easier?


# How? BigQuery!
BigQuery is a data warehouse right?  That seems perfect for 1-3 ... until you have fast changing data and low latency serving needs.  Actually, its perfect for this as well.  Let's proceed and discover together!


---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [6]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.bigquery', 'google-cloud-bigquery'),
    ('google.cloud.storage', 'google-cloud-storage')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [7]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

Inputs

In [8]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [9]:
REGION = 'us-central1'
EXPERIMENT = 'architecture'
SERIES = 'feature-data-architecture'

# BigQuery Parameters
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2]

# specify a GCS Bucket
GCS_BUCKET = PROJECT_ID

Packages

In [39]:
import json

from google.cloud import storage
from google.cloud import bigquery

Clients

In [14]:
# gcs client: assumes bucket already exists
gcs = storage.Client(project = PROJECT_ID)
bucket = gcs.bucket(GCS_BUCKET)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)
%load_ext google.cloud.bigquery

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


---
## Experiment

### Create A Dataset

Referneces:
- [Create datasets](https://cloud.google.com/bigquery/docs/datasets)
- [`CREATE SCHEMA` statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_schema_statement)

In [20]:
%%bigquery
CREATE SCHEMA IF NOT EXISTS `statmike-mlops-349915.feature_data_architecture`
    OPTIONS(
        location = 'US'
    )

Query is running:   0%|          |

### Create Table: `features_example`

In [26]:
%%bigquery
CREATE OR REPLACE TABLE feature_data_architecture.features_example AS
    SELECT 'customer_abc' AS entity_id, 'a string 1' AS feature_1, 123 AS feature_2, CURRENT_DATE() AS feature_3,
        'some words' AS feature_4, 1 AS feature_5, DATE_SUB(CURRENT_DATE(), INTERVAL CAST(FLOOR(8+100*RAND()) AS INT64) DAY) AS feature_6,
        TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(FLOOR(10080*RAND()) AS INT64) MINUTE) AS feature_timestamp
    UNION ALL
    SELECT 'customer_abd' AS entity_id, 'a string 2' AS feature_1, 124 AS feature_2, CURRENT_DATE() AS feature_3, 
        'some words' AS feature_4, 2 AS feature_5, NULL as feature_6,
        TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(FLOOR(10080*RAND()) AS INT64) MINUTE) AS feature_timestamp
    UNION ALL
    SELECT 'customer_abe' AS entity_id, 'a string 1' AS feature_1, 121 AS feature_2, CURRENT_DATE() AS feature_3, 
        'some words' AS feature_4, NULL AS feature_5, DATE_SUB(CURRENT_DATE(), INTERVAL CAST(FLOOR(8+100*RAND()) AS INT64) DAY) AS feature_6,
        TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(FLOOR(10080*RAND()) AS INT64) MINUTE) AS feature_timestamp
    UNION ALL
    SELECT 'customer_abf' AS entity_id, 'a string 2' AS feature_1, 120 AS feature_2, CURRENT_DATE() AS feature_3,
        NULL AS feature_4, 4 AS feature_5, DATE_SUB(CURRENT_DATE(), INTERVAL CAST(FLOOR(8+100*RAND()) AS INT64) DAY) AS feature_6,
        TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(FLOOR(10080*RAND()) AS INT64) MINUTE) AS feature_timestamp
;

SELECT * FROM feature_data_architecture.features_example;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,entity_id,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_timestamp
0,customer_abf,a string 2,120,2024-03-11,,4.0,2023-12-31,2024-03-04 07:45:21.070234+00:00
1,customer_abe,a string 1,121,2024-03-11,some words,,2024-01-27,2024-03-08 02:45:21.070234+00:00
2,customer_abc,a string 1,123,2024-03-11,some words,1.0,2024-01-12,2024-03-06 20:10:21.070234+00:00
3,customer_abd,a string 2,124,2024-03-11,some words,2.0,NaT,2024-03-08 05:31:21.070234+00:00


### Create Table: `features_eav`

In [27]:
%%bigquery
CREATE OR REPLACE TABLE feature_data_architecture.features_eav (
  entity_id STRING,
  feature_name STRING,
  feature_value STRUCT<STRING_VALUE STRING, INT64_VALUE INT64, DATE_VALUE DATE>,
  feature_datatype STRING
);

Query is running:   0%|          |

## Data Source For EAV

In [28]:
%%bigquery features_example
SELECT * FROM feature_data_architecture.features_example;

Query is running:   0%|          |

Downloading:   0%|          |

In [30]:
features_example

Unnamed: 0,entity_id,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_timestamp
0,customer_abf,a string 2,120,2024-03-11,,4.0,2023-12-31,2024-03-04 07:45:21.070234+00:00
1,customer_abe,a string 1,121,2024-03-11,some words,,2024-01-27,2024-03-08 02:45:21.070234+00:00
2,customer_abc,a string 1,123,2024-03-11,some words,1.0,2024-01-12,2024-03-06 20:10:21.070234+00:00
3,customer_abd,a string 2,124,2024-03-11,some words,2.0,NaT,2024-03-08 05:31:21.070234+00:00


In [31]:
features_example.to_dict(orient = 'records')

[{'entity_id': 'customer_abf',
  'feature_1': 'a string 2',
  'feature_2': 120,
  'feature_3': datetime.date(2024, 3, 11),
  'feature_4': None,
  'feature_5': 4,
  'feature_6': datetime.date(2023, 12, 31),
  'feature_timestamp': Timestamp('2024-03-04 07:45:21.070234+0000', tz='UTC')},
 {'entity_id': 'customer_abe',
  'feature_1': 'a string 1',
  'feature_2': 121,
  'feature_3': datetime.date(2024, 3, 11),
  'feature_4': 'some words',
  'feature_5': <NA>,
  'feature_6': datetime.date(2024, 1, 27),
  'feature_timestamp': Timestamp('2024-03-08 02:45:21.070234+0000', tz='UTC')},
 {'entity_id': 'customer_abc',
  'feature_1': 'a string 1',
  'feature_2': 123,
  'feature_3': datetime.date(2024, 3, 11),
  'feature_4': 'some words',
  'feature_5': 1,
  'feature_6': datetime.date(2024, 1, 12),
  'feature_timestamp': Timestamp('2024-03-06 20:10:21.070234+0000', tz='UTC')},
 {'entity_id': 'customer_abd',
  'feature_1': 'a string 2',
  'feature_2': 124,
  'feature_3': datetime.date(2024, 3, 11),
  

In [37]:
%%bigquery features_example_schema
SELECT 
 TO_JSON_STRING(
    ARRAY_AGG(STRUCT(
      column_name AS name,
      data_type AS type)
    ORDER BY ordinal_position), TRUE) AS schema
FROM
  feature_data_architecture.INFORMATION_SCHEMA.COLUMNS
WHERE
  table_name = 'features_example'

Query is running:   0%|          |

Downloading:   0%|          |

In [38]:
json.loads(features_example_schema['schema'].iloc[0])

[{'name': 'entity_id', 'type': 'STRING'},
 {'name': 'feature_1', 'type': 'STRING'},
 {'name': 'feature_2', 'type': 'INT64'},
 {'name': 'feature_3', 'type': 'DATE'},
 {'name': 'feature_4', 'type': 'STRING'},
 {'name': 'feature_5', 'type': 'INT64'},
 {'name': 'feature_6', 'type': 'DATE'},
 {'name': 'feature_timestamp', 'type': 'TIMESTAMP'}]