#### Explanation of materialising and loading the historic C360 data

Run following command in terminal to materialise any required timestamp:

`> pb run --end_time <required_time_in_unix_timestamp>`

This notebook outlines these following steps to fetch that past materialised data:
1. Connect to warehouse using ProfilesConnector. The config can be accessed from the siteconfig file without changing any format.
2. Load the material registry table, and filter only the successful runs.
3. From material registry table, apply following filters to get the model_hash and seq_no of the right materialisation
    - `model_type` = 'entity_traits_360'
    - `end_ts` = '<unix time from above, after converting to timestamp - see below>'
4. There may be multiple materialisations of same end ts. In such cases, it returns the most recently created materialisation details - seq_no and the model hash are the attributes that define a materialisation
5. For each materialisation, there are multiple views providing the same C360 view with different identifiers as the primary key (ex: email, user_main_id, anonymous_id etc). By defining the id_type to whatever identifier you choose, we then fetch the appropriate view from warehouse using the model hash and seq_no



In [1]:
# Params for fetching the past data - 
# 1. past_materialisation_timestamp in unix timestamp (the one used in pb run command)
# 2. id type of the C360 view # will be ignored if serve_trait_name is given.
# 3. serve_trait_name: Name of the serve trait (optionally defined in pb_project.yaml, something like user_id_stitched_features). If not provided, it will be generated using the id type and entity name as <id_type>_<entity_name>_default_entity_serve_360
# 4. entity name of the profiles project - defined in pb_project.yaml
# 5. connection name of the profiles project - mentioned in pb_project.yaml
end_time = 1709510400
id_type = 'user_id'
serve_trait_name = "user_id_stitched_features"
entity_name = 'user'
connection_name = 'default'

In [2]:
from profiles_rudderstack.wh import ProfilesConnector
import yaml
import pandas as pd
import os
import numpy as np
import datetime

In [3]:
homedir = os.path.expanduser("~") 
with open(os.path.join(homedir, ".pb","siteconfig.yaml"), "r") as f:
    creds = yaml.safe_load(f)["connections"][connection_name]["outputs"]["dev"]

In [4]:
for k, v in creds.items():
    print(f"{k}: <{k if k != 'type' else 'redshift|bigquery|snowflake etc'}>")

dbname: <dbname>
host: <host>
password: <password>
port: <port>
schema: <schema>
type: <redshift|bigquery|snowflake etc>
user: <user>


In [5]:
connector = ProfilesConnector(config=creds)

In [6]:
material_registry = connector.run_query(f"select * from {creds['schema']}.material_registry_4")
def safe_parse_json(entry):
    try:
        if isinstance(
            entry, str
        ): 
            entry_dict = eval(entry)
        elif isinstance(
            entry, dict
        ):
            entry_dict = entry

        return entry_dict.get("complete", {}).get("status")
    except:
        return None

material_registry["status"] = material_registry["metadata"].apply(
    safe_parse_json
)
material_registry = material_registry.query("status==2")

In [7]:

dt = datetime.datetime.fromtimestamp(end_time, tz=datetime.timezone.utc).replace(tzinfo=None)
dt = np.datetime64(dt)
material_registry["end_ts_np"] = material_registry["end_ts"].apply(lambda x: x.to_numpy())

In [8]:
shortlisted_df = (material_registry
                  .query("end_ts_np==@dt and model_type=='entity_traits_360'")
                  .filter(["model_name", "model_hash", "seq_no", "begin_ts", "end_ts", "creation_ts"])
                  .sort_values(by='creation_ts', ascending=False)
                  .groupby(['model_name', "seq_no"])['model_hash']
                  .first()
                  .reset_index())
shortlisted_df["model_name"] = shortlisted_df["model_name"].apply(str.lower)

In [11]:
if serve_trait_name is None:
    serve_trait_name = f"{id_type}_{entity_name}_default_entity_serve_360"

In [16]:
# Get the relevant id type view details

model_details = shortlisted_df.query(f"model_name=='{serve_trait_name}'").to_dict(orient='records')[0]
model_details

{'model_name': 'user_id_stitched_features',
 'seq_no': 839,
 'model_hash': '25e91404'}

In [17]:
material_name = f"material_{model_details['model_name']}_{model_details['model_hash']}_{model_details['seq_no']}"
material_name

'material_user_id_stitched_features_25e91404_839'

In [18]:
material_data = connector.run_query(f"select * from {creds['schema']}.{material_name}")
material_data.head()

Unnamed: 0,user_id,days_active,user_lifespan
0,identified user id 2,1,0
1,identified user id 3,1,0
2,identified user id 23,1,0
3,identified user id 2323,1,0
4,identified user id,1,0
