%md
## Note on tables in schema: 
- `all_model_inputs` ==  `baseline_feature_inputs` 
  - all new tag model input data. multiple observations per GUID/theme or GUID/dept combo, because table is compiled from weekly table. guids that have been active that week are included.
- `model_input_agg`
  - all new tag model input data, aggregated such that all the open/click rates are calculated for a GUID over time, not just by week. includes all new tags data

- `dc-vb_old_tags_model_inputs` 
  - table created to be similar format as `all_model_inputs`, but for old tags. only includes a months worth of data (jan-feb 2025). good for developing EDA pipeline, but any insights will use the following table, `old_tags_model_inputs_6mo`

- `old_tags_model_inputs_6mo`
  - table created to be similar format as `all_model_inputs`, but for old tags. include the 6 months of interest. **extremely useful**. will have to join with active GUIDs to get EHHNs 

- `dc_vb_new_tags_ehhn`
  - ignore, same as `all_model_inputs`, but joined with active guid-ehhn mapping. includes GUIDs/EHHNs active in the last 6 months. still has the issue of multiple observations of click rates by week. use `model_inputs_agg_6mo` instead

-  `model_inputs_raw`
    - same as `all_model_inputs`, but only for the 6 months of interest. multiple click rates by week problem still applicable

- `model_inputs_agg_6mo`
  - new tag model input data, for the 6 months of interest, and no multiple rates per guid issue. **extremely useful**

- `funlo_rollup_agg`, `funlo_seg_agg`, `segs_and_rates`, `price_seg_agg`, `health_seg_agg`, `quality_seg_agg`, `convenience_seg_agg`, `variety_seg_agg` 
  - all segmentation based aggregations of click and open rates, useful to tell if there are meaningful interaction differences between different common segmentations. all with new tags. 

- `old_tags_6mo_ehhns` 
  - same as `old_tags_model_inputs_6mo`, joined with active GUIDs for the 6 months

In [0]:
from datetime import datetime, timedelta
import pyspark.sql.functions as f
from pyspark.sql import Window
from pyspark.sql import Row
# Start and end dates
start_date = datetime.strptime('20240823', '%Y%m%d')
end_date = datetime.strptime('20250214', '%Y%m%d')

# List to store the dates
friday_dates = []

# Generate dates
current_date = start_date
while current_date <= end_date:
    friday_dates.append(current_date.strftime('%Y%m%d'))
    current_date += timedelta(weeks=1)

# Display the list of dates
display(friday_dates)

In [0]:

model_inputs = None
for a in friday_dates:
  if a == friday_dates[0]:
    files = spark.read.parquet(f"/Volumes/personalization_dev/atc_decisioning/raupp_atc/content_tags_test_1/input_features/created_at={a}/*.snappy.parquet")
    model_inputs = files.withColumn("created_at", f.lit(a))
  
  else:
    print(a)
    files = spark.read.parquet(f"/Volumes/personalization_dev/atc_decisioning/raupp_atc/content_tags_test_1/input_features/created_at={a}/*.snappy.parquet")
    files = files.withColumn("created_at", f.lit(a))
    model_inputs = model_inputs.union(files)



In [0]:
model_inputs.limit(50).display()

In [0]:
model_inputs.write.mode("overwrite").saveAsTable(f"sandbox_dev.tm_learning.model_inputs_raw")

In [0]:
erm  = spark.read.table("sandbox_dev.tm_learning.model_inputs_raw")

In [0]:
erm.select("created_at").distinct().display()

In [0]:
files = spark.read.parquet("/Volumes/personalization_dev/atc_decisioning/raupp_atc/content_tags_test_1/input_features/created_at=20240823/*.snappy.parquet")
files = files.withColumn("created_at", f.lit("20240823"))
files.limit(1).display()

In [0]:
files = spark.read.parquet("/Volumes/personalization_dev/atc_decisioning/raupp_atc/content_tags_test_1/input_features/created_at=20240823/*.snappy.parquet")

In [0]:
files2 = spark.read.parquet("/Volumes/personalization_dev/atc_decisioning/raupp_atc/content_tags_test_1/input_features/created_at=20250214/*.snappy.parquet")
files2 = files.withColumn("created_at", f.lit("20250214"))
files3 = files.union(files2)
files3.filter(f.col("created_at") == "20250214").limit(1).display()

In [0]:
files.limit(1).display()

In [0]:
path = "personalization_dev.atc.official_ehhn_model_tracker"
tracker = spark.read.format("delta").table(path)

In [0]:
tracker.limit(5).display()
tracker = tracker.filter((f.col("start_date") == "2024-08-02") & (f.col("decisioning_path") == "bau_min_score_0"))


In [0]:
tracker.select("end_date").distinct().display()

In [0]:
tracker2 = tracker.withColumn(
    "test_group",
    f.when(f.col("end_date") == "2024-09-05", 0)
     .when(f.col("end_date") == "2024-10-24", 1)
     .when(f.col("end_date") == "2025-01-30", 2)
)

In [0]:
tracker2.select("test_group").distinct().display()

In [0]:
tracker.groupBy("end_date").agg(f.count("end_date").alias("count")).orderBy("end_date").display()

In [0]:
tracker2.groupBy("test_group").agg(f.count("test_group").alias("count")).orderBy("test_group").display()

In [0]:
aaaa = tracker.select("hhgroup").distinct()

In [0]:
aaaa.display()

In [0]:
model_inputs = spark.read.table("sandbox_dev.tm_learning.model_input_agg_6mo")

In [0]:
test = model_inputs.join(tracker.select("hhgroup", "ehhn" ), on = "ehhn",how = "inner")

In [0]:
test.write.mode("overwrite").saveAsTable(f"sandbox_dev.tm_learning.model_inputs_6mo_hhgroup")

In [0]:
test.display()

In [0]:
test.groupBy("guid", "hhgroup").agg(f.count("hhgroup").alias("both")).orderBy("guid").display()

In [0]:
test.filter(f.col("guid") == "00007C628B9A4988912E99EA5803D307").display()

In [0]:
print(test.select("guid").distinct().count())
print(model_inputs.select("guid").distinct().count())

In [0]:

old_model_inputs = spark.read.table("sandbox_dev.tm_learning.old_tags_6mo_ehhns")

In [0]:
test2 = old_model_inputs.withColumnRenamed("GUID", "guid").join(test.select("ehhn", "hhgroup"), on = "guid",how = "inner")

In [0]:
test2.write.mode("overwrite").saveAsTable(f"sandbox_dev.tm_learning.model_inputs_6mo_hhgroup")

In [0]:
print(test2.select("guid").distinct().count())
print(old_model_inputs.select("guid").distinct().count())

In [0]:
test2.groupBy("guid", "hhgroup").agg(f.count("hhgroup").alias("both")).orderBy("guid").display()

In [0]:
overlap = model_inputs.join(old_model_inputs, on = "guid",how = "inner")