# KeySense Colab Smoke Test (v0.4.2)

This notebook installs KeySense from GitHub, pulls a sample dataset, and runs a quick scan + optional JSON/GE export.

**What this does:**
- Installs `keysense` from your GitHub repo
- Downloads a small NYC Taxi parquet (Jan 2019)
- Runs KeySense via the Python API (fast sample) and via CLI (optional)
- Shows `grain_score` and `near_key_gap`


In [None]:
# Install keysense directly from GitHub (v0.4.2)
!pip install --quiet "git+https://github.com/yogiadi/keysense-pyspark.git"

In [None]:
# Download sample NYC Taxi parquet (Jan 2019)
import os, urllib.request
URL = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2019-01.parquet"
FN = "yellow_tripdata_2019-01.parquet"
if not os.path.exists(FN):
    print(f"Downloading {URL}...")
    urllib.request.urlretrieve(URL, FN)
    print("✓ Downloaded", FN)
else:
    print("✓ Already present:", FN)


In [None]:
# Quick run via the Python API (fast sample)
from pyspark.sql import SparkSession
from keysense import KeySense

spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", "64")

df = spark.read.parquet("yellow_tripdata_2019-01.parquet")
wanted = [
    "tpep_pickup_datetime","tpep_dropoff_datetime",
    "PULocationID","DOLocationID","fare_amount","trip_distance"
]
cols = [c for c in wanted if c in df.columns]
df_small = df.select(*cols).sample(False, 0.05, seed=42).cache()
df_small.count()  # materialize cache

ks = KeySense(
    df_small,
    time_col="tpep_pickup_datetime",
    max_combo_len=3,
    sample_fraction=None,
    approx_rsd=0.05,
    min_col_cardinality=50,
    max_null_fraction=0.6,
    time_grain="day",
)
out = ks.evaluate(topk=10)
out.select("combo","combo_len","grain_score","near_key_gap","uniqueness_ratio").show(10, truncate=False)


In [None]:
# Optional: run CLI to emit JSON + GE suite (uncomment to run)
# This takes a couple of minutes on first run due to Spark startup.

# !python -m keysense.profiler \
#   --input yellow_tripdata_2019-01.parquet --format parquet \
#   --time-col tpep_pickup_datetime --time-grain day \
#   --max-combo-len 3 --sample 0.05 --approx-rsd 0.05 \
#   --min-col-card 50 --max-null-frac 0.6 \
#   --weights 0.6,0.25,0.15 --topk 10 \
#   --output ./nyc_cli_out \
#   --emit-json nyc_cli_out.json \
#   --emit-ge nyc_cli_ge


### Notes
- `near_key_gap` close to **0.0** indicates a true key.
- For quick runs, reduce sample size or `max_combo_len`.
- Use the CLI cell to create a JSON export and a Great Expectations suite.
