
# 00a — Environment & Targets Diagnose

Use this helper **once** to:
1. Ensure a Parquet engine (pyarrow/fastparquet) is available.
2. Verify your `TARGETS_PATH` glob matches files.
3. Sniff **target columns** from your targets parquet if they aren't named `target_*`.

> Run each cell in order. If you prefer manual installs, skip the install cell and follow the shell commands it prints.


In [1]:

# ---- Configure your targets path here (glob OK) ----
TARGETS_PATH = "/Users/tree/Projects/recommemdation_bank/data/mbd_mini/targets/fold=*/part-*.parquet"


In [2]:

# ---- 1) Ensure Parquet engine ----
import sys, importlib, subprocess

def ensure(pkg):
    try:
        importlib.import_module(pkg)
        print(f"OK: {pkg} already installed")
        return True
    except Exception:
        print(f"Missing: {pkg}")
        return False

need_pyarrow = not ensure("pyarrow")
need_fastpq = not ensure("fastparquet")

if need_pyarrow and need_fastpq:
    print("\nInstalling pyarrow (you can cancel and install manually in your venv if you prefer)...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "pip", "setuptools", "wheel"])
    subprocess.check_call([sys.executable, "-m", "pip", "install", "pyarrow"])
    import importlib; importlib.invalidate_caches()
    import pyarrow  # noqa: F401
    print("pyarrow installed. You can also install fastparquet optionally: pip install fastparquet")
else:
    print("Parquet engine already available.")


Missing: pyarrow
Missing: fastparquet

Installing pyarrow (you can cancel and install manually in your venv if you prefer)...
Collecting pip
  Using cached pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Collecting wheel
  Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
Using cached pip-25.2-py3-none-any.whl (1.8 MB)
Using cached wheel-0.45.1-py3-none-any.whl (72 kB)
Installing collected packages: wheel, pip
  Attempting uninstall: pip
    Found existing installation: pip 24.3.1
    Uninstalling pip-24.3.1:
      Successfully uninstalled pip-24.3.1
Successfully installed pip-25.2 wheel-0.45.1
Collecting pyarrow
  Using cached pyarrow-21.0.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (3.3 kB)
Using cached pyarrow-21.0.0-cp312-cp312-macosx_12_0_arm64.whl (31.2 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-21.0.0
pyarrow installed. You can also install fastparquet optionally: pip install fastparquet


In [3]:

# ---- 2) List matched files ----
import glob, os
paths = sorted(glob.glob(TARGETS_PATH))
print("Matched files:", len(paths))
print("\n".join(paths[:10]))
if len(paths) == 0:
    raise FileNotFoundError(f"No files matched TARGETS_PATH: {TARGETS_PATH}")


Matched files: 435
/Users/tree/Projects/recommemdation_bank/data/mbd_mini/targets/fold=0/part-00000-44ca8b70-9d42-48f7-9dec-0a7a012af308.c000.snappy.parquet
/Users/tree/Projects/recommemdation_bank/data/mbd_mini/targets/fold=0/part-00001-44ca8b70-9d42-48f7-9dec-0a7a012af308.c000.snappy.parquet
/Users/tree/Projects/recommemdation_bank/data/mbd_mini/targets/fold=0/part-00002-44ca8b70-9d42-48f7-9dec-0a7a012af308.c000.snappy.parquet
/Users/tree/Projects/recommemdation_bank/data/mbd_mini/targets/fold=0/part-00003-44ca8b70-9d42-48f7-9dec-0a7a012af308.c000.snappy.parquet
/Users/tree/Projects/recommemdation_bank/data/mbd_mini/targets/fold=0/part-00004-44ca8b70-9d42-48f7-9dec-0a7a012af308.c000.snappy.parquet
/Users/tree/Projects/recommemdation_bank/data/mbd_mini/targets/fold=0/part-00005-44ca8b70-9d42-48f7-9dec-0a7a012af308.c000.snappy.parquet
/Users/tree/Projects/recommemdation_bank/data/mbd_mini/targets/fold=0/part-00006-44ca8b70-9d42-48f7-9dec-0a7a012af308.c000.snappy.parquet
/Users/tree/Pro

In [4]:

# ---- 3) Peek columns and sniff target columns ----
import pandas as pd

sample = pd.read_parquet(paths[0])
print("Sample file:", paths[0])
print("Columns:", list(sample.columns))
print(sample.head(3))

# Heuristic: columns named like target_* OR binary columns with values subset of {0,1}
cand = []
for c in sample.columns:
    if c.startswith("target_"):
        cand.append(c)
    else:
        try:
            u = pd.Series(sample[c].dropna().unique())
            if len(u) <= 3 and set(u.astype(int).tolist()).issubset({0,1}):
                cand.append(c)
        except Exception:
            pass

print("\nCandidate target columns:", cand[:20])


Sample file: /Users/tree/Projects/recommemdation_bank/data/mbd_mini/targets/fold=0/part-00000-44ca8b70-9d42-48f7-9dec-0a7a012af308.c000.snappy.parquet
Columns: ['client_id', 'mon', 'target_1', 'target_2', 'target_3', 'target_4', 'trans_count', 'diff_trans_date']
                                           client_id         mon  target_1  \
0  00bd0ecf3d5a33aa8756097967d07797dca4c98de9b61c...  2022-02-28         0   
1  00bd0ecf3d5a33aa8756097967d07797dca4c98de9b61c...  2022-03-31         0   
2  00bd0ecf3d5a33aa8756097967d07797dca4c98de9b61c...  2022-04-30         0   

   target_2  target_3  target_4  trans_count  diff_trans_date  
0         0         0         0           10              0.0  
1         0         0         0           29              0.0  
2         0         0         0           51              0.0  

Candidate target columns: ['target_1', 'target_2', 'target_3', 'target_4']
