# Assignment 1 — Colab Workflow (GitLab + Pre-commit + Submission Validation)

This notebook teaches the standard workflow used throughout the course:

1. Clone your team repo
2. Install dependencies
3. Install **pre-commit** and enable a hook to strip notebook outputs
4. Run this notebook end-to-end
5. Validate `predictions.csv`
6. Commit + push + tag


In [None]:
# (Colab) show python and system info
import sys, platform
print(sys.version)
print(platform.platform())


## 1) Clone Repo

Make a fork of: https://github.com/TLKline/AIHC-5010-Winter-2026

Follow setup directions for working with a PAT in GitHub

You can clone using HTTPS.

Repo HTTPS URL (e.g., `https://gitlab.example.edu/course/team-a.git`)

In [None]:
# Change the following to your github repo path
repo_path = 'https://github.com/TLKline/AIHC-5010-Winter-2026'
!git clone {repo_path} student_repo

In [None]:
# Move into repo
%cd student_repo

# Repo git info
!git status

# Where are we?
print('----------')
print('We are at:')
!pwd


## 2) Install dependencies

This installs whatever is in `requirements.txt`.


In [None]:
!pip -q install -r Project-1/readmit30/requirements.txt

## 3) Enable pre-commit hook to strip notebook outputs

This prevents giant notebooks and reduces merge/diff pain.

One-time per clone:
- `pre-commit install`

After that, every `git commit` will strip outputs from `*.ipynb`.


In [None]:
!pip -q install pre-commit
!pre-commit install


#MAINSTART

# 4) Submission Notebook (Template)

Replace the baseline model with your team’s approach.

In [None]:
import os
from pathlib import Path

TRAIN_PATH = os.environ.get("TRAIN_PATH", "Project-1/readmit30/scripts/data/public/train.csv")
DEV_PATH   = os.environ.get("DEV_PATH",   "Project-1/readmit30/scripts/data/public/dev.csv")
TEST_PATH  = os.environ.get("TEST_PATH",  "Project-1/readmit30/scripts/data/public/public_test.csv")
OUT_PATH   = os.environ.get("OUT_PATH",   "predictions.csv")

print("TRAIN_PATH:", TRAIN_PATH)
print("DEV_PATH:", DEV_PATH)
print("TEST_PATH:", TEST_PATH)
print("OUT_PATH:", OUT_PATH)

In [None]:
import numpy as np
import pandas as pd
np.random.seed(42)

train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

assert "row_id" in train.columns and "readmit30" in train.columns
assert "row_id" in test.columns

X_train = train.drop(columns=["readmit30"])
y_train = train["readmit30"].astype(int)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

cat_cols = [c for c in X_train.columns if X_train[c].dtype == "object"]
num_cols = [c for c in X_train.columns if c not in cat_cols]

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("imputer", SimpleImputer(strategy="median"))]), num_cols),
        ("cat", Pipeline([("imputer", SimpleImputer(strategy="most_frequent")),
                          ("onehot", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
    ],
)

clf = Pipeline([
    ("preprocess", preprocess),
    ("model", LogisticRegression(max_iter=200)),
])

clf.fit(X_train, y_train)

In [None]:
p_test = clf.predict_proba(test)[:, 1]
pred = pd.DataFrame({"row_id": test["row_id"].astype(int), "prob_readmit30": p_test.astype(float)})
pred.to_csv(OUT_PATH, index=False)
pred.head()

In [None]:
# Validate output format (required for students before tagging)
!python Project-1/readmit30/scripts/validate_submission.py --pred {OUT_PATH} --test {TEST_PATH}


#MAINEND

## 5) Validate the predictions file format

This checks:
- required columns
- probabilities in [0, 1]
- row_ids match the test file

It assumes the submission notebook wrote `predictions.csv` in the repo root.


In [None]:
from pathlib import Path
pred_path = Path("predictions.csv")
test_path = Path("Project-1/readmit30/scripts/data/public/public_test.csv")

if not pred_path.exists():
    print("predictions.csv not found. Run notebooks/submission.ipynb first.")
else:
    !python Project-1/readmit30/scripts/validate_submission.py --pred predictions.csv --test Project-1/readmit30/scripts/data/public/public_test.csv


## 6) Commit + push + tag

You will:
- add changes
- commit (pre-commit hook runs here)
- push
- tag a milestone (example: `milestone_wk3`) and push tags



In [None]:
import getpass, subprocess

# Identity
subprocess.run(["git", "config", "--global", "user.name", "TLKline"], check=True)
subprocess.run(["git", "config", "--global", "user.email", "kline.timothy@mayo.edu"], check=True)

# Use the plain "store" helper (persists for the *runtime*, not your local machine)
subprocess.run(["git", "config", "--global", "credential.helper", "store"], check=True)

token = getpass.getpass("GitHub PAT: ").strip()

# Approve credentials for github.com
cred_input = f"protocol=https\nhost=github.com\nusername=TLKline\npassword={token}\n\n"
subprocess.run(["git", "credential", "approve"], input=cred_input.encode(), check=True)

# Quick auth test (doesn't modify anything)
subprocess.run(["git", "ls-remote", "origin", "-h"], check=True)

print("Auth looks good. Now you can: git push")

# Commit and push
!pre-commit run --all-files
!git add -A
# Run one more time to force the changes and push
!pre-commit run --all-files
!git add -A
!git commit -m "Assignment 1: workflow + initial submission notebook"
!git push

TAG = "checking_workflow_007"
!git tag -a {TAG} -m "Checking workflow 007"
!git push --tags
print("Tagged and pushed:", TAG)

## Done ✅

If you hit issues:
- Make sure you pulled the latest course template (missing files).
- Make sure `data/public/*` exists in your repo (or your instructor provided it separately).
