# Assignment 1 — Colab Workflow (GitLab + Pre-commit + Submission Validation)

This notebook teaches the standard workflow used throughout the course:

1. Clone your team repo
2. Install dependencies
3. Install **pre-commit** and enable a hook to strip notebook outputs
4. Run `notebooks/submission.ipynb` end-to-end
5. Validate `predictions.csv`
6. Commit + push + tag


In [45]:
# (Colab) show python and system info
import sys, platform
print(sys.version)
print(platform.platform())


3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Linux-6.6.105+-x86_64-with-glibc2.35


## 1) Clone Repo

You can clone using HTTPS.

Repo HTTPS URL (e.g., `https://gitlab.example.edu/course/team-a.git`)

In [46]:
repo_path = 'https://github.com/TLKline/AIHC-5010-Winter-2026'
!git clone {repo_path} student_repo

Cloning into 'student_repo'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 53 (delta 7), reused 51 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (53/53), 5.69 MiB | 10.17 MiB/s, done.
Resolving deltas: 100% (7/7), done.


In [47]:
# Move into repo
%cd student_repo

# Repo git info
!git status

# Where are we?
print('----------')
print('We are at:')
!pwd


/content/student_repo/student_repo/student_repo/student_repo
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
----------
We are at:
/content/student_repo/student_repo/student_repo/student_repo


## 2) Install dependencies

This installs whatever is in `requirements.txt`.


In [48]:
!pip -q install -r Project-1/readmit30/requirements.txt

## 3) Enable pre-commit hook to strip notebook outputs

This prevents giant notebooks and reduces merge/diff pain.

One-time per clone:
- `pre-commit install`

After that, every `git commit` will strip outputs from `*.ipynb`.


In [49]:
!pip -q install pre-commit
!pre-commit install


pre-commit installed at .git/hooks/pre-commit


## 4) Create your submission notebook from the template (first time only)

If your repo already has `notebooks/submission.ipynb`, skip this.


In [50]:
from pathlib import Path

#Create team name
team_name = "team0"

template = Path("Project-1/readmit30/notebooks/submission_template.ipynb")
target = Path("Project-1/readmit30/notebooks", f"submission_{team_name}.ipynb")

if target.exists():
    print("submission already exists ✅")
else:
    if not template.exists():
        print("Template not found at notebooks/submission_template.ipynb")
        print("Ask the instructor or pull latest course template.")
    else:
        target.write_bytes(template.read_bytes())
        print("Created submission from template. ✅")

submission already exists ✅


## 5) Run the submission notebook end-to-end (local)

In Colab, you can open `notebooks/submission.ipynb`

Open and copy it's contents into the notebook below:


In [51]:
# OPTIONAL: open notebook in Colab's notebook UI (click in file browser on left):
# notebooks/submission.ipynb


# Submission Notebook (Template)

Replace the baseline model with your team’s approach.

In [52]:
import os
from pathlib import Path

TRAIN_PATH = os.environ.get("TRAIN_PATH", "Project-1/readmit30/scripts/data/public/train.csv")
DEV_PATH   = os.environ.get("DEV_PATH",   "Project-1/readmit30/scripts/data/public/dev.csv")
TEST_PATH  = os.environ.get("TEST_PATH",  "Project-1/readmit30/scripts/data/public/public_test.csv")
OUT_PATH   = os.environ.get("OUT_PATH",   "predictions.csv")

print("TRAIN_PATH:", TRAIN_PATH)
print("DEV_PATH:", DEV_PATH)
print("TEST_PATH:", TEST_PATH)
print("OUT_PATH:", OUT_PATH)

TRAIN_PATH: Project-1/readmit30/scripts/data/public/train.csv
DEV_PATH: Project-1/readmit30/scripts/data/public/dev.csv
TEST_PATH: Project-1/readmit30/scripts/data/public/public_test.csv
OUT_PATH: predictions.csv


In [53]:
import numpy as np
import pandas as pd
np.random.seed(1337)

train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

assert "row_id" in train.columns and "readmit30" in train.columns
assert "row_id" in test.columns

X_train = train.drop(columns=["readmit30"])
y_train = train["readmit30"].astype(int)

In [54]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

cat_cols = [c for c in X_train.columns if X_train[c].dtype == "object"]
num_cols = [c for c in X_train.columns if c not in cat_cols]

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("imputer", SimpleImputer(strategy="median"))]), num_cols),
        ("cat", Pipeline([("imputer", SimpleImputer(strategy="most_frequent")),
                          ("onehot", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
    ],
)

clf = Pipeline([
    ("preprocess", preprocess),
    ("model", LogisticRegression(max_iter=200)),
])

clf.fit(X_train, y_train)

In [55]:
p_test = clf.predict_proba(test)[:, 1]
pred = pd.DataFrame({"row_id": test["row_id"].astype(int), "prob_readmit30": p_test.astype(float)})
pred.to_csv(OUT_PATH, index=False)
pred.head()

Unnamed: 0,row_id,prob_readmit30
0,103521306,0.277139
1,127919112,0.137758
2,233245326,0.090782
3,236785056,0.055595
4,131110896,0.118291


In [56]:
# Validate output format (optional for faculty runs; required for students before tagging)
!python Project-1/readmit30/scripts/validate_submission.py --pred {OUT_PATH} --test {TEST_PATH}


OK: predictions.csv format is valid.


## 6) Validate the predictions file format

This checks:
- required columns
- probabilities in [0, 1]
- row_ids match the test file

It assumes the submission notebook wrote `predictions.csv` in the repo root.


In [57]:
from pathlib import Path
pred_path = Path("predictions.csv")
test_path = Path("Project-1/readmit30/scripts/data/public/public_test.csv")

if not pred_path.exists():
    print("predictions.csv not found. Run notebooks/submission.ipynb first.")
else:
    !python Project-1/readmit30/scripts/validate_submission.py --pred predictions.csv --test Project-1/readmit30/scripts/data/public/public_test.csv


OK: predictions.csv format is valid.


## 7) Commit + push + tag

You will:
- add changes
- commit (pre-commit hook runs here)
- push
- tag a milestone (example: `milestone_wk3`) and push tags



In [58]:
import getpass, subprocess

# Identity
subprocess.run(["git", "config", "--global", "user.name", "TLKline"], check=True)
subprocess.run(["git", "config", "--global", "user.email", "kline.timothy@mayo.edu"], check=True)

# Use the plain "store" helper (persists for the *runtime*, not your local machine)
subprocess.run(["git", "config", "--global", "credential.helper", "store"], check=True)

token = getpass.getpass("GitHub PAT: ").strip()

# Approve credentials for github.com
cred_input = f"protocol=https\nhost=github.com\nusername=TLKline\npassword={token}\n\n"
subprocess.run(["git", "credential", "approve"], input=cred_input.encode(), check=True)

# Quick auth test (doesn't modify anything)
subprocess.run(["git", "ls-remote", "origin", "-h"], check=True)

print("Auth looks good. Now you can: git push")

# Commit and push
!pre-commit run --all-files
!git add -A
# Run one more time to force the changes and push
!pre-commit run --all-files
!git add -A
!git commit -m "Assignment 0: workflow + initial submission notebook"
!git push

TAG = "checking_workflow_002"
!git tag -a {TAG} -m "Checking workflow 002"
!git push --tags
print("Tagged and pushed:", TAG)

GitHub PAT: ··········
Auth looks good. Now you can: git push
nbstripout...............................................................[42mPassed[m
nbstripout...............................................................[42mPassed[m
nbstripout...........................................(no files to check)[46;30mSkipped[m
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
Everything up-to-date
Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Writing objects: 100% (1/1), 180 bytes | 180.00 KiB/s, done.
Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/TLKline/AIHC-5010-Winter-2026
 * [new tag]         checking_workflow_002 -> checking_workflow_002
Tagged and pushed: checking_workflow_002


## Done ✅

If you hit issues:
- Make sure you pulled the latest course template (missing files).
- Make sure `data/public/*` exists in your repo (or your instructor provided it separately).
