# Appendix F -- Git and GitHub for ML Projects
## *Python for AI/ML: A Complete Learning Journey*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/APP_F_Git_GitHub_for_ML.ipynb)
&nbsp;&nbsp;[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)

---

**Prerequisites:** None  

### Learning Objectives

- Understand Git's core model: commits, branches, and the working tree
- Initialise a repo, stage files, commit, and push to GitHub
- Use branches for experiments and merge them back
- Write a `.gitignore` that keeps notebooks clean and repos lean
- Use `nbstripout` to remove cell outputs before committing notebooks
- Version large datasets with DVC (Data Version Control)
- Structure an ML project repository for collaboration


---

## Setup


In [None]:
import subprocess
result = subprocess.run(['git', '--version'], capture_output=True, text=True)
print(result.stdout.strip())

# Check if nbstripout is available
result2 = subprocess.run(['pip', 'show', 'nbstripout'], capture_output=True, text=True)
if 'not found' in result2.stderr or not result2.stdout:
    subprocess.run(['pip', 'install', 'nbstripout', '-q'], check=False)
    print('nbstripout installed')
else:
    print('nbstripout already available')

print('\nGit is available. Run the bash cells below in a terminal')
print('or prefix with ! to run in Colab.')


---

## F.1 -- Core Git Workflow

Git tracks changes to files as a series of **commits**.
Each commit is a snapshot of the entire project at a point in time.

### The three areas

```
  Working tree   -->  Staging area  -->  Repository
  (your files)       (git add)          (git commit)
```

### Essential commands

```bash
# Initialise a new repository
git init my-ml-project
cd my-ml-project

# Check status at any time
git status

# Stage files for commit
git add notebook.ipynb
git add .                    # stage everything

# Commit with a message
git commit -m 'Add salary regression baseline'

# View history
git log --oneline

# Connect to GitHub and push
git remote add origin https://github.com/username/repo.git
git push -u origin main
```

### Branching for experiments

```bash
# Create a branch for a new experiment
git checkout -b experiment/gradient-boosting

# ... make changes, commit ...

# Merge back to main when satisfied
git checkout main
git merge experiment/gradient-boosting

# Delete the branch when done
git branch -d experiment/gradient-boosting
```

Branches are the correct way to manage parallel ML experiments.
One branch per experiment means you can always return to a working baseline.


In [None]:
# F.1.1 -- Demonstrate git operations programmatically

import subprocess
import os
import tempfile

# Create a temporary project to demonstrate git
tmpdir = tempfile.mkdtemp()
os.chdir(tmpdir)

def git(cmd, check=True):
    result = subprocess.run(
        f'git {cmd}', shell=True, capture_output=True, text=True
    )
    return result.stdout.strip() or result.stderr.strip()

# Initialise repo
print(git('init'))
print(git('config user.email "demo@example.com"'))
print(git('config user.name "Demo User"'))

# Create a file and commit it
with open('train.py', 'w') as f:
    f.write('# Salary regression baseline\nmodel = \'ridge\'\n')

print(git('add train.py'))
print(git('commit -m "Add baseline model"'))

# Create an experiment branch
print(git('checkout -b experiment/random-forest'))

with open('train.py', 'w') as f:
    f.write('# Salary regression\nmodel = \'random_forest\'\n')

print(git('add train.py'))
print(git('commit -m "Try random forest"'))

# Show log
print()
print('Git log (all branches):')
print(git('log --oneline --all --graph'))

# Merge back
print(git('checkout main'))
print(git('merge experiment/random-forest --no-ff -m "Merge random forest experiment"'))
print()
print('After merge:')
print(git('log --oneline --all --graph'))

os.chdir('/')


---

## F.2 -- .gitignore for ML Projects

Certain files should never be committed: large data files, model weights,
secrets, and Python cache files. A `.gitignore` tells Git to ignore them.

```gitignore
# Python
__pycache__/
*.py[cod]
.env
venv/
.venv/

# Jupyter
*.ipynb_checkpoints
.ipynb_checkpoints/

# Data files -- track with DVC instead
data/raw/
*.csv
*.parquet
*.json

# Model artefacts -- often too large for GitHub
models/
*.pt
*.pkl
*.joblib

# MLflow
mlruns/
mlflow.db

# OS
.DS_Store
Thumbs.db
```

## F.3 -- Notebook Version Control with nbstripout

Jupyter notebooks store cell outputs (images, tables, print output) inside
the `.ipynb` JSON. This causes two problems for version control:

1. Large diffs: a single matplotlib figure adds thousands of lines
2. Non-reproducible state: committed outputs may not reflect the actual code

`nbstripout` automatically strips outputs before every commit:

```bash
# Install once per repo
pip install nbstripout
nbstripout --install       # adds a pre-commit Git hook

# Now every 'git add notebook.ipynb' automatically strips outputs
# The outputs are visible when you run the notebook; they just aren't committed
```

## F.4 -- Recommended ML Repository Structure

```
my-ml-project/
├── data/
│   ├── raw/          # original, immutable data (tracked with DVC)
│   └── processed/    # cleaned data (tracked with DVC)
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_modelling.ipynb
│   └── 03_evaluation.ipynb
├── src/
│   ├── data.py       # data loading and cleaning functions
│   ├── features.py   # feature engineering
│   ├── models.py     # model definitions
│   └── evaluate.py   # evaluation metrics
├── tests/
│   ├── test_data.py
│   └── test_models.py
├── models/           # saved model artefacts (gitignored, DVC-tracked)
├── .dvc/             # DVC config
├── .gitignore
├── requirements.txt
├── README.md
└── Makefile          # common commands: make train, make test, make clean
```

## F.5 -- Dataset Versioning with DVC

GitHub has a 100MB file size limit. Large datasets and model weights need
a different tool. **DVC (Data Version Control)** tracks data files with
small pointer files (`.dvc`) that are committed to Git, while the actual
data is stored in a remote (S3, GCS, Azure, or even Google Drive).

```bash
pip install dvc

# Initialise DVC in an existing Git repo
dvc init

# Add a data file to DVC tracking
dvc add data/raw/so_survey_2025.csv
# This creates data/raw/so_survey_2025.csv.dvc -- commit this to Git
git add data/raw/so_survey_2025.csv.dvc .gitignore
git commit -m 'Track SO 2025 dataset with DVC'

# Configure remote storage (example: Google Drive)
dvc remote add -d gdrive gdrive://YOUR_FOLDER_ID
dvc push   # upload data to remote

# On another machine: pull the data
dvc pull
```

---

*End of Appendix F -- Python for AI/ML*  
[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)
