# Version Control with Git

This is a hands-on tutorial. You should follow along by running the commands on your own machine. When example files are needed, the contents are provided so you can create them yourself.

**Prerequisites:**
- A terminal/command line application (Terminal on macOS/Linux, Git Bash on Windows)
- Git installed on your machine (installation instructions below)
- A text editor to create and modify files

**What you'll build:** Through the exercises, you'll create a small `hello.py` file and make incremental changes to it while learning Git concepts along the way.

## Why Version Control Matters

When working on data analysis or developing statistical software, you've likely encountered scenarios like these:

- You make changes to your analysis script, and suddenly nothing works. What did you change?
- You have files named `analysis_v1.py`, `analysis_v2.py`, `analysis_final.py`, `analysis_final_FINAL.py`.
- A collaborator emails you their version of the code, and you need to merge their changes with yours.
- You want to try a different statistical approach but don't want to lose your current working version.
- Months later, a reviewer asks why you used a particular method, and you can't remember.

Version control systems solve these problems by tracking every change made to your files, who made them, and when. They let you return to any previous state, work on experimental features without risk, and collaborate with others without chaos.

### The Case for Version Control in Scientific Computing

For statistical computing and scientific research, version control is particularly valuable:

**Reproducibility.** Science requires that others can reproduce your results. Version control creates a complete record of how your code evolved, allowing you to reconstruct exactly what code produced a particular result.

**Provenance.** When you return to an analysis months later, or when reviewers ask questions, you can trace back through your changes to understand why decisions were made.

**Experimentation.** Statistical analysis often involves trying different approaches. Version control lets you create separate branches to experiment with different methods without affecting your main analysis.

**Collaboration.** Research is increasingly collaborative. Version control provides structured ways to combine work from multiple people without overwriting each other's contributions.

### Version Control Approaches

Before modern version control systems, people used ad-hoc methods to track changes.

**Manual copies** involve creating copies of files with version numbers or dates in the filename: `analysis_v1.py`, `analysis_2024-01-15.py`. This approach quickly becomes unmanageable. Which version is actually "final"? What changed between versions? What if you need to combine changes from two different versions?

**Centralized systems** like Subversion (SVN) store all versions on a central server. Users check out files, make changes, and check them back in. This requires constant network access and creates a single point of failure.

**Distributed systems** like Git give every user a complete copy of the entire project history. You can work offline, create branches freely, and merge changes in flexible ways. Git is now the dominant version control system, used by nearly all open source projects and most companies.

### Question

A researcher has been working on an analysis project for six months. Her project folder looks like this:

```
analysis_project/
    analysis.py
    analysis_backup.py
    analysis_old.py
    analysis_new.py
    analysis_working.py
    analysis_FINAL.py
    analysis_FINAL_v2.py
    data_cleaning.py
    data_cleaning_fixed.py
```

Identify at least three problems with this approach to managing code changes, and explain how version control would address each one.

**Answer:**

1. **No clear history of changes.** It's impossible to know what changed between versions or why. With version control, each change is recorded with a message explaining what was modified and why.

2. **No way to identify the current working version.** Is `analysis_FINAL_v2.py` the latest? Is `analysis_working.py` still relevant? Version control maintains a clear "current" state while preserving all history.

3. **Difficult to merge improvements.** If `analysis_old.py` has a useful function that was accidentally removed in later versions, recovering it requires manually comparing files. Version control can show exactly what changed and makes it easy to recover any previous state.

4. **No record of who made changes or when.** In collaborative work, this information is essential. Version control automatically records the author and timestamp of every change.

5. **Wasted disk space.** Multiple near-identical copies waste storage. Version control stores only the differences between versions, efficiently handling even large projects.

## Getting Started with Git

Git is the most widely used version control system. It's free, open source, and runs on all major operating systems.

### Installing Git

Git is available for Windows, macOS, and Linux. On most systems, you can check if Git is installed by opening a terminal and typing:

In [None]:
git --version

If Git is not installed:

- **macOS:** Install Xcode Command Line Tools with `xcode-select --install`, or use Homebrew: `brew install git`
- **Linux:** Use your package manager: `sudo apt install git` (Debian/Ubuntu) or `sudo dnf install git` (Fedora)
- **Windows:** Download from [git-scm.com](https://git-scm.com/) or use Windows Subsystem for Linux

### Configuring Git

Before using Git, you need to set your name and email. These are recorded with every change you make.

In [None]:
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

The `--global` flag sets these for all repositories on your computer. You can override them for individual projects by running the commands without `--global` inside a repository.

### Creating a Repository

A Git repository is a directory where Git tracks changes. Let's create a practice repository that we'll use throughout this tutorial.

In [None]:
# Create and navigate to a new directory
mkdir git-tutorial
cd git-tutorial

# Initialize a Git repository
git init

You should see: `Initialized empty Git repository in /Users/yourusername/git-tutorial/.git/`

This creates a hidden `.git` directory that stores all version control information. Your project files remain unchanged.

### The .git Directory

The `.git` directory contains everything Git needs to track your project:

```
.git/
    HEAD        # Points to the current branch
    config      # Repository-specific configuration
    objects/    # Stores all versions of all files
    refs/       # Stores branch and tag references
    ...
```

You should rarely need to manually edit files in `.git`. If you delete the `.git` directory, you lose all version history (but your current files remain).

## Basic Git Workflow

Git tracks changes through a three-stage workflow. Understanding these stages is essential for using Git effectively.

### The Three States

Files in a Git repository can be in three states:

1. **Working directory:** Your project files as they currently exist on disk. This is where you edit files normally.

2. **Staging area (index):** A holding area for changes you want to include in your next commit. You explicitly add changes here before committing.

3. **Repository:** The committed history. When you commit, Git takes everything in the staging area and permanently stores it as a new version.

This three-stage design is intentional. It lets you carefully control what goes into each commit, even if you've made many changes across multiple files.

### Checking Status

The `git status` command shows the current state of your repository.

First, create a file called `hello.py` in your `git-tutorial` directory with the following content:

In [None]:
print("hello")

Now check the status:

In [None]:
git status

You should see:

```
On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        hello.py

nothing added to commit but untracked files present (use "git add" to track)
```

This tells you:
- You're on the `master` branch
- `hello.py` is a new file Git isn't tracking yet
- Nothing is currently staged for commit

### Staging Changes

The `git add` command moves changes from the working directory to the staging area.

Stage your first file:

In [None]:
git add hello.py
git status

You should see:

```
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   hello.py
```

Now `hello.py` is staged and will be included in the next commit.

**Common staging commands:**

In [None]:
git add hello.py          # Stage a specific file
git add *.py              # Stage all Python files
git add .                 # Stage all changes in current directory

### Viewing Changes

Before staging or committing, you often want to see what changed. The `git diff` command shows differences.

First, let's commit what we have staged:

In [None]:
git commit -m "Add hello.py"

Now edit `hello.py` to change the message:

In [None]:
print("hello world")

View the changes:

In [None]:
git diff

You should see output similar to:

```diff
diff --git a/hello.py b/hello.py
--- a/hello.py
+++ b/hello.py
@@ -1 +1 @@
-print("hello")
+print("hello world")
```

Lines starting with `-` were removed; lines starting with `+` were added.

Stage and commit this change:

In [None]:
git add hello.py
git commit -m "Change to hello world"

**Common diff commands:**

In [None]:
git diff           # Show unstaged changes
git diff --staged  # Show staged changes
git diff hello.py  # Show changes in a specific file

### Committing Changes

The `git commit` command creates a permanent snapshot of everything in the staging area. The `-m` flag provides the commit message inline. Without it, Git opens your configured text editor.

Each commit is identified by a unique hash (like `abc1234`) and records:
- The exact state of all staged files
- Your name and email
- The current date and time
- Your commit message
- A reference to the previous commit

You should now have two commits. Verify with:

In [None]:
git log

### A Typical Workflow

Edit `hello.py` to add punctuation:

In [None]:
print("hello, world!")

Now follow the typical workflow:

In [None]:
git status                                # Check current status
git diff                                  # Review what changed
git add hello.py                          # Stage the changes
git commit -m "Add punctuation"           # Commit
git log                      # Verify

You should now have three commits in your history.

### Question

You've been working on an analysis script. Running `git status` shows:

```
On branch master
Changes to be committed:
        modified:   analysis.py
        new file:   config.py

Changes not staged for commit:
        modified:   analysis.py
        modified:   utils.py

Untracked files:
        data/raw_data.csv
```

Notice that `analysis.py` appears in both "Changes to be committed" and "Changes not staged for commit." If you run `git commit -m "Add configuration"` right now, which changes will be included in the commit?

**Answer:**

Only the changes to `analysis.py` and `config.py` that were staged will be committed. Specifically:

- **Committed:** The staged version of `analysis.py` and the new file `config.py`
- **Not committed:** The additional unstaged changes to `analysis.py`, all changes to `utils.py`, and the untracked file `data/raw_data.csv`

The fact that `analysis.py` appears in both sections means you staged some changes, then made additional edits that haven't been staged yet. After the commit, those unstaged changes will still be in your working directory, waiting to be staged for a future commit.

## Understanding History

One of Git's most powerful features is its complete history of all changes. You can view this history, examine individual commits, and understand how your project evolved.

### Viewing Commit History

The `git log` command shows the commit history:

In [None]:
git log

You should see output similar to (with different hashes and dates):

```
commit abc003
Author: Your Name <your.email@example.com>
Date:   Mon Jan 15 14:30:00 2024 -0500

    Add punctuation

commit abc002
Author: Your Name <your.email@example.com>
Date:   Mon Jan 15 10:15:00 2024 -0500

    Change to hello world

commit abc001
Author: Your Name <your.email@example.com>
Date:   Sun Jan 14 16:45:00 2024 -0500

    Add hello.py
```

Each entry shows:
- The commit hash (a 40-character identifier)
- The author name and email
- The date and time
- The commit message

### Useful Log Options

The log command has many formatting options. Try each of these:

In [None]:
# Compact one-line format
git log --oneline

You should see something like:

```
abc003 Add punctuation
abc002 Change to hello world
abc001 Add hello.py
```

Try other useful options:

In [None]:
git log -2             # Show last 2 commits
git log --stat         # Show commits with file changes
git log -p             # Show commits with full diff
git log -- hello.py    # Show commits affecting a specific file

### Examining Specific Commits

The `git show` command displays details about a specific commit:

In [None]:
# Show the most recent commit
git show

# Show a specific commit by hash (you can use the first few characters)
git show abc003

# Show just the files that changed
git show --stat abc003

# Show a specific file from a commit
git show abc003:hello.py

### Comparing Commits

You can compare any two points in history:

In [None]:
# Compare two commits
git diff abc002 abc003

# Compare a commit to the current state
git diff abc002

# See what files changed between commits
git diff --stat abc002 abc003

### Undoing Changes

Git provides several ways to undo changes, depending on what state they're in:

In [None]:
# Discard unstaged changes to a file (restore from staging area)
git restore hello.py

# Unstage a file (keep changes in working directory)
git restore --staged hello.py

# Undo the last commit but keep changes staged
git reset --soft HEAD~1

# Undo the last commit and unstage changes
git reset HEAD~1

Be careful with these commands. `git restore` discards changes permanently. For committed changes, the history remains unless you use more advanced commands.

### Question

Looking at this git log output:

```
a1b2c3d (HEAD -> master) Update analysis to use pandas
f4e5d6c Add data validation checks
9g8h7i6 Implement linear regression function
b3c4d5e Fix off-by-one error in bootstrap
e6f7g8h Initial project setup
```

1. Which commit is the most recent?
2. If you run `git diff 9g8h7i6 f4e5d6c`, what changes would you see?
3. How would you see what files were changed in the "Fix off-by-one error in bootstrap" commit?

**Answer:**

1. The most recent commit is `a1b2c3d` with message "Update analysis to use pandas." This is indicated by `(HEAD -> master)`, meaning it's where the current branch points.

2. `git diff 9g8h7i6 f4e5d6c` would show the changes between those two commits, specifically what was added in "Add data validation checks." The diff shows what was added when going from the first commit to the second.

3. To see what files changed in the "Fix off-by-one error in bootstrap" commit, you would run:

In [None]:
git show --stat b3c4d5e

Or to see the actual changes:

In [None]:
git show b3c4d5e

## Branching and Merging

Branches let you work on different versions of your project simultaneously. This is invaluable for experimentation: you can try a new approach without affecting your working code.

### Why Branches Matter

Consider this scenario: your analysis pipeline produces correct results, but you want to try a different statistical method. Without branches, you might:

- Copy all files to a new directory (messy, duplicates everything)
- Comment out code and add new code (clutters the file)
- Just make the changes and hope you can undo them (risky)

With branches, you create a separate line of development. If the new approach works, you merge it back. If not, you simply delete the branch.

### Creating and Listing Branches

In [None]:
# List all branches (* marks current branch)
git branch

You should see only `master` (with an asterisk indicating it's the current branch):

```
* master
```

Now create a new branch. The `-c` flag means "create" - it creates the branch and switches to it in one command:

In [None]:
# Create and switch to a new branch (-c means "create")
git switch -c experiment-uppercase

You should see: `Switched to a new branch 'experiment-uppercase'`

In [None]:
# List branches again
git branch

Now you should see:

```
* experiment-uppercase
  master
```

### Switching Branches

In [None]:
# Switch back to master
git switch master

# Switch to the experiment branch
git switch experiment-uppercase

When you switch branches, Git updates all files in your working directory to match that branch. Any uncommitted changes travel with you (unless they would conflict).

### A Branching Workflow

Let's make a change on our experiment branch, then merge it back.

Make sure you're on the `experiment-uppercase` branch:

In [None]:
git switch experiment-uppercase

Edit `hello.py` to use uppercase:

In [None]:
print("HELLO, WORLD!")

Commit the changes:

In [None]:
git add hello.py
git commit -m "Uppercase greeting"

Now merge the experiment back into master:

In [None]:
git switch master
git merge experiment-uppercase

You should see a message about a "fast-forward" merge.

Clean up:

In [None]:
git branch -d experiment-uppercase
git log --oneline

### Merging Branches

The `git merge` command combines changes from one branch into another:

In [None]:
# Make sure you're on the branch you want to merge INTO
git switch master

# Merge another branch into the current branch
git merge feature-branch

If the branches have diverged (both have new commits), Git creates a "merge commit" that combines both histories.

### Handling Merge Conflicts

Sometimes Git can't automatically merge changes because the same lines were modified differently in each branch. This creates a merge conflict.

Let's deliberately create a conflict to learn how to resolve it.

Create a new branch and modify the greeting:

In [None]:
git switch -c modify-hello

Edit `hello.py` and change the greeting to use title case:

In [None]:
print("Hello World")

Commit:

In [None]:
git add hello.py
git commit -m "Capitalize Hello World"

Now switch to master and make a *different* change to the same line:

In [None]:
git switch master

Edit `hello.py` and add a comma:

In [None]:
print("hello, World")

Commit:

In [None]:
git add hello.py
git commit -m "Add comma to greeting"

Now try to merge:

In [None]:
git merge modify-hello

You should see:

```
CONFLICT (content): Merge conflict in hello.py
Automatic merge failed; fix conflicts and then commit the result.
```

Open `hello.py` and you'll see conflict markers:

In [None]:
<<<<<<< HEAD
print("hello, World")
=======
print("Hello World")
>>>>>>> modify-hello

Edit the file to combine both changes (capitalization and comma):

In [None]:
print("Hello, World")

Complete the merge:

In [None]:
git add hello.py
git commit -m "Merge: combine capitalization and comma"
git branch -d modify-hello

### Question

You have a working script on the `master` branch and want to try a different approach. Which sequence of commands correctly creates a branch, makes the experiment, and cleans up if the experiment fails?

**Option A:**

In [None]:
git branch try-new-feature
# edit files and test
git branch -d try-new-feature

**Option B:**

In [None]:
git switch -c try-new-feature
# edit files, test, commit
git switch master
git branch -d try-new-feature

**Option C:**

In [None]:
git switch -c try-new-feature
# edit files, test
git switch master
git merge try-new-feature

**Answer:**

**Option B** is correct for abandoning an experiment.

- **Option A** is wrong because you never switch to the branch, so your edits happen on `master`. Also, you never commit the changes.
- **Option B** correctly creates and switches to the branch (the `-c` flag creates the branch), lets you make and commit changes, then switches back to `master` and deletes the experimental branch. The `-d` flag safely deletes the branch only if you've merged or are intentionally abandoning it.
- **Option C** is what you'd do if the experiment *succeeds* and you want to keep the changes.

## Working with Remote Repositories

So far, everything has been local to your computer. Remote repositories let you back up your work, share it with others, and collaborate.

> **Note:** The exercises in this section require a GitHub account. If you don't have one, you can read through this section to understand the concepts and return later when you have an account. The local Git skills from previous sections are the foundation; remote collaboration builds on top of them.

### What Are Remotes?

A remote is a copy of your repository hosted somewhere else, typically on a service like GitHub, GitLab, or Bitbucket. These platforms provide:

- Backup of your code
- Web interface to browse history
- Collaboration tools (pull requests, issues)
- Access control and permissions

### Creating a GitHub Repository

To share your local repository on GitHub:

1. Go to [github.com](https://github.com) and log in
2. Click the "+" icon in the top right and select "New repository"
3. Name it `git-tutorial`
4. **Important:** Do NOT initialize with README, .gitignore, or license (we already have content)
5. Click "Create repository"

GitHub will show you instructions. Use the "push an existing repository" commands:

In [None]:
# Navigate to your repository
cd git-tutorial

# Add GitHub as a remote named 'origin'
# Replace YOUR-USERNAME with your actual GitHub username
git remote add origin https://github.com/YOUR-USERNAME/git-tutorial.git

# Push your master branch to GitHub
git push -u origin master

You may be prompted for your GitHub credentials. If you have two-factor authentication enabled, you'll need to use a Personal Access Token instead of your password.

The `-u` flag sets up tracking so future `git push` commands know where to send changes.

Verify by visiting `https://github.com/YOUR-USERNAME/git-tutorial` in your browser. You should see your `hello.py` and `.gitignore` files.

### Cloning a Repository

To get a copy of an existing remote repository:

In [None]:
git clone https://github.com/username/repository.git

This creates a new directory with the repository contents and history, with the remote already configured as `origin`.

### Managing Remotes

In [None]:
# List configured remotes
git remote -v

# Add a remote
git remote add origin https://github.com/username/repo.git

# Remove a remote
git remote remove origin

# Change a remote's URL
git remote set-url origin https://github.com/username/new-repo.git

### Pushing Changes

The `git push` command sends your commits to the remote:

In [None]:
# Push current branch to its upstream remote
git push

# Push a specific branch
git push origin master

# Push and set upstream tracking
git push -u origin feature-branch

If someone else has pushed changes since your last pull, Git will reject your push. You'll need to pull first, resolve any conflicts, and then push again.

### Pulling Changes

The `git pull` command fetches changes from the remote and merges them:

In [None]:
# Pull changes for current branch
git pull

# Pull from a specific remote and branch
git pull origin master

`git pull` is actually two operations: `git fetch` (download changes) followed by `git merge` (integrate them).

In [None]:
# Just fetch without merging (safer, lets you review first)
git fetch origin

# See what changed on the remote
git log origin/master --oneline

# Then merge when ready
git merge origin/master

### Authentication: SSH vs HTTPS

Git can authenticate with remotes using HTTPS or SSH:

**HTTPS** is simpler to set up. You authenticate with your username and a personal access token (GitHub no longer accepts passwords).

**SSH** is more convenient for frequent use. You generate a key pair and add the public key to GitHub:

In [None]:
# Generate an SSH key
ssh-keygen -t ed25519 -C "your.email@example.com"

# Add the public key to your GitHub account settings

# Test the connection
ssh -T git@github.com

# Use SSH URLs for remotes
git remote set-url origin git@github.com:username/repository.git

### Question

You cloned a repository yesterday and have been working locally. Your colleague tells you she pushed important bug fixes this morning. You also have uncommitted changes. What's the safest sequence of actions?

1. Run `git push` to send your changes first
2. Run `git pull` to get her changes
3. Run `git stash` to save your changes, then `git pull`, then `git stash pop`
4. Run `git fetch` to see what changed, then decide

**Answer:**

**Option 4** is the safest approach. `git fetch` downloads the remote changes without modifying your working directory, letting you review what changed before integrating.

After fetching, you can:
- See the incoming commits: `git log origin/master --oneline`
- See what files changed: `git diff master origin/master --stat`
- Decide how to proceed

**Option 3** is also reasonable if you're confident you want the remote changes. `git stash` temporarily saves your uncommitted work, lets you pull, then `git stash pop` restores your work.

**Option 1** would fail because the remote has new commits you don't have.

**Option 2** might work, but if you have uncommitted changes that conflict with the incoming changes, it could create a messy situation.

## Collaboration Workflows

GitHub and similar platforms provide tools for structured collaboration. Understanding these workflows is essential for contributing to open source projects or working on research teams.

### Forking

A fork is your personal copy of someone else's repository. Forking is how you contribute to projects you don't have write access to.

1. Click "Fork" on the GitHub repository page
2. Clone your fork: `git clone https://github.com/YOUR-USERNAME/repository.git`
3. Add the original repository as a remote:

In [None]:
git remote add upstream https://github.com/ORIGINAL-OWNER/repository.git

Now you have two remotes:
- `origin`: your fork (you can push to this)
- `upstream`: the original repository (you can pull from this)

### Keeping Your Fork Updated

The original repository continues to evolve. Keep your fork current:

In [None]:
# Fetch changes from the original repository
git fetch upstream

# Merge upstream changes into your master branch
git switch master
git merge upstream/master

# Push the updates to your fork
git push origin master

### Pull Requests

A pull request (PR) proposes changes from your fork or branch to another branch (usually the original repository's master branch).

**Creating a Pull Request:**

1. Create a branch for your changes:

In [None]:
git switch -c fix-bug

2. Make and commit your changes:

In [None]:
# ... edit files ...
git add hello.py
git commit -m "Fix greeting for edge case"

3. Push the branch to your fork:

In [None]:
git push -u origin fix-bug

4. On GitHub, click "Compare & pull request"

5. Write a description explaining:
   - What the change does
   - Why it's needed
   - How you tested it

### Code Review

Pull requests enable code review, where others examine your changes before they're merged:

- Reviewers can comment on specific lines
- They might request changes or ask questions
- Discussion happens in the PR, creating a record

**Responding to review feedback:**

In [None]:
# Make requested changes
# ... edit files ...
git add hello.py
git commit -m "Address review: add input validation"
git push

New commits are automatically added to the existing pull request.

### Handling Conflicts in Pull Requests

If the base branch has changed since you created your PR, there might be conflicts. GitHub will indicate if the PR cannot be automatically merged.

To resolve:

In [None]:
# Fetch the latest changes
git fetch upstream

# Merge them into your branch
git switch fix-bug
git merge upstream/master

# Resolve any conflicts, then
git add .
git commit -m "Merge upstream master and resolve conflicts"
git push

### Issues

GitHub Issues track bugs, feature requests, and tasks. Issues can be:

- Linked to pull requests ("Fixes #42")
- Assigned to team members
- Labeled for organization
- Used for project planning

When your commit message includes "Fixes #42" or "Closes #42", the issue is automatically closed when the PR is merged.

### Question

You want to contribute a bug fix to an open source project. The project is at `github.com/example/project`. Put these steps in the correct order:

A. Create a branch for your fix
B. Fork the repository on GitHub
C. Submit a pull request
D. Clone your fork to your computer
E. Push your branch to your fork
F. Make and commit your changes

**Answer:**

The correct order is: **B, D, A, F, E, C**

1. **B. Fork the repository** - Create your own copy on GitHub
2. **D. Clone your fork** - Get the code on your computer
3. **A. Create a branch** - Isolate your changes from master
4. **F. Make and commit changes** - Do the actual work
5. **E. Push to your fork** - Upload your branch to GitHub
6. **C. Submit a pull request** - Propose your changes to the original project

## Best Practices

Following best practices makes version control more effective for you and your collaborators.

### Writing Good Commit Messages

A commit message should explain *why* a change was made, not just what changed. The code shows what changed; the message provides context.

**Good commit messages:**

```
Fix variance calculation to use Bessel's correction

The previous implementation divided by n, giving the population
variance. For sample data, we need to divide by n-1 to get an
unbiased estimate.
```

```
Add bootstrap confidence interval function

Implements the percentile method for bootstrap CIs, which doesn't
require normality assumptions. Uses 10000 resamples by default.
```

**Poor commit messages:**

```
fix bug
```

```
update analysis.py
```

```
WIP
```

**Conventions:**

- First line: 50 characters or less, imperative mood ("Fix bug" not "Fixed bug")
- Blank line after the first line
- Body: explain why, not what (the diff shows what)
- Reference issues if applicable: "Fixes #42"

### Commit Often and Logically

Each commit should represent one logical change. This makes it easier to:

- Understand what changed and why
- Revert specific changes if needed
- Review code in pull requests

**Good practice:**

In [None]:
git commit -m "Add function to load gene expression data"
git commit -m "Add function to normalize expression values"
git commit -m "Add function to identify differentially expressed genes"

**Poor practice:**

In [None]:
git commit -m "Add data loading, normalization, and DE analysis"

### What Not to Commit

Some files should not be tracked in version control.

**Large data files:** Git is designed for text files, not large binary files. Store data elsewhere and document how to obtain it.

**Generated files:** Don't commit files that can be regenerated from source (compiled code, rendered documents).

**Secrets:** Never commit passwords, API keys, or other credentials.

**Environment-specific files:** Editor settings, IDE configurations, virtual environments.

### Using .gitignore

The `.gitignore` file tells Git which files to ignore.

Create a file called `.gitignore` in your `git-tutorial` directory with the following content:

```gitignore
# Data files
*.csv
*.xlsx
data/

# Python
__pycache__/
*.pyc
.venv/
*.egg-info/

# Jupyter notebooks checkpoints
.ipynb_checkpoints/

# IDE settings
.vscode/
.idea/

# OS files
.DS_Store
Thumbbs.db

# Secrets
.env
credentials.json
```

Stage and commit it:

In [None]:
git add .gitignore
git commit -m "Add .gitignore"

Now let's test it. Create a file that should be ignored:

In [None]:
echo "name,value" > test_data.csv
git status

Notice that `test_data.csv` doesn't appear in the status output because `.csv` files are ignored.

Clean up the test file:

In [None]:
rm test_data.csv

### Question

Evaluate these commit messages. For each one, explain whether it's good or poor and why.

1. `Updated code`
2. `Fix off-by-one error in bootstrap resampling loop`
3. `Add comprehensive data validation with input checking, type verification, and boundary testing for all numeric inputs`
4. `WIP: trying different approach`

**Answer:**

1. **Poor.** "Updated code" says nothing about what changed or why. Every commit updates code. This message provides no useful information.

2. **Good.** This clearly identifies what was fixed (off-by-one error) and where (bootstrap resampling loop). A reader immediately knows what this commit addresses.

3. **Poor.** While descriptive, this message is too long for a first line (should be under 50 characters). It also suggests the commit might be doing too much. Better to split into separate commits or use a shorter first line with details in the body.

4. **Poor.** WIP (work in progress) commits are sometimes useful during development, but should be cleaned up before sharing. This message doesn't explain what approach is being tried or why.

## Recommended Resources

- [Pro Git Book](https://git-scm.com/book/en/v2) - The comprehensive, free Git reference
- [GitHub Skills](https://skills.github.com/) - Interactive Git and GitHub tutorials
- [Atlassian Git Tutorials](https://www.atlassian.com/git/tutorials) - Visual explanations of Git concepts
- [Oh Shit, Git!?!](https://ohshitgit.com/) - How to fix common Git mistakes
- [Git Cheat Sheet](https://education.github.com/git-cheat-sheet-education.pdf) - Quick reference for common commands
- [Learn Git Branching](https://learngitbranching.js.org/) - Visual, interactive branching tutorial
- [Oh My Git!](https://ohmygit.org/) - Open source game for learning Git