---
title: "Setting Up Your Toolkit"
jupyter: python3
execute:
    enabled: true
---

**What you'll learn in this module:**

This module guides you through setting up your Python development environment. You'll learn how to create a virtual environment using uv, install all necessary packages for this course, and verify your installation by importing key libraries we'll use throughout the semester.

## Why a Consistent Environment Matters

Before we dive into data analysis, we need to get our tools in order. A well-configured development environment saves you from dependency conflicts, version mismatches, and the dreaded "works on my machine" problem.

We'll use `uv` to manage our Python environment. If you're familiar with `pip` and `virtualenv`, think of `uv` as a faster, more reliable alternative. It handles package installation and dependency resolution with impressive speed.

The best part? It works seamlessly with the `pyproject.toml` file that defines all course dependencies. One command installs everything you need.

## Setting Up Your Environment

> [!IMPORTANT]
> **Windows Users:** This course uses Unix-based command line tools. You need to install Windows Subsystem for Linux (WSL) to follow along. WSL provides a Linux environment directly on Windows.
>
> **To install WSL:**
> 1. Open PowerShell as Administrator
> 2. Run: `wsl --install`
> 3. Restart your computer
> 4. After restart, Ubuntu will open automatically to complete setup
> 5. Create a username and password when prompted
>
> Once installed, use the Ubuntu terminal (not PowerShell or Command Prompt) for all course commands. You can find it by searching "Ubuntu" in the Start menu.
>
> For more details, see [Microsoft's WSL installation guide](https://learn.microsoft.com/en-us/windows/wsl/install).

Let's walk through the setup process. Open your terminal (macOS/Linux: Terminal app, Windows: Ubuntu from WSL) and navigate to the project directory.

The first step is creating a virtual environment:

```bash
uv venv
```

This creates an isolated Python environment in a `.venv` directory. Virtual environments keep project dependencies separate from your system Python, preventing conflicts between different projects.

> [!NOTE]
> If you don't have `uv` installed yet, install it with:
> ```bash
> curl -LsSf https://astral.sh/uv/install.sh | sh
> ```
> This command works on macOS, Linux, and WSL on Windows.

The second step is activating the environment:

```bash
source .venv/bin/activate
```

You'll notice your terminal prompt changes to show `(.venv)` at the beginning. This indicates the environment is active.

The third step is installing all course dependencies. Since this project uses a `pyproject.toml` file, one command installs everything:

```bash
uv pip install -e .
```

The `-e` flag installs the project in "editable" mode, which is useful if you're developing or modifying code. The installation might take a few minutes since we're installing many powerful libraries for data science, machine learning, and network analysis.

## Verifying Your Installation

Now comes the moment of truth. Let's verify that all required libraries are properly installed. The code cell below imports every major library we'll use in this course.

If everything is set up correctly, this cell will run without errors. If you encounter import errors, double check that you activated your virtual environment and ran the installation command.

In [None]:
# Core data science libraries
import numpy as np
import pandas as pd
import scipy

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh
import altair as alt

# Machine learning libraries
import sklearn
import torch
import torchvision

# Natural language processing libraries
import transformers
import sentence_transformers

# Network analysis libraries
import networkx as nx
import igraph

# LLM and agentic AI tools
import ollama
import langchain_core
import langchain_ollama
import langgraph

# Utility libraries
import requests
from PIL import Image
from tqdm import tqdm
import pydantic

print("All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")

If you see "All libraries imported successfully!" along with version numbers, congratulations! Your environment is ready. You now have access to some of the most powerful tools in data science, machine learning, and AI.

## Version Control with Git

Now that your environment is ready, let's master version control. Think of Git as a time machine for your code. It lets you save snapshots of your work, experiment with new ideas without fear, and collaborate with others without stepping on each other's toes.

The pen-and-paper exercise introduced you to Git's conceptual model. Now we'll put those concepts into practice with hands-on exercises that mirror real-world workflows.

> [!NOTE]
> **Before you start:** You should already have this repository cloned on your computer. If you're reading this file locally, you're in the right place! If not, check the GitHub Classroom assignment link on Brightspace to get your repository.

### Creating Your First Commits

Let's create a simple Python script and track it with Git. Make sure you're in your repository directory (the folder containing this notebook).

Create a new file called `analysis.py` in your repository directory with the following code:

In [None]:
def calculate_mean(data):
    total = sum(data)
    return total / len(data)

Save the file. Now check the status of your repository:

```bash
git status
```

You'll see `analysis.py` listed as an "untracked file." Git sees the file exists but isn't tracking its history yet. This is Area 1 from the pen-and-paper exercise, where all your files live.

<img src="figs/git-untracked.svg" width="600" alt="The working directory contains your file, but Git isn't tracking it yet">

The diagram shows your file living in the working directory, while the staging area and repository remain empty.

Add the file to the staging area:

```bash
git add analysis.py
```

This moves the file to Area 2, the staging area. You're telling Git "I want this file in my next snapshot." Check the status again:

```bash
git status
```

Now `analysis.py` is listed as "Changes to be committed." This is your staging area. You've chosen what goes into your next snapshot.

<img src="figs/git-staged.svg" width="600" alt="The file moves to the staging area, ready to be committed">

The file has moved to the staging area. Think of this as packing your snapshot before taking the picture.

Before making your first commit, you need to tell Git who you are. Git will prompt you to configure this if you haven't already:

```bash
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
```

Replace "Your Name" with your actual name and use the email associated with your GitHub account. The `--global` flag sets this for all repositories on your computer.

Now create your first commit:

```bash
git commit -m "Add mean calculation function"
```

If you forgot to configure your identity, Git will display an error message with instructions. Just run the config commands above and try the commit again.

Congratulations! You've created your first snapshot. The `-m` flag lets you add a message describing what this snapshot contains. Always write clear, descriptive commit messages. Your future self will thank you.

<img src="figs/git-first-commit.svg" width="600" alt="Your first commit creates a permanent snapshot in the repository">

Your first commit (C1) is now safely stored in the repository. This is a permanent snapshot you can always return to.

### Fixing a Bug

The function above has a critical bug. It crashes when given an empty list. Let's fix it and commit the fix.

Open `analysis.py` in your text editor and replace its contents with this improved version:

In [None]:
def calculate_mean(data):
    if len(data) == 0:
        return 0
    total = sum(data)
    return total / len(data)

Save the file. Now check what changed:

```bash
git diff
```

The `git diff` command shows exactly what changed. Lines starting with `-` were removed. Lines starting with `+` were added. This is incredibly useful for reviewing your work before committing.

Stage and commit the fix:

```bash
git add analysis.py
git commit -m "Fix crash when data is empty"
```

You now have two commits in your history. View them:

```bash
git log --oneline
```

Each commit has a unique identifier (a hash) and your descriptive message. This is your project's timeline.

<img src="figs/git-two-commits.svg" width="600" alt="Commit history grows as a timeline, each commit building on the previous one">

Each commit points to its parent, forming a chain. The `main` branch pointer moves forward with each new commit.

### Working with Branches

Branches let you experiment without breaking your stable code. Let's say you want to try a completely new approach to calculating statistics, but you're not sure it will work. Branches give you a safe playground.

Create a new branch called `add-median`:

```bash
git branch add-median
git checkout add-median
```

The `git branch` command creates the branch. The `git checkout` command switches to it. You can combine these steps with `git checkout -b add-median`.

Check which branch you're on:

```bash
git branch
```

The asterisk shows your current branch.

<img src="figs/git-branch-creation.svg" width="600" alt="Creating a new branch creates a new pointer to the current commit">

Both branches point to the same commit. The `HEAD` pointer shows which branch you're currently on (the one that will move when you make a new commit).

Now open `analysis.py` in your text editor and add this new function at the end of the file:

In [None]:
def calculate_median(data):
    if len(data) == 0:
        return 0
    sorted_data = sorted(data)
    n = len(sorted_data)
    if n % 2 == 0:
        return (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
    return sorted_data[n//2]

Your file should now have both `calculate_mean` and `calculate_median` functions. Save the file, then stage and commit the new function:

```bash
git add analysis.py
git commit -m "Add median calculation function"
```

<img src="figs/git-branch-diverge.svg" width="600" alt="Committing on a branch moves that branch pointer forward, creating a divergent history">

The `add-median` branch has moved forward to the new commit (C3), while `main` still points to C2. This is how branches let you experiment safely.

Now for the magic. Switch back to your main branch:

```bash
git checkout main
cat analysis.py
```

Notice the median function disappeared! It's not gone, it's just on the `add-median` branch. Your main branch remains exactly as you left it. This is the power of branches.

### Merging Branches

Your median function works perfectly. Time to bring it into the main branch. This process is called merging.

Make sure you're on the main branch:

```bash
git checkout main
```

Merge the `add-median` branch:

```bash
git merge add-median
```

Git automatically combines the work from both branches. Check your file:

```bash
cat analysis.py
```

The median function is now in main. Your experiment succeeded and you've integrated it into your stable code. This merge was smooth because no conflicts existed.

<img src="figs/git-merge-simple.svg" width="600" alt="A fast-forward merge moves the main branch pointer to catch up with the feature branch">

This was a "fast-forward" merge. Git simply moved the `main` pointer forward because no other work happened on main while you were working on the branch.

### Understanding Merge Conflicts

Real collaboration isn't always this smooth. Sometimes two branches modify the same lines of code in different ways. When this happens, Git can't automatically decide which version to keep. You must resolve the conflict manually.

Let's create a merge conflict deliberately so you know how to handle it. Create two branches that modify the same function in different ways.

First, create a branch that changes the mean function to use NumPy:

```bash
git checkout -b use-numpy-mean
```

Now open `analysis.py` and replace its entire contents with this version that uses NumPy:

In [None]:
import numpy as np

def calculate_mean(data):
    if len(data) == 0:
        return 0
    return np.mean(data)

def calculate_median(data):
    if len(data) == 0:
        return 0
    sorted_data = sorted(data)
    n = len(sorted_data)
    if n % 2 == 0:
        return (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
    return sorted_data[n//2]

Save the file, then commit:

```bash
git add analysis.py
git commit -m "Use NumPy for mean calculation"
```

Now switch back to main and create a different branch that also modifies the mean function:

```bash
git checkout main
git checkout -b add-mean-docstring
```

Open `analysis.py` and replace its entire contents with this version that adds a docstring:

In [None]:
def calculate_mean(data):
    """Calculate the arithmetic mean of a dataset.

    Returns 0 for empty datasets."""
    if len(data) == 0:
        return 0
    total = sum(data)
    return total / len(data)

def calculate_median(data):
    if len(data) == 0:
        return 0
    sorted_data = sorted(data)
    n = len(sorted_data)
    if n % 2 == 0:
        return (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
    return sorted_data[n//2]

Save the file, then commit:

```bash
git add analysis.py
git commit -m "Add docstring to mean function"
```

Now merge the first branch into main:

```bash
git checkout main
git merge use-numpy-mean
```

This succeeds because main hasn't changed since we branched. Now try to merge the second branch:

```bash
git merge add-mean-docstring
```

Conflict! Git displays a message like "CONFLICT (content): Merge conflict in analysis.py." Git tried to merge both changes but couldn't decide which version of the mean function to keep.

### Resolving the Conflict

Open `analysis.py` in a text editor. You'll see conflict markers:

```
<<<<<<< HEAD
import numpy as np

def calculate_mean(data):
    if len(data) == 0:
        return 0
    return np.mean(data)
=======
def calculate_mean(data):
    """Calculate the arithmetic mean of a dataset.

    Returns 0 for empty datasets."""
    if len(data) == 0:
        return 0
    total = sum(data)
    return total / len(data)
>>>>>>> add-mean-docstring
```

Everything between `<<<<<<< HEAD` and `=======` is the current version (from main). Everything between `=======` and `>>>>>>> add-mean-docstring` is the incoming version (from the branch you're merging).

You must manually edit the file to keep what you want. Let's combine the best of both: use NumPy and add the docstring.

Open `analysis.py` in your text editor. Delete all the conflict markers (`<<<<<<<`, `=======`, `>>>>>>>`) and replace the entire file contents with this resolved version:

In [None]:
import numpy as np

def calculate_mean(data):
    """Calculate the arithmetic mean of a dataset using NumPy.

    Returns 0 for empty datasets."""
    if len(data) == 0:
        return 0
    return np.mean(data)

def calculate_median(data):
    if len(data) == 0:
        return 0
    sorted_data = sorted(data)
    n = len(sorted_data)
    if n % 2 == 0:
        return (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
    return sorted_data[n//2]

Save the file. After resolving the conflict, stage the resolved file:

```bash
git add analysis.py
```

Complete the merge by creating a commit:

```bash
git commit -m "Merge add-mean-docstring, combining NumPy and documentation"
```

Check your history:

```bash
git log --oneline --graph --all
```

The `--graph` flag shows your branch structure visually. You can see where branches diverged and where they merged back together.

<img src="figs/git-merge-conflict.svg" width="700" alt="A complex merge with conflict resolution brings together divergent histories into a single merge commit">

This diagram shows the complete merge workflow. Both branches diverged from C2. The first branch (use-numpy-mean) merged cleanly into main at C4. The second branch (add-mean-docstring) created a merge commit (M) because it conflicted with changes from C4. The merge commit has two parents, combining work from both branches.

### Pushing to GitHub

You've been making commits locally, but they only exist on your computer so far. Let's push them to GitHub so they're backed up and your instructor can see your work.

> [!IMPORTANT]
> Your instructor will review your work directly on GitHub. Work that exists only on your local computer cannot be seen or graded. You must push your commits to make them visible.

Since you cloned this repository, the connection to GitHub (called the "remote") is already set up. The remote is named `origin`. You can verify this:

```bash
git remote -v
```

This shows you the URL of your GitHub repository. Now push your commits:

```bash
git push origin main
```

The `git push` command uploads your local commits to GitHub. The word `origin` refers to your GitHub repository, and `main` is the branch you're pushing.

Visit your repository page on GitHub and refresh it. You should now see your `analysis.py` file and all your commits! Click on "commits" to see your commit history.

**Develop a habit of pushing frequently.** After completing each exercise section, push your work. This keeps your local repository and GitHub in sync.

Now push your other branches:

```bash
git push origin add-median
git push origin use-numpy-mean
git push origin add-mean-docstring
```

Visit your repository on GitHub. Click the "branches" dropdown near the top left (next to the branch icon). You'll see all your branches listed. GitHub provides a visual interface for viewing diffs, commit history, and branch structure.

### Syncing Your Work

If you've been working on another computer or if someone else pushed changes to GitHub, you can pull those changes:

```bash
git pull origin main
```

This fetches and merges changes from GitHub into your local branch. Always pull before you start working and push when you're done to keep everything synchronized.

> [!IMPORTANT]
> **Verify Your Work is Pushed**
>
> Your instructor will review your work directly on GitHub. Before you finish, ensure everything is pushed.
>
> **Check synchronization:**
>
> ```bash
> git status
> ```
>
> You should see: `"Your branch is up to date with 'origin/main'"` and `"nothing to commit, working tree clean"`
>
> **Verify on GitHub:**
>
> Visit your repository page on GitHub and check that you see:
>
> - [ ] The `analysis.py` file with both `calculate_mean()` and `calculate_median()` functions
> - [ ] At least 5 commits in your history
> - [ ] Multiple branches: `main`, `add-median`, `use-numpy-mean`, `add-mean-docstring`
> - [ ] A merge commit showing conflict resolution
>
> **If anything is missing, push it:**
>
> ```bash
> git push origin main
> git push origin branch-name
> ```

> [!TIP]
> **Practice makes perfect**
>
> The more you use Git, the more natural it becomes:
>
> - Make commits as you work, not just at the end
> - Push frequently to back up your progress
> - Experiment with branches for different approaches
> - Use `git log --oneline --graph --all` to visualize your history
>
> Soon branching and merging will be second nature!

### Why This Matters

Version control transforms how you work. You can experiment fearlessly knowing you can always return to a working state. You can try multiple approaches in parallel. You can collaborate with teammates without coordination headaches. You can review your project's evolution to understand how you got here.

Every professional software project uses version control. Data science projects benefit just as much. When your analysis depends on dozens of interdependent scripts and notebooks, version control becomes essential for reproducibility and collaboration.

Throughout this course, we'll use Git to track our work. You'll create commits after completing exercises, branch when experimenting with different approaches, and merge when you've found solutions that work. These workflows will become second nature.

Let's get started.