# Chapter 0 ‚Äî Orientation & Setup
## *Python for AI/ML: A Complete Learning Journey*

[![Back to TOC](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)](Python_for_AIML_TOC.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/CH00_Orientation_and_Setup.ipynb)

---

**Part:** Pre-requisite (before Part 1)  
**Prerequisites:** None ‚Äî this is where everyone starts  
**Estimated time:** 45‚Äì60 minutes  

---

### üéØ Learning Objectives

By the end of this chapter you will be able to:

- Navigate the Google Colab interface confidently
- Connect Google Drive to a Colab session to save your work
- Enable GPU/TPU runtimes for deep learning chapters
- Install and import Python packages in Colab using `!pip install`
- Load and preview the Stack Overflow 2025 dataset that runs through the entire book
- Understand the structure and key columns of the project dataset
- (Optional) Set up a local Python environment with Anaconda and VS Code

---

### üßµ Project Thread ‚Äî Chapter 0

This chapter introduces the **Stack Overflow 2025 Developer Survey** ‚Äî the dataset  
that powers every hands-on example across all 9 chapters of this book.  
By the end of this chapter, you will have it loaded, previewed, and understood.


---

## Section 0.1 ‚Äî What This Book Is and How to Use It

### The Three-Part Journey

This book is structured as a deliberate, cumulative learning arc across three parts:

- **Part 1 (Chapters 1‚Äì2): Core Python Fundamentals** ‚Äî Variables, control flow, data structures, functions, OOP, file I/O, and error handling. Pure Python, no external libraries.
- **Part 2 (Chapters 3‚Äì4): Data Science Foundations** ‚Äî NumPy, Pandas, Matplotlib, and Seaborn. The tools every data scientist uses daily.
- **Part 3 (Chapters 5‚Äì9): Machine Learning & AI** ‚Äî SciPy, scikit-learn, PyTorch, Keras 3, NLP, Hugging Face Transformers, and professional deployment practices.

### The Project Thread

A single dataset ‚Äî the **Stack Overflow 2025 Developer Survey** ‚Äî runs through every chapter.  
You will load it in Chapter 0, clean it in Chapter 3, visualize it in Chapter 4, build ML models on it in Chapters 6 and 7, run NLP on its text fields in Chapter 8, and deploy a model trained on it in Chapter 9.

This continuity is intentional: real data science work involves a single dataset across many weeks and many techniques. This book mirrors that reality.

### How to Use Each Notebook

Every chapter notebook follows the same structure:
1. **Section header** ‚Äî context and learning objectives
2. **Pre-code explanation** ‚Äî plain-English description of what the code does and why
3. **Code cell** ‚Äî Python code with inline comments explaining *why* each line works
4. **Post-code interpretation** ‚Äî what the output means, common errors, key takeaways
5. **Chapter summary** ‚Äî skills gained and preview of what's next

> üí° **The Golden Rule:** Comments in every code cell explain **WHY** a line works, not just what it does.  
> Bad: `# multiply by 2`  
> Good: `# divide by maximum value to compress all salaries into the 0‚Äì1 range that gradient descent needs`


---

## Section 0.2 ‚Äî Google Colab Interface Orientation

Google Colab is a free, cloud-based Jupyter Notebook environment that runs entirely in your browser.  
It requires no local installation, gives you access to free GPUs, and saves notebooks directly to Google Drive.

### Key Interface Elements

| Element | Location | What It Does |
|---|---|---|
| **Runtime menu** | Top menu bar | Controls the Python kernel ‚Äî restart, reconnect, change hardware |
| **+ Code / + Text** | Top left under menu | Adds a new code or markdown cell below the current one |
| **‚ñ∂ Run button** | Left of each cell | Executes that cell; keyboard shortcut: `Shift + Enter` |
| **RAM / Disk indicator** | Top right | Shows current memory usage ‚Äî click it for full resource details |
| **Table of Contents** | Left sidebar (‚â°) | Navigates by markdown headers ‚Äî use this to jump between sections |
| **Files** | Left sidebar (üìÅ) | Browse the Colab filesystem and upload local files |

### Essential Keyboard Shortcuts

| Shortcut | Action |
|---|---|
| `Shift + Enter` | Run current cell, move to next |
| `Ctrl + Enter` | Run current cell, stay |
| `Ctrl + M + B` | Insert cell below |
| `Ctrl + M + D` | Delete current cell |
| `Ctrl + M + M` | Convert cell to Markdown |
| `Ctrl + M + Y` | Convert cell to Code |
| `Ctrl + /` | Comment / uncomment selected lines |

Let's run our very first cell. This is the universal first program in every language ‚Äî it confirms that the environment is working and introduces the `print()` function.


In [None]:
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Our very first Python statement.
#
# print() is a built-in Python function that displays output to the
# screen. We pass it a "string" (text wrapped in quotes) as input,
# and it prints that text followed by a newline character.
#
# This is the traditional first program in any language ‚Äî simple,
# instant feedback, and proof that everything is working.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print("Hello from Google Colab ‚Äî Python for AI/ML is ready!")

# We can print multiple things on separate lines by calling print() again.
# Each call to print() starts on a new line automatically.
print("Python version check coming next...")

# We can also print the result of an expression directly.
# Python evaluates the expression first, then passes the result to print().
print("2025 minus 1991 =", 2025 - 1991)   # just a fun calculation


### Checking Your Python Version

Before we go further, let's confirm which version of Python is running.  
This book requires Python 3.10 or higher. Google Colab always provides a recent version, so this is mainly for awareness.


In [None]:
# The 'sys' module is part of Python's standard library ‚Äî it ships with Python
# and never needs to be installed. It gives us access to system-level information
# like the Python version, the operating system, and the file system path.
import sys  # 'import' makes a module available in the current session

# sys.version is a string containing the full Python version details.
# f-strings (formatted strings) let us embed Python expressions directly
# inside a string using curly braces {}.
print(f"Python version: {sys.version}")

# sys.version_info is a named tuple ‚Äî we can access its fields by name.
# We check that the major version is 3 and minor is at least 10.
major = sys.version_info.major   # e.g., 3
minor = sys.version_info.minor   # e.g., 11

if major == 3 and minor >= 10:
    # This branch runs when the condition is True
    print(f"‚úÖ Python {major}.{minor} ‚Äî all good! This book requires Python 3.10+")
else:
    # This branch runs when the condition is False
    print(f"‚ö†Ô∏è  Python {major}.{minor} detected. Please use Python 3.10 or higher.")


---

## Section 0.3 ‚Äî Mounting Google Drive

By default, any files you create in a Colab session are **temporary** ‚Äî they disappear when the session ends.  
Mounting your Google Drive gives Colab persistent access to your Drive storage, so you can:

- Save notebooks and outputs permanently
- Load your own data files
- Keep checkpoints from long training runs in Chapters 7 and 8

> ‚ö†Ô∏è **Run this cell individually** ‚Äî click the **‚ñ∂** button on the left of the cell, do not use Run All for this step.  
> A Google authentication popup will appear ‚Äî click **Connect to Google Drive** and sign in.  
> **If you skip Drive mounting entirely, that is fine** ‚Äî all notebooks load the dataset directly from GitHub.


In [None]:
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# GOOGLE DRIVE MOUNT
#
# This cell connects your Google Drive to Colab so your work persists
# between sessions. When you run it, a permissions popup will appear ‚Äî
# click 'Connect to Google Drive' and sign in with your Google account.
#
# IMPORTANT: If you are running this as part of 'Run All', the popup
# cannot appear automatically. Run THIS CELL INDIVIDUALLY first by
# clicking the ‚ñ∂ button on the left, complete the sign-in, then
# use Runtime ‚Üí Run after to continue with the remaining cells.
#
# If you skip Drive mounting entirely, that is fine ‚Äî all notebooks
# load the dataset directly from GitHub, so Drive is never required.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
import os

try:
    from google.colab import drive

    # drive.mount() triggers the Google authentication popup.
    # force_remount=False means: if Drive is already mounted, skip re-mounting.
    drive.mount('/content/drive', force_remount=False)

    drive_root = '/content/drive/MyDrive'

    if os.path.exists(drive_root):
        print(f"‚úÖ Google Drive mounted successfully at {drive_root}")
        # Show first 5 items as a sanity check ‚Äî these are YOUR Drive folders
        items = os.listdir(drive_root)[:5]
        print(f"   Top-level items: {items}")
    else:
        print("‚ö†Ô∏è  Mount appeared to succeed but Drive path not found.")
        print("   Try: Runtime ‚Üí Disconnect and delete runtime, then re-run this cell.")

except Exception as e:
    # This branch runs if:
    #   - You ran this as part of Run All (popup could not appear)
    #   - You declined the Drive permissions request
    #   - You are running outside of Google Colab
    print("‚ÑπÔ∏è  Google Drive not mounted ‚Äî this is fine!")
    print("   All notebooks load the SO 2025 dataset directly from GitHub.")
    print("   Drive mounting is only needed if you want to SAVE outputs permanently.")
    print()
    print("   To mount Drive manually: run THIS CELL INDIVIDUALLY (click the ‚ñ∂ button)")
    print("   and follow the Google authentication popup.")
    print()
    print(f"   (Technical detail: {type(e).__name__}: {e})")


---

## Section 0.3b ‚Äî GPU and TPU Runtimes

Google Colab offers three compute options:

| Runtime | Best For | How to Enable |
|---|---|---|
| **CPU** (default) | Chapters 0‚Äì6 ‚Äî no GPU needed | Default, nothing to change |
| **T4 GPU** (free) | Chapters 7 & 8 ‚Äî deep learning & transformers | Runtime ‚Üí Change runtime type ‚Üí T4 GPU |
| **A100 GPU** (Colab Pro) | Large models, faster training | Requires Colab Pro subscription |
| **TPU** (free) | Specialized tensor workloads | Runtime ‚Üí Change runtime type ‚Üí TPU v2 |

> üìå **For now (Chapter 0), CPU is perfectly fine.** We'll remind you to switch to GPU at the start of Chapters 7 and 8.

The cell below checks what hardware is currently available ‚Äî run it now to see your current runtime.


In [None]:
# torch is PyTorch's main library. We import it here just to check GPU availability.
# We'll learn PyTorch in depth in Chapter 7 ‚Äî for now we're just using it as a detector.
#
# NOTE: If PyTorch isn't installed in your environment, this cell will print a message
# and skip the GPU check gracefully. The 'try/except' pattern handles errors cleanly ‚Äî
# we cover this fully in Chapter 2.
try:
    import torch  # PyTorch deep learning framework

    # torch.cuda.is_available() returns True if a CUDA-capable GPU is detected.
    # CUDA is NVIDIA's parallel computing platform ‚Äî PyTorch uses it to run on GPUs.
    gpu_available = torch.cuda.is_available()

    if gpu_available:
        # torch.cuda.get_device_name(0) returns the name of the first (index 0) GPU.
        gpu_name = torch.cuda.get_device_name(0)
        print(f"‚úÖ GPU detected: {gpu_name}")
        print(f"   GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    else:
        print("‚ÑπÔ∏è  No GPU detected ‚Äî running on CPU.")
        print("   This is fine for Chapters 0‚Äì6.")
        print("   For Chapters 7‚Äì8: Runtime ‚Üí Change runtime type ‚Üí T4 GPU")

except ImportError:
    # ImportError is raised when Python can't find the requested module.
    # PyTorch may not be pre-installed in all environments ‚Äî that's okay here.
    print("‚ÑπÔ∏è  PyTorch not yet installed ‚Äî GPU check skipped.")
    print("   We'll install and use PyTorch fully in Chapter 7.")


---

## Section 0.4 ‚Äî Installing Packages in Google Colab

Python's power comes from its enormous ecosystem of third-party libraries.  
In Colab, you install libraries using `!pip install` ‚Äî the `!` prefix tells Colab to run the command in the **terminal shell** rather than as Python code.

### How `!pip install` Works

```
!pip install library_name==version_number
```

- `pip` is Python's package manager ‚Äî it downloads libraries from PyPI (the Python Package Index)
- The `==version_number` part is optional but **recommended** in a book ‚Äî it pins the exact version so your code keeps working as libraries evolve
- Most libraries you'll need in this book are **pre-installed in Colab** ‚Äî you only need to install the ones that aren't

### Libraries Pre-installed in Colab (no install needed)
NumPy ¬∑ Pandas ¬∑ Matplotlib ¬∑ Seaborn ¬∑ scikit-learn ¬∑ SciPy ¬∑ TensorFlow ¬∑ PyTorch

### Libraries We'll Install When Needed
Plotly ¬∑ spaCy ¬∑ Hugging Face Transformers ¬∑ XGBoost ¬∑ LightGBM ¬∑ FastAPI ¬∑ Optuna


In [None]:
# The ! prefix runs this as a shell command, not Python code.
# pip install downloads the specified package from PyPI and installs it
# into the current Python environment.
#
# We install plotly here as a demonstration ‚Äî it's not pre-installed in Colab
# and we'll use it in Chapter 4 for interactive visualizations.
#
# The -q flag means "quiet" ‚Äî it suppresses most of the installation output
# so the cell doesn't flood with download progress text.
!pip install plotly -q

# After installation, we verify it worked by importing the library
# and printing its version number.
import plotly  # if this line runs without error, installation succeeded

print(f"‚úÖ Plotly {plotly.__version__} installed and ready")

# ‚îÄ‚îÄ Good practice: always print library versions at the top of notebooks ‚îÄ‚îÄ
# This makes it easy to reproduce results if someone runs the notebook later
# on a different version. We'll do this systematically in each chapter.
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns

print(f"   NumPy:      {np.__version__}")
print(f"   Pandas:     {pd.__version__}")
print(f"   Matplotlib: {matplotlib.__version__}")
print(f"   Seaborn:    {sns.__version__}")


---

## Section 0.5 ‚Äî Introducing the Stack Overflow 2025 Developer Survey

### What Is This Dataset?

The **Stack Overflow Annual Developer Survey** is the world's largest survey of software developers.  
Conducted every year since 2011, the 2025 edition captured responses from **~49,000 developers**  
across **180+ countries**, covering career paths, salaries, tools, technologies, and attitudes toward AI.

Stack Overflow releases the full dataset publicly under the **Open Database License (ODbL)** ‚Äî  
it is freely usable for any purpose, including education and research.

### Why This Dataset for This Book?

| Reason | Detail |
|---|---|
| **Directly relevant** | You are a developer learning Python. This dataset is *about* developers. Every insight feels personal. |
| **Task-rich** | Supports regression, classification, clustering, AND NLP from a single source |
| **Realistic mess** | Real missing values, multi-value columns, salary outliers ‚Äî authentic cleaning practice |
| **Ethics built in** | Gender pay gap, geographic disparities, AI adoption gaps ‚Äî real bias to audit in Chapter 9 |
| **Current** | 2025 data reflects the industry as it exists right now, including AI tool adoption |

### The Curated Subset We Use

The full survey has ~49,000 rows and 80+ columns. For this book we use a **curated subset**:
- **15,000 rows** ‚Äî English-language respondents with non-null salary data
- **~18 columns** ‚Äî the most relevant for our ML tasks
- Hosted at: `https://raw.githubusercontent.com/timothy-watt/python-for-ai-ml/main/data/so_survey_2025_curated.csv`

The cell below loads it directly ‚Äî no download, no Drive mounting required.


### Loading the Dataset

The cell below loads the SO 2025 curated subset directly from GitHub using Pandas.  
We'll see exactly what's in it, understand its shape, and look at the first few rows.

> üîÅ **This exact loading pattern repeats at the start of every chapter (Ch 3 onward).**  
> Memorize the `pd.read_csv(URL)` pattern ‚Äî it's the most common first line in real data science work.


In [None]:
# pandas is the primary data manipulation library in Python's data science stack.
# We import it with the alias 'pd' ‚Äî this is a universal convention. Every data
# scientist in the world writes 'pd.something', not 'pandas.something'.
import pandas as pd

# The URL where our curated SO 2025 dataset lives.
# This is a raw GitHub URL ‚Äî it serves the file directly as plain text (CSV format).
# pd.read_csv() can read from a URL just as easily as from a local file path.
DATASET_URL = "https://raw.githubusercontent.com/timothy-watt/python-for-ai-ml/main/data/so_survey_2025_curated.csv"

# pd.read_csv() reads a comma-separated values file and returns a DataFrame ‚Äî
# the core data structure in Pandas. Think of it as a supercharged spreadsheet
# that lives in memory and has hundreds of built-in analysis methods.
print("Loading SO 2025 Developer Survey dataset...")
df = pd.read_csv(DATASET_URL)

# Confirm the load succeeded by printing the shape.
# df.shape returns a tuple (number_of_rows, number_of_columns).
rows, cols = df.shape  # unpack the tuple into two variables
print(f"‚úÖ Dataset loaded successfully!")
print(f"   Shape: {rows:,} rows √ó {cols} columns")


### Previewing the First Few Rows

`df.head()` returns the first 5 rows of the DataFrame by default.  
This is always the first thing you do after loading a dataset ‚Äî a quick sanity check  
that the data looks like what you expected.


In [None]:
# df.head(n) returns the first n rows of the DataFrame.
# With no argument, it defaults to 5 rows.
# The output shows column names across the top and row index numbers on the left.

df.head(10)  # show first 10 rows for a broader preview


### Understanding the Structure with `.info()`

`df.info()` prints a concise summary of the DataFrame ‚Äî column names, data types,  
and the count of non-null values per column. This is how you spot missing data at a glance.


In [None]:
# df.info() prints:
#   - Total row count
#   - Each column's name, non-null count, and data type (dtype)
#   - Total memory usage
#
# Key dtypes you'll see:
#   object  ‚Üí string/text data (Pandas uses 'object' for strings)
#   int64   ‚Üí integer numbers (64-bit)
#   float64 ‚Üí decimal numbers (64-bit floating point)
#
# A column where non-null count < total rows has MISSING values ‚Äî
# we'll handle those in Chapter 3.

print("DataFrame Info:")
print("=" * 60)
df.info()


### Statistical Summary with `.describe()`

`df.describe()` generates descriptive statistics for all **numeric** columns ‚Äî  
count, mean, standard deviation, min, quartiles, and max.  
This gives you an immediate feel for the scale and distribution of your data.


In [None]:
# df.describe() computes summary statistics for numeric columns only by default.
# We use .round(2) to limit decimal places for readability.
#
# What to look for:
#   - 'count' less than total rows ‚Üí missing values in that column
#   - Very large 'max' vs 'mean' ‚Üí potential outliers (salary data often has these)
#   - 'std' (standard deviation) ‚Üí how spread out the values are

print("Descriptive Statistics for Numeric Columns:")
print("=" * 60)
df.describe().round(2)


### Key Columns Used Throughout the Book

The table below is your reference guide for the entire book.  
When a chapter says "we use `ConvertedCompYearly` as the target variable," you'll know exactly what that means.


In [None]:
# Let's look at each key column with a sample of its values.
# This gives more intuition than a table ‚Äî you see real data, not just descriptions.

# Define the key columns we'll use across all chapters
key_columns = [
    'ConvertedCompYearly',     # Annual salary in USD (our regression target)
    'DevType',                 # Developer role(s) ‚Äî often multi-value, semicolon-separated
    'LanguageHaveWorkedWith',  # Languages used ‚Äî multi-value, semicolon-separated
    'YearsCodePro',            # Years of professional coding experience
    'Country',                 # Respondent's country
    'EdLevel',                 # Highest education level completed
    'Employment',              # Employment status
    'RemoteWork',              # Remote / hybrid / in-person
    'AIToolCurrently',         # AI tools currently being used ‚Äî multi-value
    'AITrustTeammates',        # How much they trust AI tool recommendations
    'OrgSize',                 # Organization size
    'Age',                     # Age range (binned, not exact)
    'Gender',                  # Gender identity (used in Chapter 9 bias audit)
]

# Only show columns that actually exist in our curated subset
# (the curated version may not include all columns from the full survey)
available_cols = [c for c in key_columns if c in df.columns]
missing_cols   = [c for c in key_columns if c not in df.columns]

print(f"Key columns available in this dataset: {len(available_cols)}/{len(key_columns)}")
if missing_cols:
    print(f"Not in curated subset (available in full dataset): {missing_cols}")

print()
print("Sample values from each key column:")
print("=" * 60)

# Loop through each available column and show a few unique values
for col in available_cols:
    # .dropna() removes NaN (missing) values before sampling
    # .unique() returns all distinct values in the column
    # [:4] takes the first 4 unique values as examples
    sample_values = df[col].dropna().unique()[:4]
    print(f"  {col:<30} ‚Üí {list(sample_values)}")


### Quick Missing Value Check

Real-world datasets always have missing values. The SO 2025 survey is no exception ‚Äî  
not every respondent answered every question. Let's see the scale of what we'll be  
cleaning in Chapter 3.


In [None]:
# df.isnull() returns a DataFrame of True/False values ‚Äî True where data is missing.
# .sum() counts the True values per column (True = 1, False = 0 in Python).
# .sort_values(ascending=False) sorts from most missing to least.

missing_counts = df.isnull().sum().sort_values(ascending=False)

# Calculate the percentage of missing values per column
# This is more meaningful than raw counts since columns can have different totals
missing_pct = (missing_counts / len(df) * 100).round(1)

# Combine into a summary DataFrame for clean display
missing_summary = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing %': missing_pct
})

# Only show columns that actually have at least one missing value
missing_summary = missing_summary[missing_summary['Missing Count'] > 0]

print(f"Columns with missing values: {len(missing_summary)} of {len(df.columns)}")
print()
print(missing_summary.to_string())   # .to_string() prints full table without truncation
print()
print("‚Üí We'll handle all of these systematically in Chapter 3.")


### A First Look at Developer Salaries

Our primary regression target throughout the book is `ConvertedCompYearly` ‚Äî annual salary in USD,  
converted from local currencies using Stack Overflow's exchange rates.  
Let's get an early feel for what this data looks like.


In [None]:
# We import matplotlib here just for this quick preview chart.
# Chapter 4 covers visualization in full depth ‚Äî this is just a taste.
import matplotlib.pyplot as plt   # pyplot is matplotlib's main plotting interface
import numpy as np

# Filter to reasonable salary range to remove extreme outliers for this preview.
# We'll do this more rigorously in Chapter 3 ‚Äî here we just want a clean visual.
# $10,000 minimum removes implausibly low values (data entry errors or part-time)
# $500,000 maximum removes the extreme high-end outliers that skew the view
salary_col = 'ConvertedCompYearly'

if salary_col in df.columns:
    salary_data = df[salary_col].dropna()                          # remove missing values
    salary_clean = salary_data[(salary_data >= 10_000) &          # above $10k/year
                               (salary_data <= 500_000)]           # below $500k/year

    # Create a figure with one plot area (axes)
    fig, ax = plt.subplots(figsize=(10, 4))   # figsize sets width √ó height in inches

    # ax.hist() draws a histogram ‚Äî it groups values into 'bins' and counts
    # how many values fall into each group, showing the distribution shape.
    ax.hist(salary_clean, bins=60, color='#2E75B6', edgecolor='white', linewidth=0.5)

    # Labels and title make the chart self-explanatory
    ax.set_xlabel('Annual Salary (USD)', fontsize=12)
    ax.set_ylabel('Number of Respondents', fontsize=12)
    ax.set_title('SO 2025: Developer Salary Distribution (Respondents with $10k-$500k reported salary)', fontsize=13, fontweight='bold')

    # Format x-axis labels as currency with thousands separator (e.g., $100,000)
    ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'${x:,.0f}'))

    # Add summary statistics as text annotations on the chart
    median_sal = salary_clean.median()
    mean_sal = salary_clean.mean()
    ax.axvline(median_sal, color='#E8722A', linestyle='--', linewidth=1.5,
               label=f'Median: ${median_sal:,.0f}')
    ax.axvline(mean_sal, color='green', linestyle=':', linewidth=1.5,
               label=f'Mean: ${mean_sal:,.0f}')
    ax.legend(fontsize=10)

    plt.tight_layout()   # automatically adjust spacing so nothing overlaps
    plt.show()

    # Print summary stats alongside the chart
    print(f"Salary Summary (n={len(salary_clean):,} respondents)")
    print(f"  Median: ${median_sal:>10,.0f}")
    print(f"  Mean:   ${mean_sal:>10,.0f}")
    print(f"  Min:    ${salary_clean.min():>10,.0f}")
    print(f"  Max:    ${salary_clean.max():>10,.0f}")
    print(f"  Std:    ${salary_clean.std():>10,.0f}")
else:
    print(f"Column '{salary_col}' not found in dataset.")
    print(f"Available columns: {list(df.columns)}")


### A First Look at Programming Language Adoption

The `LanguageHaveWorkedWith` column uses a **multi-value format** ‚Äî each respondent can list  
multiple languages, separated by semicolons. For example: `"Python;JavaScript;SQL;Rust"`.  

This is one of the most common data formats you'll encounter in real surveys, and cleaning it  
(splitting and exploding into separate rows) is a key skill we'll practice in Chapter 3.  
Here's a quick preview of what Python's adoption looks like in the 2025 data.


In [None]:
lang_col = 'LanguageHaveWorkedWith'

if lang_col in df.columns:
    # Each row contains multiple languages as a single semicolon-separated string.
    # Example row value: "Python;JavaScript;SQL;Bash/Shell"
    #
    # Step 1: Drop rows where this column is missing (NaN)
    lang_series = df[lang_col].dropna()

    # Step 2: Split each string on the ';' separator.
    # str.split(';') vectorizes the Python split() method across every row.
    # expand=False returns a Series of lists.
    lang_lists = lang_series.str.split(';')

    # Step 3: .explode() converts each list element into its own row,
    # essentially "unpacking" the lists so each language appears separately.
    # This transforms the data from wide format to long format.
    lang_exploded = lang_lists.explode()

    # Step 4: Count how many respondents used each language.
    # .value_counts() counts occurrences, sorted from most to least common.
    lang_counts = lang_exploded.value_counts()

    # Show the top 15 languages
    top_15 = lang_counts.head(15)

    fig, ax = plt.subplots(figsize=(10, 5))

    # Horizontal bar chart ‚Äî easier to read long language names than vertical bars
    bars = ax.barh(top_15.index[::-1],     # reverse so most popular is at top
                   top_15.values[::-1],
                   color='#2E75B6',
                   edgecolor='white')

    ax.set_xlabel('Number of Respondents', fontsize=11)
    ax.set_title('SO 2025: Top 15 Programming Languages (by number of respondents who have worked with them)', fontsize=12, fontweight='bold')

    # Add value labels at the end of each bar for exact counts
    for bar, val in zip(bars, top_15.values[::-1]):
        ax.text(val + 50, bar.get_y() + bar.get_height()/2,
                f'{val:,}', va='center', ha='left', fontsize=9)

    plt.tight_layout()
    plt.show()

    # Find Python's exact ranking
    python_count = lang_counts.get('Python', 0)
    python_rank  = list(lang_counts.index).index('Python') + 1 if 'Python' in lang_counts.index else 'N/A'
    total_resp   = len(lang_series)

    print(f"Total respondents who answered this question: {total_resp:,}")
    print(f"Python: {python_count:,} respondents ({python_count/total_resp*100:.1f}%) ‚Äî Rank #{python_rank}")
    print("")
    print("Top 5 languages:")
    for rank, (lang, count) in enumerate(top_15.head(5).items(), 1):
        print(f"  #{rank}: {lang:<25} {count:,} respondents ({count/total_resp*100:.1f}%)")
else:
    print(f"Column '{lang_col}' not found in dataset.")


### A Glimpse at AI Tool Adoption in 2025

The 2025 survey included substantial new coverage of AI tools ‚Äî what developers are using,  
how much they trust AI recommendations, and their overall feelings about AI in their workflow.  
These columns power Chapter 8's NLP and sentiment analysis tasks.

Let's see which AI tools developers are using right now.


In [None]:
ai_col = 'AIToolCurrently'

if ai_col in df.columns:
    # Same multi-value splitting pattern as the language column above.
    # This pattern is so common in survey data that by Chapter 3,
    # you'll write it from memory without looking it up.
    ai_series   = df[ai_col].dropna()
    ai_exploded = ai_series.str.split(';').explode().str.strip()  # .str.strip() removes whitespace
    ai_counts   = ai_exploded.value_counts().head(12)

    fig, ax = plt.subplots(figsize=(10, 5))
    colors = ['#E8722A' if 'ChatGPT' in tool or 'Copilot' in tool or 'Claude' in tool
              else '#2E75B6' for tool in ai_counts.index[::-1]]

    ax.barh(ai_counts.index[::-1], ai_counts.values[::-1], color=colors[::-1], edgecolor='white')
    ax.set_xlabel('Number of Respondents', fontsize=11)
    ax.set_title('SO 2025: AI Tools Developers Are Currently Using', fontsize=12, fontweight='bold')
    plt.tight_layout()
    plt.show()

    total_ai = len(ai_series)
    total_respondents = len(df)
    print(f"Respondents using at least one AI tool: {total_ai:,} of {total_respondents:,} ({total_ai/total_respondents*100:.1f}%)")
    print("")
    print("Top 5 AI tools in 2025:")
    for rank, (tool, count) in enumerate(ai_counts.head(5).items(), 1):
        print(f"  #{rank}: {tool:<35} {count:,} respondents")
else:
    print(f"Column '{ai_col}' not found. Available columns: {[c for c in df.columns if 'AI' in c or 'ai' in c.lower()]}")


---

## Section 0.5b ‚Äî Saving the Dataset to Google Drive (Optional)

If you'd prefer to load the dataset from Drive instead of the GitHub URL in future sessions,  
run the cell below. This is optional ‚Äî all notebooks default to the GitHub URL which works without Drive.


In [None]:
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# OPTIONAL: Save a local copy of the dataset to Google Drive.
#
# Skip this cell entirely if:
#   - You have not mounted Drive (cell above)
#   - You prefer loading from GitHub each session (perfectly fine)
#
# Run this cell individually (‚ñ∂ button) after mounting Drive if you
# want a persistent local copy for faster loading or offline access.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
import os

drive_root = '/content/drive/MyDrive'

if not os.path.exists(drive_root):
    # Drive is not mounted ‚Äî skip silently rather than raising an error
    print("‚ÑπÔ∏è  Skipping Drive save ‚Äî Google Drive is not mounted.")
    print("   The dataset loads fine from GitHub. No action needed.")
else:
    # Drive IS mounted ‚Äî proceed with saving
    drive_data_path = '/content/drive/MyDrive/python_for_aiml/data'

    # os.makedirs() creates the folder and any missing parent folders.
    # exist_ok=True means no error if the folder already exists.
    os.makedirs(drive_data_path, exist_ok=True)

    save_path = f'{drive_data_path}/so_survey_2025_curated.csv'

    # df.to_csv() writes the DataFrame to a CSV file on Drive.
    # index=False omits the auto-generated row numbers from the output.
    df.to_csv(save_path, index=False)

    file_size_mb = os.path.getsize(save_path) / 1_000_000
    print(f"‚úÖ Dataset saved to Drive: {save_path}")
    print(f"   File size: {file_size_mb:.1f} MB")
    print()
    print("To load from Drive in future sessions instead of GitHub:")
    print(f'  df = pd.read_csv("{save_path}")')


---

## Section 0.6 ‚Äî Local Python Setup (Optional Sidebar)

> ‚ö†Ô∏è **This section is completely optional.** The entire book works in Google Colab with no local setup.  
> Only read this if you want a local Python environment on your own machine.

---

### Why Consider Local Setup?

| Situation | Recommendation |
|---|---|
| Following this book | **Colab** ‚Äî no setup, free GPU, works everywhere |
| Building production projects | **Local environment** ‚Äî full control, no session timeouts |
| Contributing to open source | **Local environment** ‚Äî version control integration |
| Working with sensitive data | **Local environment** ‚Äî data never leaves your machine |

---

### Option A: Anaconda (Recommended for Beginners)

Anaconda is an all-in-one Python distribution that includes Python, conda (environment manager), and 250+ pre-installed data science packages.

**Installation steps:**
1. Go to [anaconda.com/download](https://www.anaconda.com/download)
2. Download the installer for your OS (Windows / macOS / Linux)
3. Run the installer ‚Äî accept defaults for most options
4. Open **Anaconda Navigator** or the **Anaconda Prompt**
5. Verify: `python --version` should show Python 3.11+

**Creating a dedicated environment for this book:**
```bash
# Create a new environment named 'pyaiml' with Python 3.11
conda create -n pyaiml python=3.11

# Activate the environment
conda activate pyaiml

# Install core packages
pip install numpy pandas matplotlib seaborn scikit-learn scipy
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install keras transformers datasets plotly xgboost lightgbm optuna
pip install jupyter notebook fastapi uvicorn

# Launch Jupyter Notebook
jupyter notebook
```

---

### Option B: VS Code + Python Extension

VS Code with the Python extension is the most popular IDE for Python development.

1. Download VS Code from [code.visualstudio.com](https://code.visualstudio.com)
2. Install the **Python extension** (by Microsoft) from the Extensions marketplace
3. Install the **Jupyter extension** to run `.ipynb` notebooks locally
4. Select your conda environment (or any Python interpreter) via `Ctrl+Shift+P ‚Üí Python: Select Interpreter`

---

### Virtual Environments (venv ‚Äî built into Python)

If you prefer not to use Anaconda, Python's built-in `venv` module creates lightweight isolated environments:

```bash
# Create a virtual environment in a folder named '.venv'
python -m venv .venv

# Activate it (macOS/Linux)
source .venv/bin/activate

# Activate it (Windows)
.venv\Scripts\activate

# Install packages
pip install -r requirements.txt   # if the repo includes one
```

> üìå **Key principle:** Always use a dedicated environment per project. Never install packages into your system Python.


---

## ‚úÖ Chapter 0 Summary

Excellent work ‚Äî you've completed the orientation and you're ready to start writing Python.

### What You Did in This Chapter

| Task | Status |
|---|---|
| Navigated the Google Colab interface | ‚úÖ |
| Ran your first Python code (`print()`) | ‚úÖ |
| Mounted Google Drive for persistent storage | ‚úÖ |
| Checked GPU availability | ‚úÖ |
| Installed a package with `!pip install` | ‚úÖ |
| Loaded the SO 2025 dataset with `pd.read_csv()` | ‚úÖ |
| Previewed data with `.head()`, `.info()`, `.describe()` | ‚úÖ |
| Identified key columns and missing values | ‚úÖ |
| Visualized salary distribution and language adoption | ‚úÖ |

### Key Takeaways

- **Google Colab** is a free, zero-setup Jupyter environment with free GPU access
- **Notebooks** mix markdown (explanatory text) and code cells ‚Äî this mirrors real data science work
- The **Stack Overflow 2025 Developer Survey** is our project dataset for all 9 chapters ‚Äî 15,000 developers, salary, languages, AI tool adoption, and more
- `pd.read_csv(URL)` loads data from any web URL ‚Äî no download required
- `.head()`, `.info()`, `.describe()` are always your first three moves on a new dataset
- The `LanguageHaveWorkedWith` column uses semicolon-separated multi-values ‚Äî a real-world data format we'll clean properly in Chapter 3

### Skills Gained

- Colab interface navigation and keyboard shortcuts
- Google Drive mounting
- GPU runtime configuration
- Package installation with `!pip install`
- Dataset loading with Pandas
- Initial data exploration and visualization

---

### ‚è≠Ô∏è What's Next: Chapter 1 ‚Äî Python Fundamentals

Chapter 1 begins the real Python journey. We'll cover:
- Variables, data types (int, float, str, bool)
- Operators and expressions
- Control flow (if/elif/else, for loops, while loops)
- Core data structures (lists, tuples, sets, dictionaries)
- Functions ‚Äî the building block of all Python programs

By the end of Chapter 1, you'll be writing functions that summarize SO 2025 salary and language data ‚Äî using only pure Python, no external libraries.

---

*End of Chapter 0 ¬∑ Python for AI/ML*  
[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)]({TOC_URL})
