# Chapter 1 -- Python Fundamentals
## *Python for AI/ML: A Complete Learning Journey*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/CH01_Python_Fundamentals.ipynb)
&nbsp;&nbsp;[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)

---

**Part:** 1 -- Core Python Fundamentals  
**Prerequisites:** Chapter 0 (Orientation and Setup)  
**Estimated time:** 4-5 hours

---

### Learning Objectives

By the end of this chapter you will be able to:

- Declare variables and understand Python's core data types (`int`, `float`, `str`, `bool`)
- Use arithmetic, comparison, and logical operators confidently
- Write f-strings to format output cleanly
- Control program flow with `if`/`elif`/`else`, `for` loops, and `while` loops
- Use `break`, `continue`, and `pass` appropriately
- Build and manipulate Python's four core data structures: lists, tuples, sets, and dictionaries
- Define functions with positional, keyword, default, and variable-length arguments
- Write lambda functions for concise single-expression operations

---

### Project Thread -- Chapter 1

In this chapter we work with a small hand-built sample of the SO 2025 dataset --
a list of dictionaries representing survey respondents. This lets us practise
pure Python data structures and functions on realistic data before we bring in
NumPy and Pandas in Chapter 3.

By the end of Section 1.4 you will have written functions that compute real statistics
from this sample: average salary, most common language, and respondent counts by country.


---

## Section 1.1 -- Variables, Data Types, and Operators

### What Is a Variable?

A variable is a **named container** that holds a value in memory.
Think of it as a labelled box: you give the box a name, put something inside it,
and refer to it by name whenever you need what's inside.

In Python, you create a variable simply by writing `name = value`.
Python figures out the type of the value automatically -- you never have to declare it.

> **Why this matters for AI/ML:** Every piece of data you will ever work with --
> a salary figure, a model weight, a column name, a prediction -- lives in a variable.
> Understanding types prevents an enormous category of bugs in data pipelines.


In [None]:
# 1.1.1 -- Numeric Types: int, float, complex
#
# Python has three built-in numeric types:
#   int   -- whole numbers, positive or negative, no decimal point
#   float -- numbers with a decimal point (floating-point representation)
#   complex -- numbers with a real and imaginary part

# int: whole number -- used for counts, indices, years
years_experience = 7
survey_year      = 2025
respondent_count = 49_461    # underscores in numbers improve readability

# float: decimal number -- used for salaries, percentages, model metrics
annual_salary    = 135_000.0
python_adoption  = 0.548     # 54.8% of respondents use Python
median_age       = 32.5

# complex: real + imaginary (rarely used in data science -- included for completeness)
z = 3 + 4j    # j is Python's imaginary unit notation

# type() is a built-in function that returns the type of any value
print('--- Numeric Types ---')
print(f'years_experience = {years_experience}  -> type: {type(years_experience)}')
print(f'annual_salary    = {annual_salary}  -> type: {type(annual_salary)}')
print(f'python_adoption  = {python_adoption}  -> type: {type(python_adoption)}')
print(f'complex z        = {z}  -> type: {type(z)}')
print(f'respondent_count = {respondent_count:,}')   # :, adds thousands separator


### 1.1.2 -- String Manipulation and F-Strings

A **string** is a sequence of characters -- any text wrapped in single or double quotes.
Strings are one of the most important types in data science: column names, country names,
developer roles, and survey responses are all strings.

**F-strings** (formatted string literals) are Python's most readable way to embed
variable values directly inside a string. They were introduced in Python 3.6 and
are now the universal standard in professional Python code.


In [None]:
# 1.1.2 -- String Manipulation and F-Strings

developer_role = 'Full-stack developer'
primary_lang   = 'Python'
country        = 'United States'

# String concatenation with + -- joins strings end-to-end
greeting = 'Hello, ' + developer_role + ' from ' + country
print(greeting)

# f-strings: prefix with f, embed expressions in { }
salary = 135_000
yrs    = 7
print(f'Role: {developer_role}, Language: {primary_lang}')

# Format specifiers after a colon inside { }
# :,   -> add thousands separator
# :.1% -> percentage with 1 decimal place
# :>10 -> right-align in field of 10 chars
print(f'Salary:     ${salary:,}')
print(f'Adoption:   {0.548:.1%}')
print(f'Experience: {yrs} year{"s" if yrs != 1 else ""}')

# Key string methods
text = '  Full-Stack Developer;Data Scientist  '
print()
print('--- String Methods ---')
print(f'Original:   "{text}"')
print(f'.strip():   "{text.strip()}"')
print(f'.lower():   "{text.strip().lower()}"')
print(f'.upper():   "{text.strip().upper()}"')
print(f'.replace(): "{text.strip().replace(";", " | ")}"')

# .split() is critical for SO survey data -- roles and languages are semicolon-separated
roles = 'Full-Stack Developer;Data Scientist;Machine Learning Engineer'
roles_list = roles.split(';')   # splits on ; and returns a list
print(f'.split(";"):  {roles_list}')
print(f'len():        {len(roles_list)} roles')


### 1.1.3 and 1.1.4 -- Booleans, Comparison Operators, and Logical Operators

A **boolean** is a value that is either `True` or `False` -- the foundation of every
decision a program makes. Comparison operators produce booleans; logical operators
combine them. You will use these constantly in data filtering and model evaluation.


In [None]:
# Comparison operators always return True or False:
#   ==  equal to           !=  not equal to
#   >   greater than       <   less than
#   >=  greater or equal   <=  less or equal
#   in  membership test    not in  non-membership

salary    = 135_000
threshold = 100_000
languages = ['Python', 'SQL', 'JavaScript']

print('--- Comparison Operators ---')
print(f'salary > threshold:       {salary > threshold}')
print(f'salary == 100_000:        {salary == 100_000}')
print(f'salary != 100_000:        {salary != 100_000}')
print(f"'Python' in languages:    {'Python' in languages}")
print(f"'Rust' in languages:      {'Rust' in languages}")

# Logical operators: and, or, not
# and -> True only if BOTH sides are True
# or  -> True if EITHER side is True
# not -> inverts True to False and vice versa

years_exp   = 7
remote_work = True
has_degree  = True

print()
print('--- Logical Operators ---')
senior_remote = salary > 100_000 and years_exp >= 5
print(f'Senior AND remote:        {senior_remote}')

python_or_r = 'Python' in languages or 'R' in languages
print(f'Uses Python OR R:         {python_or_r}')

print(f'NOT remote:               {not remote_work}')

# Combining -- parentheses make precedence explicit
qualified = salary > 100_000 and (years_exp >= 5 or has_degree)
print(f'Qualified candidate:      {qualified}')

# == vs is: always use == for value comparisons
a = [1, 2, 3]
b = [1, 2, 3]
print()
print(f'a == b (value equal):     {a == b}')   # True -- same values
print(f'a is b (same object):     {a is b}')   # False -- different objects in memory


---

## Section 1.2 -- Control Flow: Conditionals and Loops

Control flow is how a program makes decisions and repeats actions.
Without it, every program would just run the same lines top-to-bottom every time.
In AI/ML, control flow underpins data filtering, training loops, and evaluation logic.


In [None]:
# 1.2.1 -- if / elif / else
#
# Python executes the block under the FIRST condition that is True.
# If no condition is True, the else block runs (if present).
# Indentation (4 spaces) defines which lines belong to each block.

salary = 135_000

if salary >= 200_000:
    band = 'Principal / Staff'
elif salary >= 150_000:       # elif = else if -- checked only if 'if' was False
    band = 'Senior+'
elif salary >= 100_000:
    band = 'Senior'
elif salary >= 60_000:
    band = 'Mid-level'
else:                         # runs if ALL conditions above were False
    band = 'Junior / Entry-level'

print(f'Salary ${salary:,} -> Band: {band}')

# Conditional expression (ternary) -- a compact one-line if/else
# Syntax: value_if_true  if  condition  else  value_if_false
label = 'High earner' if salary >= 100_000 else 'Below threshold'
print(f'Quick label: {label}')

# Nested conditions
remote   = True
has_visa = True

if salary > 100_000:
    if remote:
        status = 'High-value remote candidate'
    elif has_visa:
        status = 'High-value on-site candidate'
    else:
        status = 'High salary but relocation required'
else:
    status = 'Below senior threshold'

print(f'Candidate status: {status}')


In [None]:
# 1.2.2 -- for Loops and Iterables
#
# A for loop repeats a block of code for each item in an iterable.
# An iterable is anything Python can loop over: lists, strings, ranges, dicts...

languages = ['Python', 'SQL', 'JavaScript', 'TypeScript', 'Rust']

# Basic for loop
print('Languages:')
for lang in languages:
    print(f'  - {lang}')

# enumerate() gives both index AND value -- common when position matters
print()
print('Ranked:')
for rank, lang in enumerate(languages, start=1):   # start=1 -> ranks from 1
    print(f'  #{rank}: {lang}')

# range() generates integers -- no list needed in memory
# range(start, stop, step) -- stop is EXCLUSIVE (not included)
print()
print('Salary bands at $10k intervals:')
for threshold in range(60_000, 130_000, 10_000):
    print(f'  ${threshold:,}')


In [None]:
# 1.2.3 -- while Loops
#
# A while loop keeps running as long as its condition remains True.
# Use when you do not know in advance how many iterations you need.
# ALWAYS ensure the condition eventually becomes False -- otherwise
# you create an infinite loop that hangs the kernel.

import random
random.seed(42)   # seed for reproducibility

# Simulate a training loop (preview of Chapter 7)
loss      = 1.0    # high starting loss = bad model
epoch     = 0      # iteration counter
max_epoch = 20     # safety cap

print('Simulated training loop:')
print(f'{"Epoch":>6}  {"Loss":>8}')
print('-' * 17)

while loss > 0.1 and epoch < max_epoch:
    loss = loss * 0.75 + random.uniform(-0.02, 0.02)
    loss = max(loss, 0.0)   # loss cannot go below 0
    epoch += 1              # shorthand for epoch = epoch + 1
    print(f'{epoch:>6}  {loss:>8.4f}')

print()
if loss <= 0.1:
    print(f'Converged at epoch {epoch} -- loss {loss:.4f} below 0.1')
else:
    print(f'Hit max epochs ({max_epoch}) -- loss {loss:.4f} still above threshold')


In [None]:
# 1.2.4 -- break, continue, pass
#
#   break    -- exit the loop immediately
#   continue -- skip the rest of the current iteration, go to next
#   pass     -- do nothing; placeholder where Python requires a statement

salaries = [45_000, 120_000, 0, 95_000, -1, 210_000, 88_000]

# break -- stop at first invalid value
print('--- break ---')
for i, sal in enumerate(salaries):
    if sal <= 0:
        print(f'  Stopped at index {i} -- invalid salary: {sal}')
        break
    print(f'  Index {i}: ${sal:,} OK')

# continue -- skip bad values, keep processing the rest
print()
print('--- continue ---')
valid = []
for sal in salaries:
    if sal <= 0:
        print(f'  Skipping: {sal}')
        continue    # jump straight to the next iteration
    valid.append(sal)
    print(f'  Kept: ${sal:,}')
print(f'  Valid: {valid}')
print(f'  Average: ${sum(valid) / len(valid):,.0f}')

# pass -- no-op placeholder for code skeletons
print()
print('--- pass ---')
for sal in valid:
    if sal > 200_000:
        pass    # TODO: handle exceptional salaries later
    else:
        print(f'  Normal range: ${sal:,}')


---

## Section 1.3 -- Data Structures: Lists, Tuples, Sets, and Dictionaries

Python's four core data structures are the building blocks of every data pipeline.
Each has a distinct purpose -- choosing the right one is a foundational skill.

| Structure | Ordered | Mutable | Duplicates | Syntax | Best for |
|-----------|---------|---------|------------|--------|----------|
| **List**  | Yes | Yes | Yes | `[a, b]` | Sequences you will change |
| **Tuple** | Yes | No  | Yes | `(a, b)` | Fixed records, return values |
| **Set**   | No  | Yes | No  | `{a, b}` | Unique values, fast lookup |
| **Dict**  | Yes | Yes | Keys: No | `{k: v}` | Labelled data, fast access |


In [None]:
# 1.3.1 -- Lists: Creation, Indexing, Slicing, Methods

languages = ['Python', 'SQL', 'JavaScript', 'TypeScript', 'Rust']
salaries  = [95_000, 135_000, 88_000, 210_000, 72_000]

# Indexing -- zero-based: first item is index 0
print('--- Indexing ---')
print(f'First:   {languages[0]}')
print(f'Last:    {languages[-1]}')   # -1 = last item
print(f'Third:   {languages[2]}')

# Slicing -- list[start:stop:step], stop is EXCLUSIVE
print()
print('--- Slicing ---')
print(f'First three:  {languages[0:3]}')
print(f'Last two:     {languages[-2:]}')
print(f'Every other:  {languages[::2]}')
print(f'Reversed:     {languages[::-1]}')

# Mutating lists
print()
print('--- Mutation ---')
languages[4] = 'Go'
print(f'After replace:  {languages}')
languages.append('Rust')
print(f'After append:   {languages}')
languages.insert(1, 'R')
print(f'After insert:   {languages}')
removed = languages.pop(1)        # remove and return item at index 1
print(f'Popped: {removed}')
languages.remove('Go')            # remove first occurrence of a value
print(f'After remove:   {languages}')

# Operations
print()
print('--- Operations ---')
print(f'Length:          {len(languages)}')
print(f'Max salary:      ${max(salaries):,}')
print(f'Min salary:      ${min(salaries):,}')
print(f'Sum salaries:    ${sum(salaries):,}')
print(f'Sorted salaries: {sorted(salaries)}')


In [None]:
# 1.3.2 -- Tuples: Immutability and Use Cases

# A tuple is like a list but IMMUTABLE -- once created, it cannot be changed.
# Use for data that should never change: coordinates, records, return values.

respondent_record = ('R_001', 'United States', 135_000, 7, 'Python')
# Fields: (id, country, salary, years_exp, primary_language)

print(f'ID:       {respondent_record[0]}')
print(f'Country:  {respondent_record[1]}')
print(f'Salary:   ${respondent_record[2]:,}')

# Tuple unpacking -- assign each element to its own variable in one line
resp_id, country, salary, years_exp, lang = respondent_record
print(f'Unpacked: id={resp_id}, salary=${salary:,}')

# Trying to modify a tuple raises a TypeError -- this is intentional
try:
    respondent_record[2] = 150_000   # attempt to change salary
except TypeError as e:
    print(f'TypeError: {e}')
    print('Tuples are immutable by design -- use a list if you need to change values')

# Practical use: returning multiple values from a function
def salary_stats(salary_list):
    '''Return (min, max, mean) as a tuple for clean unpacking.'''
    return min(salary_list), max(salary_list), sum(salary_list)/len(salary_list)

salaries = [95_000, 135_000, 88_000, 210_000, 72_000]
lo, hi, avg = salary_stats(salaries)
print(f'Salary range: ${lo:,} to ${hi:,}, average: ${avg:,.0f}')


In [None]:
# 1.3.3 -- Sets: Unique Elements and Set Operations

# A set stores UNIQUE items with no guaranteed order.
# Key advantage: O(1) membership testing -- 'x in set' is instant
# regardless of set size. Lists do the same in O(n) -- they scan everything.

langs_a = ['Python', 'SQL', 'JavaScript', 'Python', 'SQL']   # has duplicates
langs_b = ['Python', 'R', 'SQL', 'Scala', 'Julia']

set_a = set(langs_a)   # duplicates removed automatically
set_b = set(langs_b)

print(f'Respondent A: {set_a}')
print(f'Respondent B: {set_b}')

print()
print('--- Set Operations ---')
print(f'Union (either):          {sorted(set_a | set_b)}')   # all unique languages
print(f'Intersection (both):     {sorted(set_a & set_b)}')   # used by both
print(f'Difference (A not B):    {sorted(set_a - set_b)}')   # in A, not in B
print(f'Symmetric diff:          {sorted(set_a ^ set_b)}')   # exclusive to one side

# Fast membership test
high_value = {'Python', 'Scala', 'Rust', 'Go', 'TypeScript'}
print()
print(f"'Python' high-value: {'Python' in high_value}")
print(f"'COBOL' high-value:  {'COBOL' in high_value}")

# Deduplication
all_langs   = langs_a + langs_b
unique_only = sorted(set(all_langs))
print(f'All (with dups): {all_langs}')
print(f'Unique only:     {unique_only}')


In [None]:
# 1.3.4 -- Dictionaries: Key-Value Pairs and Methods

# A dictionary maps unique KEYS to VALUES.
# Keys must be immutable (strings, numbers, tuples).
# Values can be anything -- including lists and other dicts.

respondent = {
    'id':        'R_001',
    'country':   'United States',
    'salary':    135_000,
    'years_exp': 7,
    'languages': ['Python', 'SQL', 'JavaScript'],
    'remote':    True,
    'dev_type':  'Full-stack developer',
    'ai_tools':  ['ChatGPT', 'GitHub Copilot'],
}

# Accessing values with []
print(f'Country:    {respondent["country"]}')
print(f'Salary:     ${respondent["salary"]:,}')
print(f'Languages:  {respondent["languages"]}')

# .get() is safer -- returns None (or a default) if key does not exist
# [] raises KeyError for missing keys; dangerous in data pipelines
ed = respondent.get('ed_level', 'Not specified')
print(f'Education:  {ed}')

# Modifying
respondent['salary']   = 140_000
respondent['ed_level'] = "Bachelor's"
del respondent['id']
print(f'Updated salary:  ${respondent["salary"]:,}')

# Iterating
print()
print('--- All fields ---')
for key, value in respondent.items():   # .items() yields (key, value) tuples
    print(f'  {key:<12}: {value}')


---

## Section 1.4 -- Functions: Organizing Code

A **function** is a named, reusable block of code that performs a specific task.
Functions are the fundamental unit of code organisation in Python.
In AI/ML, every preprocessing step, training loop, and evaluation metric is a function.

Good functions are small, named clearly, and documented with a docstring.


In [None]:
# 1.4.1 and 1.4.2 -- Defining Functions and Arguments
#
# Argument types:
#   positional -- must be provided, in order
#   keyword    -- can be passed by name in any order
#   default    -- has a fallback value if the caller omits it

def categorise_salary(salary, currency='USD', threshold_senior=100_000):
    """
    Categorise a salary into a band label.

    Parameters
    ----------
    salary : float
        Annual compensation amount.
    currency : str, optional
        Currency code for display. Default is 'USD'.
    threshold_senior : int, optional
        Salary at or above which a respondent is 'Senior'. Default 100_000.

    Returns
    -------
    str
        One of: 'Principal', 'Senior+', 'Senior', 'Mid-level', 'Junior', 'Entry-level'.
    """
    if salary >= threshold_senior * 2:
        return 'Principal'
    elif salary >= threshold_senior * 1.5:
        return 'Senior+'
    elif salary >= threshold_senior:
        return 'Senior'
    elif salary >= threshold_senior * 0.6:
        return 'Mid-level'
    elif salary >= threshold_senior * 0.3:
        return 'Junior'
    else:
        return 'Entry-level'

# Calling with positional arguments
print(categorise_salary(135_000))

# Calling with keyword arguments -- order does not matter
print(categorise_salary(threshold_senior=120_000, salary=135_000, currency='GBP'))

# Test across a range
test_salaries = [25_000, 55_000, 95_000, 135_000, 175_000, 250_000]
print()
print(f'{"Salary":>12}  Band')
print('-' * 26)
for sal in test_salaries:
    band = categorise_salary(sal)
    print(f'  ${sal:>10,}  {band}')


In [None]:
# 1.4.3 -- Variable-Length Arguments: *args and **kwargs
#
# *args   -- collects extra POSITIONAL arguments into a tuple
# **kwargs -- collects extra KEYWORD arguments into a dictionary

def summarise_group(*salaries, label='Group'):
    """Compute salary statistics for any number of salary values."""
    if not salaries:
        return f'{label}: no data'
    avg = sum(salaries) / len(salaries)
    return f'{label}: n={len(salaries)}, avg=${avg:,.0f}, min=${min(salaries):,}, max=${max(salaries):,}'

print(summarise_group(95_000, 135_000, 88_000, label='Python devs'))
print(summarise_group(72_000, 68_000, label='Junior devs'))
print(summarise_group(210_000, 195_000, 230_000, 178_000, label='Principal devs'))

print()

def build_profile(name, salary, **attributes):
    """Build a respondent profile dict with any extra attributes."""
    profile = {'name': name, 'salary': salary}
    profile.update(attributes)   # merge extra kwargs into the dict
    return profile

profile = build_profile(
    'Alex',
    135_000,
    country   = 'Canada',
    languages = ['Python', 'SQL'],
    remote    = True,
    years_exp = 7,
)
print('Respondent profile:')
for k, v in profile.items():
    print(f'  {k:<12}: {v}')


In [None]:
# 1.4.4 -- Return Values and Scope

# A function can return any Python object, including multiple values
# (packed into a tuple automatically).
#
# Scope: variables defined INSIDE a function are LOCAL -- they exist
# only while the function runs. Variables defined outside are GLOBAL.

def analyse_salaries(salary_list):
    """Compute mean, median, min, max, and count."""
    sorted_sals = sorted(salary_list)
    n           = len(sorted_sals)
    mean        = sum(sorted_sals) / n
    mid         = n // 2    # integer division -- always returns an int
    if n % 2 == 1:
        median = sorted_sals[mid]
    else:
        median = (sorted_sals[mid - 1] + sorted_sals[mid]) / 2
    return mean, median, min(sorted_sals), max(sorted_sals), n

salaries = [95_000, 135_000, 88_000, 210_000, 72_000, 155_000, 98_000]
mean, median, lo, hi, count = analyse_salaries(salaries)   # unpack the tuple

print(f'Count:  {count}')
print(f'Mean:   ${mean:,.0f}')
print(f'Median: ${median:,.0f}')
print(f'Range:  ${lo:,} to ${hi:,}')

# Scope demonstration
multiplier = 1.3   # GLOBAL variable

def apply_raise(salary):
    new_salary = salary * multiplier   # reads the global multiplier
    return new_salary                  # new_salary is LOCAL -- gone after return

raised = apply_raise(100_000)
print(f'After {(multiplier-1)*100:.0f}% raise: ${raised:,.0f}')


In [None]:
# 1.4.5 -- Lambda Functions
#
# A lambda is a small anonymous function defined in one expression.
# Syntax: lambda param1, param2: expression
# Use when the function is simple enough for one line and needed as an argument.

# Equivalent lambda and def
def double_salary(s): return s * 2
double_lambda = lambda s: s * 2
print(f'def:    ${double_salary(70_000):,}')
print(f'lambda: ${double_lambda(70_000):,}')

# Key use case: as the key= argument to sorted()
respondents = [
    {'name': 'Alex',  'salary': 135_000, 'country': 'Canada'},
    {'name': 'Blake', 'salary':  88_000, 'country': 'UK'},
    {'name': 'Casey', 'salary': 210_000, 'country': 'USA'},
    {'name': 'Dana',  'salary':  72_000, 'country': 'India'},
    {'name': 'Ellis', 'salary': 155_000, 'country': 'Germany'},
]

by_salary = sorted(respondents, key=lambda r: r['salary'], reverse=True)
print()
print('Sorted by salary (descending):')
for r in by_salary:
    print(f'  {r["name"]:<8} ${r["salary"]:>10,}  {r["country"]}')

# filter() -- keep only items matching a condition
high_earners = list(filter(lambda r: r['salary'] > 100_000, respondents))
print()
print(f'High earners (>$100k): {[r["name"] for r in high_earners]}')

# map() -- apply a transformation to every item
net = list(map(lambda r: {**r, 'net': r['salary'] * 0.8}, respondents))
print()
print('Net salaries (after 20% tax):')
for r in net:
    print(f'  {r["name"]:<8} gross: ${r["salary"]:>10,}  net: ${r["net"]:>10,.0f}')


---

## Section 1.4.6 -- Project: SO 2025 Data with Pure Python

Now we apply everything from Chapter 1 to a real sample of the SO 2025 dataset.

We will build a small pure-Python analytics module -- no NumPy, no Pandas --
that computes meaningful statistics from a list of respondent dictionaries.
Understanding this makes you a better Pandas user because you know what
Pandas is doing under the hood.

> We load a 20-row sample using Python's built-in `csv` and `urllib` modules.
> The full 15,000-row dataset is loaded with `pd.read_csv()` in Chapter 3.


In [None]:
# Load a small sample of the SO 2025 dataset using only Python built-ins.
# csv and urllib are part of the standard library -- no pip install needed.
import csv
import urllib.request

DATASET_URL = 'https://raw.githubusercontent.com/timothy-watt/python-for-ai-ml/main/data/so_survey_2025_curated.csv'

print('Loading sample from SO 2025 dataset...')

with urllib.request.urlopen(DATASET_URL) as response:
    raw = response.read().decode('utf-8')

lines  = raw.splitlines()
reader = csv.DictReader(lines)   # first row becomes column names

sample = []
for row in reader:
    salary_str = row.get('ConvertedCompYearly', '').strip()
    if not salary_str:
        continue
    try:
        salary = float(salary_str)
    except ValueError:
        continue

    respondent = {
        'country':  row.get('Country', 'Unknown').strip(),
        'salary':   salary,
        'years_exp': row.get('YearsCodePro', '').strip(),
        'languages': row.get('LanguageHaveWorkedWith', '').strip(),
        'dev_type':  row.get('DevType', '').strip(),
        'remote':    row.get('RemoteWork', '').strip(),
        'ai_tools':  row.get('AIToolCurrently', '').strip(),
    }
    sample.append(respondent)
    if len(sample) >= 20:
        break

print(f'Sample loaded: {len(sample)} respondents')
print()
for i, r in enumerate(sample[:3]):
    print(f'Respondent {i+1}:')
    for k, v in r.items():
        display_v = v[:60] if isinstance(v, str) and len(v) > 60 else v
        print(f'  {k:<12}: {display_v}')
    print()


In [None]:
# Pure-Python analytics functions applied to the SO 2025 sample.
# These use only Chapter 1 features: lists, dicts, loops, functions.

def average_salary(respondents):
    """Compute mean salary. Returns 0.0 if list is empty."""
    if not respondents:
        return 0.0
    return sum(r['salary'] for r in respondents) / len(respondents)


def count_by_country(respondents):
    """Count respondents per country, sorted by count descending."""
    counts = {}
    for r in respondents:
        country = r['country'] or 'Unknown'
        counts[country] = counts.get(country, 0) + 1
    return sorted(counts.items(), key=lambda item: item[1], reverse=True)


def most_common_languages(respondents, top_n=5):
    """Find the most frequently used languages from semicolon-separated strings."""
    lang_counts = {}
    for r in respondents:
        if not r['languages']:
            continue
        for lang in r['languages'].split(';'):
            lang = lang.strip()
            if lang:
                lang_counts[lang] = lang_counts.get(lang, 0) + 1
    return sorted(lang_counts.items(), key=lambda x: x[1], reverse=True)[:top_n]


def salary_by_country(respondents):
    """Compute average salary per country, sorted by average descending."""
    data = {}
    for r in respondents:
        c = r['country'] or 'Unknown'
        data.setdefault(c, []).append(r['salary'])   # .setdefault avoids key-check boilerplate
    results = [(c, sum(sals)/len(sals), len(sals)) for c, sals in data.items()]
    return sorted(results, key=lambda x: x[1], reverse=True)


# Run the analytics
print('=' * 55)
print('  SO 2025 Developer Survey -- Sample Analytics')
print('=' * 55)

print(f'Sample size:      {len(sample)} respondents')
print(f'Average salary:   ${average_salary(sample):,.0f}')

print()
print('Respondents by country:')
for country, count in count_by_country(sample):
    bar = 'I' * count
    print(f'  {country:<25} {count:>2}  {bar}')

print()
print('Top languages in sample:')
for rank, (lang, count) in enumerate(most_common_languages(sample, top_n=8), 1):
    print(f'  #{rank}: {lang:<25} {count} respondent(s)')

print()
print('Average salary by country:')
for country, avg_sal, n in salary_by_country(sample):
    print(f'  {country:<25} ${avg_sal:>10,.0f}  (n={n})')


---

## Chapter 1 Summary

You have completed the foundational Python chapter.
Every concept here will be used directly and constantly in every chapter that follows.

### Key Takeaways

- **Variables** are named containers. Python infers type automatically.
- **Core types:** `int`, `float`, `str`, `bool` -- know when to use each.
- **F-strings** (`f'{value:.2f}'`) are the standard formatting approach in modern Python.
- **Lists** are ordered and mutable -- go-to for sequences you will change.
- **Tuples** are ordered and immutable -- use for fixed records and multiple return values.
- **Sets** store unique items -- use for fast membership tests and deduplication.
- **Dictionaries** map keys to values -- natural representation of one data record.
- **Functions** are the unit of code reuse -- name clearly, document with docstrings.
- **`*args` / `**kwargs`** accept flexible argument counts.
- **Lambdas** are one-line functions -- ideal as `key=` arguments to `sorted()`.
- **`break`** exits a loop; **`continue`** skips to next iteration; **`pass`** is a placeholder.

### Project Thread Status

| Task | Status |
|------|--------|
| Loaded SO 2025 sample with `urllib` + `csv` | Done |
| Parsed salary, country, language columns | Done |
| Built `average_salary()` | Done |
| Built `count_by_country()` | Done |
| Built `most_common_languages()` | Done |
| Built `salary_by_country()` | Done |
| Produced a pure-Python analytics report | Done |

---

### What's Next: Chapter 2 -- Intermediate Python

Chapter 2 covers modules and packages, file I/O, error handling,
Object-Oriented Programming (including the bridge to scikit-learn's API),
and Pythonic patterns: comprehensions, generators, and decorators.

---

*End of Chapter 1 -- Python for AI/ML*  
[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)
