
# DS2002 — SQL + Python → Pandas DataFrames
## 2026-02-02 — Lecture (Deep Dive)

Today’s goal is simple but important:

**Understand what DataFrames let you do that SQL alone does not.**

We will:
- keep using the same store database
- ask increasingly interesting questions
- answer them *inside Python* using DataFrames

Think of this lecture as learning how to *reason* with data once it has left the database.



## Reminder: Where DataFrames Fit

A realistic data workflow looks like this:

1. SQL pulls the correct rows and columns
2. Results become a DataFrame
3. Analysis, transformation, and experimentation happen in Python
4. Results may be written back to the database

Everything today lives in steps 2–3.


In [None]:

import sqlite3
import pandas as pd

conn = sqlite3.connect(":memory:")

def exec_sql(sql):
    conn.executescript(sql)
    conn.commit()

def q(sql):
    return pd.read_sql_query(sql, conn)

print("Environment ready.")



## Store Database (Same Example, Richer Questions)

We will keep using this small store.
The schema is intentionally simple so the *analysis* is the focus.


In [None]:

exec_sql('''
CREATE TABLE customers (
    customer_id INTEGER PRIMARY KEY,
    name TEXT,
    region TEXT
);

CREATE TABLE products (
    product_id INTEGER PRIMARY KEY,
    name TEXT,
    category TEXT,
    price REAL
);

CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    customer_id INTEGER,
    order_date TEXT
);

CREATE TABLE order_items (
    order_id INTEGER,
    product_id INTEGER,
    quantity INTEGER
);

INSERT INTO customers VALUES
(1,'Alice','East'),
(2,'Bob','West'),
(3,'Carol','East'),
(4,'David','South');

INSERT INTO products VALUES
(101,'Keyboard','Electronics',50),
(102,'Mouse','Electronics',20),
(103,'Monitor','Electronics',200),
(104,'Desk','Furniture',300),
(105,'Chair','Furniture',150);

INSERT INTO orders VALUES
(1,1,'2026-02-01'),
(2,2,'2026-02-02'),
(3,1,'2026-02-03'),
(4,3,'2026-02-03'),
(5,4,'2026-02-04');

INSERT INTO order_items VALUES
(1,101,1),(1,102,2),
(2,103,1),
(3,104,1),(3,105,1),
(4,101,2),(4,103,1),
(5,105,2);
''')



## Pull the Data Into a DataFrame


In [None]:

df = q('''
SELECT
    c.name AS customer,
    c.region,
    o.order_id,
    o.order_date,
    p.name AS product,
    p.category,
    oi.quantity,
    p.price,
    (oi.quantity * p.price) AS line_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
ORDER BY o.order_date;
''')

df



## Filtering: Asking Focused Questions

Filtering is how we narrow attention.
Every real analysis starts with filtering.

The important idea:
**Filters are expressions that evaluate to True or False for each row.**


In [None]:

# All rows where the customer is Alice
df[df['customer'] == 'Alice']


In [None]:

# Orders with a line total over $100
df[df['line_total'] > 100]


In [None]:

# Electronics items only
df[df['category'] == 'Electronics']



### Combining Filters

You can combine conditions using logical operators.


In [None]:

# Electronics purchases over $100
df[(df['category'] == 'Electronics') & (df['line_total'] > 100)]



## Sorting: Revealing Structure

Sorting is not just cosmetic.
It reveals patterns, extremes, and anomalies.


In [None]:

# Highest-value line items first
df.sort_values(by='line_total', ascending=False)


In [None]:

# Sort by customer, then by order date
df.sort_values(by=['customer', 'order_date'])



## Grouping: Turning Rows Into Insight

Grouping answers questions like:
- How much did each customer spend?
- Which category generates the most revenue?
- Which region performs best?


In [None]:

# Total revenue by customer
df.groupby('customer')['line_total'].sum()


In [None]:

# Revenue by product category
df.groupby('category')['line_total'].sum()


In [None]:

# Revenue by region and category
df.groupby(['region', 'category'])['line_total'].sum()



### Multiple Aggregations

You can compute several statistics at once.


In [None]:

df.groupby('customer')['line_total'].agg(['count', 'sum', 'mean', 'max'])



## Answering Real Questions With DataFrames

Once data is in a DataFrame, questions can be asked and answered rapidly.


In [None]:

# Which customer spent the most?
df.groupby('customer')['line_total'].sum().sort_values(ascending=False)


In [None]:

# Which product category generates the highest average sale?
df.groupby('category')['line_total'].mean().sort_values(ascending=False)



## Writing Results Back to SQL

Analysis often produces new tables that should be saved.


In [None]:

customer_summary = (
    df.groupby('customer')['line_total']
      .sum()
      .reset_index(name='total_spent')
)

customer_summary


In [None]:

customer_summary.to_sql('customer_summary', conn, if_exists='replace', index=False)
q("SELECT * FROM customer_summary;")



## Final Perspective

DataFrames let you:
- explore freely
- ask better questions
- iterate quickly
- build features for models

SQL and Pandas are not competitors.
They are partners.



# Appendix — Working with Pandas DataFrames (Reference Section)

This section is intentionally reference-heavy.

You are **not** expected to memorize these methods.
You *are* expected to know that these capabilities exist and to recognize when they are useful.

Everything below operates on the `df` DataFrame created earlier in this notebook.



## 1. Inspecting a DataFrame (Always Start Here)

Before analyzing data, you should always understand:
- how many rows and columns exist
- what the column names are
- what data types Pandas inferred
- whether values look reasonable


In [None]:

df.head()        # first few rows
df.tail()        # last few rows
df.shape         # (rows, columns)
df.columns       # column names
df.dtypes        # data types
df.info()        # compact summary



## 2. Selecting Columns

Selecting columns changes the *shape* of the table.
It does not filter rows.

A single column returns a **Series**.
Multiple columns return a **DataFrame**.


In [None]:

df['customer']
df[['customer', 'region', 'line_total']]



## 3. Filtering Rows (Boolean Indexing)

Filtering is how we ask focused questions.

The expression inside the brackets evaluates to True or False **for each row**.
Rows marked True are kept.


In [None]:

df[df['customer'] == 'Alice']
df[df['line_total'] > 100]
df[df['category'].isin(['Electronics', 'Furniture'])]

df[
    (df['region'] == 'East') &
    (df['line_total'] > 50)
]



## 4. Sorting Data

Sorting helps reveal structure, extremes, and anomalies.
It is often used during exploration.


In [None]:

df.sort_values('line_total')
df.sort_values('line_total', ascending=False)
df.sort_values(['customer', 'order_date'])



## 5. Creating and Modifying Columns

Creating new columns is sometimes called *feature engineering*.

This is where raw data becomes useful data.


In [None]:

df['tax'] = df['line_total'] * 0.07
df['total_with_tax'] = df['line_total'] + df['tax']
df['is_large_order'] = df['line_total'] > 150

df



## 6. Grouping and Aggregation

Grouping collapses many rows into fewer summary rows.

This mirrors SQL `GROUP BY`, but allows rapid experimentation.


In [None]:

df.groupby('customer')['line_total'].sum()
df.groupby('category')['line_total'].mean()
df.groupby(['region', 'category'])['line_total'].sum()



### Multiple Aggregations at Once


In [None]:

df.groupby('customer')['line_total'].agg(
    count='count',
    total='sum',
    average='mean',
    max_order='max'
)



## 7. Resetting the Index

Groupby results often use the grouping columns as an index.
Resetting the index turns them back into normal columns.


In [None]:

summary = df.groupby('customer')['line_total'].sum()
summary.reset_index()



## 8. Missing Data (NaN)

Real-world data often contains missing values.
Pandas provides explicit tools to detect and handle them.


In [None]:

df.isna()
df.isna().sum()
df.dropna()
df.fillna(0)



## 9. Saving DataFrames

Analysis often produces new tables that should be stored for later use.


In [None]:

summary = (
    df.groupby('customer')['line_total']
      .sum()
      .reset_index(name='total_spent')
)

summary.to_sql(
    'customer_summary',
    conn,
    if_exists='replace',
    index=False
)

q("SELECT * FROM customer_summary;")



## Final Perspective

A DataFrame is not just a container.
It is a *working surface* for thinking with data.

SQL retrieves structured data.
Pandas lets you explore, transform, and build insight from it.

You will use these patterns repeatedly for the rest of the course.
