<a href="https://colab.research.google.com/github/samjurassic/datascience-demo/blob/main/coda/HBS_CoDA_Python_Part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HBS CoDA Python Workshop - Part 1
Learning objectives include:

• Basic data types in Python

• Load a CSV or Excel file into Python

• Conduct exploratory data analysis with Pandas and Seaborn

• Calculating summary statistics

• Using groupby to aggregate data

• Creating scatter, bar, line, and box plots

• Decision tree learning with sk-learn

• Analyze multiple files with DuckDB

• Vibe-code a data pipeline using genAI

## Python Basics

In [None]:
# --- Types and DATA STRUCTURES ---

# string (in quotes)
my_name = "Sam"

# note: print with a f-string to interpolate values in {}
print(f"My name is {my_name}")

# integer
favorite_number = 7

# float
sales_tax = 0.0875

# boolean
it = True

# list
prices = [10, 214, 150, 59]

# PYTHON INDICES START AT ZERO, access with integer in sq. brackets []
print(f"First price: {prices[0]}")

# dict: Used for named key: value pairs, often a row of data (in excel terms)
product_row = {"id": 101, "name": "Keyboard", "price": 25, "in_stock": True}

# Accessing dictionary data by name (Key)
print(f"{product_row["name"]} Price: {product_row["price"]}")


In [None]:
# --- FUNCTIONS ---

# A Function acting as a reusable tool
# Includes Control Flow (Logic) to make decisions

def calculate_tax(price):
    if price >= 100.0:
        return price * 1.0875  # High tax for luxury items
    else:
        return price * 1.00  # Low tax for cheap items


# --- LOOPS/ITERABLES ---

# The For loop
taxed_prices = [] # start with empty list

for p in prices:
    new_price = calculate_tax(p) # call function
    taxed_prices.append(new_price) # list.append() adds item to list

print(f"For Loop Result: \t{taxed_prices}")

# The "Pythonic Way" (List Comprehension)
# A loop condensed into one line: [ ACTION for ITEM in LIST ]

taxed_prices_v2 = [calculate_tax(p) for p in prices]

print(f"Comprehension Result: \t{taxed_prices_v2}")

# note: try to balance code readability and length

## DataFrames 101

In [None]:
# imports bring in code that isn't included in base python

import pandas as pd # dataframes
import numpy as np # numerical calculations

# plotting
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Remember Dictionaries? A DataFrame is just a "Dictionary of Lists"
# Each Key is a Column Name. Each Value is the List of data.
data = {
    "Product": ["Apple", "Banana", "Orange"],
    "Price": [1.20, 0.50, 2.50],
    "In_Stock": [True, True, False]
}

# Convert the dictionary to a DataFrame
tiny_df = pd.DataFrame(data)
print("--- Tiny DataFrame ---")
print(tiny_df)
print("\n")

In [None]:
# filtering data

# query (returns a copy)
tiny_df.query("Product == 'Apple'")

In [None]:
# .loc returns original object
tiny_df.loc[tiny_df["Product"] == "Apple"]

In [None]:
# Assignment

print("Before update", tiny_df, sep="\n")

# update existing values with .loc
tiny_df.loc[tiny_df["Product"] == "Apple", "Price"] = 1.45

# assign new column
tiny_df["Date"] = "2026-02-18"

print("", "After update", tiny_df, sep="\n")

## Reading a CSV from a URL and Data Cleanup

We will be using food import data from the USDA. This file includes several common data issues we will address using Pandas:
- Mixed data types in the same column
- Mixed units in the same column
- Aggregate data included with disaggregated data

In [None]:
# https://www.ers.usda.gov/data-products/us-food-imports/documentation

link = "https://www.ers.usda.gov/media/6495/summary-data-on-annual-food-imports-values-and-volume-by-food-category-and-source-country.csv?v=37251"

# encoding is usually figured out automatically, but if you get an encoding error you can specify
food_imports = pd.read_csv(link, encoding="cp1252")

# food_imports.to_csv("annual_food_imports.csv", index=False)

In [None]:
food_imports.head()

In [None]:
# df.info() gives size, type, null information
food_imports.info()

In [None]:
food_imports.describe().round(2)

In [None]:
# year is not numeric... let's try to re-assign to integer
try:
  food_imports["Year"] = food_imports["YearNum"].astype(int)
except Exception as e:
  print(e)

In [None]:
food_imports["YearNum"].value_counts().tail(5)

In [None]:
# what that means
food_imports.query("YearNum.isin(['means10years', 'means'])").head(5)

In [None]:
# let's drop these rows with df.drop
rows_to_drop = food_imports.query("YearNum.isin(['means10years', 'means'])").index
food_imports.drop(rows_to_drop, axis=0, inplace=True)

In [None]:
# now we can make a new integer year column
food_imports["Year"] = food_imports["YearNum"].astype(int)

food_imports.describe()

### Line plot example

In [None]:
# we can chain multiple methods together
vegetables_annual_usd = (food_imports
                         .query("Commodity == 'Total vegetables and preparations' and UOM == 'Million $'")
                         .groupby(["Country", "Year"])["FoodValue"]
                         .sum()
                         .reset_index())

plot_data = vegetables_annual_usd[~vegetables_annual_usd['Country'].isin(['WORLD', 'REST OF WORLD', 'WORLD (Quantity)'])].copy()

In [None]:
# Set figure size to make it wider/readable
plt.figure(figsize=(12, 8))

# CREATE PLOT
sns.lineplot(
    data=plot_data,
    x="Year",
    y="FoodValue",
    hue="Country",
    legend="auto",
    linewidth=2
)

# log y axis (don't need to transform data in dataframe to do this)
plt.yscale('log')

# FORMATTING
plt.title("Vegetable Food Value by Country", fontsize=16)
plt.ylabel("Millions of USD")
plt.xlabel("Year")
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
food_imports.Commodity.value_counts().to_frame()[0:9]

### Boxplot example

In [None]:
coffee_imports = food_imports[~food_imports['Country'].isin(['WORLD', 'REST OF WORLD', 'WORLD (Quantity)'])].query("Commodity == 'Coffee beans, unroasted'")

In [None]:
plt.figure()

sns.boxplot(
    data=coffee_imports,
    x="Country",
    y="FoodValue",
    legend="auto",
    hue="Country"
)

# Rotate labels so they don't overlap
plt.title("Range of Unroasted Coffee Imports by Country 1999-2024")
plt.xlabel("Country of Origin")
plt.ylabel("Millions of USD")
plt.tight_layout()

### Bar plot Example

In [None]:
swiss_imports = food_imports.query("Country == 'SWITZERLAND'")

swiss_imports_sum = swiss_imports.groupby("Commodity")["FoodValue"].sum().reset_index()

sns.barplot(swiss_imports_sum, x="Commodity", y="FoodValue")

In [None]:
swiss_imports_sum

In [None]:
# Calculate difference using row indices 5 and 2
# tea_and_spices = (
total_coffee_tea_spices = swiss_imports_sum.loc[swiss_imports_sum["Commodity"] == "Total coffee, tea, and spices", "FoodValue"].values[0]
total_coffee = swiss_imports_sum.loc[swiss_imports_sum["Commodity"] == "Coffee, roasted and instant", "FoodValue"].values[0]

tea_and_spices = round(total_coffee_tea_spices - total_coffee, 2)
print(tea_and_spices)

# Add the new row at index 6
swiss_imports_sum.loc[6] = ['Tea and spices', tea_and_spices]

In [None]:
swiss_imports_sum.drop(swiss_imports_sum[swiss_imports_sum['Commodity'] == 'Total coffee, tea, and spices'].index, inplace=True)

In [None]:
swiss_imports_sum

In [None]:
# add color to bars (labels still look bad)
sns.barplot(swiss_imports_sum, x="Commodity", y="FoodValue", hue="Commodity")

In [None]:
# Sort the dataframe by value (descending)
swiss_imports_sum = swiss_imports_sum.sort_values("FoodValue", ascending=False)

# Create the plot
plt.figure(figsize=(10, 6)) # Make it wider so labels fit
ax = sns.barplot(
    data=swiss_imports_sum,
    x="Commodity",
    y="FoodValue",
    hue="Commodity",
    palette="viridis",  # Options: "magma", "rocket", "flare", "crest"
    legend=False        # Removes redundant legend since x-axis has labels
)
# Add labels on top of the bars
# 'fmt="%.1f"' keeps one decimal place as seen in your original data
for container in ax.containers:
    ax.bar_label(container, padding=3, fmt='%.1f')


# increase room for bar labels on y axis
max_val = swiss_imports_sum['FoodValue'].max()
ax.set_ylim(0, max_val * 1.1)

# Rotate labels so they don't overlap
plt.xticks(rotation=45, ha='right')
plt.title("Swiss Food Imports by Value 1999-2024")
plt.ylabel("Millions of USD")
plt.tight_layout()

plt.show()

## Machine Learning: Decision Tree Example

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report

# Load and clean data
cars = sns.load_dataset("mpg").dropna()
cars.head()

In [None]:
# SCATTER PLOT
plt.figure(figsize=(10, 6))
sns.scatterplot(data=cars, x="horsepower", y="mpg", hue="origin", palette="viridis")
plt.title("Cars: Horsepower vs. MPG")
plt.show()

In [None]:
# PREPARE DATA
# Features (X) and Target (y)
X = cars[['horsepower', 'mpg', 'weight']]
y = cars['origin']

# Split into Training (80%) and Testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=42)

# TRAIN THE MODEL
# We set 'max_depth=3' so the tree isn't too big to look at!
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)

print("Model Training Complete.")
print(f"Model Accuracy on test data: {clf.score(X_test, y_test):.2%}")

In [None]:
# PLOT THE TREE
plt.figure(figsize=(16, 10))
plot_tree(
    clf,
    feature_names=X.columns,
    class_names=clf.classes_,
    filled=True,      # Colors the boxes based on the majority class
    rounded=True,     # Makes boxes look nicer
    fontsize=12
)
plt.title("Decision Tree: How the computer predicts Car Origin")
plt.show()

# note: gini impurity is used as the tree's objective function
# this measures node impurity i.e. how many incorrect examples are in the node
# 0 is perfect classification, 1 - 1/N(classes) is worst (here 1-1/3 = 0.667)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=clf.classes_).plot(cmap='Blues')

# correct predictions are along the Top Left-Bottom Right diagonal

print(classification_report(y_test, y_pred, target_names=clf.classes_))

# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

print("""
Precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives.
Precision is intuitively the ability of the classifier not to label a negative sample as positive.

Recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.
Recall is intuitively the ability of the classifier to find all the positive samples.

F-1 score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.

The support is the number of occurrences of each class in y_true.

Macro average (averaging the unweighted mean per label)
Weighted average (averaging the support-weighted mean per label)
""")


## Handling multiple files/data pipeline vibecoding

In [None]:
import os
import duckdb # read more at https://duckdb.org/

# 1. Generate synthetic data
countries = ['BRAZIL', 'COLOMBIA', 'VIETNAM', 'PERU', 'GUATEMALA']
data = {
    'Country': np.random.choice(countries, 500),
    'FoodValue': np.random.normal(1500, 800, 500),
    'Date': pd.date_range(start='2020-01-01', periods=500, freq='D')
}
df_raw = pd.DataFrame(data)

# 2. Split by Country and by 100-row chunks
os.makedirs('raw_imports', exist_ok=True)

for country, group in df_raw.groupby('Country'):
    # Split this country's data into chunks of 100
    chunks = [group[i:i+100] for i in range(0, group.shape[0], 100)]

    for i, chunk in enumerate(chunks):
        filename = f"raw_imports/import_{country}_{i}.csv"
        chunk.to_csv(filename, index=False)

print(f"Created {len(os.listdir('raw_imports'))} files in /raw_imports/")

In [None]:
# We can run this on all files, or let DuckDB handle it via a View.
# Let's use DuckDB for the "Heavy Lifting" because it's cooler.

# Create connection
con = duckdb.connect()

# 1. Read all files
# 2. Clean/Standardize on the fly
# 3. Group by Country
pipeline_sql = """
    CREATE OR REPLACE TABLE clean_imports AS
    SELECT
        upper(Country) as Country,
        AVG(FoodValue) as Avg_Value,
        SUM(FoodValue) as Total_Value,
        COUNT(*) as Transaction_Count,
        max(Date) as Latest_Date
    FROM read_csv_auto('raw_imports/*.csv')
    GROUP BY Country
    HAVING Total_Value > 0
    ORDER BY Total_Value DESC
"""

con.execute(pipeline_sql)
print("Transformation Pipeline Complete.")

# Final check: Peek at the results
print(con.execute("SELECT * FROM clean_imports").df())

In [None]:
# Write to Parquet (The industry standard for ML data)
con.execute("COPY clean_imports TO 'final_summary.parquet' (FORMAT PARQUET)")

# Write to CSV (For your boss to open in Excel)
con.execute("COPY clean_imports TO 'final_summary.csv' (HEADER, DELIMITER ',')")

con.close()

### Prompt Template for data pipeline

Role: Act as a Data Engineer. Produce Python code for the following ETL pipeline.

1. SOURCE (Extract):
Look for files in [FOLDER_NAME] with the extension [CSV/PARQUET].
Handle multiple files using [glob/duckdb/pathlib].

2. CLEANING (Transform - Python):
Apply these rules to every file: [e.g., strip whitespace, convert dates, handle nulls].
Remove rows where [CONDITION].

3. AGGREGATION (Transform - SQL/DuckDB):
Create a virtual view of the cleaned data.
Run a SQL query to: [e.g., Group by X, Sum Y, Join with Z].

4. SINK (Load):
Write the final output to [DIRECTORY/FILENAME] in [PARQUET/CSV] format.