# Lab 1B: Pandas Fundamentals

## Comprehensive Data Manipulation for Machine Learning

**Duration:** 90 minutes | **Difficulty:** Beginner to Intermediate | **Prerequisites:** Lab 1A (NumPy)

---

### Learning Objectives

By the end of this lab, you will be able to:

1. **Create and explore** DataFrames effectively
2. **Select and filter** data using various methods
3. **Clean data** by handling missing values and duplicates
4. **Transform data** with apply, map, and aggregation
5. **Merge and join** datasets from multiple sources
6. **Prepare data** for machine learning pipelines
7. **Visualize** data distributions and relationships

---

## Setup

Run this cell first to import libraries and configure matplotlib for inline display.

In [None]:
# Setup - Run this cell first
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Configure matplotlib for inline display
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['font.size'] = 12

# Pandas display options
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 15)
pd.set_option('display.precision', 2)

# Set random seed for reproducibility
np.random.seed(42)

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print("âœ“ Setup complete! Matplotlib configured for inline display.")

---

## Section 1: Creating and Exploring DataFrames

A DataFrame is a 2D labeled data structure - think of it as a spreadsheet or SQL table in Python.

### Exercise 1.1: Create DataFrames

Create DataFrames from different sources:

1. `df_dict` - From a dictionary of lists
2. `df_arrays` - From NumPy arrays with custom column names
3. `df_records` - From a list of dictionaries (records)

In [None]:
# Exercise 1.1: Create DataFrames

# Method 1: From dictionary of lists
data_dict = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'Chicago', 'NYC'],
    'salary': [70000, 80000, 90000, 75000]
}
# YOUR CODE HERE
df_dict = None  # pd.DataFrame(data_dict)

# Method 2: From NumPy arrays
np_data = np.random.randn(5, 3)
# YOUR CODE HERE
df_arrays = None  # pd.DataFrame(np_data, columns=['A', 'B', 'C'])

# Method 3: From list of records (dictionaries)
records = [
    {'product': 'Widget', 'price': 25.99, 'quantity': 100},
    {'product': 'Gadget', 'price': 49.99, 'quantity': 50},
    {'product': 'Gizmo', 'price': 19.99, 'quantity': 200}
]
# YOUR CODE HERE
df_records = None  # pd.DataFrame(records)

print("From dictionary:")
print(df_dict)
print("\nFrom NumPy arrays:")
print(df_arrays)
print("\nFrom records:")
print(df_records)

### Exercise 1.2: Explore DataFrames

Let's create a larger dataset and explore it. Complete the exploration functions:

In [None]:
# Create a larger sample dataset
np.random.seed(42)
n = 200

df = pd.DataFrame({
    'customer_id': range(1000, 1000 + n),
    'age': np.random.randint(18, 70, n),
    'income': np.random.normal(60000, 20000, n).astype(int),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n, p=[0.3, 0.4, 0.2, 0.1]),
    'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix'], n),
    'signup_year': np.random.choice([2020, 2021, 2022, 2023], n),
    'purchases': np.random.randint(0, 50, n),
    'satisfaction': np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.1, 0.3, 0.35, 0.2])
})

# Add some missing values
df.loc[np.random.choice(n, 10, replace=False), 'income'] = np.nan
df.loc[np.random.choice(n, 5, replace=False), 'satisfaction'] = np.nan

print(f"Dataset created with {len(df)} rows and {len(df.columns)} columns")
df.head()

In [None]:
# Exercise 1.2: Explore the DataFrame

# YOUR CODE HERE - Get basic info about the DataFrame
shape = None          # df.shape
columns = None        # df.columns.tolist()
dtypes = None         # df.dtypes
first_5 = None        # df.head()
last_3 = None         # df.tail(3)
random_sample = None  # df.sample(5)
stats = None          # df.describe()
info = None           # df.info() - prints directly

print(f"Shape: {shape}")
print(f"\nColumns: {columns}")
print(f"\nData types:\n{dtypes}")
print(f"\nFirst 5 rows:")
print(first_5)
print(f"\nStatistical summary:")
print(stats)

### Exercise 1.3: Value Counts and Unique Values

Understand categorical distributions:

In [None]:
# Exercise 1.3: Value Counts and Unique Values

# YOUR CODE HERE
unique_cities = None      # df['city'].unique()
n_unique_cities = None    # df['city'].nunique()
city_counts = None        # df['city'].value_counts()
city_percentages = None   # df['city'].value_counts(normalize=True) * 100
education_counts = None   # df['education'].value_counts()

print(f"Unique cities: {unique_cities}")
print(f"Number of unique cities: {n_unique_cities}")
print(f"\nCity counts:\n{city_counts}")
print(f"\nCity percentages:\n{city_percentages.round(1)}")
print(f"\nEducation distribution:\n{education_counts}")

### Visualization: Data Distributions

In [None]:
# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Age distribution
axes[0, 0].hist(df['age'], bins=20, edgecolor='black', alpha=0.7, color='steelblue')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Count')
axes[0, 0].set_title('Age Distribution')

# Income distribution
axes[0, 1].hist(df['income'].dropna(), bins=20, edgecolor='black', alpha=0.7, color='coral')
axes[0, 1].set_xlabel('Income ($)')
axes[0, 1].set_ylabel('Count')
axes[0, 1].set_title('Income Distribution')

# City counts (bar chart)
city_counts = df['city'].value_counts()
axes[1, 0].bar(city_counts.index, city_counts.values, edgecolor='black', color='seagreen')
axes[1, 0].set_xlabel('City')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Customers by City')
axes[1, 0].tick_params(axis='x', rotation=45)

# Satisfaction distribution
sat_counts = df['satisfaction'].value_counts().sort_index()
axes[1, 1].bar(sat_counts.index.astype(str), sat_counts.values, edgecolor='black', color='purple', alpha=0.7)
axes[1, 1].set_xlabel('Satisfaction Score')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Satisfaction Distribution')

plt.tight_layout()
plt.show()

---

## Section 2: Selecting and Filtering Data

Pandas provides multiple ways to select data: `[]`, `.loc[]`, `.iloc[]`, and boolean indexing.

### Exercise 2.1: Column Selection

Select columns in different ways:

In [None]:
# Exercise 2.1: Column Selection

# YOUR CODE HERE
single_col = None          # df['age']  - returns Series
multi_cols = None          # df[['age', 'income', 'city']]  - returns DataFrame
cols_loc = None            # df.loc[:, 'age':'city']  - range of columns
cols_by_dtype = None       # df.select_dtypes(include=['int64'])  - select by dtype

print("Single column (Series):")
print(single_col.head())
print(f"\nMultiple columns:")
print(multi_cols.head())
print(f"\nColumn range with .loc:")
print(cols_loc.head())
print(f"\nNumeric columns only:")
print(cols_by_dtype.head())

### Exercise 2.2: Row Selection with .loc and .iloc

- `.loc[]` - Label-based selection
- `.iloc[]` - Integer position-based selection

In [None]:
# Exercise 2.2: Row Selection

# YOUR CODE HERE
first_row = None           # df.iloc[0]
rows_0_to_4 = None         # df.iloc[0:5]
rows_and_cols = None       # df.iloc[0:5, [1, 2, 3]]  - first 5 rows, columns at index 1,2,3
specific_cell = None       # df.iloc[0, 1]  - row 0, column 1

# With .loc (label-based)
row_by_label = None        # df.loc[0]  - row with index 0
cols_by_name = None        # df.loc[0:4, ['age', 'income']]  - specific columns by name

print("First row (iloc[0]):")
print(first_row)
print(f"\nRows 0-4, specific columns (iloc):")
print(rows_and_cols)
print(f"\nRows 0-4, named columns (loc):")
print(cols_by_name)

### Exercise 2.3: Boolean Filtering

Filter rows based on conditions - this is the most common selection method!

In [None]:
# Exercise 2.3: Boolean Filtering

# YOUR CODE HERE
# Single condition
high_income = None         # df[df['income'] > 80000]
nyc_customers = None       # df[df['city'] == 'NYC']

# Multiple conditions (use & for AND, | for OR)
young_high_earners = None  # df[(df['age'] < 35) & (df['income'] > 70000)]
nyc_or_la = None           # df[(df['city'] == 'NYC') | (df['city'] == 'LA')]

# Using .isin() for multiple values
select_cities = None       # df[df['city'].isin(['NYC', 'LA', 'Chicago'])]

# Using .query() method (alternative syntax)
query_result = None        # df.query('age > 30 and income > 60000')

print(f"High income (>80k): {len(high_income)} rows")
print(f"NYC customers: {len(nyc_customers)} rows")
print(f"Young high earners (<35, >70k): {len(young_high_earners)} rows")
print(f"NYC or LA: {len(nyc_or_la)} rows")
print(f"\nYoung high earners sample:")
print(young_high_earners.head())

### Exercise 2.4: String Methods

Pandas provides string methods via the `.str` accessor:

In [None]:
# Exercise 2.4: String Methods

# YOUR CODE HERE
upper_cities = None        # df['city'].str.upper()
lower_cities = None        # df['city'].str.lower()
contains_c = None          # df[df['city'].str.contains('C', case=False)]
starts_with_h = None       # df[df['city'].str.startswith('H')]
city_lengths = None        # df['city'].str.len()

print("Uppercase cities:")
print(upper_cities.head())
print(f"\nCities containing 'C': {len(contains_c)} rows")
print(contains_c['city'].unique())
print(f"\nCities starting with 'H': {len(starts_with_h)} rows")

---

## Section 3: Data Cleaning

Real-world data is messy! Cleaning data is typically 80% of a data scientist's work.

### Exercise 3.1: Handling Missing Values

In [None]:
# Exercise 3.1: Handling Missing Values

# YOUR CODE HERE - Check for missing values
missing_count = None       # df.isnull().sum()
missing_percent = None     # (df.isnull().sum() / len(df) * 100).round(2)
rows_with_missing = None   # df[df.isnull().any(axis=1)]

print("Missing values per column:")
print(missing_count)
print(f"\nMissing percentages:")
print(missing_percent)
print(f"\nRows with any missing: {len(rows_with_missing)}")

In [None]:
# Exercise 3.1b: Fill or Drop Missing Values

# Create a copy to experiment
df_clean = df.copy()

# YOUR CODE HERE
# Option 1: Fill with specific value
# df_clean['income'].fillna(0, inplace=True)

# Option 2: Fill with mean/median
income_median = None       # df_clean['income'].median()
# df_clean['income'].fillna(income_median, inplace=True)

# Option 3: Fill with mode (most common value)
satisfaction_mode = None   # df_clean['satisfaction'].mode()[0]
# df_clean['satisfaction'].fillna(satisfaction_mode, inplace=True)

# Option 4: Forward fill (use previous value)
# df_clean['column'].fillna(method='ffill', inplace=True)

# Option 5: Drop rows with missing values
# df_clean.dropna(inplace=True)  # All missing
# df_clean.dropna(subset=['income'], inplace=True)  # Specific column

print(f"Income median: {income_median}")
print(f"Satisfaction mode: {satisfaction_mode}")

# Fill missing values with median/mode
df_clean['income'].fillna(income_median, inplace=True)
df_clean['satisfaction'].fillna(satisfaction_mode, inplace=True)

print(f"\nAfter cleaning:")
print(df_clean.isnull().sum())

### Exercise 3.2: Handling Duplicates

In [None]:
# Exercise 3.2: Handling Duplicates

# Add some duplicate rows for demonstration
df_with_dups = pd.concat([df_clean, df_clean.iloc[:5]], ignore_index=True)

# YOUR CODE HERE
n_duplicates = None        # df_with_dups.duplicated().sum()
duplicate_rows = None      # df_with_dups[df_with_dups.duplicated(keep=False)]
df_no_dups = None          # df_with_dups.drop_duplicates()

# Check duplicates based on specific columns
dups_by_cols = None        # df_with_dups.duplicated(subset=['customer_id'])

print(f"Total rows: {len(df_with_dups)}")
print(f"Duplicate rows: {n_duplicates}")
print(f"After removing duplicates: {len(df_no_dups)}")
print(f"\nDuplicates by customer_id: {dups_by_cols.sum()}")

### Exercise 3.3: Data Type Conversion

In [None]:
# Exercise 3.3: Data Type Conversion

# Create sample data with type issues
df_types = pd.DataFrame({
    'id': ['001', '002', '003'],
    'value': ['100', '200', '300'],
    'price': ['$25.99', '$49.99', '$19.99'],
    'date': ['2023-01-15', '2023-02-20', '2023-03-25'],
    'active': ['True', 'False', 'True']
})

print("Original dtypes:")
print(df_types.dtypes)

# YOUR CODE HERE - Convert types
# Convert 'value' to integer
# df_types['value'] = df_types['value'].astype(int)

# Convert 'price' to float (need to remove $)
# df_types['price'] = df_types['price'].str.replace('$', '').astype(float)

# Convert 'date' to datetime
# df_types['date'] = pd.to_datetime(df_types['date'])

# Convert 'active' to boolean
# df_types['active'] = df_types['active'].map({'True': True, 'False': False})

df_types['value'] = df_types['value'].astype(int)
df_types['price'] = df_types['price'].str.replace('$', '').astype(float)
df_types['date'] = pd.to_datetime(df_types['date'])
df_types['active'] = df_types['active'].map({'True': True, 'False': False})

print("\nConverted dtypes:")
print(df_types.dtypes)
print("\nData:")
print(df_types)

---

## Section 4: Data Transformation

Transform data using apply, map, groupby, and aggregation.

### Exercise 4.1: Creating New Columns

In [None]:
# Exercise 4.1: Creating New Columns

df_transform = df_clean.copy()

# YOUR CODE HERE
# Simple arithmetic
df_transform['income_thousands'] = None  # df_transform['income'] / 1000

# Conditional column with np.where
df_transform['income_level'] = None      # np.where(df_transform['income'] > 70000, 'High', 'Medium/Low')

# Multiple conditions with np.select
conditions = [
    df_transform['age'] < 30,
    df_transform['age'] < 50,
    df_transform['age'] >= 50
]
choices = ['Young', 'Middle', 'Senior']
df_transform['age_group'] = None         # np.select(conditions, choices)

# Using cut for binning
df_transform['satisfaction_label'] = None  # pd.cut(df_transform['satisfaction'], bins=[0, 2, 3, 5], labels=['Low', 'Medium', 'High'])

# Fill in the code
df_transform['income_thousands'] = df_transform['income'] / 1000
df_transform['income_level'] = np.where(df_transform['income'] > 70000, 'High', 'Medium/Low')
df_transform['age_group'] = np.select(conditions, choices)
df_transform['satisfaction_label'] = pd.cut(df_transform['satisfaction'], bins=[0, 2, 3, 5], labels=['Low', 'Medium', 'High'])

print(df_transform[['age', 'age_group', 'income', 'income_level', 'satisfaction', 'satisfaction_label']].head(10))

### Exercise 4.2: Apply and Map Functions

In [None]:
# Exercise 4.2: Apply and Map Functions

# apply() - Apply a function to each element/row/column
# map() - Map values using a dictionary or function (Series only)

# YOUR CODE HERE

# Apply function to column
def categorize_purchases(x):
    if x < 10: return 'Low'
    elif x < 30: return 'Medium'
    else: return 'High'

df_transform['purchase_level'] = None  # df_transform['purchases'].apply(categorize_purchases)

# Apply lambda function
df_transform['age_squared'] = None     # df_transform['age'].apply(lambda x: x**2)

# Map using dictionary
education_years = {
    'High School': 12,
    'Bachelor': 16,
    'Master': 18,
    'PhD': 22
}
df_transform['education_years'] = None # df_transform['education'].map(education_years)

# Apply to entire DataFrame (row-wise)
# df_transform.apply(lambda row: row['income'] / row['age'], axis=1)

df_transform['purchase_level'] = df_transform['purchases'].apply(categorize_purchases)
df_transform['age_squared'] = df_transform['age'].apply(lambda x: x**2)
df_transform['education_years'] = df_transform['education'].map(education_years)

print(df_transform[['purchases', 'purchase_level', 'education', 'education_years']].head(10))

### Exercise 4.3: GroupBy and Aggregation

GroupBy is one of the most powerful Pandas features!

In [None]:
# Exercise 4.3: GroupBy and Aggregation

# YOUR CODE HERE

# Simple groupby with single aggregation
income_by_city = None      # df_transform.groupby('city')['income'].mean()

# Multiple aggregations
city_stats = None          # df_transform.groupby('city')['income'].agg(['mean', 'median', 'std', 'count'])

# Group by multiple columns
city_edu_stats = None      # df_transform.groupby(['city', 'education'])['income'].mean()

# Multiple columns with different aggregations
multi_agg = None           # df_transform.groupby('city').agg({'income': 'mean', 'purchases': 'sum', 'age': ['min', 'max']})

income_by_city = df_transform.groupby('city')['income'].mean()
city_stats = df_transform.groupby('city')['income'].agg(['mean', 'median', 'std', 'count'])
city_edu_stats = df_transform.groupby(['city', 'education'])['income'].mean().unstack()
multi_agg = df_transform.groupby('city').agg({'income': 'mean', 'purchases': 'sum', 'age': ['min', 'max']})

print("Average income by city:")
print(income_by_city.round(0))
print("\nCity income statistics:")
print(city_stats.round(0))
print("\nIncome by city and education:")
print(city_edu_stats.round(0))

### Visualization: GroupBy Results

In [None]:
# Visualize groupby results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Income by city (bar chart)
income_by_city.plot(kind='bar', ax=axes[0], color='steelblue', edgecolor='black')
axes[0].set_title('Average Income by City')
axes[0].set_xlabel('City')
axes[0].set_ylabel('Average Income ($)')
axes[0].tick_params(axis='x', rotation=45)

# Income by education (horizontal bar)
income_by_edu = df_transform.groupby('education')['income'].mean().sort_values()
income_by_edu.plot(kind='barh', ax=axes[1], color='coral', edgecolor='black')
axes[1].set_title('Average Income by Education')
axes[1].set_xlabel('Average Income ($)')
axes[1].set_ylabel('Education Level')

plt.tight_layout()
plt.show()

---

## Section 5: Merging and Joining DataFrames

Combining data from multiple sources is a critical skill.

### Setup: Create Sample DataFrames

In [None]:
# Create sample DataFrames for merging

# Customer info
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'Houston']
})

# Orders (some customers have multiple orders, some have none)
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'customer_id': [1, 1, 2, 3, 3, 6],  # Note: customer 6 doesn't exist in customers
    'product': ['Widget', 'Gadget', 'Gizmo', 'Widget', 'Gadget', 'Gizmo'],
    'amount': [100, 150, 200, 100, 150, 250]
})

# Product info
products = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo'],
    'category': ['Electronics', 'Electronics', 'Home'],
    'price': [25.99, 49.99, 19.99]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)
print("\nProducts:")
print(products)

### Exercise 5.1: Merge Operations

In [None]:
# Exercise 5.1: Merge Operations

# YOUR CODE HERE

# Inner join - only matching records
inner_merge = None         # pd.merge(customers, orders, on='customer_id', how='inner')

# Left join - all customers, matching orders
left_merge = None          # pd.merge(customers, orders, on='customer_id', how='left')

# Right join - all orders, matching customers
right_merge = None         # pd.merge(customers, orders, on='customer_id', how='right')

# Outer join - all records from both
outer_merge = None         # pd.merge(customers, orders, on='customer_id', how='outer')

inner_merge = pd.merge(customers, orders, on='customer_id', how='inner')
left_merge = pd.merge(customers, orders, on='customer_id', how='left')
right_merge = pd.merge(customers, orders, on='customer_id', how='right')
outer_merge = pd.merge(customers, orders, on='customer_id', how='outer')

print(f"Inner join ({len(inner_merge)} rows):")
print(inner_merge)
print(f"\nLeft join ({len(left_merge)} rows):")
print(left_merge)
print(f"\nRight join ({len(right_merge)} rows):")
print(right_merge)

### Exercise 5.2: Multi-Table Joins

In [None]:
# Exercise 5.2: Multi-Table Joins

# Join all three tables
# YOUR CODE HERE

# Step 1: Join customers with orders
customer_orders = None     # pd.merge(customers, orders, on='customer_id', how='inner')

# Step 2: Join result with products
full_data = None           # pd.merge(customer_orders, products, on='product', how='left')

customer_orders = pd.merge(customers, orders, on='customer_id', how='inner')
full_data = pd.merge(customer_orders, products, on='product', how='left')

print("Full merged data:")
print(full_data)

# Analysis: Total spend per customer
customer_spend = full_data.groupby('name')['amount'].sum().sort_values(ascending=False)
print("\nTotal spend per customer:")
print(customer_spend)

### Exercise 5.3: Concatenation

In [None]:
# Exercise 5.3: Concatenation

# Sample DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df3 = pd.DataFrame({'C': [9, 10], 'D': [11, 12]})

# YOUR CODE HERE

# Vertical concatenation (stack rows)
vertical = None            # pd.concat([df1, df2], axis=0, ignore_index=True)

# Horizontal concatenation (add columns)
horizontal = None          # pd.concat([df1, df3], axis=1)

vertical = pd.concat([df1, df2], axis=0, ignore_index=True)
horizontal = pd.concat([df1, df3], axis=1)

print("df1:")
print(df1)
print("\ndf2:")
print(df2)
print("\nVertical concat (df1 + df2):")
print(vertical)
print("\nHorizontal concat (df1 + df3):")
print(horizontal)

---

## Section 6: Preparing Data for ML

Prepare data for machine learning pipelines.

### Exercise 6.1: Encoding Categorical Variables

In [None]:
# Exercise 6.1: Encoding Categorical Variables

df_ml = df_clean[['age', 'income', 'education', 'city', 'purchases', 'satisfaction']].copy()

# YOUR CODE HERE

# One-hot encoding (creates dummy variables)
df_encoded = None          # pd.get_dummies(df_ml, columns=['city', 'education'])

# Label encoding (convert categories to numbers)
education_mapping = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df_ml['education_encoded'] = None  # df_ml['education'].map(education_mapping)

df_encoded = pd.get_dummies(df_ml, columns=['city', 'education'])
df_ml['education_encoded'] = df_ml['education'].map(education_mapping)

print("One-hot encoded columns:")
print(df_encoded.columns.tolist())
print("\nOne-hot encoded sample:")
print(df_encoded.head())

print("\nLabel encoded:")
print(df_ml[['education', 'education_encoded']].drop_duplicates())

### Exercise 6.2: Feature Scaling

In [None]:
# Exercise 6.2: Feature Scaling

# Select numeric columns for scaling
numeric_cols = ['age', 'income', 'purchases']
df_scale = df_ml[numeric_cols].copy()

# YOUR CODE HERE

# Min-Max Normalization (scale to 0-1)
def min_max_normalize(series):
    return (series - series.min()) / (series.max() - series.min())

df_normalized = None       # df_scale.apply(min_max_normalize)

# Z-score Standardization (mean=0, std=1)
def standardize(series):
    return (series - series.mean()) / series.std()

df_standardized = None     # df_scale.apply(standardize)

df_normalized = df_scale.apply(min_max_normalize)
df_standardized = df_scale.apply(standardize)

print("Original data stats:")
print(df_scale.describe().round(2))
print("\nNormalized (0-1) stats:")
print(df_normalized.describe().round(2))
print("\nStandardized (mean=0, std=1) stats:")
print(df_standardized.describe().round(2))

### Visualization: Scaling Comparison

In [None]:
# Visualize scaling effects
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Original
for col in numeric_cols:
    axes[0].hist(df_scale[col], bins=20, alpha=0.5, label=col)
axes[0].set_title('Original Data')
axes[0].set_xlabel('Value')
axes[0].legend()

# Normalized
for col in numeric_cols:
    axes[1].hist(df_normalized[col], bins=20, alpha=0.5, label=col)
axes[1].set_title('Min-Max Normalized (0-1)')
axes[1].set_xlabel('Value')
axes[1].legend()

# Standardized
for col in numeric_cols:
    axes[2].hist(df_standardized[col], bins=20, alpha=0.5, label=col)
axes[2].set_title('Standardized (mean=0, std=1)')
axes[2].set_xlabel('Value')
axes[2].axvline(x=0, color='black', linestyle='--')
axes[2].legend()

plt.tight_layout()
plt.show()

### Exercise 6.3: Train-Test Split Preparation

In [None]:
# Exercise 6.3: Prepare Features and Target

# Prepare final dataset for ML
df_final = pd.get_dummies(df_clean, columns=['city', 'education'])

# Define features (X) and target (y)
target_col = 'satisfaction'
feature_cols = [col for col in df_final.columns if col != target_col and col != 'customer_id']

X = df_final[feature_cols]
y = df_final[target_col]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns ({len(feature_cols)}):")
print(feature_cols)

# Manual train-test split
np.random.seed(42)
indices = np.random.permutation(len(X))
split_idx = int(0.8 * len(X))

train_idx = indices[:split_idx]
test_idx = indices[split_idx:]

X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

print(f"\nTrain set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nTrain target distribution:")
print(y_train.value_counts(normalize=True).round(3))

---

## Lab Summary

Congratulations! You've mastered essential Pandas operations for data manipulation:

| Topic | Key Functions |
|-------|---------------|
| **Creation** | `pd.DataFrame()`, `pd.read_csv()`, `pd.read_json()` |
| **Exploration** | `head()`, `tail()`, `info()`, `describe()`, `shape`, `dtypes` |
| **Selection** | `[]`, `.loc[]`, `.iloc[]`, boolean indexing, `.query()` |
| **Missing Data** | `isnull()`, `fillna()`, `dropna()` |
| **Duplicates** | `duplicated()`, `drop_duplicates()` |
| **Transformation** | `apply()`, `map()`, `np.where()`, `pd.cut()` |
| **Aggregation** | `groupby()`, `agg()`, `pivot_table()` |
| **Merging** | `pd.merge()`, `pd.concat()`, join types |
| **ML Prep** | `pd.get_dummies()`, normalization, standardization |

### Next Steps

- **Lab 2:** Machine Learning with PyTorch
- **Lab 3:** Neural Networks

---

*Remember to save your work! (Ctrl+S)*