# Data Analysis with Pandas

Pandas is the most popular Python library for data manipulation and analysis. It provides powerful tools for working with structured data (tables, spreadsheets, databases).

## What You'll Learn
- Loading data from CSV files into DataFrames
- Exploring data structure and content
- Data type conversion and feature engineering
- Filtering and querying data
- Creating visualizations with seaborn and matplotlib
- Grouping and aggregating data
- Analyzing patterns and relationships

---

## Dataset: Hardware Testing Measurements

This notebook analyzes hardware testing data from electronic devices under test (DUT). The dataset contains:
- **Device information**: ID, board revision, and test run number
- **Test conditions**: Temperature and operating frequency
- **Measurements**: Supply voltage, current, signal-to-noise ratio
- **Test results**: Pass or fail status

**Use Case**: Quality assurance teams use this data to identify failure patterns, validate design specifications, and improve product reliability across different operating conditions.

---

## Step 1: Import Required Libraries

First, import the necessary libraries for data analysis and visualization.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

import seaborn as sns
import matplotlib.pyplot as plt

# Set visualization style
sns.set_theme(style="whitegrid")

# Set random seed for reproducibility
np.random.seed(42)

print("✅ Libraries imported successfully")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

**Libraries Used:**
- **pandas**: Data manipulation and analysis
- **numpy**: Numerical computations
- **pathlib**: File path handling (cross-platform)
- **seaborn**: Statistical data visualization
- **matplotlib**: Plotting library

---

## Step 2: Load the Data

Load the CSV file into a pandas DataFrame.

In [None]:
# Define file path
csv_path = Path("hw_measurements.csv")

# Check if file exists
if csv_path.exists():
    print(f"✅ Found data file: {csv_path}")
else:
    print(f"❌ File not found: {csv_path}")
    print("Make sure hw_measurements.csv is in the same directory as this notebook.")

In [None]:
# Load CSV into DataFrame
df = pd.read_csv(csv_path)

print("✅ Data loaded successfully")
print(f"DataFrame shape: {df.shape[0]} rows × {df.shape[1]} columns")

**What happened:**
- `pd.read_csv()` reads the CSV file and converts it to a DataFrame
- DataFrame is pandas' primary data structure (like a spreadsheet)
- Each row is a measurement, each column is a variable

---

## Step 3: Initial Data Exploration

### View the First Few Rows

In [None]:
# Display first 5 rows
df.head()

**What we see:**
- `timestamp`: When the measurement was taken
- `device_id`: Unique identifier for each device under test (DUT001, DUT002, etc.)
- `board_rev`: Board revision (A or B)
- `run`: Test run number (multiple runs per temperature)
- `temp_c`: Temperature in Celsius
- `freq_mhz`: Operating frequency in MHz
- `supply_v`: Supply voltage in volts (~3.3V)
- `current_a`: Current in amperes (~0.1A)
- `snr_db`: Signal-to-noise ratio in decibels
- `result`: Test result (PASS or FAIL)

### Check DataFrame Dimensions

In [None]:
# Get shape (rows, columns)
rows, cols = df.shape
print(f"DataFrame has {rows} rows and {cols} columns")

# Alternative: direct access
df.shape

### View DataFrame Information

In [None]:
# Display concise summary
df.info()

**Key Information:**
- **RangeIndex**: Row numbers (0 to n-1)
- **Data columns**: Total number of columns
- **Non-Null Count**: How many non-missing values per column
- **Dtype**: Data type of each column
  - `object`: Text/string data
  - `float64`: Decimal numbers (64-bit)
  - `int64`: Whole numbers (64-bit)

**⚠️ Notice**: `timestamp` is currently stored as `object` (text), not datetime. We'll fix this later.

### Statistical Summary

In [None]:
# Generate descriptive statistics for numeric columns
df.describe()

**Understanding the Output:**
- **count**: Number of non-null values
- **mean**: Average value
- **std**: Standard deviation (spread of data)
- **min**: Minimum value
- **25%, 50%, 75%**: Quartiles (percentiles)
- **max**: Maximum value

**Insights:**
- Temperature ranges from ~20°C to higher test temperatures
- Supply voltage around 3.3V (typical for modern digital circuits)
- Current consumption around 0.1A (100mA)
- Operating frequency at 80 MHz
- SNR values around 20dB (signal quality metric)
- Multiple test runs per temperature point for reliability

### Check for Missing Values

In [None]:
# Count missing values per column
missing_per_column = df.isna().sum()
print("Missing values per column:")
print(missing_per_column)
print()

# Total missing values in entire dataset
total_missing = df.isna().sum().sum()
print(f"Total missing values: {total_missing}")

if total_missing == 0:
    print("✅ No missing values - data is complete!")
else:
    print(f"⚠️ Found {total_missing} missing values")

**What happened:**
- `.isna()` creates Boolean DataFrame (True where data is missing)
- `.sum()` counts True values (missing data)
- First `.sum()` counts per column, second `.sum()` totals everything

### View Column Names

In [None]:
# Get list of column names
print("Column names:")
print(df.columns.tolist())
print()
print(f"Total columns: {len(df.columns)}")

### Analyze Categorical Data

In [None]:
# Count test results
result_counts = df["result"].value_counts()
print("Test Results:")
print(result_counts)
print()

# Calculate percentages
result_percentages = df["result"].value_counts(normalize=True) * 100
print("Test Results (percentage):")
print(result_percentages.round(2))
print()

# Failure rate
failure_rate = (df["result"] == "FAIL").sum() / len(df) * 100
print(f"Overall failure rate: {failure_rate:.2f}%")

**Insights:**
- `.value_counts()` counts occurrences of each unique value
- `normalize=True` converts counts to proportions
- Shows how many tests passed vs. failed

---

## Step 4: Data Type Conversion

### Convert Timestamp to Datetime

In [None]:
# Before conversion
print("Before conversion:")
print(f"Type: {df['timestamp'].dtype}")
print(f"Sample value: {df['timestamp'].iloc[0]}")
print()

# Convert to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# After conversion
print("After conversion:")
print(f"Type: {df['timestamp'].dtype}")
print(f"Sample value: {df['timestamp'].iloc[0]}")
print()
print("✅ Timestamp converted to datetime64")

**Why this matters:**
- Text timestamps can't be used for time-based calculations
- `datetime64` enables date arithmetic, filtering by date ranges, time series analysis
- Can now extract year, month, day, hour, etc.

In [None]:
# Verify the change
df.info()

**Notice**: `timestamp` is now `datetime64[ns]` instead of `object`

---

## Step 5: Creating new columns

Create new columns (features) from existing data to gain additional insights.

### Calculate Power (W = V × I)

In [None]:
# Create new column: power (watts) = voltage × current
df["power_w"] = df["supply_v"] * df["current_a"]

print("✅ Created 'power_w' column")
print()
print("Sample power values:")
print(df[["supply_v", "current_a", "power_w"]].head())

**What happened:**
- Element-wise multiplication creates new column
- Power (watts) = Voltage × Current (Ohm's law: P = V × I)
- This derived metric helps analyze energy consumption

In [None]:
# Verify new column was added
df.info()

### Create Binary Failure Indicator

In [None]:
# Create binary column: 1 for FAIL, 0 for PASS
df["is_fail"] = (df["result"] == "FAIL").astype(int)

print("✅ Created 'is_fail' binary column")
print()
print("Comparison:")
print(df[["result", "is_fail"]].head(10))

**Why create this:**
- Binary (0/1) format is useful for mathematical operations
- Can calculate mean to get failure rate
- Easier for machine learning models
- `astype(int)` converts Boolean (True/False) to integers (1/0)

In [None]:
# View updated DataFrame with new columns
df.head()

---

## Step 6: Filtering and Querying Data

### Method 1: Boolean Indexing with .loc[]

In [None]:
# Find failures where SNR dropped below 19dB
df_low_snr_failures = df.loc[
    (df["result"] == "FAIL") & (df["snr_db"] < 19)
]

print(f"Found {len(df_low_snr_failures)} failures with SNR < 19dB")
print()
print("Sample data:")
df_low_snr_failures.head()

**Understanding the syntax:**
- `.loc[]` accesses rows by labels/conditions
- `&` is logical AND (both conditions must be True)
- Parentheses are required around each condition
- Returns new DataFrame with matching rows

**Other operators:**
- `|` = OR (either condition)
- `~` = NOT (negate condition)
- `==` = equals
- `!=` = not equals
- `>`, `<`, `>=`, `<=` = comparisons

### Method 2: Query Method (SQL-like)

In [None]:
# Find board revision B measurements with temp >= 40°C
df_rev_b_hot = df.query("board_rev == 'B' and temp_c >= 40").copy()

print(f"Found {len(df_rev_b_hot)} board B measurements with temp >= 40°C")
print()
print("Sample data:")
df_rev_b_hot.head()

**Why use .query():**
- More readable for complex conditions
- SQL-like syntax (familiar to database users)
- Use `and`, `or`, `not` instead of `&`, `|`, `~`
- `.copy()` creates independent DataFrame (not a view)

**Comparison:**
```python
# Boolean indexing (verbose)
df.loc[(df['board_rev'] == 'B') & (df['temp_c'] >= 40)]

# Query method (cleaner)
df.query("board_rev == 'B' and temp_c >= 40")
```

---

## Step 7: Data Visualization

### Line Plot: Current vs Temperature by Board Revision

In [None]:
# Create figure with specific size
plt.figure(figsize=(10, 5))

# Create line plot
sns.lineplot(
    data=df,
    x="temp_c",
    y="current_a",
    hue="board_rev",  # Color by board revision
    marker="o",        # Add markers at data points
    linewidth=2
)

plt.title("Current vs Temperature (by Board Revision)", fontsize=14, fontweight='bold')
plt.xlabel("Temperature (°C)", fontsize=12)
plt.ylabel("Current (A)", fontsize=12)
plt.legend(title="Board Revision")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Insights:**
- Current increases with temperature (expected behavior)
- Both board revisions show similar trends
- Slight variations between revisions at different temperatures

**Visualization tips:**
- `figsize=(width, height)` controls plot size
- `hue` creates separate lines by category
- `marker="o"` adds dots at each measurement
- `tight_layout()` prevents label overlap

### Line Plot: Supply Voltage vs Temperature

In [None]:
plt.figure(figsize=(10, 5))

sns.lineplot(
    data=df,
    x="temp_c",
    y="supply_v",
    hue="board_rev",
    marker="o",
    linewidth=2
)

plt.title("Supply Voltage vs Temperature (by Board Revision)", fontsize=14, fontweight='bold')
plt.xlabel("Temperature (°C)", fontsize=12)
plt.ylabel("Supply Voltage (V)", fontsize=12)
plt.legend(title="Board Revision")
plt.axhline(y=3.3, color='red', linestyle='--', alpha=0.5, label='Target 3.3V')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Insights:**
- Voltage stability across temperature range
- Both board revisions maintain voltage near 3.3V target
- Red dashed line shows target voltage (3.3V)
- Voltage regulation remains consistent with temperature changes

**New element:**
- `plt.axhline()` adds horizontal reference line

### Scatter Plot: SNR vs Current (colored by result)

In [None]:
plt.figure(figsize=(10, 6))

sns.scatterplot(
    data=df,
    x="current_a",
    y="snr_db",
    hue="result",      # Color by test result
    style="board_rev",  # Different markers by board rev
    s=100,              # Marker size
    alpha=0.7           # Transparency
)

plt.title("SNR vs Current (by Result and Board Revision)", fontsize=14, fontweight='bold')
plt.xlabel("Current (A)", fontsize=12)
plt.ylabel("Signal-to-Noise Ratio (dB)", fontsize=12)
plt.legend(title="Status", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Insights:**
- SNR decreases as current increases (inverse relationship)
- Failures (red) cluster at high current, low SNR
- Passes (blue) at low current, high SNR
- Clear separation between pass and fail regions

**Scatter plot features:**
- `style` varies marker shape
- `alpha` controls transparency (0=transparent, 1=opaque)
- `bbox_to_anchor` moves legend outside plot

### Box Plot: Power Distribution by Temperature Band

In [None]:
# Create temperature bands (categories)
df["temp_band"] = pd.cut(
    df["temp_c"],
    bins=[15, 30, 45, 60, 70],
    labels=["20-30°C", "30-45°C", "45-60°C", "60-70°C"]
)

print("✅ Created temperature bands")
print()
print("Distribution:")
print(df["temp_band"].value_counts().sort_index())

**What pd.cut() does:**
- Divides continuous data into discrete bins
- `bins` defines bin edges
- `labels` provides names for each bin
- Useful for grouping and categorical analysis

In [None]:
plt.figure(figsize=(10, 6))

sns.boxplot(
    data=df,
    x="temp_band",
    y="power_w",
    hue="board_rev",
    palette="Set2"
)

plt.title("Power Distribution by Temperature Band", fontsize=14, fontweight='bold')
plt.xlabel("Temperature Range", fontsize=12)
plt.ylabel("Power (W)", fontsize=12)
plt.legend(title="Board Revision")
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

**Reading box plots:**
- **Box**: Represents middle 50% of data (25th to 75th percentile)
- **Line in box**: Median (50th percentile)
- **Whiskers**: Extend to min/max within 1.5×IQR
- **Dots**: Outliers beyond whiskers

**Insights:**
- Power consumption increases with temperature
- Higher temperature bands show more variability
- Board B has slightly higher power at high temps

### Facet Grid: SNR vs Temperature per Device

In [None]:
# Create grid of subplots (one per device)
g = sns.FacetGrid(
    df,
    col="device_id",    # Separate plot per device
    col_wrap=4,         # 4 plots per row
    height=3,           # Height of each subplot
    sharey=True         # Share y-axis scale
)

# Add line plot to each subplot
g.map_dataframe(
    sns.lineplot,
    x="temp_c",
    y="snr_db",
    marker="o",
    color="steelblue",
    linewidth=2
)

# Customize titles and layout
g.set_titles("{col_name}", fontweight='bold')
g.set_axis_labels("Temperature (°C)", "SNR (dB)")
g.fig.suptitle("SNR vs Temperature (per device)", y=1.02, fontsize=14, fontweight='bold')
g.tight_layout()
plt.show()

**FacetGrid advantages:**
- Compare patterns across categories (devices)
- See individual device behavior
- Identify outliers or anomalies
- Spot device-specific issues vs. systematic problems

**Insights to look for:**
- Consistent trends across all devices (systematic behavior)
- Outlier devices with different patterns (potential defects)
- Similar or different SNR degradation rates
- Temperature thresholds where SNR drops significantly

---

## Step 8: Grouping and Aggregation

### Group by Multiple Columns

In [None]:
# Group by board revision and temperature, calculate statistics
aggregated = (
    df.groupby(["board_rev", "temp_c"], as_index=False)
      .agg(
          mean_power_w=("power_w", "mean"),
          mean_snr_db=("snr_db", "mean"),
          mean_current_a=("current_a", "mean"),
          fail_rate=("is_fail", "mean"),
          count=("device_id", "count")
      )
)

print("✅ Created aggregated DataFrame")
print(f"Original: {len(df)} rows → Aggregated: {len(aggregated)} rows")
print()
aggregated.head(10)

**Understanding .agg():**
- Groups data by specified columns
- Applies aggregation functions (mean, sum, count, etc.)
- Syntax: `new_column=("source_column", "function")`
- `as_index=False` keeps grouping columns as regular columns

**Aggregations performed:**
- `mean_power_w`: Average power per group
- `mean_snr_db`: Average SNR per group
- `mean_current_a`: Average current per group
- `fail_rate`: Mean of binary is_fail (= proportion of failures)
- `count`: Number of measurements per group

### Visualize Aggregated Data: Mean Power vs Temperature

In [None]:
plt.figure(figsize=(10, 5))

sns.lineplot(
    data=aggregated,
    x="temp_c",
    y="mean_power_w",
    hue="board_rev",
    marker="o",
    linewidth=2.5,
    markersize=8
)

plt.title("Mean Power vs Temperature (Aggregated by Board Revision)", fontsize=14, fontweight='bold')
plt.xlabel("Temperature (°C)", fontsize=12)
plt.ylabel("Mean Power (W)", fontsize=12)
plt.legend(title="Board Revision")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Insights:**
- Cleaner trend lines after aggregation
- Board B has slightly higher power consumption
- Both revisions show exponential-like increase with temperature

### Critical Plot: Failure Rate vs Temperature

In [None]:
plt.figure(figsize=(10, 5))

sns.lineplot(
    data=aggregated,
    x="temp_c",
    y="fail_rate",
    hue="board_rev",
    marker="o",
    linewidth=2.5,
    markersize=8
)

plt.title("Failure Rate vs Temperature (Aggregated)", fontsize=14, fontweight='bold')
plt.xlabel("Temperature (°C)", fontsize=12)
plt.ylabel("Failure Rate (proportion)", fontsize=12)
plt.ylim(0, 1)
plt.axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label='50% failure threshold')
plt.legend(title="Board Revision")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**What to analyze:**
- Identify temperature threshold where failures begin
- Compare failure rates between board revisions
- Determine if failures increase gradually or suddenly
- Find safe operating temperature range

**Business Impact:**
- Use failure rate data to set operating specifications
- Identify temperature limits for product datasheet
- Determine if thermal protection is needed
- Guide design improvements for next revision

---

## Step 9: Additional Analysis

### Correlation Analysis

In [None]:
# Select numeric columns for correlation
numeric_cols = ["run", "temp_c", "freq_mhz", "supply_v", "current_a", "snr_db", "power_w", "is_fail"]
correlation_matrix = df[numeric_cols].corr()

# Visualize correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(
    correlation_matrix,
    annot=True,          # Show correlation values
    fmt='.2f',           # Format to 2 decimal places
    cmap="coolwarm",     # Color scheme
    center=0,            # Center colormap at 0
    square=True,         # Square cells
    linewidths=1,        # Cell borders
    cbar_kws={"shrink": 0.8}
)

plt.title("Correlation Matrix", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

**Reading correlations:**
- **+1.0**: Perfect positive correlation (both increase together)
- **0.0**: No correlation
- **-1.0**: Perfect negative correlation (one increases, other decreases)

**What to look for:**
- **Strong correlations** (|r| > 0.7): Variables with strong relationships
- **Temperature effects**: How temp_c correlates with other measurements
- **Failure indicators**: Which variables correlate with is_fail
- **Power relationships**: How power_w relates to voltage and current
- **Signal quality**: SNR correlation with other parameters

**Analysis approach:**
- Identify which factors most influence test results
- Look for unexpected relationships that warrant investigation
- Use correlation insights to guide deeper analysis

### Summary Statistics by Board Revision

In [None]:
# Compare statistics between board revisions
comparison = df.groupby("board_rev").agg({
    "temp_c": ["mean", "min", "max"],
    "supply_v": ["mean", "std"],
    "current_a": ["mean", "max"],
    "snr_db": ["mean", "min"],
    "power_w": ["mean", "max"],
    "is_fail": ["sum", "mean"]
})

print("Comparison by Board Revision:")
comparison

**Insights:**
- Both revisions tested across similar temperature ranges
- Board B has slightly better voltage stability
- Similar failure rates between revisions
- Performance differences are minor

---

## Step 10: Creating DataFrames and Exporting Data

### Creating DataFrames from Dictionaries

DataFrames can be created from various data structures. Here are two common approaches:

**Method 1: Dictionary with column names as keys**

Each key becomes a column name, values are lists/arrays of data.

In [None]:
# Create DataFrame from dictionary (columns as keys)
data_dict = {
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [1200, 25, 75, 300],
    'quantity': [10, 50, 30, 15],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics']
}

df_products = pd.DataFrame(data_dict)

print("DataFrame from dictionary (columns as keys):")
print(df_products)
print()
print(f"Shape: {df_products.shape}")
print(f"Columns: {df_products.columns.tolist()}")

**How it works:**
- Dictionary keys become column names
- Dictionary values (lists) become column data
- All lists must have the same length
- Pandas automatically creates a default integer index (0, 1, 2, ...)

**Method 2: List of dictionaries (rows as dictionaries)**

Each dictionary represents one row of data.

In [None]:
# Create DataFrame from list of dictionaries (rows as dictionaries)
data_list = [
    {'product': 'Laptop', 'price': 1200, 'quantity': 10, 'category': 'Electronics'},
    {'product': 'Mouse', 'price': 25, 'quantity': 50, 'category': 'Accessories'},
    {'product': 'Keyboard', 'price': 75, 'quantity': 30, 'category': 'Accessories'},
    {'product': 'Monitor', 'price': 300, 'quantity': 15, 'category': 'Electronics'}
]

df_products_from_list = pd.DataFrame(data_list)

print("DataFrame from list of dictionaries (rows as dictionaries):")
print(df_products_from_list)
print()

# Verify both methods create identical DataFrames
are_equal = df_products.equals(df_products_from_list)
print(f"Both methods create identical DataFrames: {are_equal}")

**How it works:**
- Each dictionary in the list represents one row
- Dictionary keys become column names
- Dictionaries can have different keys (missing values filled with NaN)
- This format is common when reading data from APIs or JSON files

**When to use each method:**
- **Dictionary (Method 1)**: When you have data organized by columns (e.g., collecting all prices together)
- **List of dictionaries (Method 2)**: When you have data organized by rows (e.g., each record is a complete observation)

### Exporting DataFrames to CSV

After analyzing and transforming your data, you can export it to a CSV file for sharing or further use.

In [None]:
# Example 1: Basic CSV export (includes index by default)
output_path_basic = Path("products_export_basic.csv")
df_products.to_csv(output_path_basic)
print(f"✅ Exported to {output_path_basic}")
print()

# Example 2: Export without index (cleaner for most use cases)
output_path_no_index = Path("products_export.csv")
df_products.to_csv(output_path_no_index, index=False)
print(f"✅ Exported to {output_path_no_index} (without index)")
print()

# Example 3: Export with specific columns only
output_path_selected = Path("products_prices.csv")
df_products[['product', 'price']].to_csv(output_path_selected, index=False)
print(f"✅ Exported selected columns to {output_path_selected}")
print()

# Example 4: Export the aggregated analysis results
output_path_aggregated = Path("aggregated_results.csv")
aggregated.to_csv(output_path_aggregated, index=False)
print(f"✅ Exported aggregated data to {output_path_aggregated}")
print()

print("Common parameters:")
print("- index=False : Don't include row index in output")
print("- sep=',' : Column separator (default is comma)")
print("- encoding='utf-8' : Character encoding")
print("- columns=['col1', 'col2'] : Select specific columns to export")

**Key points about CSV export:**
- **`.to_csv(path)`** writes DataFrame to CSV file
- **`index=False`** is commonly used to exclude the row index (cleaner output)
- Files are saved relative to the notebook location (use Path for cross-platform compatibility)
- You can export filtered, transformed, or aggregated DataFrames
- Perfect for sharing results with non-Python users (Excel, Google Sheets can open CSV files)

**Typical workflow:**
1. Load data: `df = pd.read_csv('input.csv')`
2. Analyze and transform data
3. Export results: `df.to_csv('output.csv', index=False)`

---

## Best Practices

### ✅ Do:
- **Explore first**: Use `.head()`, `.info()`, `.describe()` before analysis
- **Check data types**: Convert strings to datetime when needed
- **Handle missing values**: Check with `.isna().sum()`
- **Create derived features**: Add calculated columns (like power_w)
- **Use meaningful names**: Clear column and variable names
- **Visualize early**: Plots reveal patterns that statistics might miss
- **Group and aggregate**: Summarize data to find trends
- **Document insights**: Add markdown cells explaining findings

### ❌ Don't:
- Skip exploratory data analysis (EDA)
- Assume data types are correct
- Ignore missing values or outliers
- Create plots without labels/titles
- Analyze without understanding the domain
- Forget to use `.copy()` when creating filtered DataFrames
- Over-complicate visualizations

---

## Summary

### Analysis Workflow Learned:
1. **Load data**: Import CSV files into DataFrames
2. **Explore**: Examine structure, data types, and basic statistics
3. **Clean**: Convert data types, handle missing values
4. **Transform**: Create derived features (power calculation, binary flags)
5. **Filter**: Extract subsets using boolean indexing and queries
6. **Visualize**: Create plots to identify patterns and relationships
7. **Aggregate**: Group data and calculate summary statistics
8. **Interpret**: Draw insights and make data-driven recommendations

### Key DataFrame Operations:

**Data Loading & Inspection:**
```python
pd.read_csv()              # Load CSV file
df.head()                  # View first rows
df.shape                   # Get dimensions (rows, columns)
df.info()                  # Column info and data types
df.describe()              # Statistical summary
df.columns                 # Column names
```

**Data Cleaning:**
```python
pd.to_datetime()           # Convert to datetime
df["new_col"] = ...        # Create new column
pd.cut()                   # Bin continuous data into categories
df.isna().sum()            # Count missing values
df["col"].value_counts()   # Count unique values
```

**Filtering:**
```python
df.loc[condition]          # Boolean indexing
df.query("expression")     # SQL-like filtering
df.copy()                  # Create independent copy
```

**Aggregation:**
```python
df.groupby().agg()         # Group and aggregate
df.corr()                  # Correlation matrix
```

### Visualization Techniques:

```python
sns.lineplot()             # Line charts for trends over continuous variables
sns.scatterplot()          # Scatter plots for relationships between two variables
sns.boxplot()              # Box plots for distribution across categories
sns.heatmap()              # Heatmaps for correlation matrices
sns.FacetGrid()            # Multiple subplots for comparing categories
```

**Customization:**
- `figsize=(width, height)` - Control plot dimensions
- `hue` - Color by category
- `style` - Vary marker/line style by category
- `marker`, `linewidth`, `markersize` - Visual styling
- `plt.axhline()`, `plt.axvline()` - Add reference lines
- `plt.title()`, `plt.xlabel()`, `plt.ylabel()` - Labels
- `plt.legend()` - Configure legend
- `plt.grid()` - Add gridlines
- `plt.tight_layout()` - Optimize spacing

### Best Practices Reinforced:
✅ Always explore data before analysis  
✅ Verify and convert data types early  
✅ Create derived features to enhance analysis  
✅ Use visualizations to validate statistical findings  
✅ Group and aggregate to identify patterns  
✅ Document insights with markdown cells  
✅ Use descriptive variable names  

### Next Steps:
Continue building your pandas skills with:
- **Time series analysis**: Date-based aggregations, rolling windows, resampling
- **Advanced merging**: Combining multiple DataFrames with merge, join, concat
- **Pivot tables**: Reshaping data with pivot_table()
- **Missing data handling**: fillna(), dropna(), interpolate()
- **String operations**: Working with text data using .str accessor
- **Apply functions**: Custom transformations with apply(), map()
- **Machine learning**: Using pandas with scikit-learn for predictive modeling