# Python List Built-in Functions for Data Engineering

This notebook covers essential built-in functions for working with lists in Python,
with examples focused on Data Engineering use cases.

## 1. `filter()` - Filtering Data Records

The `filter()` function is fundamental in Data Engineering for:
- Removing invalid/null records from datasets
- Filtering records based on business rules
- Data quality checks in ETL pipelines

**Syntax:** `filter(function, iterable)`

Returns an iterator containing elements for which the function returns `True`.

In [None]:
# Sample data: Raw transaction records from a data pipeline
transactions = [
    {"txn_id": "T001", "amount": 1500.00, "status": "completed", "customer_id": "C101"},
    {"txn_id": "T002", "amount": None, "status": "failed", "customer_id": "C102"},
    {"txn_id": "T003", "amount": 2500.50, "status": "completed", "customer_id": "C103"},
    {"txn_id": "T004", "amount": 0, "status": "pending", "customer_id": None},
    {"txn_id": "T005", "amount": 5000.00, "status": "completed", "customer_id": "C105"},
    {"txn_id": "T006", "amount": -100.00, "status": "refund", "customer_id": "C101"},
]

print(f"Total raw records: {len(transactions)}")

### Use Case 1: Filter using a Regular Function

In ETL pipelines, we often need to filter out records with missing or invalid data.

**How `filter()` works:**
1. You define a function that takes ONE element and returns `True` or `False`
2. `filter()` applies this function to EACH element in the list
3. Elements that return `True` are kept, `False` are discarded

```
filter(function, iterable)
       ↓         ↓
       |         └── Your list of data
       └── A function that returns True/False for each item
```

In [None]:
# STEP 1: Define a filter function
# This function takes ONE record and returns True/False
def has_valid_amount(record):
    """Check if transaction has a valid positive amount."""
    # Returns True if amount exists AND is greater than 0
    # Returns False otherwise (record will be filtered out)
    return record["amount"] is not None and record["amount"] > 0


# Let's see what our function returns for each record
print("Testing our filter function on each record:")
print("-" * 50)
for txn in transactions:
    result = has_valid_amount(txn)
    print(f"{txn['txn_id']} | amount: {txn['amount']:>10} | keep? {result}")

In [None]:
# STEP 2: Apply filter() with our function
# filter(function_name, list_to_filter)

filtered_result = filter(has_valid_amount, transactions)

# IMPORTANT: filter() returns an ITERATOR, not a list
print(f"filter() returns: {filtered_result}")
print(f"Type: {type(filtered_result)}")

# STEP 3: Convert to list to see/use the results
valid_transactions = list(filtered_result)

print(f"\nFiltered down to {len(valid_transactions)} valid transactions:")
for txn in valid_transactions:
    print(f"  {txn['txn_id']}: ${txn['amount']:.2f}")

### Use Case 2: Using Lambda for Quick Inline Filtering

Before we use lambda with filter(), let's understand what lambda is.

---

## What is Lambda?

**Lambda is a way to create a small, anonymous (unnamed) function in one line.**

Instead of writing:
```python
def my_function(x):
    return x > 100
```

You can write:
```python
lambda x: x > 100
```

**Syntax breakdown:**
```
lambda x: x > 100
   ↓   ↓     ↓
   |   |     └── What to return (the expression)
   |   └── Input parameter(s)
   └── Keyword that says "this is a lambda"
```

**Key points:**
- Lambda is just a shorthand for simple functions
- Can only contain ONE expression (no multiple lines)
- Automatically returns the result of the expression
- Commonly used with `filter()`, `map()`, `sorted()`

In [None]:
# Lambda Examples - Understanding the basics

# Example 1: Regular function vs Lambda
# Regular function
def is_positive(x):
    return x > 0

# Same thing as lambda
is_positive_lambda = lambda x: x > 0

# Both work the same way
print("Regular function: is_positive(5) =", is_positive(5))
print("Lambda function:  is_positive_lambda(5) =", is_positive_lambda(5))
print()

# Example 2: Lambda with dictionary access
# Regular function
def get_amount(record):
    return record["amount"]

# Same as lambda
get_amount_lambda = lambda record: record["amount"]

sample = {"txn_id": "T001", "amount": 1500}
print("Regular function: get_amount(sample) =", get_amount(sample))
print("Lambda function:  get_amount_lambda(sample) =", get_amount_lambda(sample))

### Now: Using Lambda with filter()

Instead of defining a separate function, we can write the filter logic inline.

In [None]:
# COMPARING: Regular function vs Lambda with filter()

# METHOD 1: Using a regular function (what we did in Use Case 1)
def is_completed(record):
    return record["status"] == "completed"

completed_method1 = list(filter(is_completed, transactions))


# METHOD 2: Using lambda (inline - no separate function needed)
# The lambda does the exact same thing as is_completed function above
completed_method2 = list(filter(lambda x: x["status"] == "completed", transactions))

#                              ↑
#                              └── This lambda is equivalent to:
#                                  def anonymous(x):
#                                      return x["status"] == "completed"

print("Method 1 (regular function):", [t["txn_id"] for t in completed_method1])
print("Method 2 (lambda):          ", [t["txn_id"] for t in completed_method2])
print(f"\nBoth methods give same result: {completed_method1 == completed_method2}")

In [None]:
# More Lambda + filter() examples for Data Engineering

# Example 1: Filter high-value transactions (amount > 2000)
high_value = list(filter(lambda x: x["amount"] is not None and x["amount"] > 2000, transactions))
print("High-value transactions (>$2000):")
for txn in high_value:
    print(f"  {txn['txn_id']}: ${txn['amount']}")

print()

# Example 2: Filter transactions with valid customer_id
has_customer = list(filter(lambda x: x["customer_id"] is not None, transactions))
print("Transactions with valid customer_id:")
for txn in has_customer:
    print(f"  {txn['txn_id']}: {txn['customer_id']}")

print()

# Example 3: Filter by multiple conditions (still one expression!)
# Transactions that are completed AND have amount > 1000
premium_completed = list(filter(
    lambda x: x["status"] == "completed" and x["amount"] is not None and x["amount"] > 1000,
    transactions
))
print("Premium completed transactions:")
for txn in premium_completed:
    print(f"  {txn['txn_id']}: ${txn['amount']} - {txn['status']}")

### When to Use Regular Function vs Lambda?

| Use Regular Function | Use Lambda |
|---------------------|------------|
| Complex logic (multiple lines) | Simple one-liner conditions |
| Need to reuse the function elsewhere | One-time use, inline |
| Logic needs documentation/docstring | Self-explanatory condition |
| Multiple conditions that are hard to read | Quick filters |

**DE Rule of Thumb:** If your lambda gets too long or hard to read, switch to a regular function.

### filter() vs List Comprehension

Both approaches are valid. Here's when to use each:

| `filter()` | List Comprehension |
|------------|--------------------|
| Better for complex, reusable filter functions | Better for simple, inline conditions |
| Returns iterator (memory efficient) | Creates list immediately |
| Functional programming style | Pythonic style |

In [None]:
# Same filtering using list comprehension
valid_txns_lc = [txn for txn in transactions if txn["amount"] is not None and txn["amount"] > 0]

# Both produce the same result
print(f"filter() result: {len(valid_transactions)} records")
print(f"List comprehension result: {len(valid_txns_lc)} records")
print(f"Results match: {valid_transactions == valid_txns_lc}")

### Data Engineering Best Practice: Chaining Filters

In real pipelines, you often chain multiple filter conditions.

In [None]:
# Chain filters: valid amount AND completed status AND has customer_id
def is_clean_record(record):
    """Data quality check for clean records."""
    return (
        record["amount"] is not None 
        and record["amount"] > 0
        and record["status"] == "completed"
        and record["customer_id"] is not None
    )

clean_records = list(filter(is_clean_record, transactions))

print(f"\nData Quality Report:")
print(f"  Raw records: {len(transactions)}")
print(f"  Clean records: {len(clean_records)}")
print(f"  Filtered out: {len(transactions) - len(clean_records)}")
print(f"\nClean records ready for downstream processing:")
for txn in clean_records:
    print(f"  {txn}")