[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wasim/Data-Science/blob/main/data-analyst-roadmap/03_pandas/06_advanced_filtering.ipynb)

# Advanced Filtering and Querying

Select exactly the data you need.

## Key Concepts
- **Boolean Indexing:** Standard selection `df[condition]`
- **query():** SQL-like string queries
- **String Methods:** `str.contains()`, `str.extract()`
- **isin():** Multiple value matching

In [None]:
import pandas as pd
import numpy as np

# Sample dataset (Product Sales)
df = pd.DataFrame({
    'Product': ['Apple', 'Banana', 'Cherry', 'Date', 'Elderberry'],
    'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Berry'],
    'Price': [1.2, 0.5, 2.5, 3.0, 4.5],
    'Stock': [100, 50, 200, 150, 80],
    'Rating': [4.5, 3.8, 4.9, np.nan, 4.2]
})

print("Original DataFrame:")
print(df)

## 1. Boolean Indexing (The Standard Way)
Combine conditions with `&` (and), `|` (or), `~` (not).

In [None]:
# Expensive fruits (> $2.0)
expensive = df[df['Price'] > 2.0]
print("Price > 2.0:")
print(expensive)

In [None]:
# Complex condition: High Stock AND High Rating
high_quality = df[
    (df['Stock'] > 80) & 
    (df['Rating'] > 4.0)
]
print("High Stock & High Rating:")
print(high_quality)

## 2. Using `query()` Method
 cleaner syntax for complex filtering.

In [None]:
# Same filter as above
result = df.query("Stock > 80 and Rating > 4.0")
print("Using query():")
print(result)

In [None]:
# Using external variables with @
min_price = 1.0
result = df.query("Price > @min_price")
print(f"Price > {min_price}:")
print(result)

## 3. String Methods (.str accessor)
Filter based on text patterns.

In [None]:
# Ends with 'berry'
berries = df[df['Product'].str.endswith('berry')]
print("Ends with 'berry':")
print(berries)

In [None]:
# Contains 'a' (case insensitive)
contains_a = df[
    df['Product'].str.contains('a', case=False)
]
print("Contains 'a':")
print(contains_a)

## 4. `isin()` for Multiple Values
Filter by list of options.

In [None]:
target_fruits = ['Apple', 'Date', 'Mango']
selected = df[df['Product'].isin(target_fruits)]
print("In target list:")
print(selected)

## 5. Handling Missing Data in Filters
Be careful with NaN values.

In [None]:
# Select rows with valid Rating (not NaN)
valid_rating = df[df['Rating'].notna()]
print("Valid Ratings:")
print(valid_rating)

# Query syntax for NaN
# result = df.query("Rating == Rating") # Only valid (NaN != NaN)
# result = df.query("Rating != Rating") # Only NaN

## 6. Where vs Mask
Replace values based on condition.

In [None]:
# Keep values where Price > 1, else replace with NaN
result_where = df['Price'].where(df['Price'] > 1)
print("Where Price > 1:")
print(result_where)

# Replace values where Price > 2 with 999
result_mask = df['Price'].mask(df['Price'] > 2, 999)
print("\nMask Price > 2 -> 999:")
print(result_mask)

## Practice Exercises

### Exercise 1
Filter products that have 'e' in name AND price < 3.0 
using `query()`.

In [None]:
# Your code here


### Exercise 2
Find all rows where Category is NOT 'Fruit'.

In [None]:
# Your code here


## Key Takeaways

✅ **Boolean** - `df[(cond1) & (cond2)]`  
✅ **Query** - `df.query("col > val")` - cleaner!  
✅ **String** - `df['col'].str.contains()`  
✅ **Isin** - `df['col'].isin([list])`  
✅ **Loc/Iloc** - Select by label/position (from Intro)  

**Next:** [Project: Merging and Filtering](README.md) →