# Pandas - Part 2: Selection and Filtering

This notebook covers selecting data using loc, iloc, and boolean filtering.

**Topics covered:**
- Using loc (label-based)
- Using iloc (integer-based)
- Boolean filtering
- Multiple conditions
- Query method

**Problems:** 15 (Easy: 1-5, Medium: 6-10, Hard: 11-15)

In [None]:
# ============================================
# SETUP - Run this cell first!
# ============================================
import pandas as pd
import numpy as np
import sys
sys.path.insert(0, '..')
from utils.checker import check

# Load sample data
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 28, 22],
    'salary': [50000, 60000, 70000, 55000, 45000],
    'dept': ['IT', 'HR', 'IT', 'Sales', 'HR']
})
print(df)
print("\nSetup complete!")

---
## Problem 1: Select Single Column
**Difficulty:** Easy

### Concept
Selecting a single column from a DataFrame returns a Series. This is the most basic data selection operation in pandas.

### Syntax
```python
df['column_name']  # Returns Series
df.column_name     # Alternative syntax (if no spaces)
```

### Example
```python
>>> names = df['name']
>>> type(names)
<class 'pandas.core.series.Series'>
```

### Task
Select the 'name' column from `df`. Store in `names`.

### Expected Properties
- Should be a pandas Series
- Should have length 5
- First element should be 'Alice'

In [None]:
# Your solution:
names = None

In [None]:
# Verification
check.is_type(names, pd.Series, "P1: Type check")
check.has_length(names, 5, "P1: Length")
check.first_element_is(names, 'Alice', "P1: First element")

---
## Problem 2: Select Multiple Columns
**Difficulty:** Easy

### Concept
To select multiple columns, pass a list of column names. This returns a DataFrame (not a Series) containing only the selected columns.

### Syntax
```python
df[['col1', 'col2']]  # Note the double brackets!
```

### Example
```python
>>> subset = df[['name', 'age']]
>>> subset.columns
Index(['name', 'age'])
```

### Task
Select 'name' and 'age' columns from `df`. Store in `name_age`.

### Expected Properties
- Should be a pandas DataFrame
- Should have exactly 2 columns
- Columns should be ['name', 'age']

In [None]:
# Your solution:
name_age = None

In [None]:
# Verification
check.is_type(name_age, pd.DataFrame, "P2: Type check")
check.has_columns(name_age, ['name', 'age'], "P2: Columns")
check.is_true(len(name_age.columns) == 2, "P2: Two columns", "Should have exactly 2 columns")

---
## Problem 3: Select Row by Index with iloc
**Difficulty:** Easy

### Concept
`iloc` is used for integer-based indexing (position-based). It selects rows/columns by their integer position, starting from 0.

### Syntax
```python
df.iloc[row_index]         # Single row as Series
df.iloc[row_index, col_index]  # Single value
df.iloc[start:end]         # Row slice
```

### Example
```python
>>> first_row = df.iloc[0]
>>> first_row['name']
'Alice'
```

### Task
Select the first row using iloc. Store in `first_row`.

### Expected Properties
- Should be a pandas Series
- 'name' field should equal 'Alice'

In [None]:
# Your solution:
first_row = None

In [None]:
# Verification
check.is_type(first_row, pd.Series, "P3: Type check")
check.is_true(first_row['name'] == 'Alice', "P3: First row name", "Name should be 'Alice'")

---
## Problem 4: Select Rows by Range with iloc
**Difficulty:** Easy

### Concept
Use slicing with iloc to select a range of rows. Remember that Python slicing is exclusive of the end index.

### Syntax
```python
df.iloc[start:end]  # Rows from start to end-1
```

### Example
```python
>>> df.iloc[1:3]  # Rows at positions 1 and 2
```

### Task
Select rows at indices 1 and 2 (Bob and Charlie) using iloc. Store in `rows_1_to_3`.

### Expected Properties
- Should be a pandas DataFrame
- Should have exactly 2 rows
- First row should have name 'Bob'

In [None]:
# Your solution:
rows_1_to_3 = None

In [None]:
# Verification
check.is_type(rows_1_to_3, pd.DataFrame, "P4: Type check")
check.has_length(rows_1_to_3, 2, "P4: Length")
check.is_true(rows_1_to_3.iloc[0]['name'] == 'Bob', "P4: First row", "First row should be 'Bob'")

---
## Problem 5: Select Single Value with iloc
**Difficulty:** Easy

### Concept
You can access a specific cell value by providing both row and column indices to iloc.

### Syntax
```python
df.iloc[row_index, col_index]  # Single value
```

### Example
```python
>>> value = df.iloc[0, 1]  # First row, second column
```

### Task
Get the value at row 2, column 1 (Charlie's age) using iloc. Store in `value`.

### Expected Properties
- Should be an integer
- Should equal 35

In [None]:
# Your solution:
value = None

In [None]:
# Verification
check.is_type(value, (int, np.integer), "P5: Type check")
check.is_true(value == 35, "P5: Correct value", "Value should be 35")

---
## Problem 6: Boolean Filter - Single Condition
**Difficulty:** Medium

### Concept
Boolean indexing allows you to filter rows based on conditions. Create a boolean mask with a condition, then use it to filter the DataFrame.

### Syntax
```python
mask = df['column'] > value
filtered = df[mask]
# Or in one line:
filtered = df[df['column'] > value]
```

### Example
```python
>>> adults = df[df['age'] >= 18]
```

### Task
Filter `df` to get rows where age > 25. Store in `older_than_25`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 3 rows
- All rows should have age > 25

In [None]:
# Your solution:
older_than_25 = None

In [None]:
# Verification
check.is_type(older_than_25, pd.DataFrame, "P6: Type check")
check.has_length(older_than_25, 3, "P6: Length")
check.is_true(all(older_than_25['age'] > 25), "P6: All ages > 25", "All ages should be greater than 25")

---
## Problem 7: Boolean Filter - String Equality
**Difficulty:** Medium

### Concept
Boolean filtering works with any comparison operator, including equality checks on strings.

### Syntax
```python
df[df['string_col'] == 'value']
```

### Example
```python
>>> engineers = df[df['dept'] == 'Engineering']
```

### Task
Filter `df` to get rows where dept is 'IT'. Store in `it_dept`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 2 rows
- All rows should have dept == 'IT'

In [None]:
# Your solution:
it_dept = None

In [None]:
# Verification
check.is_type(it_dept, pd.DataFrame, "P7: Type check")
check.has_length(it_dept, 2, "P7: Length")
check.is_true(all(it_dept['dept'] == 'IT'), "P7: All IT dept", "All rows should have dept='IT'")

---
## Problem 8: Using loc with Label
**Difficulty:** Medium

### Concept
`loc` is used for label-based indexing. When the index has meaningful labels (not just 0, 1, 2), use loc to access rows by their label.

### Syntax
```python
df.loc[label]              # Single row by label
df.loc[label, 'column']    # Single value
df.loc[labels, columns]    # Multiple rows/columns
```

### Example
```python
>>> df_indexed = df.set_index('name')
>>> df_indexed.loc['Alice']  # Access by name
```

### Task
The setup code creates `df_indexed` with names as index. Use loc to get Bob's row. Store in `bob_row`.

### Expected Properties
- Should be a pandas Series
- 'age' field should equal 30

In [None]:
# Your solution:
df_indexed = df.set_index('name')
bob_row = None

In [None]:
# Verification
check.is_type(bob_row, pd.Series, "P8: Type check")
check.is_true(bob_row['age'] == 30, "P8: Bob's age", "Age should be 30")

---
## Problem 9: Select Specific Rows and Columns with loc
**Difficulty:** Medium

### Concept
`loc` can select specific rows and columns simultaneously using labels. This is powerful for extracting exact subsets of data.

### Syntax
```python
df.loc[row_labels, column_labels]
df.loc[['row1', 'row2'], ['col1', 'col2']]
```

### Example
```python
>>> subset = df_indexed.loc[['Alice', 'Bob'], ['age', 'salary']]
```

### Task
Using `df_indexed`, select rows 'Alice' and 'Charlie' and columns 'age' and 'salary'. Store in `subset`.

### Expected Properties
- Should be a pandas DataFrame
- Should have shape (2, 2)
- Columns should be ['age', 'salary']

In [None]:
# Your solution:
subset = None

In [None]:
# Verification
check.is_type(subset, pd.DataFrame, "P9: Type check")
check.has_shape(subset, (2, 2), "P9: Shape")
check.has_columns(subset, ['age', 'salary'], "P9: Columns")

---
## Problem 10: Using isin()
**Difficulty:** Medium

### Concept
The `isin()` method checks if values are in a given list. It's cleaner than using multiple OR conditions.

### Syntax
```python
df[df['column'].isin([value1, value2, value3])]
```

### Example
```python
>>> df[df['dept'].isin(['IT', 'Engineering'])]
```

### Task
Filter `df` to get rows where dept is either 'IT' or 'HR'. Store in `it_hr`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 4 rows
- All dept values should be either 'IT' or 'HR'

In [None]:
# Your solution:
it_hr = None

In [None]:
# Verification
check.is_type(it_hr, pd.DataFrame, "P10: Type check")
check.has_length(it_hr, 4, "P10: Length")
check.is_true(all(it_hr['dept'].isin(['IT', 'HR'])), "P10: IT or HR", "All depts should be 'IT' or 'HR'")

---
## Problem 11: Multiple Conditions with AND
**Difficulty:** Hard

### Concept
Combine multiple boolean conditions using `&` (AND) or `|` (OR). Each condition must be in parentheses.

### Syntax
```python
df[(condition1) & (condition2)]  # Both must be True
df[(condition1) | (condition2)]  # Either can be True
```

### Example
```python
>>> df[(df['age'] > 25) & (df['salary'] > 50000)]
```

### Task
Filter `df` to get rows where age > 25 AND salary > 50000. Store in `filtered`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 3 rows
- All rows should satisfy both conditions

In [None]:
# Your solution:
filtered = None

In [None]:
# Verification
check.is_type(filtered, pd.DataFrame, "P11: Type check")
check.has_length(filtered, 3, "P11: Length")
check.is_true(all((filtered['age'] > 25) & (filtered['salary'] > 50000)), "P11: Both conditions", "All rows should satisfy both conditions")

---
## Problem 12: Multiple Conditions with OR
**Difficulty:** Hard

### Concept
The OR operator `|` returns rows that satisfy at least one of the conditions.

### Syntax
```python
df[(condition1) | (condition2)]  # Either condition
```

### Example
```python
>>> df[(df['age'] < 25) | (df['age'] > 60)]
```

### Task
Filter `df` to get rows where age < 25 OR salary > 65000. Store in `or_filtered`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 2 rows
- Each row should satisfy at least one condition

In [None]:
# Your solution:
or_filtered = None

In [None]:
# Verification
check.is_type(or_filtered, pd.DataFrame, "P12: Type check")
check.has_length(or_filtered, 2, "P12: Length")
check.is_true(all((or_filtered['age'] < 25) | (or_filtered['salary'] > 65000)), "P12: At least one condition", "All rows should satisfy at least one condition")

---
## Problem 13: Using query() Method
**Difficulty:** Hard

### Concept
The `query()` method provides a more readable way to filter DataFrames using SQL-like string expressions.

### Syntax
```python
df.query('column > value')
df.query('age > 25 and salary > 50000')
df.query('dept == "IT"')  # Use quotes for strings
```

### Example
```python
>>> df.query('age > 30 and dept == "IT"')
```

### Task
Use the query method to filter `df` where salary > 50000 and dept == 'IT'. Store in `query_result`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 1 row
- Row should be Charlie

In [None]:
# Your solution:
query_result = None

In [None]:
# Verification
check.is_type(query_result, pd.DataFrame, "P13: Type check")
check.has_length(query_result, 1, "P13: Length")
check.is_true(query_result.iloc[0]['name'] == 'Charlie', "P13: Charlie", "Result should be Charlie")

---
## Problem 14: Filter and Select Columns
**Difficulty:** Hard

### Concept
You can chain filtering with column selection to get specific columns from filtered rows. This is a common pattern in data analysis.

### Syntax
```python
df[df['col'] > value][['col1', 'col2']]
# Or using loc:
df.loc[df['col'] > value, ['col1', 'col2']]
```

### Example
```python
>>> df[df['age'] > 30][['name', 'salary']]
```

### Task
Filter `df` where age > 25, then select only 'name' and 'salary' columns. Store in `names_salaries`.

### Expected Properties
- Should be a pandas DataFrame
- Should have columns ['name', 'salary']
- Should have 3 rows

In [None]:
# Your solution:
names_salaries = None

In [None]:
# Verification
check.is_type(names_salaries, pd.DataFrame, "P14: Type check")
check.has_columns(names_salaries, ['name', 'salary'], "P14: Columns")
check.has_length(names_salaries, 3, "P14: Length")

---
## Problem 15: Negate Boolean Condition
**Difficulty:** Hard

### Concept
Use the `~` operator to negate (invert) a boolean condition. Use `!=` for inequality or `~` with `isin()` for excluding values.

### Syntax
```python
df[~(condition)]           # NOT condition
df[df['col'] != value]     # Not equal
df[~df['col'].isin(list)]  # Not in list
```

### Example
```python
>>> df[df['dept'] != 'IT']
>>> df[~df['dept'].isin(['IT', 'HR'])]
```

### Task
Filter `df` to get rows where dept is NOT 'IT'. Store in `not_it`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 3 rows
- No row should have dept == 'IT'

In [None]:
# Your solution:
not_it = None

In [None]:
# Verification
check.is_type(not_it, pd.DataFrame, "P15: Type check")
check.has_length(not_it, 3, "P15: Length")
check.is_true(all(not_it['dept'] != 'IT'), "P15: No IT dept", "No rows should have dept='IT'")

---
## Summary

Run this cell to see your overall progress on this notebook.

In [None]:
check.summary()