## üß≠ Overview ‚Äî `filter()` Function in Python

The **`filter()`** function is a built-in Python tool that helps select only those elements of a sequence that meet a specific condition.  
It takes two inputs:  
1. A **function** that returns either `True` or `False`.  
2. A **sequence** (like a list, tuple, or array) to apply the function to.  

It returns an **iterator** containing only the elements for which the function returned `True`.  
When analyzing data, `filter()` is especially helpful for extracting subsets that meet given criteria ‚Äî for example, students with average scores above 70%.  

---

### üß© How it works in this example

We will:
- Define a function `mean_atleast_70(student_id)` to check whether a student‚Äôs mean grade ‚â• 70.  
- Use `filter()` to apply this function across all unique `student_id` values in our dataset.  
- Convert the result into a list using `list()` to display the filtered IDs.  


In [14]:
# import relevant modules
import pandas as pd
import numpy as np

In [15]:
# read grades dataset, save as a pandas dataframe
grades = pd.read_csv('grades.csv')

In [16]:
# display first few rows of grades
grades.head()

Unnamed: 0,exam,student_id,grade
0,1,1,86.0
1,1,2,65.0
2,1,3,70.0
3,1,4,98.0
4,1,5,89.0


In [17]:
# Inspect the column names (important to identify what 'student ID' is)
print("Column names in dataset:")
print(grades.columns)

Column names in dataset:
Index(['exam', 'student_id', 'grade'], dtype='object')


In [18]:
# Actual column data types
grades.dtypes

exam            int64
student_id      int64
grade         float64
dtype: object

In [19]:
# Make a clean copy so we do not mutate the original
grades_cleaned = grades.copy()


In [20]:
# If you have NaNs, decide how to handle them (we'll treat missing as 0 for parity with earlier lesson)
grades_cleaned['grade'] = grades_cleaned['grade'].fillna(0)
print(grades_cleaned)

    exam  student_id  grade
0      1           1   86.0
1      1           2   65.0
2      1           3   70.0
3      1           4   98.0
4      1           5   89.0
5      1           6    0.0
6      1           7   75.0
7      1           8   56.0
8      1           9   90.0
9      1          10   81.0
10     2           1   79.0
11     2           2   60.0
12     2           3   78.0
13     2           4   75.0
14     2           5    0.0
15     2           6   80.0
16     2           7   87.0
17     2           8   82.0
18     2           9   95.0
19     2          10   96.0
20     3           1   78.0
21     3           2   80.0
22     3           3   87.0
23     3           4    0.0
24     3           5   89.0
25     3           6   90.0
26     3           7  100.0
27     3           8   72.0
28     3           9   73.0
29     3          10   75.0
30     4           1    0.0
31     4           2   80.0
32     4           3   81.0
33     4           4   82.0
34     4           5

In [17]:
def mean_atleast_70(student_id):
    
    """Compute mean grade across all exams for student with given student_id.
    Treat missing exam grades as zeros.
    If mean grade is atleast 70, return True. Otherwise, return False."""
    
    mean_grade = grades_cleaned.loc[grades_cleaned['student_id'] == student_id]['grade'].mean()
    return mean_grade >= 70

### üß© Understanding `grades_clean.loc[grades_clean['student_id'] == student_id, 'grade']`

This expression is fundamental when working with **pandas** DataFrames.  
It combines **filtering**, **selection**, and **assignment** concepts.

---

### üí° What each part means

| Piece | Explanation |
|-------|--------------|
| `grades_clean` | The name of your pandas **DataFrame** ‚Äî think of it as a spreadsheet stored in memory. |
| `grades_clean['student_id']` | Accesses the **column** named `'student_id'`. This returns a list-like object (called a *Series*) containing each student‚Äôs ID. |
| `grades_clean['student_id'] == student_id` | Creates a **Boolean mask** ‚Äî a list of `True` or `False` values for every row. For example, if `student_id = 3`, the mask might look like `[False, True, False, True, ...]`, meaning ‚Äúwhich rows belong to student 3.‚Äù |
| `grades_clean.loc[...]` | The `.loc[]` method **selects rows and columns by label**. Inside the brackets, you specify which rows and which column(s) you want. |
| `grades_clean.loc[grades_clean['student_id'] == student_id, 'grade']` | This says: ‚ÄúFrom the DataFrame `grades_clean`, select all rows where the `'student_id'` column equals the given `student_id`, and from those rows, return only the `'grade'` column.‚Äù |
| `student_grades = ...` | The equals sign `=` here is **assignment**, not mathematical equality. It means ‚Äústore whatever‚Äôs on the right-hand side into the variable `student_grades`.‚Äù |

In [22]:
# test mean_grade on student_id 1
assert mean_atleast_70(1) == False, 'test failed'
print('test passed')

test passed


## üß† Understanding `assert` ‚Äî Why No ‚Äúelse‚Äù Is Needed

The `assert` statement is a simple way to test whether something in your code is **true**.  
If the condition you specify is true, the code keeps going.  
If it‚Äôs false, Python automatically **stops** and raises an error message.

In other words, `assert` combines both the **if** and the **else** logic into a single line.

---

### üß© Example: Regular `if` vs. `assert`

#### Without `assert` (long version)
```python
if mean_atleast_70(1) == False:
    print("test passed")
else:
    raise AssertionError("test failed")
```

#### With `assert` (short version)
```python
assert mean_atleast_70(1) == False, "test failed"
print("test passed")
```

‚úÖ If the condition is `True` ‚Üí program continues ‚Üí prints ‚Äútest passed.‚Äù  
‚ùå If the condition is `False` ‚Üí Python stops and shows `AssertionError: test failed.`

---

### üí° Why It‚Äôs Useful
- Quicker than writing full `if/else` checks.  
- Helps **validate assumptions** while coding.  
- Makes your notebook more readable during debugging or testing.

You can use multiple asserts in a row to test several cases:
```python
assert mean_atleast_70(1) == False, "student 1 test failed"
assert mean_atleast_70(3) == True, "student 3 test failed"
print("All tests passed!")
```

This acts like a mini self-test for your function ‚Äî no separate testing library required.


In [23]:
# sequence containing all distinct student ids
student_ids = grades_cleaned['student_id'].unique()
student_ids

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [25]:
list(filter(mean_atleast_70, student_ids))  # the elements being filtered are NumPy integers (np.int64), not standard Python integers

[np.int64(3), np.int64(5), np.int64(7), np.int64(8), np.int64(9), np.int64(10)]

In [28]:
# Convert the NumPy array to a Python list before filtering
student_ids = grades_cleaned['student_id'].unique().tolist()

# then return
list(filter(mean_atleast_70, student_ids))


[3, 5, 7, 8, 9, 10]

In [30]:
# Alternative: Cast the NumPy integers to Python int after filtering
filtered_students = [int(x) for x in filter(mean_atleast_70, student_ids)]
print(filtered_students)

[3, 5, 7, 8, 9, 10]


## üß† Key Takeaways ‚Äî `filter()`

- **Purpose:** `filter(function, sequence)` selects elements for which the function returns `True`.  
- **Iterator:** `filter()` returns an **iterator**; use `list()` to convert it into a visible sequence.  
- **Function requirement:** The function passed to `filter()` must return a Boolean value (`True` or `False`).  
- **Example pattern:**
  \`\`\`python
  def is_valid(x):
      return x > 10
  valid_values = list(filter(is_valid, numbers))
  \`\`\`
- **In data analysis:** Ideal for condition-based subsetting ‚Äî e.g., selecting students with high averages, filtering valid entries, or removing outliers.  
- **Relation to map():**  
  - `map()` transforms every element (returns new values).  
  - `filter()` keeps only elements meeting a condition (returns subset).  


## ‚öñÔ∏è Comparison ‚Äî `map()` vs. `filter()`

| Feature | `map()` | `filter()` |
|:--|:--|:--|
| **Purpose** | Transforms every element in a sequence | Selects only elements meeting a condition |
| **Input Function Type** | Function that performs a **transformation** | Function that returns a **Boolean** (`True` / `False`) |
| **Output** | Iterator with **transformed values** | Iterator with **subset of original values** |
| **Typical Use Case** | Apply a formula or operation to each element | Extract elements satisfying specific criteria |
| **Returns Equal Length?** | ‚úÖ Yes ‚Äî same number of elements as input | ‚ùå No ‚Äî only elements where condition is `True` |
| **Example** | `map(lambda x: x**2, numbers)` ‚Üí `[1, 4, 9]` | `filter(lambda x: x > 0, numbers)` ‚Üí `[1, 2, 3]` |
| **Relation to Data Analysis** | Used for **data transformation** (e.g., scaling, normalization) | Used for **data selection** (e.g., filtering rows, cleaning data) |

---

### üß© Quick Summary
- `map()` answers **‚ÄúHow do I modify each element?‚Äù**  
- `filter()` answers **‚ÄúWhich elements should I keep?‚Äù**
