# Day 4: Data Selection and Filtering with Pandas

Welcome to Day 4! Now that you know how to create and inspect a DataFrame, it's time to learn how to select and filter the data within it. This is a fundamental skill for any data analysis task, allowing you to zoom in on the specific pieces of information you're interested in.

Today's topics include:
1.  **Selecting Data** using labels (`.loc`) and integer positions (`.iloc`).
2.  **Filtering Data** based on conditions (also known as boolean indexing).
3.  **Handling Missing Values** using `dropna()` and `fillna()`.

First, let's get our environment set up. We'll import pandas and numpy, and then load the Iris dataset, just as we did yesterday.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

# Load the Iris dataset into a DataFrame
iris_data = load_iris()
iris_df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
iris_df['target'] = iris_data.target

# Let's preview the data to make sure it's loaded correctly
iris_df.head()

---

## Part 1: Selecting Data with `.loc` and `.iloc`

Pandas provides two powerful methods for selecting data:
- `.loc`: Selects data based on **labels** (e.g., column names, index labels).
- `.iloc`: Selects data based on **integer position** (e.g., the 0th row, the 1st column).

**Exercise 1.1:** Select the 'petal length (cm)' column and display its first 5 values using the `head()` method.

In [None]:
# Your code here

**Solution 1.1:**

In [None]:
# Solution
iris_df['petal length (cm)'].head()

**Exercise 1.2:** Select both the 'sepal width (cm)' and 'petal width (cm)' columns. Display the first 5 rows of this selection.

In [None]:
# Your code here

**Solution 1.2:**

In [None]:
# Solution
iris_df[['sepal width (cm)', 'petal width (cm)']].head()

**Exercise 1.3:** Select the 10th row of the DataFrame using `.iloc`.

In [None]:
# Your code here

**Solution 1.3:**

In [None]:
# Solution
# Remember that Python is 0-indexed, so the 10th row is at index 9.
iris_df.iloc[9]

**Exercise 1.4:** Select the rows from index 5 to 10 (inclusive) and the columns from 'sepal length (cm)' to 'petal length (cm)' (inclusive) using `.loc`.

In [None]:
# Your code here

**Solution 1.4:**

In [None]:
# Solution
# Note that .loc is inclusive of both start and end labels.
iris_df.loc[5:10, 'sepal length (cm)':'petal length (cm)']

---

## Part 2: Filtering Data

Filtering, or boolean indexing, is the process of selecting rows from a DataFrame based on a condition.

**Exercise 2.1:** Create a new DataFrame called `long_petals_df` that contains only the rows from `iris_df` where the 'petal length (cm)' is greater than 6.0.

In [None]:
# Your code here

**Solution 2.1:**

In [None]:
# Solution
long_petals_df = iris_df[iris_df['target'] == 2]
long_petals_df

**Exercise 2.2:** Create a new DataFrame called `specific_iris_df` that contains rows where the target is 0 **and** the 'sepal width (cm)' is greater than 3.5.

*Hint: For multiple conditions, wrap each condition in parentheses `()` and use `&` for AND.*

In [None]:
# Your code here

**Solution 2.2:**

In [None]:
# Solution
specific_iris_df = iris_df[(iris_df['target'] == 0) & (iris_df['sepal width (cm)'] > 3.5)]
specific_iris_df

---

## Part 3: Handling Missing Data

Real-world data is often messy and contains missing values, represented as `NaN` (Not a Number). Let's create a temporary DataFrame with some missing data to practice handling it.

In [None]:
# Create a temporary DataFrame for this exercise
data = {'A': [1, 2, np.nan, 4, 5], 
        'B': [10, np.nan, np.nan, 40, 50],
        'C': [100, 200, 300, 400, 500]}
missing_df = pd.DataFrame(data)
missing_df

**Exercise 3.1:** Remove all rows from `missing_df` that contain any missing values.

In [None]:
# Your code here

**Solution 3.1:**

In [None]:
# Solution
missing_df.dropna()

**Exercise 3.2:** Fill the missing values in `missing_df`. Replace `NaN` in column 'A' with 0 and `NaN` in column 'B' with the mean of column 'B'.

In [None]:
# Your code here

**Solution 3.2:**

In [None]:
# Solution
b_mean = missing_df['B'].mean()
filled_df = missing_df.fillna(value={'A': 0, 'B': b_mean})
filled_df

---

### Excellent work on Day 4!

You now have the power to select, slice, and filter DataFrames, which are some of the most common operations in data analysis. You also took your first steps in data cleaning by handling missing values. Tomorrow, we'll shift gears and start visualizing our data with Matplotlib!