# Day 4: Data Selection and Filtering with Pandas

Welcome to Day 4! Now that you know how to create and inspect a DataFrame, it's time to learn how to select and filter the data within it. This is a fundamental skill for any data analysis task, allowing you to zoom in on the specific pieces of information you're interested in.

Today's topics include:
1.  **Selecting Data** using labels (`.loc`) and integer positions (`.iloc`).
2.  **Filtering Data** based on conditions (also known as boolean indexing).
3.  **Handling Missing Values** using `dropna()` and `fillna()`.

First, let's get our environment set up. We'll import pandas and numpy, and then load the Iris dataset, just as we did yesterday.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

# Load the Iris dataset into a DataFrame
iris_data = load_iris()
iris_df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
iris_df['target'] = iris_data.target

# Let's preview the data to make sure it's loaded correctly
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


---

## Part 1: Selecting Data with `.loc` and `.iloc`

Pandas provides two powerful methods for selecting data:
- `.loc`: Selects data based on **labels** (e.g., column names, index labels).
- `.iloc`: Selects data based on **integer position** (e.g., the 0th row, the 1st column).

**Exercise 1.1:** Select the 'petal length (cm)' column and display its first 5 values using the `head()` method.

In [8]:
# Your code here
iris_df['petal length (cm)'].head()

0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
Name: petal length (cm), dtype: float64

**Solution 1.1:**

In [7]:
# Solution
iris_df['petal length (cm)'].head()

0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
Name: petal length (cm), dtype: float64

**Exercise 1.2:** Select both the 'sepal width (cm)' and 'petal width (cm)' columns. Display the first 5 rows of this selection.

In [12]:
# Your code here
iris_df[['sepal width (cm)', 'petal width (cm)']].head()

Unnamed: 0,sepal width (cm),petal width (cm)
0,3.5,0.2
1,3.0,0.2
2,3.2,0.2
3,3.1,0.2
4,3.6,0.2


**Solution 1.2:**

In [13]:
# Solution
iris_df[['sepal width (cm)', 'petal width (cm)']].head()

Unnamed: 0,sepal width (cm),petal width (cm)
0,3.5,0.2
1,3.0,0.2
2,3.2,0.2
3,3.1,0.2
4,3.6,0.2


**Exercise 1.3:** Select the 10th row of the DataFrame using `.iloc`.

In [14]:
# Your code here
iris_df.iloc[9]

sepal length (cm)    4.9
sepal width (cm)     3.1
petal length (cm)    1.5
petal width (cm)     0.1
target               0.0
Name: 9, dtype: float64

**Solution 1.3:**

In [15]:
# Solution
# Remember that Python is 0-indexed, so the 10th row is at index 9.
iris_df.iloc[9]

sepal length (cm)    4.9
sepal width (cm)     3.1
petal length (cm)    1.5
petal width (cm)     0.1
target               0.0
Name: 9, dtype: float64

**Exercise 1.4:** Select the rows from index 5 to 10 (inclusive) and the columns from 'sepal length (cm)' to 'petal length (cm)' (inclusive) using `.loc`.

In [21]:
# Your code here
iris_df.loc[5:10, 'sepal length (cm)': 'petal length (cm)']

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm)
5,5.4,3.9,1.7
6,4.6,3.4,1.4
7,5.0,3.4,1.5
8,4.4,2.9,1.4
9,4.9,3.1,1.5
10,5.4,3.7,1.5


**Solution 1.4:**

In [19]:
# Solution
# Note that .loc is inclusive of both start and end labels.
iris_df.loc[5:10, 'sepal length (cm)':'petal length (cm)']

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm)
5,5.4,3.9,1.7
6,4.6,3.4,1.4
7,5.0,3.4,1.5
8,4.4,2.9,1.4
9,4.9,3.1,1.5
10,5.4,3.7,1.5


---

## Part 2: Filtering Data

Filtering, or boolean indexing, is the process of selecting rows from a DataFrame based on a condition.

**Exercise 2.1:** Create a new DataFrame called `long_petals_df` that contains only the rows from `iris_df` where the 'petal length (cm)' is greater than 6.0.

In [36]:
# Your code here
long_petals_df = iris_df[iris_df['petal length (cm)'] > 6.0]
long_petals_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
105,7.6,3.0,6.6,2.1,2
107,7.3,2.9,6.3,1.8,2
109,7.2,3.6,6.1,2.5,2
117,7.7,3.8,6.7,2.2,2
118,7.7,2.6,6.9,2.3,2
122,7.7,2.8,6.7,2.0,2
130,7.4,2.8,6.1,1.9,2
131,7.9,3.8,6.4,2.0,2
135,7.7,3.0,6.1,2.3,2


**Solution 2.1:**

In [34]:
# Solution
long_petals_df = iris_df[iris_df['target'] == 2]
long_petals_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
100,6.3,3.3,6.0,2.5,2
101,5.8,2.7,5.1,1.9,2
102,7.1,3.0,5.9,2.1,2
103,6.3,2.9,5.6,1.8,2
104,6.5,3.0,5.8,2.2,2
105,7.6,3.0,6.6,2.1,2
106,4.9,2.5,4.5,1.7,2
107,7.3,2.9,6.3,1.8,2
108,6.7,2.5,5.8,1.8,2
109,7.2,3.6,6.1,2.5,2


**Exercise 2.2:** Create a new DataFrame called `specific_iris_df` that contains rows where the target is 0 **and** the 'sepal width (cm)' is greater than 3.5.

*Hint: For multiple conditions, wrap each condition in parentheses `()` and use `&` for AND.*

In [38]:
# Your code here
specific_iris_df = iris_df[(iris_df['sepal width (cm)'] > 3.5) & (iris_df['target'] == 0 )]
specific_iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
10,5.4,3.7,1.5,0.2,0
14,5.8,4.0,1.2,0.2,0
15,5.7,4.4,1.5,0.4,0
16,5.4,3.9,1.3,0.4,0
18,5.7,3.8,1.7,0.3,0
19,5.1,3.8,1.5,0.3,0
21,5.1,3.7,1.5,0.4,0
22,4.6,3.6,1.0,0.2,0


**Solution 2.2:**

In [39]:
# Solution
specific_iris_df = iris_df[(iris_df['target'] == 0) & (iris_df['sepal width (cm)'] > 3.5)]
specific_iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
10,5.4,3.7,1.5,0.2,0
14,5.8,4.0,1.2,0.2,0
15,5.7,4.4,1.5,0.4,0
16,5.4,3.9,1.3,0.4,0
18,5.7,3.8,1.7,0.3,0
19,5.1,3.8,1.5,0.3,0
21,5.1,3.7,1.5,0.4,0
22,4.6,3.6,1.0,0.2,0


---

## Part 3: Handling Missing Data

Real-world data is often messy and contains missing values, represented as `NaN` (Not a Number). Let's create a temporary DataFrame with some missing data to practice handling it.

In [40]:
# Create a temporary DataFrame for this exercise
data = {'A': [1, 2, np.nan, 4, 5], 
        'B': [10, np.nan, np.nan, 40, 50],
        'C': [100, 200, 300, 400, 500]}
missing_df = pd.DataFrame(data)
missing_df

Unnamed: 0,A,B,C
0,1.0,10.0,100
1,2.0,,200
2,,,300
3,4.0,40.0,400
4,5.0,50.0,500


**Exercise 3.1:** Remove all rows from `missing_df` that contain any missing values.

In [41]:
# Your code here
missing_df.dropna()

Unnamed: 0,A,B,C
0,1.0,10.0,100
3,4.0,40.0,400
4,5.0,50.0,500


**Solution 3.1:**

In [42]:
# Solution
missing_df.dropna()

Unnamed: 0,A,B,C
0,1.0,10.0,100
3,4.0,40.0,400
4,5.0,50.0,500


**Exercise 3.2:** Fill the missing values in `missing_df`. Replace `NaN` in column 'A' with 0 and `NaN` in column 'B' with the mean of column 'B'.

In [60]:
missing_df['B'] = missing_df['B'].fillna(value=round(missing_df['B'].mean(), 2))
missing_df['A'] = missing_df['A'].fillna(0)
missing_df

Unnamed: 0,A,B,C
0,1.0,10.0,100
1,2.0,33.333333,200
2,0.0,33.333333,300
3,4.0,40.0,400
4,5.0,50.0,500


**Solution 3.2:**

In [45]:
# Solution
b_mean = missing_df['B'].mean()
filled_df = missing_df.fillna(value={'A': 0, 'B': b_mean})
filled_df

Unnamed: 0,A,B,C
0,1.0,10.0,100
1,2.0,33.333333,200
2,0.0,33.333333,300
3,4.0,40.0,400
4,5.0,50.0,500


---

### Excellent work on Day 4!

You now have the power to select, slice, and filter DataFrames, which are some of the most common operations in data analysis. You also took your first steps in data cleaning by handling missing values. Tomorrow, we'll shift gears and start visualizing our data with Matplotlib!