# Week 5, Class 2: Data Loading, Indexing, and Selection

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35]
})
print(df)

In [None]:
df.loc[df["Name"] == "Charlie", "Age"] = 40
print(df)

In [None]:
df.at[0, "Age"] = 26
print(df)

## 1. Loading Data from Files
Pandas provides convenient functions for reading data from various file formats. The most common for scientific data are `.csv` (Comma-Separated Values) and `.xlsx` (Excel).

### 1.1. Reading CSV Files
The `pd.read_csv()` function is your go-to for reading text files with delimited data.

In [None]:
# Read the file into a DataFrame
df = pd.read_csv("experiment_data.csv")
print(f"DataFrame loaded from CSV:\n{df}")

### 1.2. Reading Excel Files

For `.xlsx` files, you use `pd.read_excel()`. Note that this often requires an additional library like `openpyxl`.

In [None]:
# Read the file into a DataFrame
df_excel = pd.read_excel("data.xlsx", sheet_name="Experiments")
df_excel.head()

### 1.3. Inspecting the Data
Once your data is in a DataFrame, these methods are invaluable for a quick sanity check:

* `df.head(n)`: Returns the first `n` rows (default is 5).
* `df.tail(n)`: Returns the last `n` rows (default is 5).
* `df.info()`: Provides a concise summary of the DataFrame, including the column data types and number of non-null values.

In [None]:
print("First 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

print("\nDataFrame info:")
df.info()

## 2. Accessing Data with `.loc[]` (Label-Based Indexing)

Pandas provides an explicit way to access data using the `.loc[]` accessor. `.loc` stands for **"location"** and is used for **label-based indexing**. This means you access rows and columns by their names.

### 2.1. Selecting Rows by Index Label

For our `df`, the row labels are the default integers `0, 1, 2, 3, 4`.

In [None]:
# Select a single row by its label
row_2 = df.loc[2]
print(f"Row 2 (as a Series):\n{row_2}")
print(f"Type of selected row: {type(row_2)}")

# Select multiple rows by a list of labels
rows_0_and_3 = df.loc[[0, 3]]
print(f"\nRows 0 and 3:\n{rows_0_and_3}")

# Select a range of rows by label (inclusive of the end!)
# Slicing with .loc is inclusive of the final element
rows_1_to_3 = df.loc[1:3]
print(f"\nRows 1 to 3 (inclusive):\n{rows_1_to_3}")

### 2.2. Selecting Both Rows and Columns by Label

The power of `.loc` is its ability to select both rows and columns in a single, readable command. The syntax is `df.loc[row_labels, column_labels]`.

In [None]:
# Select a single value at a specific row and column label
value_at_2_pH = df.loc[2, 'pH_Level']
print(f"pH Level of row 2: {value_at_2_pH}")

# Select a subset of rows and a subset of columns
subset_data = df.loc[[1, 4], ['SampleID', 'Weight(g)']]
print(f"\nSubset of data:\n{subset_data}")

# Select all rows, but only specific columns
all_rows_some_cols = df.loc[:, ['Weight(g)', 'Catalyst']]
print(f"\nAll rows, specific columns:\n{all_rows_some_cols}")

## 3. Accessing Data with `.iloc[]` (Position-Based Indexing)

The `.iloc[]` accessor is used for **integer-position-based indexing**. It works just like NumPy arrays or Python lists, where you access elements by their numerical position starting from 0.

### 3.1. Selecting Rows by Position

In [None]:
# Select a single row by its integer position
row_at_position_2 = df.iloc[2]
print(f"Row at position 2 (label 2): \n{row_at_position_2}")

# Select multiple rows by a list of integer positions
rows_at_0_and_3 = df.iloc[[0, 3]]
print(f"\nRows at positions 0 and 3:\n{rows_at_0_and_3}")

# Select a range of rows by position (exclusive of the end!)
# Slicing with .iloc is like Python lists - it stops before the final index
rows_at_1_to_3 = df.iloc[1:4]
print(f"\nRows at positions 1 to 3 (exclusive end):\n{rows_at_1_to_3}")

### 3.2. Selecting Both Rows and Columns by Position

The syntax is `df.iloc[row_positions, column_positions]`.

In [None]:
# Select a single value at a specific row and column position
value_at_2_3 = df.iloc[2, 3] # Row 2, Column 3
print(f"Value at position (2, 3): {value_at_2_3}")

# Select a subset of rows and a subset of columns by position
subset_data_iloc = df.iloc[[1, 4], [0, 1]] # Rows 1 & 4, Columns 0 & 1
print(f"\nSubset of data by position:\n{subset_data_iloc}")

# Select all rows, but only the first two columns
first_two_cols = df.iloc[:, 0:2]
print(f"\nAll rows, first two columns:\n{first_two_cols}")

**Key Distinction:** Use `.loc` when you know the names of your rows and columns. Use `.iloc` when you only know their numerical positions.

## 4. Boolean Indexing with a DataFrame

Just like with NumPy arrays, you can use a boolean Series or array to filter your DataFrame.

In [None]:
# Find all samples with a weight greater than 11.0 g
heavy_samples_mask = df['Weight(g)'] > 11.0
print(f"Boolean mask for heavy samples:\n{heavy_samples_mask}")

# Use the mask to filter the DataFrame
heavy_samples_df = df[heavy_samples_mask]
print(f"\nDataFrame of heavy samples:\n{heavy_samples_df}")

# Combining multiple conditions
# Find samples with pH between 7.0 and 7.3 AND using Catalyst 'B'
condition_1 = (df['pH_Level'] >= 7.0) & (df['pH_Level'] <= 7.3)
condition_2 = df['Catalyst'] == 'B'

# Combine the masks using the '&' (AND) operator
combined_mask = condition_1 & condition_2

# Filter the DataFrame using the combined mask
filtered_data = df[combined_mask]
print(f"\nFiltered data (pH >= 7.0 & pH <= 7.3 AND Catalyst B):\n{filtered_data}")

# You can also perform this with .loc for more explicit selection
filtered_data_loc = df.loc[combined_mask, ['SampleID', 'Weight(g)']]
print(f"\nFiltered data using .loc to get specific columns:\n{filtered_data_loc}")

**Note:** When combining boolean Series for filtering, you must use `&` for AND and `|` for OR, and wrap each condition in parentheses `()`. Standard `and` and `or` won't work on a Series.

## Summary and Key Takeaways

* **`pd.read_csv()`** and **`pd.read_excel()`** are the primary functions for loading data into a `DataFrame`.
* **`.head()`**, **`.tail()`**, and **`.info()`** are essential for initial data inspection.
* **`.loc[]`** is for **label-based** indexing (using row/column names). Slicing with `.loc` is inclusive of the end label.
* **`.iloc[]`** is for **position-based** indexing (using integer positions). Slicing with `.iloc` is exclusive of the end position.
* **Boolean indexing** is a highly effective way to filter a `DataFrame` based on conditional logic.

## Exercises (Homework)

Complete the following exercises in a new Python script or a new Jupyter Notebook.

1.  **Read and Inspect a New Dataset:**
    * Create a new CSV file named `lab_temps.csv` with the following content:
        ```csv
        Lab,RunID,Temp(C),Pressure(kPa)
        A,R1,25.1,101.3
        B,R1,24.5,100.9
        A,R2,26.0,101.5
        C,R1,23.8,101.1
        B,R2,25.2,101.2
        ```
    * Read this file into a DataFrame named `temp_df`.
    * Print the first 2 rows of `temp_df`.
    * Print a summary of the DataFrame's info using `.info()`.

2.  **Access Data with `.loc`:**
    * Using `temp_df`, retrieve and print the row with index label `2`.
    * Retrieve and print the `Temp(C)` and `Pressure(kPa)` columns for rows with index labels `0` and `4` using a single `.loc` call.
    * What is the value of the `Pressure(kPa)` column in the last row, using `.loc` and negative indexing?

3.  **Access Data with `.iloc`:**
    * Using `temp_df`, retrieve and print the row at integer position `1`.
    * Retrieve and print the `Lab` and `Temp(C)` columns for the rows at integer positions `0, 2, 4` using a single `.iloc` call.
    * What is the value of the `Temp(C)` column in the last row, using `.iloc`?

4.  **Filter Data with Boolean Indexing:**
    * Using `temp_df`, create a new DataFrame `filtered_df` that contains only the rows where the `Lab` is 'A'.
    * Create a new DataFrame `critical_df` that contains only the rows where the `Temp(C)` is greater than or equal to `25.5` AND the `Pressure(kPa)` is less than or equal to `101.2`.
    * Print both `filtered_df` and `critical_df`.