# Introduction to Pandas DataFrames

**Role:** Senior Python Engineer

**Context:** Tabular Data Manipulation

## Overview of DataFrames

The **DataFrame** is the core workhorse of the Pandas library. You can think of it as a programmatic Excel spreadsheet or a SQL table.

Technically, a DataFrame is a 2-dimensional, size-mutable, and potentially heterogeneous tabular data structure. More intuitively, **a DataFrame is simply a collection of Pandas Series that share the same index.**

---

## 1. Creating a DataFrame

While in practice you will mostly create DataFrames by reading external files (like CSVs or SQL databases), it is important to know how to construct them manually for testing and manipulation.

```python
import numpy as np
import pandas as pd

# Set a random seed for reproducible results
np.random.seed(101)

# Generate random data: 5 rows, 4 columns
data = np.random.randn(5, 4)

# Define row labels (Index) and column labels
row_labels = ['A', 'B', 'C', 'D', 'E']
col_labels = ['W', 'X', 'Y', 'Z']

# Construct the DataFrame
df = pd.DataFrame(data=data, index=row_labels, columns=col_labels)

```

---

## 2. Accessing Columns

Because a DataFrame is a collection of Series, grabbing a single column returns a Series object. Grabbing multiple columns returns a smaller DataFrame.

### A. Grabbing a Single Column (Returns a Series)

Always use bracket notation `df['ColumnName']`.
*(Note: While `df.ColumnName` works, it is heavily discouraged because it can conflict with built-in DataFrame methods like `df.drop` or `df.count`).*

```python
# Returns a Pandas Series
w_col = df['W'] 

```

### B. Grabbing Multiple Columns (Returns a DataFrame)

To grab multiple columns, pass a **list** of column names inside the brackets.

```python
# Returns a Pandas DataFrame
subset = df[['W', 'Z']] 

```

---

## 3. Creating and Removing Columns

### A. Creating a New Column

You can create a new column on the fly by defining it as if it already exists, usually by performing arithmetic on existing columns.

```python
# Create a new column 'new' that is the sum of 'W' and 'Y'
df['new'] = df['W'] + df['Y']

```

### B. Removing a Column (`drop`)

Use the `.drop()` method to remove rows or columns.

**Crucial Concepts for `drop()`:**

1. **Axis:** You must specify `axis=1` to drop a column. (By default, `axis=0` which targets rows). This stems from the underlying NumPy shape where index 0 is rows and index 1 is columns.
2. **In-Place:** By default, `.drop()` returns a *copy* of the DataFrame with the column removed; it does not alter the original `df`. To permanently alter the original, you must set `inplace=True`.

```python
# Drops the column permanently from the original DataFrame
df.drop('new', axis=1, inplace=True)

```

---

## 4. Accessing Rows

You do not use standard bracket notation (e.g., `df['A']`) to grab rows, as Pandas reserves that for columns. Instead, you use specific location methods. **Extracting a single row returns a Pandas Series**, where the Series index is the DataFrame's column names.

### A. `loc` (Label-based Location)

Use `.loc` when you want to extract a row using its explicit label (name).

```python
# Returns Row 'A' as a Series
row_a = df.loc['A'] 

```

### B. `iloc` (Integer-based Location)

Use `.iloc` when you want to extract a row using its numerical index position, regardless of its label.

```python
# Returns Row 'C' (which is at integer index 2)
row_c = df.iloc[2]

```

---

## 5. Selecting Subsets (Rows & Columns)

You can use `.loc` to grab specific cross-sections of the DataFrame by passing `[row_labels, column_labels]`.

```python
# 1. Grab a single specific value (Row B, Column Y)
value = df.loc['B', 'Y']

# 2. Grab a subset matrix (Rows A & B, Columns W & Y)
sub_matrix = df.loc[['A', 'B'], ['W', 'Y']]

```

---

## Summary Cheat Sheet

| Operation | Syntax | Returns |
| --- | --- | --- |
| **Get Column** | `df['col_name']` | Series |
| **Get Columns** | `df[['col1', 'col2']]` | DataFrame |
| **Create Column** | `df['new'] = df['A'] + df['B']` | Modifies `df` |
| **Drop Column** | `df.drop('col', axis=1, inplace=True)` | Modifies `df` |
| **Get Row (Label)** | `df.loc['row_label']` | Series |
| **Get Row (Index)** | `df.iloc[row_index]` | Series |
| **Get Subset** | `df.loc[['r1', 'r2'], ['c1', 'c2']]` | DataFrame |s


# Pandas DataFrames: Conditional Selection & Indexing

In this section, we cover the most powerful feature of Pandas DataFrames: **Conditional Selection** (also known as Boolean Masking). This allows you to filter massive datasets instantly without writing explicit `for` loops. We also explore how to manipulate the DataFrame's Index.

---

## 1. Conditional Selection (Boolean Masking)

Conditional selection in Pandas works almost exactly like it does in NumPy.

### Step 1: Generating the Boolean Series

When you apply a comparison operator to a DataFrame column (which is a Series), Pandas returns a new Series of the same length, filled with `True` or `False`.

```python
import numpy as np
import pandas as pd

np.random.seed(101)
df = pd.DataFrame(np.random.randn(5, 4), 
                  index=['A', 'B', 'C', 'D', 'E'], 
                  columns=['W', 'X', 'Y', 'Z'])

# Condition: Where is column 'W' greater than 0?
bool_series = df['W'] > 0

print(bool_series)
# Output:
# A     True
# B     True
# C    False
# D     True
# E     True

```

### Step 2: Filtering the DataFrame

You pass this Boolean Series back into the DataFrame's brackets `df[]`. Pandas will return **only the rows where the Series is `True**`.

```python
# The standard one-liner pattern
filtered_df = df[df['W'] > 0]

# Notice row 'C' is entirely missing because its 'W' value was < 0
print(filtered_df)

```

### Step 3: Stacking Commands (The One-Liner)

Because the result of `df[df['W'] > 0]` is just another DataFrame, you can immediately grab specific columns from it in the same line of code.

```python
# "Give me the 'X' and 'Y' columns ONLY for the rows where 'W' > 0"
result = df[df['W'] > 0][['X', 'Y']]

```

*Engineering Tip:* If this one-liner is difficult to read, break it down into multiple variables. In production code, readability often trumps brevity.

---

## 2. Multiple Conditions (`&` and `|`)

**CRITICAL:** You **cannot** use the standard Python `and` / `or` keywords when evaluating Pandas Series.

The Python `and` operator expects a single boolean value (e.g., `True and False`). A Pandas Series contains multiple values, causing an `Ambiguous Truth Value` error.

Instead, you must use bitwise operators and **wrap each condition in parentheses**.

* **AND:** Use the ampersand `&`
* **OR:** Use the pipe `|`

```python
# Condition: 'W' must be > 0 AND 'Y' must be > 1
result = df[(df['W'] > 0) & (df['Y'] > 1)]

# Condition: 'W' must be > 0 OR 'Z' must be < 0
result = df[(df['W'] > 0) | (df['Z'] < 0)]

```

---

## 3. Manipulating the Index

The Index in Pandas is the bolded "name" of the row. Sometimes, you need to turn the Index into a standard column, or turn a standard column into the new Index.

### A. Resetting the Index (`reset_index`)

This resets the Index back to the default numerical sequence (). The old Index is preserved as a new column named `'index'`.

```python
# By default, this returns a copy. Use inplace=True to modify the original.
df.reset_index(inplace=True)

# 'A', 'B', 'C' are now in a standard column called 'index'
# The row labels are now 0, 1, 2, 3, 4

```

### B. Setting a New Index (`set_index`)

This takes an existing column and elevates it to become the new Index.

**Warning:** This is a destructive action. The existing Index will be completely overwritten and lost unless you saved it to a column first.

```python
# Let's add a new column for states
states = ['CA', 'NY', 'WY', 'OR', 'CO']
df['States'] = states

# Set the 'States' column to be the new Index
df.set_index('States', inplace=True)

# The rows are now labeled 'CA', 'NY', etc.
# The 'States' column no longer exists as a standard data column.

```

---

## Summary Cheat Sheet

| Operation | Syntax | Note |
| --- | --- | --- |
| **Filter Rows** | `df[df['col'] > 0]` | Returns rows where condition is True. |
| **Select specific columns after filter** | `df[df['col'] > 0][['X', 'Y']]` | Stack operations sequentially. |
| **Multiple Conditions (AND)** | `df[(cond1) & (cond2)]` | Requires `&` and `()`. |
| **Multiple Conditions (OR)** | `df[(cond1) | (cond2)]` | Requires `|` and `()`. |
| **Reset Index to Numbers** | `df.reset_index(inplace=True)` | Old index becomes a column. |
| **Set Column as Index** | `df.set_index('col', inplace=True)` | Overwrites current index. |