# Lesson 3: Indexing and Selecting Data in Pandas


Hello! Today we're diving into **Indexing and Selecting Data in pandas**, a crucial part of data manipulation and analysis. 

- **Indexing** helps us locate data in specific rows.  
- **Selecting** focuses on picking specific columns or cells.

We'll explore how to select and index data using pandas through hands-on examples. Let's begin!

---

## Understanding Indexing: Setting Index

In pandas, an **index** serves as the address of your data. By default, pandas assigns integer labels to rows, but you can set any column as the index, turning it into an identifier for rows.

### Example: Using `set_index()`, `reset_index()`, and `rename()`

```python
import pandas as pd

df = pd.DataFrame({
    "Name": ["Alice", "Bob", "John"],
    "Age": [25, 22, 30],
    "City": ["New York", "Los Angeles", "Chicago"]
})

df.set_index("Name", inplace=True)
print(df)
# Output:
#         Age          City
# Name                     
# Alice   25      New York
# Bob     22   Los Angeles
# John    30       Chicago
```

### Key Notes:
1. Use the `loc[]` method for **label-based indexing**.
2. Use the `iloc[]` method for **integer-based indexing** (explored later).  
3. The `inplace` parameter applies changes directly to the target DataFrame if set to `True`.  
   - In **pandas 3.0**, `inplace` will be omitted. You'll need to assign the modified DataFrame back:  
     ```python
     df = df.set_index("Name")
     ```

---

## Understanding Indexing: Resetting Index

To reset the index to default, use `reset_index()`:

```python
df.reset_index(inplace=True)
print(df)
# Output:
#     Name  Age          City
# 0  Alice   25      New York
# 1    Bob   22   Los Angeles
# 2   John   30       Chicago
```

---

## Understanding Indexing: Renaming Index

Renaming the index involves renaming the corresponding column using the `rename()` method:

```python
df.rename(columns={"Name": "Student Name", "Age": "Student Age"}, inplace=True)
print(df)
# Output:
#   Student Name  Student Age          City
# 0        Alice           25      New York
# 1          Bob           22   Los Angeles
# 2         John           30       Chicago
```

### Note:
Provide a dictionary where the key is the old name, and the value is the new name.

---

## Selecting Data Using Labels and Location

pandas provides `loc[]` and `iloc[]` for data access:  
- **`loc[]`**: Label-based indexing.  
- **`iloc[]`**: Integer-based indexing.  

### Example:

```python
df = pd.DataFrame({
    "Name": ["Alice", "Bob", "John", "Robert", "Ann"],
    "Age": [25, 22, 30, 28, 32],
    "City": ["New York", "Los Angeles", "Chicago", "San Francisco", "Houston"]
})

df.set_index("Name", inplace=True)

print(df.loc[["Alice", "John"], ["Age", "City"]])
# Output:
#         Age       City
# Name                 
# Alice   25   New York
# John    30    Chicago

print(df.iloc[[1, 3], [0, 1]])
# Output:
#         Age           City
# Name                      
# Bob      22    Los Angeles
# Robert   28  San Francisco
```

### Key Notes:
1. **`loc[]`**: Uses labels (index and column names) for selection.  
2. **`iloc[]`**: Uses numerical indices for rows and columns (similar to 2D NumPy arrays).

---

## Lesson Summary and Practice

🎉 Congrats on completing this lesson!  
Here's what you've learned:
- Setting, resetting, and renaming indices using `set_index()`, `reset_index()`, and `rename()`.
- Accessing data using `loc[]` and `iloc[]`.

💡 **Practice Exercises:**  
To reinforce your knowledge, try some exercises based on this lesson.

Stay tuned for the next lesson, where we'll dive deeper into pandas and explore more useful features!


## Exploring School Cities in a DataFrame

This code is set in a world where we're curious about educational institutions and their student numbers. Imagine you have a list of schools and want to find out in which cities the "Greenwood Primary" and "Harbor Science" institutions are located. The answer is already programmed for you. Click Run to discover the cities of these institutions!

import pandas as pd

schools = pd.DataFrame({
    "Institution": ["Greenwood Primary", "Sunshine Secondary", "Harbor Science"],
    "Students": [320, 200, 256],
    "City": ["Cambridge", "Oxford", "Brighton"]
})

schools.set_index("Institution", inplace=True)
print(schools.loc[["Greenwood Primary", "Harbor Science"], "City"])

Here's a formatted version of your code in Markdown with an explanation:

```markdown
# Discovering Cities of Educational Institutions

This script is set in a world where we're curious about educational institutions and their student numbers.  
Imagine you have a list of schools and want to find out in which cities the **"Greenwood Primary"** and **"Harbor Science"** institutions are located.  

The answer is already programmed for you. Click **Run** to discover the cities of these institutions!

```python
import pandas as pd

# Creating a DataFrame of schools, student numbers, and cities
schools = pd.DataFrame({
    "Institution": ["Greenwood Primary", "Sunshine Secondary", "Harbor Science"],
    "Students": [320, 200, 256],
    "City": ["Cambridge", "Oxford", "Brighton"]
})

# Setting "Institution" as the index
schools.set_index("Institution", inplace=True)

# Selecting and printing the cities for specific institutions
print(schools.loc[["Greenwood Primary", "Harbor Science"], "City"])
```

### Expected Output:
```plaintext
Institution
Greenwood Primary    Cambridge
Harbor Science        Brighton
Name: City, dtype: object
```

### Explanation:
1. **Data Creation**: A DataFrame is created with three columns: `Institution`, `Students`, and `City`.  
2. **Set Index**: The `set_index()` method sets "Institution" as the index, making it easier to query specific schools.  
3. **Query**: The `loc[]` method is used to select the rows for "Greenwood Primary" and "Harbor Science", retrieving the corresponding cities.  

Run the code and see which cities these institutions belong to! 🌟

## Practicing Integer-based Indexing with iloc[]

All right! It's time to practice integer-based indexing with iloc[]. Modify the starter code to select and print the number of students for "Lincoln High" and "Oakwood College" using iloc[], instead of their cities. Note that you must use the iloc for this task, though of course it can be solved in different ways.
import pandas as pd

schools = pd.DataFrame({
    "Institution": ["Greenwood Elementary", "Lincoln High", "Oakwood College"],
    "Students": [250, 400, 600],
    "City": ["Cambridge", "Arlington", "Madison"]
})

schools.set_index("Institution", inplace=True)
print(schools.loc[["Lincoln High", "Oakwood College"], "City"])  # Use iloc instead of loc

Here's how you can modify the code to use `iloc[]` for integer-based indexing to select and print the number of students for "Lincoln High" and "Oakwood College":

```python
import pandas as pd

# Creating the DataFrame
schools = pd.DataFrame({
    "Institution": ["Greenwood Elementary", "Lincoln High", "Oakwood College"],
    "Students": [250, 400, 600],
    "City": ["Cambridge", "Arlington", "Madison"]
})

# Setting "Institution" as the index
schools.set_index("Institution", inplace=True)

# Using iloc[] to select and print the number of students for "Lincoln High" and "Oakwood College"
print(schools.iloc[[1, 2], 0])  # Select rows 1 and 2, column 0 ("Students")
```

### Explanation:
1. **`iloc[]` for Integer-Based Indexing**:
   - Row `1`: Corresponds to "Lincoln High".
   - Row `2`: Corresponds to "Oakwood College".
   - Column `0`: Corresponds to the "Students" column.

### Expected Output:
```plaintext
Institution
Lincoln High      400
Oakwood College   600
Name: Students, dtype: int64
```

This approach demonstrates integer-based indexing to extract data based on row and column positions. 🎯

## Data Retrieval Accuracy in School Records

Great job, Space Explorer!

We've noticed an issue with the school's data-fetching in our database. Could you spot the problem and ensure we're retrieving the correct information? Your task is to carefully inspect the code and correct it so that the correct school data is displayed.
import pandas as pd

schools = pd.DataFrame({
  "School_ID": [101, 102, 103],
  "School_Name": ["Greenwood Elementary", "Pine Hill Academy", "Maplewood High"],
  "NumberOfStudents": [250, 300, 400]
})

schools.set_index("School_ID", inplace=True)
print(schools.iloc[[101], ["School_Name", "NumberOfStudents"]])

The issue in the code lies in how the `iloc[]` method is being used. **`iloc[]` only accepts integer-based positions for indexing**, but `[101]` is being used, which refers to a value in the `School_ID` column (not a positional index). Similarly, `iloc[]` doesn't accept column labels (e.g., `["School_Name", "NumberOfStudents"]`); it requires column positions.

Here's the corrected version of the code:

```python
import pandas as pd

# Creating the DataFrame
schools = pd.DataFrame({
    "School_ID": [101, 102, 103],
    "School_Name": ["Greenwood Elementary", "Pine Hill Academy", "Maplewood High"],
    "NumberOfStudents": [250, 300, 400]
})

# Setting "School_ID" as the index
schools.set_index("School_ID", inplace=True)

# Correctly retrieving data using iloc[] for position-based indexing
print(schools.iloc[[0], [0, 1]])  # Row position 0 and column positions 0, 1
```

### Explanation:
1. **`iloc[]` Indexing**:
   - `iloc[]` operates on zero-based positional indices, not the values in the index (`School_ID` in this case).
   - To retrieve the first row (`School_ID: 101`), use position `0`.
   - For columns `["School_Name", "NumberOfStudents"]`, their positional indices are `0` and `1`, respectively.

2. **Corrected Output**:
   ```plaintext
          School_Name  NumberOfStudents
   School_ID                              
   101     Greenwood Elementary             250
   ```

### Key Takeaway:
- **`iloc[]`**: For position-based indexing (zero-based).  
- **`loc[]`**: For label-based indexing (uses actual index and column labels).  

This correction ensures the code works as intended and retrieves the correct school data! 🎉

## Indexing Galactic Academy Data

Great discovery, Space Voyager! Now, I need you to set the correct index for our schools DataFrame and select a specific school by its label. It's a small step for you but a giant leap for your data journey!
import pandas as pd

# Creating a DataFrame with educational institutions and student demographics
schools = pd.DataFrame({
    "School": ["Greenwood Elementary", "Sunset High School", "Riverside College"],
    "Students": [120, 240, 980],
    "City": ["Springfield", "Shelbyville", "Ogdenville"]
})

# TODO: Set the 'School' as the index to uniquely identify each row

# TODO: Select and print data for "Sunset High School" using label-based indexing

Here’s how to set the `School` column as the index and select data for **"Sunset High School"** using label-based indexing (`loc[]`):

```python
import pandas as pd

# Creating a DataFrame with educational institutions and student demographics
schools = pd.DataFrame({
    "School": ["Greenwood Elementary", "Sunset High School", "Riverside College"],
    "Students": [120, 240, 980],
    "City": ["Springfield", "Shelbyville", "Ogdenville"]
})

# Setting 'School' as the index to uniquely identify each row
schools.set_index("School", inplace=True)

# Selecting and printing data for "Sunset High School" using label-based indexing
print(schools.loc["Sunset High School"])
```

### Explanation:
1. **Set Index**:
   - `schools.set_index("School", inplace=True)` sets the `School` column as the index, making each school uniquely identifiable by its name.

2. **Label-Based Indexing with `loc[]`**:
   - `schools.loc["Sunset High School"]` retrieves the row corresponding to **"Sunset High School"** based on the index.

### Expected Output:
```plaintext
Students           240
City       Shelbyville
Name: Sunset High School, dtype: object
```

### Key Takeaway:
By setting an appropriate column as the index, you can efficiently retrieve rows using `loc[]`. This approach is particularly useful for datasets with unique identifiers like names, IDs, or codes. 🚀

## Setting the Index in a DataFrame

Stellar Navigator, you're doing great! Can you rename the column School to Institution for clarity? Update the DataFrame to reflect this change without altering any other parts.
import pandas as pd

# Creating a DataFrame with information about students and schools
data = {
  "Student Name": ["James", "Anna", "Laura", "Peter"],
  "Grade": [9, 10, 8, 10],
  "School": ["Riverdale High", "Sunnydale High", "Riverdale High", "Westview High"]
}

df = pd.DataFrame(data)
df.set_index("Student Name", inplace=True) # Setting 'Student Name' as the index
# TODO: rename the School column
print(df)

Here’s how to rename the `School` column to `Institution` without altering other parts of the DataFrame:

```python
import pandas as pd

# Creating a DataFrame with information about students and schools
data = {
    "Student Name": ["James", "Anna", "Laura", "Peter"],
    "Grade": [9, 10, 8, 10],
    "School": ["Riverdale High", "Sunnydale High", "Riverdale High", "Westview High"]
}

df = pd.DataFrame(data)

# Setting 'Student Name' as the index
df.set_index("Student Name", inplace=True)

# Renaming the 'School' column to 'Institution'
df.rename(columns={"School": "Institution"}, inplace=True)

# Printing the updated DataFrame
print(df)
```

### Explanation:
1. **`rename()` Method**:
   - Use the `rename()` method with the `columns` parameter to rename the `School` column to `Institution`.
   - The `inplace=True` parameter applies the change directly to the DataFrame.

2. **Resulting DataFrame**:
   ```plaintext
                 Grade       Institution
Student Name                            
James               9    Riverdale High
Anna               10   Sunnydale High
Laura               8    Riverdale High
Peter              10    Westview High
   ```

### Key Takeaway:
The `rename()` method is a simple and efficient way to rename columns without affecting other parts of the DataFrame. 🌟