# Explicit Indexes

In pandas, an **explicit index** means the rows of a DataFrame (or Series) are labeled with meaningful values instead of just the default integer range. These labels make it easier to navigate, organize, and analyze data.

### How indexes are created and managed

* By default, pandas assigns a numeric index (0, 1, 2, …).
* You can make an index explicit using methods like **`.set_index()`**.
* If you want to return to the default numbering, you can use **`.reset_index()`**.
* To throw away the old index completely, you can do **`.reset_index(drop=True)`**.
* You can always check which labels are being used with **`.index`**.

### Why indexes are useful

Indexes simplify subsetting and filtering:

* With **`.loc[]`**, you can select rows directly by their label.
* With **`.isin()`**, you can check membership against index labels.
* Sorting data by labels is easy with **`.sort_index()`**.

### Characteristics of indexes

* Index values **don’t have to be unique** (though unique indexes are often easier to work with).
* Indexes can be **multi-level** (also called hierarchical). This means rows can be labeled by more than one category.

  * You can subset the *outer level* with a simple list of labels.
  * You can subset *inner levels* using a list of tuples.

### Challenges with indexes

Indexes are powerful, but they come with some downsides:

1. **Index values are still data** – sometimes they hold information that might be better kept as a normal column.
2. **They can break the "tidy data" principle** – tidy data says each variable should have its own column, but putting a variable into the index hides it from being a regular column.

✨ In short: explicit indexes make your DataFrame more descriptive and flexible to query, but you need to be mindful that they can hide important data and sometimes complicate the structure.

## Preparing Data

In [2]:
import pandas as pd
temperatures = pd.read_csv("datasets/temperatures.csv")

## Exercise: Setting and removing indexes

In pandas, you can promote a column to become the index of a DataFrame. This often makes subsetting cleaner and can improve lookup speed.

In this activity, you’ll work with `temperatures`, a DataFrame containing average temperatures from cities around the world.

### Instructions

1. Display the original `temperatures` DataFrame.
2. Create a new DataFrame `temps_by_city` where the column `"city"` is set as the index.
3. Print `temps_by_city` and compare it with the original version.
4. Reset the index of `temps_by_city`, keeping the index values as a column.
5. Reset the index again, this time removing the index values completely.

In [3]:
# Step 1: Inspect the original data
print(temperatures)

# Step 2: Assign "city" as the index
temps_by_city = temperatures.set_index("city")

# Step 3: View the new DataFrame with "city" as index
print(temps_by_city)

# Step 4: Reset index but keep the "city" values as a column
print(temps_by_city.reset_index())

# Step 5: Reset index and drop the "city" values
print(temps_by_city.reset_index(drop=True))

       Unnamed: 0        date     city        country  avg_temp_c
0               0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1               1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2               2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3               3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4               4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
...           ...         ...      ...            ...         ...
16495       16495  2013-05-01     Xian          China      18.979
16496       16496  2013-06-01     Xian          China      23.522
16497       16497  2013-07-01     Xian          China      25.251
16498       16498  2013-08-01     Xian          China      24.528
16499       16499  2013-09-01     Xian          China         NaN

[16500 rows x 5 columns]
         Unnamed: 0        date        country  avg_temp_c
city                                                      
Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
Abidj

## Exercise: Subsetting with .loc[]

One of the most powerful features of using indexes in pandas is the `.loc[]` method, which lets you subset rows directly by their index values. Unlike standard square-bracket filtering, `.loc[]` makes your code shorter and easier to follow.

In this task, you’ll practice filtering rows in two different ways: using regular column-based subsetting, and using `.loc[]` with an index.

### Instructions

1. Create a list named `cities` containing `"London"` and `"Paris"`.
2. Use regular column filtering with `.isin()` to extract rows from `temperatures` where `"city"` is in `cities`.
3. Use `.loc[]` on `temperatures_ind` to directly pull rows for those same cities.

In [4]:
# Step 1: Define a list of cities to filter on
cities = ["London", "Paris"]

# Step 2: Filter using column-based subsetting
subset_col = temperatures[temperatures["city"].isin(cities)]
print(subset_col)

# Step 3: Filter using index labels with .loc[]
temperatures_ind = temperatures.set_index("city")
subset_loc = temperatures_ind.loc[cities]
print(subset_loc)

       Unnamed: 0        date    city         country  avg_temp_c
8910         8910  2000-01-01  London  United Kingdom       4.693
8911         8911  2000-02-01  London  United Kingdom       6.115
8912         8912  2000-03-01  London  United Kingdom       7.422
8913         8913  2000-04-01  London  United Kingdom       8.246
8914         8914  2000-05-01  London  United Kingdom      12.491
...           ...         ...     ...             ...         ...
12040       12040  2013-05-01   Paris          France      11.703
12041       12041  2013-06-01   Paris          France      16.340
12042       12042  2013-07-01   Paris          France      21.186
12043       12043  2013-08-01   Paris          France      19.235
12044       12044  2013-09-01   Paris          France         NaN

[330 rows x 5 columns]
        Unnamed: 0        date         country  avg_temp_c
city                                                      
London        8910  2000-01-01  United Kingdom       4.693
London 

## Exercise: Setting multi-level indexes

In pandas, you can build indexes from more than one column, creating what’s called a **multi-level index** (or hierarchical index).

* **Advantage:** It allows you to organize data in a nested way that reflects real-world relationships. For example, cities naturally belong to countries, so it makes sense to nest `"city"` inside `"country"`.
* **Disadvantage:** The syntax for working with multi-level indexes is a bit different from working with regular columns, so you need to keep track of both approaches.

### Instructions

1. Convert the `"country"` and `"city"` columns into a multi-level index and store the result as `temperatures_ind`.
2. Create a list of two index pairs to keep:

   * `"Brazil"`, `"Rio De Janeiro"`
   * `"Pakistan"`, `"Lahore"`
3. Use `.loc[]` to select these rows from `temperatures_ind` and print the result.

In [5]:
# Step 1: Create a multi-level index with country and city
temperatures_ind = temperatures.set_index(["country", "city"])

# Step 2: Define the rows we want to keep
rows_to_keep = [("Brazil", "Rio De Janeiro"), 
                ("Pakistan", "Lahore")]

# Step 3: Subset using .loc[]
subset = temperatures_ind.loc[rows_to_keep]
print(subset)

                         Unnamed: 0        date  avg_temp_c
country  city                                              
Brazil   Rio De Janeiro       12540  2000-01-01      25.974
         Rio De Janeiro       12541  2000-02-01      26.699
         Rio De Janeiro       12542  2000-03-01      26.270
         Rio De Janeiro       12543  2000-04-01      25.750
         Rio De Janeiro       12544  2000-05-01      24.356
...                             ...         ...         ...
Pakistan Lahore                8575  2013-05-01      33.457
         Lahore                8576  2013-06-01      34.456
         Lahore                8577  2013-07-01      33.279
         Lahore                8578  2013-08-01      31.511
         Lahore                8579  2013-09-01         NaN

[330 rows x 3 columns]


## Exercise: Sorting by index values

So far, you’ve sorted rows in a DataFrame using **`.sort_values()`**. But when your DataFrame has an index that carries important meaning, you can also sort directly by index values using **`.sort_index()`**.

This is especially powerful with **multi-level indexes**, since you can control the sorting order for each level.

The DataFrame `temperatures_ind` (with a multi-level index of `country` and `city`) is available.


### Instructions

1. Sort `temperatures_ind` by all index levels.
2. Sort `temperatures_ind` by the `"city"` level of the index.
3. Sort `temperatures_ind` by `country` in ascending order, and then `city` in descending order.

In [7]:
# Step 1: Sort by all index values
print(temperatures_ind.sort_index())

                    Unnamed: 0        date  avg_temp_c
country     city                                      
Afghanistan Kabul         7260  2000-01-01       3.326
            Kabul         7261  2000-02-01       3.454
            Kabul         7262  2000-03-01       9.612
            Kabul         7263  2000-04-01      17.925
            Kabul         7264  2000-05-01      24.658
...                        ...         ...         ...
Zimbabwe    Harare        5605  2013-05-01      18.298
            Harare        5606  2013-06-01      17.020
            Harare        5607  2013-07-01      16.299
            Harare        5608  2013-08-01      19.232
            Harare        5609  2013-09-01         NaN

[16500 rows x 3 columns]


In [8]:
# Step 2: Sort by city-level index values
print(temperatures_ind.sort_index(level="city"))

                       Unnamed: 0        date  avg_temp_c
country       city                                       
Côte D'Ivoire Abidjan           0  2000-01-01      27.293
              Abidjan           1  2000-02-01      27.685
              Abidjan           2  2000-03-01      29.061
              Abidjan           3  2000-04-01      28.162
              Abidjan           4  2000-05-01      27.547
...                           ...         ...         ...
China         Xian          16495  2013-05-01      18.979
              Xian          16496  2013-06-01      23.522
              Xian          16497  2013-07-01      25.251
              Xian          16498  2013-08-01      24.528
              Xian          16499  2013-09-01         NaN

[16500 rows x 3 columns]


In [9]:
# Step 3: Sort by country (asc) and city (desc)
print(
    temperatures_ind.sort_index(
        level=["country", "city"], 
        ascending=[True, False]
    )
)

                    Unnamed: 0        date  avg_temp_c
country     city                                      
Afghanistan Kabul         7260  2000-01-01       3.326
            Kabul         7261  2000-02-01       3.454
            Kabul         7262  2000-03-01       9.612
            Kabul         7263  2000-04-01      17.925
            Kabul         7264  2000-05-01      24.658
...                        ...         ...         ...
Zimbabwe    Harare        5605  2013-05-01      18.298
            Harare        5606  2013-06-01      17.020
            Harare        5607  2013-07-01      16.299
            Harare        5608  2013-08-01      19.232
            Harare        5609  2013-09-01         NaN

[16500 rows x 3 columns]
