# Slicing and Subsetting with `.loc[]` and `.iloc[]`

### 1. What is Slicing?

Slicing is a way to select **consecutive elements** from a collection (lists, Series, or DataFrames). It lets you extract a range of values without writing them one by one.

### 2. Slicing Lists

* Lists can be sliced using square brackets `[]` with a start and stop position separated by a colon.
* Python uses **zero-based indexing**, so position `2` means the **third element**.
* The **stop value is exclusive**, meaning it won’t be included in the slice.
* You can leave out the start (`:3` → first three elements) or both (`:` → entire list).

### 3. Sorting Indexes Before Slicing

When slicing DataFrames, the **index must be sorted first**.

* Example: A dataset of dogs with a multi-level index (`breed`, `color`) must be sorted using `.sort_index()` before slicing works properly.

### 4. Slicing the Outer Index Level

* Use `.loc[]` with the **first and last index values**, separated by a colon.
* Unlike lists, the **end value is included** in the result.
* Example: slicing from `Beagle` to `Poodle` includes both.

### 5. Pitfall: Inner Index Slicing Gone Wrong

* Slicing directly on **inner index levels** doesn’t work as expected.
* Example: slicing `Tan:Grey` returns an **empty DataFrame** instead of the dogs you expect.
* pandas doesn’t raise an error—so be careful.

### 6. Correct Way: Inner Index Slicing

* For inner levels, you must use **tuples**.
* Example: to slice from a brown Labrador to a black Poodle, pass tuples like `("Labrador", "Brown") : ("Poodle", "Black")`.

### 7. Slicing Columns

* DataFrames are **two-dimensional**, so you can slice rows and columns.
* With `.loc[]`, provide two arguments:

  * First → row slice (`:` for all rows)
  * Second → column slice (`"col1":"col3"`)

### 8. Slicing Rows and Columns Together

* You can slice both dimensions in one step.
* Example:

  * First argument → row range
  * Second argument → column range

### 9. Slicing by Dates

* Date indexes are a common use case.
* Set a `date` column as the index and sort it.
* Then slice by start and end dates using strings.

### 10. Partial Date Slicing

* You don’t need full dates; **partial dates work too**.
* Example: slicing from `"2014"` to `"2016"` includes all dates in 2014, 2015, and 2016.

### 11. Subsetting with `.iloc[]`

* `.iloc[]` slices by **row and column numbers** instead of labels.
* Syntax: `.iloc[row_start:row_stop, col_start:col_stop]`
* Works like list slicing: the **final position is excluded**.
* Example: `0:5` rows returns only rows `0–4`.

✅ **Key Difference:**

* `.loc[]` → uses **labels** (inclusive of end).
* `.iloc[]` → uses **numbers** (exclusive of end).

## Preparing Data

In [1]:
import pandas as pd
temperatures = pd.read_csv("datasets/temperatures.csv")

## Exercise: Slicing Index Values

Slicing allows you to select consecutive elements of an object using the `first:last` syntax. In DataFrames, this can be done with **index values** or by **row/column numbers**.
Here, we’ll focus on slicing by index values using the `.loc[]` method.

### Important Notes:

* You can only slice an index if it has been **sorted** using `.sort_index()`.
* To slice at the **outer index level**, provide the first and last values as strings.
* To slice at the **inner index levels**, you need to pass the values as **tuples**.
* Passing a single slice to `.loc[]` means you’re slicing **rows**.

💡 In this exercise, the DataFrame `temperatures_ind` has a **multi-level index** of `country` and `city`.

### Instructions

1. Sort the index of `temperatures_ind`.
2. Use slicing with `.loc[]` to get these subsets:

   * Rows from **Pakistan** to **Philippines**.
   * Rows from **Lahore** to **Manila** (⚠️ will return nonsense because it’s not aligned to index levels).
   * Rows from **Pakistan, Lahore** to **Philippines, Manila**.

In [5]:
# Create a multi-level index with country and city
temperatures_ind = temperatures.set_index(["country", "city"])

# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Subset rows from Pakistan to Philippines
print(temperatures_srt.loc["Pakistan" : "Philippines"])

                        Unnamed: 0        date  avg_temp_c
country     city                                          
Pakistan    Faisalabad        4785  2000-01-01      12.792
            Faisalabad        4786  2000-02-01      14.339
            Faisalabad        4787  2000-03-01      20.309
            Faisalabad        4788  2000-04-01      29.072
            Faisalabad        4789  2000-05-01      34.845
...                            ...         ...         ...
Philippines Manila            9895  2013-05-01      29.552
            Manila            9896  2013-06-01      28.572
            Manila            9897  2013-07-01      27.266
            Manila            9898  2013-08-01      26.754
            Manila            9899  2013-09-01         NaN

[825 rows x 3 columns]


In [6]:
# Try to subset rows from Lahore to Manila (nonsense result)
print(temperatures_srt.loc["Lahore" : "Manila"])

Empty DataFrame
Columns: [Unnamed: 0, date, avg_temp_c]
Index: []


In [7]:
# Subset rows from Pakistan, Lahore to Philippines, Manila
print(temperatures_srt.loc[("Pakistan", "Lahore") : ("Philippines", "Manila")])

                    Unnamed: 0        date  avg_temp_c
country     city                                      
Pakistan    Lahore        8415  2000-01-01      12.792
            Lahore        8416  2000-02-01      14.339
            Lahore        8417  2000-03-01      20.309
            Lahore        8418  2000-04-01      29.072
            Lahore        8419  2000-05-01      34.845
...                        ...         ...         ...
Philippines Manila        9895  2013-05-01      29.552
            Manila        9896  2013-06-01      28.572
            Manila        9897  2013-07-01      27.266
            Manila        9898  2013-08-01      26.754
            Manila        9899  2013-09-01         NaN

[495 rows x 3 columns]


## Exercise: Slicing in both directions

DataFrames are two-dimensional, which means you can slice them **by rows and columns at the same time**. Using `.loc[]`, you simply provide two arguments: the first for rows and the second for columns.

The DataFrame `temperatures_srt` is already indexed by `country` and `city`, sorted, and available for use.


### Instructions

1. Select all rows from **India, Hyderabad** through **Iraq, Baghdad**.
2. Select all columns from **date** through **avg\_temp\_c**.
3. Perform slicing in **both directions at once**, combining the above row and column slices.


In [8]:
# Step 1: Slice rows from India-Hyderabad to Iraq-Baghdad
rows_slice = temperatures_srt.loc[("India", "Hyderabad") : ("Iraq", "Baghdad")]
print(rows_slice)

                   Unnamed: 0        date  avg_temp_c
country city                                         
India   Hyderabad        5940  2000-01-01      23.779
        Hyderabad        5941  2000-02-01      25.826
        Hyderabad        5942  2000-03-01      28.821
        Hyderabad        5943  2000-04-01      32.698
        Hyderabad        5944  2000-05-01      32.438
...                       ...         ...         ...
Iraq    Baghdad          1150  2013-05-01      28.673
        Baghdad          1151  2013-06-01      33.803
        Baghdad          1152  2013-07-01      36.392
        Baghdad          1153  2013-08-01      35.463
        Baghdad          1154  2013-09-01         NaN

[2145 rows x 3 columns]


In [9]:
# Step 2: Slice columns from 'date' to 'avg_temp_c'
cols_slice = temperatures_srt.loc[:, "date":"avg_temp_c"]
print(cols_slice)

                          date  avg_temp_c
country     city                          
Afghanistan Kabul   2000-01-01       3.326
            Kabul   2000-02-01       3.454
            Kabul   2000-03-01       9.612
            Kabul   2000-04-01      17.925
            Kabul   2000-05-01      24.658
...                        ...         ...
Zimbabwe    Harare  2013-05-01      18.298
            Harare  2013-06-01      17.020
            Harare  2013-07-01      16.299
            Harare  2013-08-01      19.232
            Harare  2013-09-01         NaN

[16500 rows x 2 columns]


In [10]:
# Step 3: Slice rows and columns together
both_slice = temperatures_srt.loc[
    ("India", "Hyderabad") : ("Iraq", "Baghdad"),
    "date":"avg_temp_c"
]
print(both_slice)

                         date  avg_temp_c
country city                             
India   Hyderabad  2000-01-01      23.779
        Hyderabad  2000-02-01      25.826
        Hyderabad  2000-03-01      28.821
        Hyderabad  2000-04-01      32.698
        Hyderabad  2000-05-01      32.438
...                       ...         ...
Iraq    Baghdad    2013-05-01      28.673
        Baghdad    2013-06-01      33.803
        Baghdad    2013-07-01      36.392
        Baghdad    2013-08-01      35.463
        Baghdad    2013-09-01         NaN

[2145 rows x 2 columns]


## Exercise: Slicing time series

Time series data often requires selecting rows within specific **date ranges**. To do this efficiently:

1. Set the date column as the **index**.
2. Use `.loc[]` to perform slicing.

Remember to keep dates in **ISO 8601 format**:

* `"yyyy-mm-dd"` → full date
* `"yyyy-mm"` → year and month
* `"yyyy"` → year only

You can also filter using **Boolean conditions** with logical operators like `&`. When combining multiple conditions, wrap each in parentheses `()`.

The DataFrame `temperatures` is available with no index, and pandas is loaded as `pd`.

### Instructions

1. Use **Boolean conditions** (not `.isin()` or `.loc[]`) to select rows from 2010 and 2011.
2. Set the `date` column as the index and sort it.
3. Use `.loc[]` to select rows from 2010 to 2011.
4. Use `.loc[]` to select rows from **August 2010** through **February 2011**.

In [11]:
# Step 1: Filter using Boolean conditions for 2010 and 2011
temps_2010_2011 = temperatures[
    (temperatures["date"] >= "2010-01-01") & 
    (temperatures["date"] <= "2011-12-31")
]
print(temps_2010_2011)

       Unnamed: 0        date     city        country  avg_temp_c
120           120  2010-01-01  Abidjan  Côte D'Ivoire      28.270
121           121  2010-02-01  Abidjan  Côte D'Ivoire      29.262
122           122  2010-03-01  Abidjan  Côte D'Ivoire      29.596
123           123  2010-04-01  Abidjan  Côte D'Ivoire      29.068
124           124  2010-05-01  Abidjan  Côte D'Ivoire      28.258
...           ...         ...      ...            ...         ...
16474       16474  2011-08-01     Xian          China      23.069
16475       16475  2011-09-01     Xian          China      16.775
16476       16476  2011-10-01     Xian          China      12.587
16477       16477  2011-11-01     Xian          China       7.543
16478       16478  2011-12-01     Xian          China      -0.490

[2400 rows x 5 columns]


In [13]:
# Step 2: Set date as index and sort
temps_indexed = temperatures.set_index("date").sort_index()

# Step 3: Slice using .loc[] for full years
print(temps_indexed.loc["2010":"2011"])

            Unnamed: 0        city    country  avg_temp_c
date                                                     
2010-01-01        4905  Faisalabad   Pakistan      11.810
2010-01-01       10185   Melbourne  Australia      20.016
2010-01-01        3750   Chongqing      China       7.921
2010-01-01       13155   São Paulo     Brazil      23.738
2010-01-01        5400   Guangzhou      China      14.136
...                ...         ...        ...         ...
2010-12-01        6896     Jakarta  Indonesia      26.602
2010-12-01        5246       Gizeh      Egypt      16.530
2010-12-01       11186      Nagpur      India      19.120
2010-12-01       14981      Sydney  Australia      19.559
2010-12-01       13496    Salvador     Brazil      26.265

[1200 rows x 4 columns]


In [14]:
# Step 4: Slice using .loc[] for specific months across years
print(temps_indexed.loc["2010-08":"2011-02"])

            Unnamed: 0           city        country  avg_temp_c
date                                                            
2010-08-01        2602       Calcutta          India      30.226
2010-08-01       12337           Pune          India      24.941
2010-08-01        6562          Izmir         Turkey      28.352
2010-08-01       15637        Tianjin          China      25.543
2010-08-01        9862         Manila    Philippines      27.101
...                ...            ...            ...         ...
2011-01-01        4257  Dar Es Salaam       Tanzania      28.541
2011-01-01       11352        Nairobi          Kenya      17.768
2011-01-01         297    Addis Abeba       Ethiopia      17.708
2011-01-01       11517        Nanjing          China       0.144
2011-01-01       11847       New York  United States      -4.463

[600 rows x 4 columns]


## Exercise: Subsetting by row/column number

While we often subset rows using **Boolean conditions** or **index labels**, sometimes it's useful to select rows or columns by **their integer positions**.

The `.iloc[]` method allows you to do this. Like `.loc[]`, it can take two arguments:

* First for rows
* Second for columns

The DataFrame `temperatures` (without a special index) is available and pandas is loaded as `pd`.

### Instructions

1. Select the **23rd row, 2nd column** (remember, Python is 0-indexed).
2. Select the **first 5 rows**.
3. Select **all rows**, but only **columns 3 and 4**.
4. Select the **first 5 rows** and **columns 3 and 4** simultaneously.

In [15]:
# Step 1: 23rd row, 2nd column (index 22,1)
print(temperatures.iloc[22, 1])

2001-11-01


In [16]:
# Step 2: First 5 rows
print(temperatures.iloc[:5])

   Unnamed: 0        date     city        country  avg_temp_c
0           0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1           1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2           2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3           3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4           4  2000-05-01  Abidjan  Côte D'Ivoire      27.547


In [17]:
# Step 3: All rows, columns 3 and 4 (index 2 and 3)
print(temperatures.iloc[:, 2:4])

          city        country
0      Abidjan  Côte D'Ivoire
1      Abidjan  Côte D'Ivoire
2      Abidjan  Côte D'Ivoire
3      Abidjan  Côte D'Ivoire
4      Abidjan  Côte D'Ivoire
...        ...            ...
16495     Xian          China
16496     Xian          China
16497     Xian          China
16498     Xian          China
16499     Xian          China

[16500 rows x 2 columns]


In [18]:
# Step 4: First 5 rows, columns 3 and 4 at once
print(temperatures.iloc[:5, 2:4])

      city        country
0  Abidjan  Côte D'Ivoire
1  Abidjan  Côte D'Ivoire
2  Abidjan  Côte D'Ivoire
3  Abidjan  Côte D'Ivoire
4  Abidjan  Côte D'Ivoire
