# loc vs iloc... which one to use?

loc and iloc are really useful, and it can be confusing to know which one to use.

Let's experiment with a dataset and see if we can't create clear examples of why one works when another doesn't and *vice versa*.

Steve Taylor, March 2021

In [1]:
import pandas as pd

## The Setup

Read in the file as we always do

In [2]:
students = pd.read_csv("students.csv")

Let's check out what we've loaded...

In [3]:
students.head()

Unnamed: 0,studentID,firstName,lastName,birthdate,Points
0,1,Amy,Willis,10/23/1991,18.032651
1,2,Donald,Pierce,4/7/1990,79.671554
2,3,Adam,Holmes,5/16/1991,10.495381
3,4,Patrick,Payne,12/29/1990,33.449285
4,5,Chris,Lynch,10/3/1990,33.654615


Aside:
>When we dtype the dataframe, the *studentID* column is an `int64` (a C language data type, not a Python one).
>
>Pandas will go to great lengths to figure out if things are floats (in C, that's `float64` here). Because so much of the inside of Pandas is written in the C language, and "strings" are really just addresses in memory to objects, the Pandas dtype of what we know as `str` is `object`.
>
>It's not important to grok C types here, but it is useful to know that Pandas bias is to try to figure out what the types are on it's own. You can override these types when creating dataframes, and we'll look at that later in class.

In [4]:
students.dtypes

studentID      int64
firstName     object
lastName      object
birthdate     object
Points       float64
dtype: object

In [5]:
# I'm not "sticking" the index... just looking at it...
students.set_index("studentID")

Unnamed: 0_level_0,firstName,lastName,birthdate,Points
studentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Amy,Willis,10/23/1991,18.032651
2,Donald,Pierce,4/7/1990,79.671554
3,Adam,Holmes,5/16/1991,10.495381
4,Patrick,Payne,12/29/1990,33.449285
5,Chris,Lynch,10/3/1990,33.654615
...,...,...,...,...
96,Matthew,Cunningham,11/5/1991,84.951415
97,Kelly,Lynch,3/18/1990,4.377005
98,Ryan,Little,12/23/1988,52.021577
99,Willie,Taylor,12/6/1989,13.154354


## Shaking Things Up

In [6]:
# because I wanted to shuffle the original
students = students.sample(frac=1).reset_index(drop=True)

When we look at the shuffled dataframe, notice how we've eliminated the relationship between the index and the studentID.

In [7]:
students.head()

Unnamed: 0,studentID,firstName,lastName,birthdate,Points
0,5,Chris,Lynch,10/3/1990,33.654615
1,41,Tina,Kelley,8/28/1988,74.862142
2,23,Angela,Ray,3/31/1991,36.966729
3,29,Andrea,Perkins,3/26/1988,1.41317
4,85,Deborah,Owens,5/7/1991,49.93798


In [8]:
# Now set the index
students = students.set_index("studentID")

Show the student first and last names, with IDs of 1, 50, and 100.

When the studentID's matched the position from the original file, you get away with using it's row position (minus one of course) to get the student ID. But now, if you use iloc, you absolutely won't get what you you want.

In [9]:
students.iloc[[0, 49, 99], [0, 1]]

Unnamed: 0_level_0,firstName,lastName
studentID,Unnamed: 1_level_1,Unnamed: 2_level_1
5,Chris,Lynch
40,Phillip,Jenkins
82,Stephen,Knight


Using loc we get what we're looking for using the student IDs.

In [10]:
students.loc[[1, 50, 100], ["firstName", "lastName"]]

Unnamed: 0_level_0,firstName,lastName
studentID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Amy,Willis
50,Gerald,Gutierrez
100,Lisa,Austin


Aside:
>For most of our purposes, you can use a tuple where you'd use a list. Think of tuples as read-only (that's *immutable* in the lingo) lists; if you try to change the value of something in a tuple (e.g., my_tuple[3] = "something new"), you'll get an error. As a general rule if you use a tuple, you are relying on Python to catch you if you try to change something, but as importantly, you are communicating to others in your code that the intention is it be used as read-only, similar to a constant.

In [11]:
students.loc[(1, 50, 100), ("firstName", "lastName")]

Unnamed: 0_level_0,firstName,lastName
studentID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Amy,Willis
50,Gerald,Gutierrez
100,Lisa,Austin


>Back to loc and iloc!

When I want the second set of 7 **rows** of the students in the dataframe:

In [12]:
students.iloc[7:14]

Unnamed: 0_level_0,firstName,lastName,birthdate,Points
studentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
53,Eugene,Simpson,10/6/1990,97.350789
96,Matthew,Cunningham,11/5/1991,84.951415
55,Louise,Alexander,2/19/1990,22.813329
65,Kathleen,Greene,4/15/1990,13.37381
7,James,Lawson,10/17/1989,8.996545
69,Diana,Armstrong,5/18/1989,20.902868
46,Roy,Olson,5/25/1990,16.246391


Fun aside is you can do the same thing using the dataframe accessor alone, but it doesn't take a second argument for columns like iloc does.

In [13]:
students[7:14]

Unnamed: 0_level_0,firstName,lastName,birthdate,Points
studentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
53,Eugene,Simpson,10/6/1990,97.350789
96,Matthew,Cunningham,11/5/1991,84.951415
55,Louise,Alexander,2/19/1990,22.813329
65,Kathleen,Greene,4/15/1990,13.37381
7,James,Lawson,10/17/1989,8.996545
69,Diana,Armstrong,5/18/1989,20.902868
46,Roy,Olson,5/25/1990,16.246391


If we use loc now, we get something that's pretty wild. It includes student ID's 7 through 14, as asked (look at the first item and the last to verify), but it also included everything else inclusively. This might not be what you where looking for. Customarily when we're looking to select things by business value we'll use different techniques like `where()`.

In [14]:
students.loc[7:14]

Unnamed: 0_level_0,firstName,lastName,birthdate,Points
studentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7,James,Lawson,10/17/1989,8.996545
69,Diana,Armstrong,5/18/1989,20.902868
46,Roy,Olson,5/25/1990,16.246391
78,Daniel,Wallace,3/25/1991,82.183959
60,Donna,Lopez,3/7/1988,39.466053
100,Lisa,Austin,11/17/1990,94.808464
99,Willie,Taylor,12/6/1989,13.154354
15,Donald,Brooks,5/14/1989,30.137299
58,Wanda,Davis,7/27/1990,46.715852
24,Raymond,Garcia,12/8/1989,5.904415


We can always select using loc by using expressions. In this case we use the `.index` to get the value for comparison, versus using a named column.

It's worth knowing that `((students.index >= 7) & (students.index < 14))` evaluates to something called a boolean array, which is enough like a list -- we might say "list-like" -- to satisfy the accessor. More on that below.

In [15]:
students.loc[((students.index >= 7) & (students.index < 14))]

Unnamed: 0_level_0,firstName,lastName,birthdate,Points
studentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7,James,Lawson,10/17/1989,8.996545
8,Barbara,Robertson,12/5/1991,60.141281
9,Louis,Simpson,12/13/1990,58.775231
13,Kenneth,Davis,2/13/1990,44.17353
11,Andrew,Thompson,6/25/1988,8.550434
10,Dennis,Gilbert,12/7/1990,29.948968
12,Lillian,Richards,4/29/1990,17.350226


## Notes on Accessors and Functions

The use of iloc and loc (as well as many other things like a dataframe itself) can be confusing because we often get sloppy and teach these as functions (e.g., .loc() is **wrong**). And iloc and loc are specifically *not* functions. These are called "accessors", and these use a list. How do we know that? First of course is docs, e.g., [loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) from the official Pandas documentation. 

But second is how we use it. For instance in our example above of `students.loc[[1, 50, 100], ["firstName", "lastName"]]`, we are using .loc**[]**. The use of square-brackets an *indexing operator*, which tells us that we need to use it with a list-like object (I switched it to list from tuples, but both are list-like). 

So with the example, let's look the list that loc is using? It's a list of two lists: 
```python
.loc[
    [1, 50, 100], 
    ["firstName", "lastName"]
]
```

the first list is `[1,50,100]`, and the second list is `['firstName','lastName']`. If we check the docs we'll find the second list is optional, which is why we often see something like `students.loc[[1,50,100]]`. That's a single list inside the outside indexing operator... think list-of-lists.

This is applicable to .loc, .loc, or use of the dataframe itself, e.g., `students[0:15]` is a single list in the indexing operator of the datataframe variable, right? Right! Slices return a list.

If we definitely want the second list to include specific columns, we don't have a way to omit the first list, so we often use a "get them all" slice in the first list position. 

```python
.loc[
    :,
    ["firstName", "lastName"]
]
```

Where we often go wrong is to replace the slice operator with [:]:

```python
.loc[
    [:],   # WRONG!
    ["firstName", "lastName"]
]
```

Why does that break? Now the first list is one with an colon in it. The indexing operator (i.e., []) can take an integer or a slice for lists. In our case a colon isn't anything that's valid to go inside an indexing operator. So it raises an error.

*If you still haven't let go of functions and optional, postional, and named arguments yet, the whole "nested" lists in the indexing operator is a little mind-bendy* Take your time. :-)

In [16]:
students.loc[:, ["firstName", "lastName"]]

Unnamed: 0_level_0,firstName,lastName
studentID,Unnamed: 1_level_1,Unnamed: 2_level_1
5,Chris,Lynch
41,Tina,Kelley
23,Angela,Ray
29,Andrea,Perkins
85,Deborah,Owens
...,...,...
49,Karen,Hayes
89,Robin,Collins
45,Philip,Gibson
80,Brandon,Clark
