# Selection by label

With `iloc`, we were able to slice rows of our data, selecting out individuals, according to their numeric position in the index. We can also select rows *and* columns, selecting subsets of features, (and much more!), thanks to `loc`.

Let's begin by providing our `DataFrame` a more natural row index for the planets. Indexes may be modified after the fact or provided during construction.

In [1]:
# This cell is hidden from the textbook thanks to its tag.
#
# We're just re-constructing data established in other sections' notebooks.
#
# (Otherwise these would be undefined!)

import pandas as pd

planets_features = [
    'name',                # familiar name
    'solar_distance_km_6', # distance from sun: 10**6 km
    'mass_kg_24',          # absolute mass: 10**24 kg
    'density_kg_m3',       # density: kg/m**3
    'gravity_m_s2',        # gravity: m/s**2
]

planets_data = [
    ['Mercury', 57.9, 0.33, 5427.0, 3.7],
    ['Venus', 108.2, 4.87, 5243.0, 8.9],
    ['Earth', 149.6, 5.97, 5514.0, 9.8],
    ['Mars', 227.9, 0.642, 3933.0, 3.7],
    ['Jupiter', 778.6, 1898.0, 1326.0, 23.1],
    ['Saturn', 1433.5, 568.0, 687.0, 9.0],
    ['Uranus', 2872.5, 86.8, 1271.0, 8.7],
    ['Neptune', 4495.1, 102.0, 1638.0, 11.0]
]

planets = pd.DataFrame(planets_data, columns=planets_features)

distance_rel_change = planets.solar_distance_km_6.pct_change()

distance_pct_change = distance_rel_change * 100

In [2]:
planets = pd.DataFrame(planets_data,
                       columns=planets_features,
                       index=pd.RangeIndex(1, 9, name='number'))

planets

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Mercury,57.9,0.33,5427.0,3.7
2,Venus,108.2,4.87,5243.0,8.9
3,Earth,149.6,5.97,5514.0,9.8
4,Mars,227.9,0.642,3933.0,3.7
5,Jupiter,778.6,1898.0,1326.0,23.1
6,Saturn,1433.5,568.0,687.0,9.0
7,Uranus,2872.5,86.8,1271.0,8.7
8,Neptune,4495.1,102.0,1638.0,11.0


With `iloc`, we retrieve the third individual from this `DataFrame` by specifying the offset `2`.

In [3]:
planets.iloc[2]

name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 3, dtype: object

But with `loc`, we can retrieve this individual – Earth – by its index value or row label – now `3`.

In [4]:
planets.loc[3]

name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 3, dtype: object

While numeric indexes are often more practical, we can further distinguish these individuals' labels in the row index by using familiar strings instead.

In [5]:
ordinals = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth']

planet_ordinals = pd.DataFrame(planets_data,
                               columns=planets_features,
                               index=pd.Index(ordinals, name='ordinal'))

planet_ordinals

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
ordinal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
first,Mercury,57.9,0.33,5427.0,3.7
second,Venus,108.2,4.87,5243.0,8.9
third,Earth,149.6,5.97,5514.0,9.8
fourth,Mars,227.9,0.642,3933.0,3.7
fifth,Jupiter,778.6,1898.0,1326.0,23.1
sixth,Saturn,1433.5,568.0,687.0,9.0
seventh,Uranus,2872.5,86.8,1271.0,8.7
eighth,Neptune,4495.1,102.0,1638.0,11.0


`loc` supports the same selection features as `iloc`, such as slicing, even as it considers labels rather than offsets.

In [6]:
middle_planets = planet_ordinals.loc['third':'sixth']

middle_planets

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
ordinal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
third,Earth,149.6,5.97,5514.0,9.8
fourth,Mars,227.9,0.642,3933.0,3.7
fifth,Jupiter,778.6,1898.0,1326.0,23.1
sixth,Saturn,1433.5,568.0,687.0,9.0


**But watch out!** In the above example, `loc` interpreted our slice differently than `iloc` – *both* the elements at the slice `start` and `stop` were included in the results. This is convenient, because labels aren't necessarily incremental; we may not have in mind the label *following* the last one we want. Nonetheless, this is an inconsistency.

That said, `loc` resolves the inconsistency of the reference used in selecting out individuals from slices of the original `DataFrame`.

In [7]:
middle_planets.iloc[0]

name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: third, dtype: object

Using `iloc` we must refer to Earth as the "zeroeth" (zero-offset or "first") in the above data subset named `middle_planets`…

In [8]:
middle_planets.loc['third']

name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: third, dtype: object

…but using `loc`, we may continue to refer to Earth as the "third" planet from the sun.

In addition to selecting out individuals by row label, `loc` also allows us to select features by column label.

Say we wanted to contextualize `distance_pct_change`, reproducing that `DataFrame`, but this time with only the most-relevant columns, `name` and `solar_distance_km_6`. Bear in mind that there are many ways about this! But the `assign` method, at least, enables us to create a new `DataFrame` from an existing one; while, our existing `DataFrame` of the planets contains the "offending" columns. To start, we want a `DataFrame` containing only those two columns.

When we've focused on singular features before, such as the `name` data, it hasn't been a `DataFrame` – it's a `Series`.

In [9]:
planets.name

number
1    Mercury
2      Venus
3      Earth
4       Mars
5    Jupiter
6     Saturn
7     Uranus
8    Neptune
Name: name, dtype: object

In [10]:
type(planets.name)

pandas.core.series.Series

In [11]:
planets.name.assign(distance_pct_change=distance_pct_change)

AttributeError: 'Series' object has no attribute 'assign'

Well, *that* didn't work….

Instead, let's use `loc` to slice our `DataFrame` along its columns – and create a `DataFrame` with only the features we want.

Unlike what we've seen so far, element retrieval from the `loc` property can accept multiple arguments.

1. The first argument indicates row(s) to select – as above, by label, rather than offset.
1. The second argument does the same, but for column(s).

Because these arguments are positional, we can't provide the second argument without also providing the first. Luckily, we needn't supply arguments to the slice itself, so indicating that a new sequence should be constructed from the same elements. With `loc` that looks like:

In [12]:
planets.loc[:, :]

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Mercury,57.9,0.33,5427.0,3.7
2,Venus,108.2,4.87,5243.0,8.9
3,Earth,149.6,5.97,5514.0,9.8
4,Mars,227.9,0.642,3933.0,3.7
5,Jupiter,778.6,1898.0,1326.0,23.1
6,Saturn,1433.5,568.0,687.0,9.0
7,Uranus,2872.5,86.8,1271.0,8.7
8,Neptune,4495.1,102.0,1638.0,11.0


Further, we can supply just a column label. If our row slice is empty, then this is yet another way to retrieve the complete `Series` of data for this feature:

In [13]:
planets.loc[:, 'name']

number
1    Mercury
2      Venus
3      Earth
4       Mars
5    Jupiter
6     Saturn
7     Uranus
8    Neptune
Name: name, dtype: object

More relevant, we can suply a `list` of the features to select.

In [14]:
planets.loc[:, ['name']]

Unnamed: 0_level_0,name
number,Unnamed: 1_level_1
1,Mercury
2,Venus
3,Earth
4,Mars
5,Jupiter
6,Saturn
7,Uranus
8,Neptune


And now we know how to produce the two-feature `DataFrame` we were looking for.

In [15]:
planet_distances = planets.loc[:, ['name', 'solar_distance_km_6']]

planet_distances.assign(distance_pct_change=distance_pct_change)

Unnamed: 0_level_0,name,solar_distance_km_6,distance_pct_change
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Mercury,57.9,86.873921
2,Venus,108.2,38.262477
3,Earth,149.6,52.339572
4,Mars,227.9,241.641071
5,Jupiter,778.6,84.11251
6,Saturn,1433.5,100.383676
7,Uranus,2872.5,56.48738
8,Neptune,4495.1,


**But that's not right!** Let's take another look at `distance_pct_change`.

In [16]:
distance_pct_change

0           NaN
1     86.873921
2     38.262477
3     52.339572
4    241.641071
5     84.112510
6    100.383676
7     56.487380
Name: solar_distance_km_6, dtype: float64

Indeed, we've changed the index of our `DataFrame` – now it's 1 through 8, rather than the 0 through 7 of `distance_pct_change`.

In [17]:
distance_pct_change.index

RangeIndex(start=0, stop=8, step=1)

Luckily, there's more than one resolution. Let's tell the index of `distance_pct_change` to increase all of its values by `1`, using the `+=` operator.

In [18]:
distance_pct_change.index += 1

In [19]:
distance_pct_change.index

RangeIndex(start=1, stop=9, step=1)

Now we're set.

In [20]:
planet_distances.assign(distance_pct_change=distance_pct_change)

Unnamed: 0_level_0,name,solar_distance_km_6,distance_pct_change
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Mercury,57.9,
2,Venus,108.2,86.873921
3,Earth,149.6,38.262477
4,Mars,227.9,52.339572
5,Jupiter,778.6,241.641071
6,Saturn,1433.5,84.11251
7,Uranus,2872.5,100.383676
8,Neptune,4495.1,56.48738


In fact, `pandas` often offers more than one way to achieve the same, or a similar, result. We can also select rows and columns – in this case less succinctly – by specifying what labeled elements we *don't* want, using `drop`.

In [21]:
planets.drop(columns=['mass_kg_24', 'density_kg_m3', 'gravity_m_s2'])

Unnamed: 0_level_0,name,solar_distance_km_6
number,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Mercury,57.9
2,Venus,108.2
3,Earth,149.6
4,Mars,227.9
5,Jupiter,778.6
6,Saturn,1433.5
7,Uranus,2872.5
8,Neptune,4495.1
