# Selection by condition

So far we've selected individuals based on their position in the row index and based on their row index value or label. But particularly in larger data sets, this may not be practical.

To begin, we might sort our data by a column or a set of columns. Say we were having trouble finding Earth – we could produce a new `DataFrame`, sorted by the `name` feature.

In [52]:
planets.sort_values('name')

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,Earth,149.6,5.97,5514.0,9.8
5,Jupiter,778.6,1898.0,1326.0,23.1
4,Mars,227.9,0.642,3933.0,3.7
1,Mercury,57.9,0.33,5427.0,3.7
8,Neptune,4495.1,102.0,1638.0,11.0
6,Saturn,1433.5,568.0,687.0,9.0
7,Uranus,2872.5,86.8,1271.0,8.7
2,Venus,108.2,4.87,5243.0,8.9


By default, this sorts individuals in "ascending" order – from alphabetical "first" to "last" and numerical smallest to greatest.

If instead we wanted to see the most massive planets, we would sort by the `mass_kg_24` feature, but in "descending" order.

In [53]:
planets_massive = planets.sort_values('mass_kg_24', ascending=False)

planets_massive

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,Jupiter,778.6,1898.0,1326.0,23.1
6,Saturn,1433.5,568.0,687.0,9.0
8,Neptune,4495.1,102.0,1638.0,11.0
7,Uranus,2872.5,86.8,1271.0,8.7
3,Earth,149.6,5.97,5514.0,9.8
2,Venus,108.2,4.87,5243.0,8.9
4,Mars,227.9,0.642,3933.0,3.7
1,Mercury,57.9,0.33,5427.0,3.7


In both cases, we are interested in the first-listed rows. For example, we can extract the most massive planet, Jupiter, from the sorted `DataFrame`.

In [54]:
planets_massive.iloc[0]

name                   Jupiter
solar_distance_km_6      778.6
mass_kg_24                1898
density_kg_m3             1326
gravity_m_s2              23.1
Name: 5, dtype: object

Sorting in this way can get you pretty far. But more powerfully, you can specify a *condition* or a set of conditions which individuals must pass in order to be selected.

Again, the `DataFrame` property `loc` does the job for us. It supports the output of a `DataFrame` with the same features as the original but only the individuals whose features satisfy the condition.

Rather than using an offset, label or slice, this is specified to `loc` using a *boolean* sequence, which itself indicates the rows satisfying our condition, such as:

    [True, False, True]

But don't worry! We needn't produce this `list` ourselves. We can generate it from a simple conditional expression in Python, applied to the `Series` of data underlying the feature itself.

In [55]:
planets.mass_kg_24

number
1       0.330
2       4.870
3       5.970
4       0.642
5    1898.000
6     568.000
7      86.800
8     102.000
Name: mass_kg_24, dtype: float64

Our solar system's inner planets never get any more massive than Earth – less than 6 x 10<sup>24</sup> kilograms in mass. We can exclude such lightweights with the *mask* produced by the following comparison expression.

In [56]:
planets.mass_kg_24 > 6

number
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
Name: mass_kg_24, dtype: bool

As you can see, we've produced a new `Series`, populated by Boolean values which reflect that the "statement" of our comparison – that the planets' masses are "greater than 6" thousand yottagrams – is `False` for the first four planets, but `True` for the remainder.

We can specify to `loc` this mask – if we like, of course, the expression itself – and produce our new `DataFrame` of individuals satisfying our condition.

In [57]:
planets.loc[planets.mass_kg_24 > 6]

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,Jupiter,778.6,1898.0,1326.0,23.1
6,Saturn,1433.5,568.0,687.0,9.0
7,Uranus,2872.5,86.8,1271.0,8.7
8,Neptune,4495.1,102.0,1638.0,11.0


Or, even simpler, we can construct the same sort of look-up to find Earth.

In [58]:
planets.loc[planets.name == 'Earth']

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,Earth,149.6,5.97,5514.0,9.8


Above we built our conditions from a known value, such as `6`. We can build a value look-up into our comparison as well.

Another look-up property, `at`, provides a shortcut to scalar values.

In [59]:
planets.at[3, 'mass_kg_24']

5.97

In [60]:
planets.loc[
    planets.mass_kg_24 > planets.at[3, 'mass_kg_24']
]

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,Jupiter,778.6,1898.0,1326.0,23.1
6,Saturn,1433.5,568.0,687.0,9.0
7,Uranus,2872.5,86.8,1271.0,8.7
8,Neptune,4495.1,102.0,1638.0,11.0


We can also select individuals that satisfy multiple conditions.

Let's compare the planets of our solar system to the Earth. We can begin by selecting only those planets whose gravity is within approximately 50% of Earth's – less than $14.8\frac{m}{s^2}$ and more than $4.8\frac{m}{s^2}$.

In [61]:
not_too_much_gravity = planets.gravity_m_s2 < planets.at[3, 'gravity_m_s2'] + 5

not_too_much_gravity

number
1     True
2     True
3     True
4     True
5    False
6     True
7     True
8     True
Name: gravity_m_s2, dtype: bool

In [62]:
not_too_little_gravity = planets.gravity_m_s2 > planets.at[3, 'gravity_m_s2'] - 5

not_too_little_gravity

number
1    False
2     True
3     True
4    False
5     True
6     True
7     True
8     True
Name: gravity_m_s2, dtype: bool

We might simply invoke `loc` twice, once for each condition.

In [63]:
planets.loc[not_too_much_gravity].loc[not_too_little_gravity]

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,Venus,108.2,4.87,5243.0,8.9
3,Earth,149.6,5.97,5514.0,9.8
6,Saturn,1433.5,568.0,687.0,9.0
7,Uranus,2872.5,86.8,1271.0,8.7
8,Neptune,4495.1,102.0,1638.0,11.0


Or, more powerfully, we can combine our conditions into a single conditional mask. In this case, we want *both* conditions to be satisfied, and so we'll combine them using the *bitwise AND* operator: `&`.

In [64]:
not_too_much_gravity & not_too_little_gravity

number
1    False
2     True
3     True
4    False
5    False
6     True
7     True
8     True
Name: gravity_m_s2, dtype: bool

In [65]:
planets.loc[not_too_much_gravity & not_too_little_gravity]

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,Venus,108.2,4.87,5243.0,8.9
3,Earth,149.6,5.97,5514.0,9.8
6,Saturn,1433.5,568.0,687.0,9.0
7,Uranus,2872.5,86.8,1271.0,8.7
8,Neptune,4495.1,102.0,1638.0,11.0


We're left with the majority of the planets. Only Mercury, Mars and Jupiter have been excluded – Mercury and Mars for having too little gravity, and Jupiter for having too much.

In fact, we know that Saturn is a gas giant, consisting almost entirely of hydrogen gas – nothing at all like Earth! This fact is indirectly evident from its density: less than 13% of the density of the Earth.

In [66]:
planets.density_kg_m3 / planets.at[3, 'density_kg_m3']

number
1    0.984222
2    0.950852
3    1.000000
4    0.713275
5    0.240479
6    0.124592
7    0.230504
8    0.297062
Name: density_kg_m3, dtype: float64

We can exclude the gas giants as well by combining our gravity-based conditions with a density-based condition: that the planets' densities are at least 50% of Earth's.

In [67]:
dense_enough = planets.density_kg_m3 >= planets.at[3, 'density_kg_m3'] * 0.5

dense_enough

number
1     True
2     True
3     True
4     True
5    False
6    False
7    False
8    False
Name: density_kg_m3, dtype: bool

In [68]:
planets.loc[not_too_much_gravity & not_too_little_gravity & dense_enough]

Unnamed: 0_level_0,name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,Venus,108.2,4.87,5243.0,8.9
3,Earth,149.6,5.97,5514.0,9.8


Now we're left with only Venus and Earth.

And, we're given an excellent example of the danger of drawing conclusions from too-small a data set. The two planets appear similar, judging by the above. Venus *is* considerably closer to the sun – 41.4 million kilometers closer than the Earth; but, as we saw, this is the *smallest* distance between any of the planets, and it's difficult to gauge the significance of this feature with regard to its similarity to Earth … at least, this feature on its own.

That said, if our data included a feature like `average_temperature_celsius` – for which Venus would be 462&deg; – then we'd know why Mars is considered the most likely habitable planet in our solar system, outside of the Earth … never mind *its* average temperature of -63&deg; Celsius, and its much lower gravity!

In conclusion, `pandas` and its `DataFrame` offer a great many additions to the data structures, functions and methods of Python, in support of processing and analyzing data. This introduction only covers some of the basic ways in which `pandas` builds upon and differs from what we've seen so far. But don't be overwhelmed! The best way to learn is to dive in. Apply what you've learned here, consult the <a href="https://pandas.pydata.org/docs/" target="_blank" rel="noopener">pandas documentation</a> (and the <a href="https://en.wikipedia.org/wiki/Internet" target="_blank" rel="noopener">Internet</a>) &hellip; and read on!