# Agenda: Sorting!

1. Series
    - Sort by index
    - Sort by values
2. Data frame
    - Sort by index
    - Sort by one column
    - Sort by multiple columns
3. Along the way, we'll play with a number of useful methods   


In [1]:
import pandas as pd

filename = '../data/taxi.csv'

df = pd.read_csv(filename)

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [5]:
import numpy as np
from pandas import Series, DataFrame

np.random.seed(0)
s = Series(np.random.randint(-50, 50, 10),
           index=list('acegihfjdb'))

In [6]:
s

a    -6
c    -3
e    14
g    17
i    17
h   -41
f    33
j   -29
d   -14
b    37
dtype: int64

In [7]:
# the first kind of sort we'll do is on the index

# this returns a new series, not modifying the original one
s.sort_index()

a    -6
b    37
c    -3
d   -14
e    14
f    33
g    17
h   -41
i    17
j   -29
dtype: int64

In [9]:
# method chaining syntax

(
    s
    .sort_index()
    .head()
    .loc[['b', 'd']]
)

b    37
d   -14
dtype: int64

In [10]:
# let's create the series again, , and grab a slice

np.random.seed(0)
s = Series(np.random.randint(-50, 50, 10),
           index=list('acegihfjdb'))

s.loc['b':'f']

Series([], dtype: int64)

In [11]:
s.loc['g':'f']  # .loc is up to AND INCLUDING

g    17
i    17
h   -41
f    33
dtype: int64

In [12]:
# what if the index has repeated values?

np.random.seed(0)
s = Series(np.random.randint(-50, 50, 10),
           index=list('aceaihbjdb'))

s.loc['a':'b']

KeyError: "Cannot get left slice bound for non-unique label: 'a'"

In [13]:
s.loc['c':'b']

KeyError: "Cannot get right slice bound for non-unique label: 'b'"

In [14]:
# in order to solve this problem, we need to sort our series!

s.sort_index().loc['a':'b']

a    -6
a    17
b    33
b    37
dtype: int64

In [15]:
s.sort_index().loc['a':'c']

a    -6
a    17
b    33
b    37
c    -3
dtype: int64

In [16]:
# sometimes, we want to sort by the values
# we can do this with the sort_values() method

s.sort_values()

h   -41
j   -29
d   -14
a    -6
c    -3
e    14
a    17
i    17
b    33
b    37
dtype: int64

In [17]:
s = Series([10, 5, 15, 'b', 'a', 'c'])
s

0    10
1     5
2    15
3     b
4     a
5     c
dtype: object

In [18]:
s.sort_values()

TypeError: '<' not supported between instances of 'str' and 'int'

# Exercise: Series sorting

1. Create a series based on members of your family (or friends). The index will be their names, adn the values will be their ages.
2. Sort by the names. What is the mean age of the first 3 people, alphabetically?
3. Sort by the ages. What are the names of the youngest and the eldest people in your series?

In [19]:
s = Series([53, 23, 21, 18],
           index='Reuven Atara Shikma Amotz'.split())
s

Reuven    53
Atara     23
Shikma    21
Amotz     18
dtype: int64

In [22]:
(
    s
    .sort_index()
    .head(3)
    .mean()
)

31.333333333333332

In [24]:
(
    s
    .sort_values()
    .iloc[[0, -1]]   # first and final elements
)

Amotz     18
Reuven    53
dtype: int64

In [25]:
# how can we sort in *descending* order?
# so far, we've seen that both sort_index and sort_values go in ascending order

# pass ascending=False as a keyword argument; its default value is True
s.sort_values(ascending=False)

Reuven    53
Atara     23
Shikma    21
Amotz     18
dtype: int64

In [26]:
s.sort_index(ascending=False)

Shikma    21
Reuven    53
Atara     23
Amotz     18
dtype: int64

In [27]:
# all of these methods assume that the index/values are "comparable" in Python,
# meaning that they implement not only == but also < (and maybe a few other operators, as well)

In [28]:
# what if we want to change the way in which values are sorted?
# meaning: if we have both negative and positive numbers

np.random.seed(0)   # reset random numbers
s = Series(np.random.randint(-50, 50, 10),
           index=list('abcdefghij'))
s

a    -6
b    -3
c    14
d    17
e    17
f   -41
g    33
h   -29
i   -14
j    37
dtype: int64

In [29]:
# can I sort these numbers? Sure, using sort_values

s.sort_values()

f   -41
h   -29
i   -14
a    -6
b    -3
c    14
d    17
e    17
g    33
j    37
dtype: int64

In [30]:
# can I turn these values into absolute values (i.e., positive values) and then sort them?

s.abs().sort_values()

b     3
a     6
c    14
i    14
d    17
e    17
h    29
g    33
j    37
f    41
dtype: int64

In [34]:
# what if I want to sort them by absolute value
# but I don't want to change the values themselves

# for that, we can pass an argument to the "key" keyword argument
# the value that we pass to "key" is a FUNCTION, one that takes
# a series as an argument.  The input to the function will be our
# series, and the output will be a new series that we'll use for
# sorting, but nothing else

s.sort_values(key=abs)

b    -3
a    -6
c    14
i   -14
d    17
e    17
h   -29
g    33
j    37
f   -41
dtype: int64

In [41]:
# what if I want to sort by the square of the numbers?

def square_series(a_series):   # here, I define a function that takes a series
    return a_series ** 2       #   ... and returns a series

s.sort_values(key=square_series)

b    -3
a    -6
c    14
i   -14
d    17
e    17
h   -29
g    33
j    37
f   -41
dtype: int64

# Data frames

In many ways, sorting a data frame is just like sorting a series:

- We can sort by the index
- We can sort by the values

The difference, though, is that we have multiple columns! That means we have to indicate which column should be used as the basis for sorting. Also: We can sort by *more than* one column, if we want.