Source: https://www.practicaldatascience.org/html/exercises/Exercise_indices.html

# Exercise 1

In [13]:
import pandas as pd
attendees = pd.DataFrame({'names': ["Jill", "Kumar", "Zaira"],
                          'prizes': [0, 0, 0],
                          'arrival_order': [2, 1, 3]})
arrival_prizes = pd.Series([20, 10, 0])

# Exercise 2
Now let’s sort our attendees list by `arrival_order` so that the first row is the person who arrived first, the second is the person who arrived second, etc. to match how we’ve organized `arrival_prizes`.

In [5]:
attendees.sort_values('arrival_order')

Unnamed: 0,names,prizes,arrival_order
1,Kumar,0,1
0,Jill,0,2
2,Zaira,0,3


In [7]:
pd.concat([attendees.sort_values('arrival_order').reset_index(drop=True),arrival_prizes], axis=1)

Unnamed: 0,names,prizes,arrival_order,0
0,Kumar,0,1,20
1,Jill,0,2,10
2,Zaira,0,3,0


# Exercise 3
Now let’s “give” everyone their arrival prizes by adding arrival prizes to people’s prize column:

In [14]:
attendees['prizes'] = attendees['prizes'] + arrival_prizes

In [15]:
attendees

Unnamed: 0,names,prizes,arrival_order
0,Jill,20,2
1,Kumar,10,1
2,Zaira,0,3


# Exercise 4
Now let’s look at the result. Does it look like what you expected? Do you know what went wrong?

The operation uses the index order

# Exercise 5
If you ever want to get alignment on row numbers, the easiest way to achieve that is to reset the indices on both objects you want to merge. When one re-sets indices without specifying a column to become the new index, the new index will just be row-numbers.

So reset prizes to 0, do what you need to do to get the order right, reset the index, and try again.

Note: When you reset the index on a Series, the Series is converted to a DataFrame, and the old index is added as a column. To avoid this behavior and just drop the old index when re-setting indices (in either a Series or a DataFrame), use the drop=True argument when resetting the index.

In [16]:
attendees['prizes'] = 0

In [17]:
attendees = attendees.sort_values('arrival_order').reset_index(drop=True)

In [18]:
attendees['prizes'] = attendees['prizes'] + arrival_prizes

In [19]:
attendees

Unnamed: 0,names,prizes,arrival_order
0,Kumar,20,1
1,Jill,10,2
2,Zaira,0,3


# Exercise 6
OK, so besides doing automatic alignment, is there a reason to use indices?

Let’s find out. Create the following fake dataset of social security numbers and some “names” (random strings). Warning: this will take a little time to run.

In [20]:
import numpy.random as npr
import string
import random
npr.seed(42)
random.seed(42)

size=1000000 # 1,000,000
people = pd.DataFrame({'social_security_numbers': npr.randint(low=10000000, high=99999999, size=size),
                       'names': [''.join(random.choices(string.ascii_uppercase, k=10))
                                 for i in range(size)]})

# Exercise 7
Now subset your data to get the social security number associated with the name of “TPKSMSLREI”. (Yes, there are ways to get real random names, but they take a while to run because they query websites that generate fake names, so we’re just doing this!).

In [29]:
%%timeit

cond1 = people['names'] == 'TPKSMSLREI'
people[cond1]

47.6 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Exercise 9
Now make names your index for this data. Then try subsetting using loc[] to get all the observations with the name of “TPKSMSLREI” and time the operation.

In [32]:
people = people.set_index('names')

In [34]:
%%timeit
people.loc['TPKSMSLREI']

57 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Much faster :)

# Take-away
So in short: indices can be nice in that they do automatic alignment, provided you’re expecting it. Moreover, if you want to pull random rows out of your dataset, it is often much faster than a regular query!