#  Selecting Subsets of Data
## Recipes
* [Selecting Series data](#Selecting-Series-data)
* [Selecting DataFrame rows](#Selecting-DataFrame-rows)
* [Selecting DataFrame rows and columns simultaneously](#Selecting-DataFrame-rows-and-columns-simultaneously)
* [Selecting data with both integers and labels](#Selecting-data-with-both-integers-and-labels)
* [Speeding up scalar selection](#Speeding-up-scalar-selection)
* [Slicing rows lazily](#Slicing-rows-lazily)
* [Slicing lexicographically](#Slicing-Lexicographically)

# Introduction

Every dimension of data in a Series or DataFrame is labeled through an Index object. It is this Index that separates pandas data structures from NumPy's n-dimensional array. Indexes provide meaningful labels for each row and column of data, and pandas users have the ability to select data through the use of these labels. Additionally, pandas allows its users to select data by the integer location of the rows and columns. This dual selection capability, one using labels and the other using integer location, makes for powerful yet confusing syntax to select subsets of data.

Selecting data through the use of labels or integer location is not unique to pandas. Python dictionaries and lists are built-in data structures that select their data in exactly one of these ways. Both dictionaries and lists have precise instructions and limited use-cases for what may be passed to the indexing operator. A dictionary's key (its label) must be an immutable object, such as a string, integer, or tuple. Lists must either use integers or slice objects for selection. Dictionaries can only select one object at a time by passing the key to the indexing operator. In some sense, pandas is combining the ability to select data using integers, as with lists, and labels, as with dictionaries.


# Selecting Series data

Series and DataFrames are complex data containers that have multiple attributes that use the indexing operator to select data in different ways. In addition to the indexing operator itself, the .iloc and .loc attributes are available and use the indexing operator in their own unique ways. Collectively, these attributes are called the indexers. 

Series and DataFrame indexers allow selection by integer location (like Python lists) and by label (like Python dictionaries). The .iloc indexer selects only by integer location and works similarly to Python lists. The .loc indexer selects only by index label, which is similar to how Python dictionaries work.


In [3]:
import numpy as np 
import pandas as pd

In [8]:
college=pd.read_csv('data/college.csv',index_col='INSTNM')
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [9]:
city=college['CITY']
city.head()

INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
Name: CITY, dtype: object

(7535, 26)

In [11]:
# The .iloc indexer makes selections only by integer location. Passing an integer to it returns a scalar value:

city.iloc[3]


'Huntsville'

In [14]:
# To select several different integer locations, pass a list to .iloc. This returns a Series:

city.iloc[[0,1,3]]

INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
University of Alabama in Huntsville    Huntsville
Name: CITY, dtype: object

In [16]:
# To select an equally spaced partition of data, use slice notation:

city.iloc[0:5:1]

INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
Name: CITY, dtype: object

In [18]:
# Now we turn to the .loc indexer, which selects only with index labels. Passing a single string returns a scalar value:

city.loc['Heritage Christian University']



'Florence'

In [19]:
# To select several disjoint labels, use a list:

np.random.seed(1)
labels = list(np.random.choice(city.index, 4))
labels

['Northwest HVAC/R Training Center',
 'California State University-Dominguez Hills',
 'Lower Columbia College',
 'Southwest Acupuncture College-Boulder']

In [20]:
city.loc[labels]

INSTNM
Northwest HVAC/R Training Center                Spokane
California State University-Dominguez Hills      Carson
Lower Columbia College                         Longview
Southwest Acupuncture College-Boulder           Boulder
Name: CITY, dtype: object

To select an equally spaced partition of data, use slice notation. Make sure that the start and stop values are strings. You can use an integer to specify the step size of the slice:


In [21]:
 city.loc['Alabama State University':'Reid State Technical College':10] 

INSTNM
Alabama State University              Montgomery
Enterprise State Community College    Enterprise
Heritage Christian University           Florence
Marion Military Institute                 Marion
Reid State Technical College           Evergreen
Name: CITY, dtype: object

When passing a scalar value to the indexing operator, as with step 2 and step 5, a scalar value is returned. When passing a list or slice, as in the other steps, a Series is returned. This returned value might seem inconsistent, but if we think of a Series as a dictionary-like object that maps labels to values, then returning the value makes sense. To select a single item and retain the item in its Series, pass in as a single-item list rather than a scalar value:


In [23]:
city.iloc[[3]]

INSTNM
University of Alabama in Huntsville    Huntsville
Name: CITY, dtype: object

Care needs to be taken when using slice notation with .loc. If the start index appears after the stop index, then an empty Series is returned without an exception raised

In [24]:
city.loc['Reid State Technical College':'Alabama State University':10] 

Series([], Name: CITY, dtype: object)

# Selecting DataFrame rows

The most explicit and preferred way to select DataFrame rows is with the .iloc and .loc indexers. They are capable of selecting rows or columns independently and simultaneously.

In [25]:
#Read in the college dataset, and set the index as the institution name:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [29]:
# Pass an integer to the .iloc indexer to select an entire row at that position:

college.iloc[1]

CITY                  Birmingham
STABBR                        AL
HBCU                           0
MENONLY                        0
WOMENONLY                      0
RELAFFIL                       0
SATVRMID                     570
SATMTMID                     565
DISTANCEONLY                   0
UGDS                       11383
UGDS_WHITE                0.5922
UGDS_BLACK                  0.26
UGDS_HISP                 0.0283
UGDS_ASIAN                0.0518
UGDS_AIAN                 0.0022
UGDS_NHPI                 0.0007
UGDS_2MOR                 0.0368
UGDS_NRA                  0.0179
UGDS_UNKN                   0.01
PPTUG_EF                  0.2607
CURROPER                       1
PCTPELL                    0.346
PCTFLOAN                  0.5214
UG25ABV                   0.2422
MD_EARN_WNE_P10            39700
GRAD_DEBT_MDN_SUPP       21941.5
Name: University of Alabama at Birmingham, dtype: object

In [28]:
college.loc['University of Alaska Anchorage']

CITY                  Anchorage
STABBR                       AK
HBCU                          0
MENONLY                       0
WOMENONLY                     0
RELAFFIL                      0
SATVRMID                    NaN
SATMTMID                    NaN
DISTANCEONLY                  0
UGDS                      12865
UGDS_WHITE               0.5747
UGDS_BLACK               0.0358
UGDS_HISP                0.0761
UGDS_ASIAN               0.0778
UGDS_AIAN                0.0653
UGDS_NHPI                0.0086
UGDS_2MOR                 0.098
UGDS_NRA                 0.0181
UGDS_UNKN                0.0457
PPTUG_EF                 0.4539
CURROPER                      1
PCTPELL                  0.2385
PCTFLOAN                 0.2647
UG25ABV                  0.4386
MD_EARN_WNE_P10           42500
GRAD_DEBT_MDN_SUPP      19449.5
Name: University of Alaska Anchorage, dtype: object

In [30]:
# To select a disjointed set of rows as a DataFrame, pass a list of integers to the .iloc indexer:
college.iloc[[0,4,1]]


Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5


In [31]:
# The same DataFrame from step 4 may be reproduced using .loc by passing it a list of the exact institution names:

labels = ['University of Alaska Anchorage','International Academy of Hair Design', 'University of Alabama in Huntsville'] 
college.loc[labels]    


Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
University of Alaska Anchorage,Anchorage,AK,0.0,0.0,0.0,0,,,0.0,12865.0,...,0.098,0.0181,0.0457,0.4539,1,0.2385,0.2647,0.4386,42500,19449.5
International Academy of Hair Design,Tempe,AZ,0.0,0.0,0.0,0,,,0.0,188.0,...,0.016,0.0,0.0638,0.0,0,0.7185,0.7346,0.3905,22200,10556.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0


In [32]:
# Use slice notation with .iloc to select an entire segment of the data:

college.iloc[99:102]
    

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
International Academy of Hair Design,Tempe,AZ,0.0,0.0,0.0,0,,,0.0,188.0,...,0.016,0.0,0.0638,0.0,0,0.7185,0.7346,0.3905,22200,10556
GateWay Community College,Phoenix,AZ,0.0,0.0,0.0,0,,,0.0,5211.0,...,0.0127,0.0161,0.0702,0.7465,1,0.327,0.2189,0.5832,29800,7283
Mesa Community College,Mesa,AZ,0.0,0.0,0.0,0,,,0.0,19055.0,...,0.0205,0.0257,0.0682,0.6457,1,0.3423,0.2207,0.401,35200,8000


In [33]:
# Slice notation also works with the .loc indexer and is inclusive of the last label:

start = 'International Academy of Hair Design'
stop = 'Mesa Community College'
college.loc[start:stop]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
International Academy of Hair Design,Tempe,AZ,0.0,0.0,0.0,0,,,0.0,188.0,...,0.016,0.0,0.0638,0.0,0,0.7185,0.7346,0.3905,22200,10556
GateWay Community College,Phoenix,AZ,0.0,0.0,0.0,0,,,0.0,5211.0,...,0.0127,0.0161,0.0702,0.7465,1,0.327,0.2189,0.5832,29800,7283
Mesa Community College,Mesa,AZ,0.0,0.0,0.0,0,,,0.0,19055.0,...,0.0205,0.0257,0.0682,0.6457,1,0.3423,0.2207,0.401,35200,8000


# Selecting DataFrame rows and columns simultaneously

Directly using the indexing operator is the correct method to select one or more columns from a DataFrame. However, it does not allow you to select both rows and columns simultaneously. To select rows and columns simultaneously, you will need to pass both valid row and column selections separated by a comma to either the .iloc or .loc indexers.


In [34]:
# Read in the college dataset, and set the index as the institution name. Select the first three rows and the first four columns with slice notation:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college.iloc[:3, :4]


Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama A & M University,Normal,AL,1.0,0.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0
Amridge University,Montgomery,AL,0.0,0.0


In [35]:
college.loc[:'Amridge University', :'MENONLY']


Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama A & M University,Normal,AL,1.0,0.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0
Amridge University,Montgomery,AL,0.0,0.0


In [36]:
# Select all the rows of two different columns:
college.iloc[:, [4,6]].head() 

Unnamed: 0_level_0,WOMENONLY,SATVRMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,0.0,424.0
University of Alabama at Birmingham,0.0,570.0
Amridge University,0.0,
University of Alabama in Huntsville,0.0,595.0
Alabama State University,0.0,425.0


In [39]:
college.loc[:, ['WOMENONLY', 'SATVRMID']]


Unnamed: 0_level_0,WOMENONLY,SATVRMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,0.0,424.0
University of Alabama at Birmingham,0.0,570.0
Amridge University,0.0,
University of Alabama in Huntsville,0.0,595.0
Alabama State University,0.0,425.0
...,...,...
SAE Institute of Technology San Francisco,,
Rasmussen College - Overland Park,,
National Personal Training Institute of Cleveland,,
Bay Area Medical Academy - San Jose Satellite Location,,


In [40]:
# Select disjointed rows and columns:
college.iloc[[100, 200], [7, 15]]

Unnamed: 0_level_0,SATMTMID,UGDS_NHPI
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
GateWay Community College,,0.0029
American Baptist Seminary of the West,,


In [41]:
rows = ['GateWay Community College', 'American Baptist Seminary of the West']
columns = ['SATMTMID', 'UGDS_NHPI']
college.loc[rows, columns]

Unnamed: 0_level_0,SATMTMID,UGDS_NHPI
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
GateWay Community College,,0.0029
American Baptist Seminary of the West,,


In [43]:
# Select a single scalar value:
college.iloc[5, -4] 

0.401

In [46]:
college.loc['The University of Alabama', 'PCTFLOAN']

0.401

In [47]:
# Slice the rows and select a single column:

college.iloc[90:80:-2, 5]

INSTNM
Empire Beauty School-Flagstaff     0
Charles of Italy Beauty College    0
Central Arizona College            0
University of Arizona              0
Arizona State University-Tempe     0
Name: RELAFFIL, dtype: int64

In [48]:
start = 'Empire Beauty School-Flagstaff'
stop = 'Arizona State University-Tempe'
college.loc[start:stop:-2, 'RELAFFIL']

INSTNM
Empire Beauty School-Flagstaff     0
Charles of Italy Beauty College    0
Central Arizona College            0
University of Arizona              0
Arizona State University-Tempe     0
Name: RELAFFIL, dtype: int64

# Selecting data with both integers and labels

The .iloc and .loc indexers each select data by either integer or label location but are not able to handle a combination of both input types at the same time. In earlier versions of pandas, another indexer, .ix, was available to select data by both integer and label location. While this conveniently worked for those specific situations, it was ambiguous by nature and was a source of confusion for many pandas users. The .ix indexer has subsequently been deprecated and thus should be avoided.


In [51]:
# Read in the college dataset and assign the institution name (INSTNM) as the index:

college = pd.read_csv('data/college.csv', index_col='INSTNM')

In [52]:
# Use the Index method get_loc to find the integer position of the desire columns:

col_start = college.columns.get_loc('UGDS_WHITE')
col_end = college.columns.get_loc('UGDS_UNKN') + 1
col_start, col_end

(10, 19)

In [53]:
# Use col_start and col_end to select columns by integer location using .iloc:

college.iloc[:5, col_start:col_end]


Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


# Speeding up scalar selection

Both the .iloc and .loc indexers are capable of selecting a single element, a scalar value, from a Series or DataFrame. However, there exist the indexers, .iat and .at, which respectively achieve the same thing at faster speeds. Like .iloc, the .iat indexer uses integer location to make its selection and must be passed two integers separated by a comma. Similar to .loc, the .at index uses labels to make its selection and must be passed an index and column label separated by a comma.

In [54]:
# Read in the college scoreboard dataset with the institution name as the index Pass a college name and column name to .loc in order to select a scalar value:

college = pd.read_csv('data/college.csv', index_col='INSTNM')
cn = 'Texas A & M University-College Station'
college.loc[cn, 'UGDS_WHITE']

0.6609999999999999

In [55]:
# Achieve the same result with .at :

college.at[cn, 'UGDS_WHITE'] 

0.6609999999999999

In [56]:
# Use the %timeit magic command to find the difference in speed:

%timeit college.loc[cn, 'UGDS_WHITE']


18.3 µs ± 458 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [57]:
%timeit college.at[cn, 'UGDS_WHITE']

10.6 µs ± 166 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [58]:
# Find the integer locations of the preceding selections and then time the difference between .iloc and .iat:

row_num = college.index.get_loc(cn)
col_num = college.columns.get_loc('UGDS_WHITE')

In [59]:
row_num, col_num

(3765, 10)

In [60]:
%timeit college.iloc[row_num, col_num]

18.3 µs ± 836 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [61]:
%timeit college.iat[row_num, col_num]

11.6 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [62]:
%timeit college.iloc[5, col_num]

18.1 µs ± 353 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [64]:
%timeit college.iat[5, col_num]

11.7 µs ± 402 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


# Slicing rows lazily

The previous we showed how the .iloc and .loc indexers were used to select subsets of both Series and DataFrames in either dimension. A shortcut to select the rows exists with just the indexing operator itself. This is just a shortcut to show additional features of pandas, but the primary function of the indexing operator is actually to select DataFrame columns. If you want to select rows, it is best to use .iloc or .loc, as they are unambiguous.


In [65]:
# Read in the college dataset with the institution name as the index and then select every other row from index 10 to 20:

college = pd.read_csv('data/college.csv', index_col='INSTNM')
college[10:20:2]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Birmingham Southern College,Birmingham,AL,0.0,0.0,0.0,1,560.0,560.0,0.0,1180.0,...,0.0051,0.0,0.0051,0.0017,1,0.192,0.4809,0.0152,44200.0,27000
Concordia College Alabama,Selma,AL,1.0,0.0,0.0,1,420.0,400.0,0.0,322.0,...,0.0031,0.0466,0.0,0.1056,1,0.8667,0.9333,0.2367,19900.0,PrivacySuppressed
Enterprise State Community College,Enterprise,AL,0.0,0.0,0.0,0,,,0.0,1729.0,...,0.0254,0.0012,0.0069,0.3823,1,0.4895,0.2263,0.3399,24600.0,8273
Faulkner University,Montgomery,AL,0.0,0.0,0.0,1,,,0.0,2367.0,...,0.0173,0.0182,0.0258,0.2302,1,0.5812,0.7253,0.4589,37200.0,22000
New Beginning College of Cosmetology,Albertville,AL,0.0,0.0,0.0,0,,,0.0,115.0,...,0.0,0.0,0.0,0.0783,1,0.8224,0.8553,0.3933,,5500


In [67]:
# This same slicing exists with Series:
city = college['CITY']
city[10:20:2]

INSTNM
Birmingham Southern College              Birmingham
Concordia College Alabama                     Selma
Enterprise State Community College       Enterprise
Faulkner University                      Montgomery
New Beginning College of Cosmetology    Albertville
Name: CITY, dtype: object

In [68]:
# Both Series and DataFrames can slice by label as well with just the indexing operator:

start = 'Mesa Community College'
stop = 'Spokane Community College'
college[start:stop:1500]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Mesa Community College,Mesa,AZ,0.0,0.0,0.0,0,,,0.0,19055.0,...,0.0205,0.0257,0.0682,0.6457,1,0.3423,0.2207,0.401,35200.0,8000
Hair Academy Inc-New Carrollton,New Carrollton,MD,0.0,0.0,0.0,0,,,0.0,504.0,...,0.0,0.0,0.0,0.4683,1,0.9756,1.0,0.5882,15200.0,9666
National College of Natural Medicine,Portland,OR,0.0,0.0,0.0,0,,,0.0,,...,,,,,1,,,,,PrivacySuppressed


In [70]:
# Here is the same slice by label with a Series:

city[start:stop:1500]

INSTNM
Mesa Community College                            Mesa
Hair Academy Inc-New Carrollton         New Carrollton
National College of Natural Medicine          Portland
Name: CITY, dtype: object

# Slicing Lexicographically

The .loc indexer typically selects data based on the exact string label of the index. However, it also allows you to select data based on the lexicographic order of the values in the index. Specifically, .loc allows you to select all rows with an index lexicographically using slice notation. This works only if the index is sorted.

In [72]:
# Read in the college dataset, and set the institution name as the index:

college = pd.read_csv('data/college.csv', index_col='INSTNM')

In [None]:
# Attempt to select all colleges with names lexicographically between 'Sp' and 'Su':

college.loc['Sp':'Su']

# There was a key error in this case so first we have to sortr it.

In [75]:
# As the index is not sorted, the preceding command fails. Let's go ahead and sort the index:

college = college.sort_index()

In [76]:
college

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A & W Healthcare Educators,New Orleans,LA,0.0,0.0,0.0,0,,,0.0,40.0,...,0.0000,0.0000,0.0000,0.1250,1,0.7018,0.8596,0.6667,,19022.5
A T Still University of Health Sciences,Kirksville,MO,0.0,0.0,0.0,0,,,0.0,,...,,,,,1,,,,219800,PrivacySuppressed
ABC Beauty Academy,Garland,TX,0.0,0.0,0.0,0,,,0.0,30.0,...,0.0000,0.0000,0.0000,0.0000,0,0.7857,0.0000,0.8286,,PrivacySuppressed
ABC Beauty College Inc,Arkadelphia,AR,0.0,0.0,0.0,0,,,0.0,38.0,...,0.0000,0.0000,0.0000,0.2105,1,0.9815,1.0000,0.4688,PrivacySuppressed,16500
AI Miami International University of Art and Design,Miami,FL,0.0,0.0,0.0,0,,,0.0,2778.0,...,0.0018,0.0025,0.4644,0.2185,1,0.5507,0.6966,0.3262,29900,31000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yukon Beauty College Inc,Yukon,OK,0.0,0.0,0.0,0,,,0.0,25.0,...,0.0000,0.0000,0.0000,0.0000,1,0.9259,0.8148,0.4706,PrivacySuppressed,PrivacySuppressed
Z Hair Academy,Lawrence,KS,0.0,0.0,0.0,0,,,0.0,95.0,...,0.0211,0.0000,0.0105,0.0000,1,0.7286,0.6571,0.1525,,10500
Zane State College,Zanesville,OH,0.0,0.0,0.0,0,,,0.0,2063.0,...,0.0218,0.0000,0.2399,0.5730,1,0.3645,0.3434,0.3185,23800,13960.5
duCret School of Arts,Plainfield,NJ,0.0,0.0,0.0,0,,,0.0,41.0,...,0.0976,0.0000,0.0244,0.4146,1,0.4375,0.5000,0.1250,PrivacySuppressed,PrivacySuppressed


In [78]:
college.loc['Sp':'Su']

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Spa Tech Institute-Ipswich,Ipswich,MA,0.0,0.0,0.0,0,,,0.0,37.0,...,0.0000,0.0000,0.0541,0.4054,1,0.2656,0.3906,0.7907,21500,6333
Spa Tech Institute-Plymouth,Plymouth,MA,0.0,0.0,0.0,0,,,0.0,153.0,...,0.0000,0.0000,0.2484,0.3399,1,0.3716,0.4266,0.6250,21500,6333
Spa Tech Institute-Westboro,Westboro,MA,0.0,0.0,0.0,0,,,0.0,90.0,...,0.0000,0.0000,0.0222,0.5778,1,0.3409,0.4545,0.6882,21500,6333
Spa Tech Institute-Westbrook,Westbrook,ME,0.0,0.0,0.0,0,,,0.0,240.0,...,0.0000,0.0000,0.0042,0.2542,1,0.4350,0.5093,0.5224,21500,6333
Spalding University,Louisville,KY,0.0,0.0,0.0,1,490.0,440.0,0.0,1227.0,...,0.0302,0.0016,0.0326,0.2502,1,0.4442,0.6725,0.3764,41700,25000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Studio Academy of Beauty,Chandler,AZ,0.0,0.0,0.0,0,,,0.0,332.0,...,0.0392,0.0000,0.0090,0.0000,1,0.5855,0.6218,0.5675,,6333
Studio Jewelers,New York,NY,0.0,0.0,0.0,0,,,0.0,55.0,...,0.0000,0.0364,0.0000,0.6000,1,0.0451,0.0902,0.8525,PrivacySuppressed,PrivacySuppressed
Stylemaster College of Hair Design,Longview,WA,0.0,0.0,0.0,0,,,0.0,77.0,...,0.0130,0.0000,0.0000,0.0000,1,0.8036,0.7024,0.4510,17000,13320
Styles and Profiles Beauty College,Selmer,TN,0.0,0.0,0.0,0,,,0.0,31.0,...,0.0000,0.0000,0.0000,0.0000,1,0.8182,0.7955,0.2400,PrivacySuppressed,PrivacySuppressed
