## Indexing DataFrames

Having created a DataFrame, it is necessary to be able to refer to the elements and parts of a DataFrame. This can be achieved using different methods of indexing.

First let's create a DataFrame.

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 3), 
                  index = ['a' + str(i) for i in range(10)],    # List comprehension
                  columns = ['X', 'Y', 'Z'])
df

Unnamed: 0,X,Y,Z
a0,0.612574,-0.256674,1.174492
a1,1.02904,0.927267,2.111432
a2,-1.362142,0.375844,-0.221504
a3,-0.081476,-0.130276,0.794388
a4,0.481864,-1.359462,0.128806
a5,-0.282926,0.524219,0.85505
a6,1.001354,1.386409,-3.497621
a7,2.269834,-2.056785,1.056835
a8,-0.608418,0.778839,-0.948648
a9,-0.225175,1.348959,-2.452029


### Using loc indexer

A specific element can be Selected using loc indexer. This method is called <font color = 'blue'>_labelled indexing_</font>.

In [2]:
df.loc['a3', 'Y']

-0.1302756729062064

Note that we have used labels for indexing the row and the column.

### Using iloc indexer

This alternative method is called <font color = 'blue'>_positional indexing_</font>. In this method row and column is indexed using their positions.

In [3]:
df.iloc[3, 1]

-0.1302756729062064

### Selecting a column 

There are several ways of selecting entire column of a DataFrame.

#### Using the index of required column

In [4]:
df['X']       # regular indexing

a0    0.612574
a1    1.029040
a2   -1.362142
a3   -0.081476
a4    0.481864
a5   -0.282926
a6    1.001354
a7    2.269834
a8   -0.608418
a9   -0.225175
Name: X, dtype: float64

In [5]:
df[0]  # Selects column 0 only if 0 is an index

KeyError: 0

#### Using attribite notation

Alternatively, one-word index of a column is also available as an attribute of the DataFrame.

In [6]:
df.X

a0    0.612574
a1    1.029040
a2   -1.362142
a3   -0.081476
a4    0.481864
a5   -0.282926
a6    1.001354
a7    2.269834
a8   -0.608418
a9   -0.225175
Name: X, dtype: float64

#### Using iloc indexer

In [7]:
df.iloc[:,0]      

a0    0.612574
a1    1.029040
a2   -1.362142
a3   -0.081476
a4    0.481864
a5   -0.282926
a6    1.001354
a7    2.269834
a8   -0.608418
a9   -0.225175
Name: X, dtype: float64

#### Using loc indexer

In [8]:
df.loc[:, 'X']     # Using loc indexer

a0    0.612574
a1    1.029040
a2   -1.362142
a3   -0.081476
a4    0.481864
a5   -0.282926
a6    1.001354
a7    2.269834
a8   -0.608418
a9   -0.225175
Name: X, dtype: float64

When a column is selected using any of above approaches, the result is a Series object containing the data of the indexed column. The name of the resultant Series is the label of the indexed column.

### Selecting a row

There are multiple ways of selecting an entire row.

#### Using iloc indexer

In [9]:
df.iloc[0, :]

X    0.612574
Y   -0.256674
Z    1.174492
Name: a0, dtype: float64

#### iloc indexer with column index omited

In [10]:
df.iloc[0]

X    0.612574
Y   -0.256674
Z    1.174492
Name: a0, dtype: float64

#### Using loc indexer

In [11]:
df.loc['a0', :]

X    0.612574
Y   -0.256674
Z    1.174492
Name: a0, dtype: float64

#### Using loc inxdexer omiting column index

In [12]:
df.loc['a0']

X    0.612574
Y   -0.256674
Z    1.174492
Name: a0, dtype: float64

### Slicing the rows

Multiple consecutive rows can be selected using familier slicing operation.

#### Regular slicing notation with positional indexing

In [13]:
df[1:4]

Unnamed: 0,X,Y,Z
a1,1.02904,0.927267,2.111432
a2,-1.362142,0.375844,-0.221504
a3,-0.081476,-0.130276,0.794388


#### Regular slicing notation with labeled indexing

In [14]:
df['a1':'a3']

Unnamed: 0,X,Y,Z
a1,1.02904,0.927267,2.111432
a2,-1.362142,0.375844,-0.221504
a3,-0.081476,-0.130276,0.794388


<u> Remark </u>

Its very important to remeber that regular indexing notation (i.e. without using loc/ iloc indexer) selects column without slicing. However, with slicing, it selects the rows.

#### Slicing with iloc indexer

In [15]:
df.iloc[1:4]         

Unnamed: 0,X,Y,Z
a1,1.02904,0.927267,2.111432
a2,-1.362142,0.375844,-0.221504
a3,-0.081476,-0.130276,0.794388


#### Slicing with loc indexer

In [16]:
df.loc['a1':'a4']

Unnamed: 0,X,Y,Z
a1,1.02904,0.927267,2.111432
a2,-1.362142,0.375844,-0.221504
a3,-0.081476,-0.130276,0.794388
a4,0.481864,-1.359462,0.128806


### Slicing the columns

#### Using loc indexer

In [17]:
df.loc[:, 'Y':'Z']

Unnamed: 0,Y,Z
a0,-0.256674,1.174492
a1,0.927267,2.111432
a2,0.375844,-0.221504
a3,-0.130276,0.794388
a4,-1.359462,0.128806
a5,0.524219,0.85505
a6,1.386409,-3.497621
a7,-2.056785,1.056835
a8,0.778839,-0.948648
a9,1.348959,-2.452029


#### Using iloc indexer

In [18]:
df.iloc[:,1:2]

Unnamed: 0,Y
a0,-0.256674
a1,0.927267
a2,0.375844
a3,-0.130276
a4,-1.359462
a5,0.524219
a6,1.386409
a7,-2.056785
a8,0.778839
a9,1.348959


<u> Remark </u>: Column slicing cannot be done without using loc/ iloc indexer

## Fancy indexing

### Selecting one or more columns

Multiple columns can be selected by using the list of columns as index.

#### Without using indexer

In [19]:
df[['X']]

Unnamed: 0,X
a0,0.612574
a1,1.02904
a2,-1.362142
a3,-0.081476
a4,0.481864
a5,-0.282926
a6,1.001354
a7,2.269834
a8,-0.608418
a9,-0.225175


Note how the output differs from that produced by `df['X']`

In [20]:
df[['X', 'Z']]

Unnamed: 0,X,Z
a0,0.612574,1.174492
a1,1.02904,2.111432
a2,-1.362142,-0.221504
a3,-0.081476,0.794388
a4,0.481864,0.128806
a5,-0.282926,0.85505
a6,1.001354,-3.497621
a7,2.269834,1.056835
a8,-0.608418,-0.948648
a9,-0.225175,-2.452029


#### Using loc indexer

In [21]:
df.loc[:,['X', 'Z']]

Unnamed: 0,X,Z
a0,0.612574,1.174492
a1,1.02904,2.111432
a2,-1.362142,-0.221504
a3,-0.081476,0.794388
a4,0.481864,0.128806
a5,-0.282926,0.85505
a6,1.001354,-3.497621
a7,2.269834,1.056835
a8,-0.608418,-0.948648
a9,-0.225175,-2.452029


#### Using iloc indexer

In [22]:
df.iloc[:,[0, 2]]

Unnamed: 0,X,Z
a0,0.612574,1.174492
a1,1.02904,2.111432
a2,-1.362142,-0.221504
a3,-0.081476,0.794388
a4,0.481864,0.128806
a5,-0.282926,0.85505
a6,1.001354,-3.497621
a7,2.269834,1.056835
a8,-0.608418,-0.948648
a9,-0.225175,-2.452029


### Selecting multiple rows and columns

In [23]:
df.iloc[[1, 3, 7], [0, 2]]

Unnamed: 0,X,Z
a1,1.02904,2.111432
a3,-0.081476,0.794388
a7,2.269834,1.056835


Note how this type of indexing produces a more intuitive result than that in case of numpy.

## Using Boolean index

To select only speciic rows satisfying a criterion, we can use boolean index as

In [24]:
df.X

a0    0.612574
a1    1.029040
a2   -1.362142
a3   -0.081476
a4    0.481864
a5   -0.282926
a6    1.001354
a7    2.269834
a8   -0.608418
a9   -0.225175
Name: X, dtype: float64

In [25]:
df.X >0

a0     True
a1     True
a2    False
a3    False
a4     True
a5    False
a6     True
a7     True
a8    False
a9    False
Name: X, dtype: bool

In [26]:
df[df.X > 0]  # When bool series is used an an index, the rows corresponding to True value are selected

Unnamed: 0,X,Y,Z
a0,0.612574,-0.256674,1.174492
a1,1.02904,0.927267,2.111432
a4,0.481864,-1.359462,0.128806
a6,1.001354,1.386409,-3.497621
a7,2.269834,-2.056785,1.056835


In [27]:
J = [True, False, True]
df.loc[:,J]   # also works with iloc indexer

Unnamed: 0,X,Z
a0,0.612574,1.174492
a1,1.02904,2.111432
a2,-1.362142,-0.221504
a3,-0.081476,0.794388
a4,0.481864,0.128806
a5,-0.282926,0.85505
a6,1.001354,-3.497621
a7,2.269834,1.056835
a8,-0.608418,-0.948648
a9,-0.225175,-2.452029


In [28]:
I = [True, True, False, True, False, False, True, True, True, False]
df[I]

Unnamed: 0,X,Y,Z
a0,0.612574,-0.256674,1.174492
a1,1.02904,0.927267,2.111432
a3,-0.081476,-0.130276,0.794388
a6,1.001354,1.386409,-3.497621
a7,2.269834,-2.056785,1.056835
a8,-0.608418,0.778839,-0.948648


In [29]:
J2 = np.array([True, True, False])
J2

array([ True,  True, False])

In [30]:
df.loc[:, J2]

Unnamed: 0,X,Y
a0,0.612574,-0.256674
a1,1.02904,0.927267
a2,-1.362142,0.375844
a3,-0.081476,-0.130276
a4,0.481864,-1.359462
a5,-0.282926,0.524219
a6,1.001354,1.386409
a7,2.269834,-2.056785
a8,-0.608418,0.778839
a9,-0.225175,1.348959
