## Indexing DataFrames

Having created a DataFrame, it is necessary to be able to refer to the elements and parts of a DataFrame. This can be achieved using different methods of indexing.

First we create a DataFrame.

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 3), 
                  index = ['a' + str(i) for i in range(10)],    # List comprehension
                  columns = ['X', 'Y', 'Z'])
df

Unnamed: 0,X,Y,Z
a0,0.991532,1.731547,-0.580791
a1,-0.447652,0.079192,1.001877
a2,-0.108235,-1.931978,2.273536
a3,0.718982,-0.890286,-0.4618
a4,-0.459673,-0.007272,-1.528105
a5,-1.025016,-0.533098,-0.67077
a6,0.633949,0.708923,-1.859408
a7,0.164404,-1.415267,-0.901724
a8,-0.211213,-0.276117,-1.601427
a9,-0.055067,0.075195,-0.004709


### Using loc indexer

A specific element can be Selected using loc indexer. This method is called <font color = 'blue'>_labelled indexing_</font>.

In [31]:
df.loc['a3', 'Y']

-0.8902856100926875

Note that we have used labels for indexing the row and the column.

### Using iloc indexer

This alternative method is called <font color = 'blue'>_positional indexing_</font>. In this method row and column is indexed using their positions.

In [32]:
df.iloc[3, 1]

-0.8902856100926875

### Selecting a column 

There are several ways of selecting entire column of a DataFrame.

#### Using the index of required column

In [33]:
df['X']       # regular indexing

a0    0.991532
a1   -0.447652
a2   -0.108235
a3    0.718982
a4   -0.459673
a5   -1.025016
a6    0.633949
a7    0.164404
a8   -0.211213
a9   -0.055067
Name: X, dtype: float64

In [34]:
df[0]  # Selects column 0 only if 0 is an index

KeyError: 0

#### Using attribite notation

Alternatively, one-word index of a column is also available as an attribute of the DataFrame.

In [35]:
df.X

a0    0.991532
a1   -0.447652
a2   -0.108235
a3    0.718982
a4   -0.459673
a5   -1.025016
a6    0.633949
a7    0.164404
a8   -0.211213
a9   -0.055067
Name: X, dtype: float64

#### Using iloc indexer

In [36]:
df.iloc[:,0]      

a0    0.991532
a1   -0.447652
a2   -0.108235
a3    0.718982
a4   -0.459673
a5   -1.025016
a6    0.633949
a7    0.164404
a8   -0.211213
a9   -0.055067
Name: X, dtype: float64

#### Using loc indexer

In [37]:
df.loc[:, 'X']     # Using loc indexer

a0    0.991532
a1   -0.447652
a2   -0.108235
a3    0.718982
a4   -0.459673
a5   -1.025016
a6    0.633949
a7    0.164404
a8   -0.211213
a9   -0.055067
Name: X, dtype: float64

When a column is selected using any of above approaches, the result is a Series object containing the data of the indexed column. The name of the resultant Series is the label of the indexed column.

### Selecting a row

There are multiple ways of selecting an entire row.

#### Using iloc indexer

In [38]:
df.iloc[0, :]

X    0.991532
Y    1.731547
Z   -0.580791
Name: a0, dtype: float64

#### iloc indexer with column index omited

In [39]:
df.iloc[0]

X    0.991532
Y    1.731547
Z   -0.580791
Name: a0, dtype: float64

#### Using loc indexer

In [40]:
df.loc['a0', :]

X    0.991532
Y    1.731547
Z   -0.580791
Name: a0, dtype: float64

#### Using loc inxdexer omiting column index

In [41]:
df.loc['a0']

X    0.991532
Y    1.731547
Z   -0.580791
Name: a0, dtype: float64

### Slicing the rows

Multiple consecutive rows can be selected using familier slicing operation.

#### Regular slicing notation with positional indexing

In [13]:
df[1:4]

Unnamed: 0,X,Y,Z
a1,-0.447652,0.079192,1.001877
a2,-0.108235,-1.931978,2.273536
a3,0.718982,-0.890286,-0.4618


#### Regular slicing notation with labeled indexing

In [14]:
df['a1':'a3']

Unnamed: 0,X,Y,Z
a1,-0.447652,0.079192,1.001877
a2,-0.108235,-1.931978,2.273536
a3,0.718982,-0.890286,-0.4618


<u> Remark </u>

Its very important to remeber that regular indexing notation (i.e. without using loc/ iloc indexer) selects column without slicing. However, with slicing, it selects the rows.

#### Slicing with iloc indexer

In [15]:
df.iloc[1:4]         

Unnamed: 0,X,Y,Z
a1,-0.447652,0.079192,1.001877
a2,-0.108235,-1.931978,2.273536
a3,0.718982,-0.890286,-0.4618


#### Slicing with loc indexer

In [16]:
df.loc['a1':'a4']

Unnamed: 0,X,Y,Z
a1,-0.447652,0.079192,1.001877
a2,-0.108235,-1.931978,2.273536
a3,0.718982,-0.890286,-0.4618
a4,-0.459673,-0.007272,-1.528105


### Slicing the columns

#### Using loc indexer

In [17]:
df.loc[:, 'Y':'Z']

Unnamed: 0,Y,Z
a0,1.731547,-0.580791
a1,0.079192,1.001877
a2,-1.931978,2.273536
a3,-0.890286,-0.4618
a4,-0.007272,-1.528105
a5,-0.533098,-0.67077
a6,0.708923,-1.859408
a7,-1.415267,-0.901724
a8,-0.276117,-1.601427
a9,0.075195,-0.004709


#### Using iloc indexer

In [18]:
df.iloc[:,1:2]

Unnamed: 0,Y
a0,1.731547
a1,0.079192
a2,-1.931978
a3,-0.890286
a4,-0.007272
a5,-0.533098
a6,0.708923
a7,-1.415267
a8,-0.276117
a9,0.075195


<u> Remark </u>: Column slicing cannot be done without using loc/ iloc indexer

## Fancy indexing

### Selecting one or more columns

Multiple columns can be selected by using the list of columns as index.

#### Without using indexer

In [19]:
df[['X']]

Unnamed: 0,X
a0,0.991532
a1,-0.447652
a2,-0.108235
a3,0.718982
a4,-0.459673
a5,-1.025016
a6,0.633949
a7,0.164404
a8,-0.211213
a9,-0.055067


Note how the output differs from that produced by `df['X']`

In [20]:
df[['X', 'Z']]

Unnamed: 0,X,Z
a0,0.991532,-0.580791
a1,-0.447652,1.001877
a2,-0.108235,2.273536
a3,0.718982,-0.4618
a4,-0.459673,-1.528105
a5,-1.025016,-0.67077
a6,0.633949,-1.859408
a7,0.164404,-0.901724
a8,-0.211213,-1.601427
a9,-0.055067,-0.004709


#### Using loc indexer

In [21]:
df.loc[:,['X', 'Z']]

Unnamed: 0,X,Z
a0,0.991532,-0.580791
a1,-0.447652,1.001877
a2,-0.108235,2.273536
a3,0.718982,-0.4618
a4,-0.459673,-1.528105
a5,-1.025016,-0.67077
a6,0.633949,-1.859408
a7,0.164404,-0.901724
a8,-0.211213,-1.601427
a9,-0.055067,-0.004709


#### Using iloc indexer

In [22]:
df.iloc[:,[0, 2]]

Unnamed: 0,X,Z
a0,0.991532,-0.580791
a1,-0.447652,1.001877
a2,-0.108235,2.273536
a3,0.718982,-0.4618
a4,-0.459673,-1.528105
a5,-1.025016,-0.67077
a6,0.633949,-1.859408
a7,0.164404,-0.901724
a8,-0.211213,-1.601427
a9,-0.055067,-0.004709


### Selecting multiple rows and columns

In [23]:
df.iloc[[1, 3, 7], [0, 2]]

Unnamed: 0,X,Z
a1,-0.447652,1.001877
a3,0.718982,-0.4618
a7,0.164404,-0.901724


Note how this type of indexing produces a more intuitive result than that in case of numpy.

## Using Boolean index

To select only speciic rows satisfying a criterion, we can use boolean index as

In [24]:
df.X

a0    0.991532
a1   -0.447652
a2   -0.108235
a3    0.718982
a4   -0.459673
a5   -1.025016
a6    0.633949
a7    0.164404
a8   -0.211213
a9   -0.055067
Name: X, dtype: float64

In [25]:
df.X >0

a0     True
a1    False
a2    False
a3     True
a4    False
a5    False
a6     True
a7     True
a8    False
a9    False
Name: X, dtype: bool

In [26]:
df[df.X > 0]  # When bool series is used an an index, the rows corresponding to True value are selected

Unnamed: 0,X,Y,Z
a0,0.991532,1.731547,-0.580791
a3,0.718982,-0.890286,-0.4618
a6,0.633949,0.708923,-1.859408
a7,0.164404,-1.415267,-0.901724


In [27]:
J = [True, False, True]
df.loc[:,J]   # also works with iloc indexer

Unnamed: 0,X,Z
a0,0.991532,-0.580791
a1,-0.447652,1.001877
a2,-0.108235,2.273536
a3,0.718982,-0.4618
a4,-0.459673,-1.528105
a5,-1.025016,-0.67077
a6,0.633949,-1.859408
a7,0.164404,-0.901724
a8,-0.211213,-1.601427
a9,-0.055067,-0.004709


In [28]:
I = [True, True, False, True, False, False, True, True, True, False]
df[I]

Unnamed: 0,X,Y,Z
a0,0.991532,1.731547,-0.580791
a1,-0.447652,0.079192,1.001877
a3,0.718982,-0.890286,-0.4618
a6,0.633949,0.708923,-1.859408
a7,0.164404,-1.415267,-0.901724
a8,-0.211213,-0.276117,-1.601427


In [29]:
J2 = np.array([True, True, False])
J2

array([ True,  True, False])

In [30]:
df.loc[:, J2]

Unnamed: 0,X,Y
a0,0.991532,1.731547
a1,-0.447652,0.079192
a2,-0.108235,-1.931978
a3,0.718982,-0.890286
a4,-0.459673,-0.007272
a5,-1.025016,-0.533098
a6,0.633949,0.708923
a7,0.164404,-1.415267
a8,-0.211213,-0.276117
a9,-0.055067,0.075195
