# Indexing and Selecting Data

### Selecting Rows

Selecting rows in dataframes is similar to the indexing you have seen in numpy arrays. The syntax ```df[start_index:end_index]``` will subset rows according to the start and end indices.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_excel("iris.xls")
df.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,iris
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Notice that, by default, pandas assigns integer labels to the rows, starting at 0.

In [2]:
# Selecting the rows from indices 2 to 7
df[2:7]

Unnamed: 0,sepal length,sepal width,petal length,petal width,iris
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa


In [5]:
# Selecting alternate rows starting from index = 5
df[5::2].head(10)

Unnamed: 0,sepal length,sepal width,petal length,petal width,iris
5,5.4,3.9,1.7,0.4,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa
11,4.8,3.4,1.6,0.2,Iris-setosa
13,4.3,3.0,1.1,0.1,Iris-setosa
15,5.7,4.4,1.5,0.4,Iris-setosa
17,5.1,3.5,1.4,0.3,Iris-setosa
19,5.1,3.8,1.5,0.3,Iris-setosa
21,5.1,3.7,1.5,0.4,Iris-setosa
23,5.1,3.3,1.7,0.5,Iris-setosa


### Selecting Columns

There are two simple ways to select a single column from a dataframe - ```df['column_name']``` and ```df.column_name```.

In [6]:
# Using df['column']
sepal_length = df['sepal length']
sepal_length.head()

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal length, dtype: float64

In [7]:
# Using df.column
iris = df.iris
iris.head()

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: iris, dtype: object

In [8]:
# Notice that in both these cases, the resultant is a Series object
print(type(df['iris']))
print(type(df.iris))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


#### Selecting Multiple Columns 

You can select multiple columns by passing the list of column names inside the ```[]```: ```df[['column_1', 'column_2', 'column_n']]```.

For instance, to select only the columns Cust_id, Sales and Profit:

In [8]:
df[['sepal length', 'petal length', 'iris']].head()

Unnamed: 0,sepal length,petal length,iris
0,5.1,1.4,Iris-setosa
1,4.9,1.4,Iris-setosa
2,4.7,1.3,Iris-setosa
3,4.6,1.5,Iris-setosa
4,5.0,1.4,Iris-setosa


Notice that in this case, the output is itself a dataframe.

In [10]:
type(df[['sepal length', 'petal length', 'iris']])

pandas.core.frame.DataFrame

In [10]:
# Similarly, if you select one column using double square brackets, 
# you'll get a df, not Series

print(type(df['iris'])) #series
type(df[['iris']]) #dataframe

<class 'pandas.core.series.Series'>


pandas.core.frame.DataFrame

### Selecting Subsets of Dataframes

Until now, you have seen selecting rows and columns using the following ways:
* Selecting rows: ```df[start:stop]```
* Selecting columns: ```df['column']``` or ```df.column``` or ```df[['col_x', 'col_y']]```
    * ```df['column']``` or ```df.column``` return a series
    * ```df[['col_x', 'col_y']]``` returns a dataframe

But pandas does not prefer this way of indexing dataframes, since it has some ambiguity. For instance, let's try and select the third row of the dataframe.



In [11]:
# Trying to select the third row: Throws an error
df[2]

KeyError: 2

Pandas throws an error because it is confused whether the ```[2]``` is an *index* or a *label*. Recall from the previous section that you can change the row indices. 

In [12]:
# Changing the row indices to sepal length
df.set_index('sepal length').head()

Unnamed: 0_level_0,sepal width,petal length,petal width,iris
sepal length,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa


Now imagine you had a column with entries ```[2, 4, 7, 8 ...]```, and you set that as the index. What should ```df[2]``` return?
The second row, or the row with the index value = 2?

Taking an example from this dataset, say you decide to assign the ```petal length``` column as the index.

In [13]:
df.set_index('petal length').head()

Unnamed: 0_level_0,sepal length,sepal width,petal width,iris
petal length,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.4,5.1,3.5,0.2,Iris-setosa
1.4,4.9,3.0,0.2,Iris-setosa
1.3,4.7,3.2,0.2,Iris-setosa
1.5,4.6,3.1,0.2,Iris-setosa
1.4,5.0,3.6,0.2,Iris-setosa


Now, what should ```df[13]``` return - the 14th row, or the row with index label 13 (i.e. the second row)?

Because of this and similar other ambiguities, pandas provides **explicit ways** to subset dataframes - position based indexing and label based indexing, which we'll study next.