# Indexing and Selecting Data

In this section, you will:

* Select rows from a dataframe
* Select columns from a dataframe
* Select subsets of dataframes

### Selecting Rows

Selecting rows in dataframes is similar to the indexing you have seen in numpy arrays. The syntax ```df[start_index:end_index]``` will subset rows according to the start and end indices.

In [None]:
import numpy as np
import pandas as pd

market_df = pd.read_csv("../global_sales_data/market_fact.csv")
market_df.head()

In [None]:
# Selecting the rows from indices 2 to 6
market_df[2:7]

In [None]:
# Selecting alternate rows starting from index = 5
market_df[5::2].head()

### Selecting Columns

There are two simple ways to select a single column from a dataframe - ```df['column_name']``` and ```df.column_name```.

In [None]:
# Using df['column']
sales = market_df['Sales']
sales.head()


In [None]:
# Using df.column
sales = market_df.Sales
sales.head()

In [None]:
# Notice that in both these cases, the resultant is a Series object
print(type(market_df['Sales']))
print(type(market_df.Sales))


#### Selecting Multiple Columns 

You can select multiple columns by passign the list of column names inside the ```[]```: ```df[['column_1', 'column_2', 'column_n']]```.

For instance, to select only the columns Cust_id, Sales and Profit:

In [None]:
# Select Cust_id, Sales and Profit:
market_df[['Cust_id', 'Sales', 'Profit']].head()

Notice that in this case, the output is a dataframe, not series (as expected, since series are one-dimensional).

In [None]:
type(market_df[['Cust_id', 'Sales', 'Profit']])

In [None]:
# Similarly, if you select one column using double square brackets, 
# you'll get a df, not Series

type(market_df[['Sales']])

### Selecting Subsets of Dataframes

Until now, you have seen selecting rows and columns using the following ways:
* Selecting rows: ```df[start:stop]```
* Selecting columns: ```df['column']``` or ```df.column``` or ```df[['col_x', 'col_y']]```
    * ```df['column']``` or ```df.column``` return a series
    * ```df[['col_x', 'col_y']]``` returns a dataframe

But pandas does not prefer this way of indexing dataframes, since it has some ambiguity. For instance, let's try and select the third row of the dataframe.



In [None]:
# Trying to select the third row: Throws an error
market_df[2]

Pandas throws an error because it is confused whether the ```[2]``` is an *index* or a *label*. Recall from the previous section that you can change the row indices. 

In [None]:
# Changing the row indices to Ord_id
market_df.set_index('Ord_id').head()

Now imagine you had a column with entries [2, 4, 7, 8 ...], and you set that as the index. What should ```df[2]``` return?
The second row, or the row with the index value = 2?

Because of this and similar other ambiguities, pandas provides **explicit ways** to subset dataframes - position based indexing and label based indexing, which we'll study next.