# Indexing and selecting data

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## More on NumPy indexing

In [2]:
a = np.array([-2, 3, 4, -5, 5])
print(a)

[-2  3  4 -5  5]


### Fancy indexing

Apart from indexing with integers and slices NumPy also supports indexing with arrays of integers (so-called *fancy indexing*). For example, to get the 2nd and 4th element of ``a``:

In [3]:
a[[1, 3]]

array([ 3, -5])

### Boolean indexing

To select data fulfilling specific criteria, one can use the *bolean indexing*. This is best illustrated on 1D arrays; for example, lets select only positive elements of ``a``:


In [4]:
a[a > 0]

array([3, 4, 5])

Note that the index array has the same size as and type of boolean:

In [5]:
print(a)
print(a > 0)

[-2  3  4 -5  5]
[False  True  True False  True]


Multiple criteria can be also combine in one query:

In [6]:
a[(a > 0) & (a < 5)]

array([3, 4])

<div class="alert alert-success">
    <b>EXERCISE</b>: Select all odd numbers from the array <code>a</code>
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: Select <b>negative</b> odd numbers from the array <code>a</code>
</div>

## Indexing pandas `Series`

``Series`` can be indexed similarly to 1D NumPy array. 

In [9]:
pop_dict = {'Germany': 81.3, 
            'Belgium': 11.3, 
            'France': 64.3, 
            'United Kingdom': 64.9, 
            'Netherlands': 16.9}
population = pd.Series(pop_dict)
print(population)

Belgium           11.3
France            64.3
Germany           81.3
Netherlands       16.9
United Kingdom    64.9
dtype: float64


We can use fancy indexing with the rich index:

In [10]:
population[['Netherlands', 'Germany']]

Netherlands    16.9
Germany        81.3
dtype: float64

Similarly, boolean indexing can be used to filter the ``Series``. Lets select countries with population of more than 20 millions:

In [11]:
population[population > 20]

France            64.3
Germany           81.3
United Kingdom    64.9
dtype: float64

You can also do position-based indexing by using integers instead of labels:

In [12]:
population[:2]

Belgium    11.3
France     64.3
dtype: float64

## Indexing `DataFrame`

In [13]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries

Unnamed: 0,area,capital,country,population
0,30510,Brussels,Belgium,11.3
1,671308,Paris,France,64.3
2,357050,Berlin,Germany,81.3
3,41526,Amsterdam,Netherlands,16.9
4,244820,London,United Kingdom,64.9


In [14]:
countries = countries.set_index('country')
countries

Unnamed: 0_level_0,area,capital,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Belgium,30510,Brussels,11.3
France,671308,Paris,64.3
Germany,357050,Berlin,81.3
Netherlands,41526,Amsterdam,16.9
United Kingdom,244820,London,64.9


## Some notes on selecting data

Data frames allow for labeling rows and columns, but this makes indexing also a bit more complex compared to 1D NumPy's ``array`` and pandas ``Series``. We now have to distuinguish between:

- selection of rows or columns,
- selection by label or position.

### `[]` provides some convenience shortcuts 

For a ``DataFrame``, basic indexing selects the columns.

Selecting a single column:

In [15]:
countries['area']

country
Belgium            30510
France            671308
Germany           357050
Netherlands        41526
United Kingdom    244820
Name: area, dtype: int64

or multiple columns using fancy indexing:

In [16]:
countries[['area', 'population']]

Unnamed: 0_level_0,area,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Belgium,30510,11.3
France,671308,64.3
Germany,357050,81.3
Netherlands,41526,16.9
United Kingdom,244820,64.9


But, slicing accesses the rows:

In [17]:
countries['France':'Netherlands']

Unnamed: 0_level_0,area,capital,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
France,671308,Paris,64.3
Germany,357050,Berlin,81.3
Netherlands,41526,Amsterdam,16.9


We can also select rows similarly to the boolean indexing in numpy. The boolean mask should be 1-dimensional and the same length as the thing being indexed. Boolean indexing of `DataFrame`  can be used like the `WHERE` clause of SQL to select **rows** matching some criteria:

In [18]:
countries[countries['area'] > 100000]

Unnamed: 0_level_0,area,capital,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
France,671308,Paris,64.3
Germany,357050,Berlin,81.3
United Kingdom,244820,London,64.9


So as a summary, `[]` provides the following convenience shortcuts:

<table>
<tr>
<td></td>
<td>NumPy/`Series`</td>
<td>`DataFrame`</td>
</tr>
<tr>
<td>Integer index<br>`data[label]`</td>
<td>single element</td>
<td>single **column**</td>
</tr>
<tr>
<td>Slice<br>`data[label1:label2]`</td>
<td>sequence</td>
<td>one or more **rows**</td>
</tr>
<tr>
<td>Fancy indexing<br>`data[[label1,label2]]`</td>
<td>sequence</td>
<td>one or more **columns**</td>
</tr>
<tr>
<td>Boolean indexing<br>`data[mask]`</td>
<td>sequence</td>
<td>one or more **rows**</td>
</tr>
</table>

<div class="alert alert-success">
    <b>EXERCISE</b>: Calculate the area of Germany relative to the total area of all other countries in the data frame. *Hint*: you can compare the index of the data frame to any string
</div>

### Systematic indexing with `loc` and `iloc`

When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
    
* `loc`: selection by label
* `iloc`: selection by position

These methods index the different dimensions of the frame:

* `df.loc[row_indexer, column_indexer]`
* `df.iloc[row_indexer, column_indexer]`

Selecting a single element:

In [20]:
countries.loc['Germany', 'area']

357050

But the row or column indexer can also be a list, slice, boolean array, ..

In [21]:
countries.loc['France':'Germany', ['area', 'population']]

Unnamed: 0_level_0,area,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1
France,671308,64.3
Germany,357050,81.3


---
Selecting by position with `iloc` works similar as indexing numpy arrays:

In [22]:
countries.iloc[:2,1:3]

Unnamed: 0_level_0,capital,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Belgium,Brussels,11.3
France,Paris,64.3


The different indexing methods can also be used to assign data:

In [23]:
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10

In [24]:
countries2

Unnamed: 0_level_0,area,capital,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Belgium,30510,Brussels,10.0
France,671308,Paris,10.0
Germany,357050,Berlin,10.0
Netherlands,41526,Amsterdam,16.9
United Kingdom,244820,London,64.9


---

<div class="alert alert-success">
    <b>EXERCISE</b>: Add a column `density` with the population density (note: population column is expressed in millions)
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: Select the capital and the population column of those countries where the density is larger than 300
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: List names, capitals and population densities of two countries with highest population density.
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: Change the capital of the UK to Cambridge
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: Select all countries whose population density is between 100 and 300 people/km²
</div>

## More exercises!

For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so all credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/file/d/0B3G70MlBnCgKa0U4WFdWdGdVOFU/view?usp=sharing) and [`cast.csv`](https://drive.google.com/file/d/0B3G70MlBnCgKRzRmTWdQTUdjNnM/view?usp=sharing) and put them in the `/data` folder.

In [30]:
cast = pd.read_csv('data/cast.csv')
cast.head()

Unnamed: 0,title,year,name,type,character,n
0,Suuri illusioni,1985,Homo $,actor,Guests,22.0
1,Gangsta Rap: The Glockumentary,2007,Too $hort,actor,Himself,
2,Menace II Society,1993,Too $hort,actor,Lew-Loc,27.0
3,Porndogs: The Adventures of Sadie,2009,Too $hort,actor,Bosco,3.0
4,Stop Pepper Palmer,2014,Too $hort,actor,Himself,


In [31]:
titles = pd.read_csv('data/titles.csv')
titles.head()

Unnamed: 0,title,year
0,The Rising Son,1990
1,Ashes of Kukulcan,2016
2,The Thousand Plane Raid,1969
3,Crucea de piatra,1993
4,The 86,2015


<div class="alert alert-success">
    <b>EXERCISE</b>: How many movies are listed in the titles dataframe?
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: What are the earliest two films listed in the titles dataframe?
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: How many movies have the title "Hamlet"?
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: List all of the "Treasure Island" movies from earliest to most recent.
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: How many movies were made from 1950 through 1959?
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: How many roles in the movie "Inception" are NOT ranked by an "n" value?
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: But how many roles in the movie "Inception" did receive an "n" value?
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: Display the cast of "North by Northwest" in their correct "n"-value order, ignoring roles that did not earn a numeric "n" value.
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: How many roles were credited in the silent 1921 version of Hamlet?
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: List the supporting roles (having n=2) played by Cary Grant in the 1940s, in order by year.
</div>