# Data Indexing and Selection with Pandas

We looked at indexing in NumPy last week. 

We can do something similar in Pandas to access and modify values in a `Series` or a `DataFrame`

## Data Selection in a Series

Recall: a `Series` is similar to a one-dimensional Numpy array but also like a Python dictonary.

Keep these analogies in mind as it will help you to understand the patterns of data indexing and selection

## Series as a dictionary

Map collections of keys to a collection of values

In [1]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

Is a key/index or a value in the `Series`? 

We can use standard Pythonic syntax to check:

In [2]:
'a' in data     # boolean statement checking if 'a' is in the pandas

True

In [3]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [4]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [5]:
list(data.values)

[0.25, 0.5, 0.75, 1.0]

Modification of values using the dictionary-like syntax.

To extend a `Series` you can just do the following:

In [6]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

## Treating a Series as a one-dimensional array

`Series` objects build on this using the same basic mechanisms as NumPy arays:

- slices
- masking
- fancy indexing

In [7]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [8]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [9]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [10]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

## Indexers: loc, iloc, and ix

The previous indexing and slicing conventions can be confusing 

e.g If your series has an explicit __integer__ index.

Indexing would use the explicit index

Slicing would use the implict index

In [11]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [12]:
# explicit index when indexing
data[1]

'a'

In [13]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

How could you overcome this issue?

__indexer attributes__

- `loc` allows indexing that always references the explicit

In [14]:
data.loc[1]

'a'

In [15]:
data.loc[1:3]

1    a
3    b
dtype: object

- `iloc` attribute allows indexing to refer to the implicit Python-style index

In [16]:
data.iloc[1]

'b'

In [17]:
data.iloc[1:3]

3    b
5    c
dtype: object

A guiding principle of Python is:

"explicit is better than implicit"

Use `loc` and `iloc` to make your code readable

In [18]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


# Data Selection from a DataFrame

A `DataFrame` sometimes acts like a 2D array

Sometimes it acts like a dictionary of `Series` structures sharing the same index.

Keep these two analogies in mind.

## DataFrame as a dictionary

Recall our population example:

In [19]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Individual `Series` that make up the columns can be accessed via dictionary based indexing:

In [20]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Equally, attribute-style column access where *column names are strings* can be used:

In [21]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Are the two comparable?

In [22]:
data.area is data['area']

True

This is cool, but doesn't always work:

- where the column names are not strings
- names conflict with dataframe methods

In [23]:
data.pop

<bound method DataFrame.pop of               area       pop
California  423967  38332521
Texas       695662  26448193
New York    141297  19651127
Florida     170312  19552860
Illinois    149995  12882135>

In [24]:
data.pop?

[1;31mSignature:[0m [0mdata[0m[1;33m.[0m[0mpop[0m[1;33m([0m[0mitem[0m[1;33m:[0m [1;34m'Hashable'[0m[1;33m)[0m [1;33m->[0m [1;34m'Series'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return item and drop from frame. Raise KeyError if not found.

Parameters
----------
item : label
    Label of column to be popped.

Returns
-------
Series

Examples
--------
>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
...                    ('parrot', 'bird', 24.0),
...                    ('lion', 'mammal', 80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=('name', 'class', 'max_speed'))
>>> df
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN

>>> df.pop('class')
0      bird
1      bird
2    mammal
3    mammal
Name: class, dtype: object

>>> df
     name  max_speed
0  falcon      389.0
1  parrot       24.0
2    lion       80.5
3  monkey  

In [25]:
data.pop is data['pop']

False

Column assignment shouldn't be done by attribute as this can also lead to confusion

`data['pop'] = z` not `data.pop = z`

New columns can be added with this syntax

In [26]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [27]:
data['ts'] = pd.Series([range(5)])

data

Unnamed: 0,area,pop,density,ts
California,423967,38332521,90.413926,
Texas,695662,26448193,38.01874,
New York,141297,19651127,139.076746,
Florida,170312,19552860,114.806121,
Illinois,149995,12882135,85.883763,


## DataFrame as a two-dimensional array

We can examine the raw underlying values of a `DataFrame` using the `.values` attribute

In [28]:
data.values

array([[423967, 38332521, 90.41392608386974, nan],
       [695662, 26448193, 38.01874042279153, nan],
       [141297, 19651127, 139.07674614464568, nan],
       [170312, 19552860, 114.80612053173, nan],
       [149995, 12882135, 85.88376279209307, nan]], dtype=object)

In [29]:
data

Unnamed: 0,area,pop,density,ts
California,423967,38332521,90.413926,
Texas,695662,26448193,38.01874,
New York,141297,19651127,139.076746,
Florida,170312,19552860,114.806121,
Illinois,149995,12882135,85.883763,


Many array-like observations can be carried out on a `DataFrame` too:

e.g swap the rows and columns simply:

In [30]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332521.0,26448193.0,19651127.0,19552860.0,12882135.0
density,90.413926,38.01874,139.076746,114.806121,85.883763
ts,,,,,


There are some subtle differences between this and NumPy:

- e.g a single index accesses a row

In [31]:
data.values[0]

array([423967, 38332521, 90.41392608386974, nan], dtype=object)

In [32]:
list(data.items())[0]

('area',
 California    423967
 Texas         695662
 New York      141297
 Florida       170312
 Illinois      149995
 Name: area, dtype: int64)

Passing an "index" accesses a column

In [33]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Here, we can now use implicit indexing to treat the `DataFrame` as a simple NumPy array while maintaining index and column labels

In [34]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


We can also use the explicit indexer:

In [35]:
data.loc[:"Illinois", :"pop"]       # data.loc[:row, :column]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Masking and fancy indexing can be incorporated into the data access patterns within the pandas indexers

In [36]:
data.loc[data.density > 100, ['pop', 'area']]

Unnamed: 0,pop,area
New York,19651127,141297
Florida,19552860,170312


Any of these conventions can also be used to set or modify values:

In [37]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density,ts
California,423967,38332521,90.0,
Texas,695662,26448193,38.01874,
New York,141297,19651127,139.076746,
Florida,170312,19552860,114.806121,
Illinois,149995,12882135,85.883763,


Practice makes perfect: give all of the above a go to get used to the different approaches. Do so in the lab and in the assignment.

# Summary

Sometimes we can treat a `DataFrame` like a 2D array

Sometimes we treat it like a dictionary.

We've looked at a combination of implicit and explicit indexing tactics.

Indexing, slices, masks and fancy indexing can all be used to access specific parts of the data that you're interested in.