# Data Indexing and Selection

We looked in detail at methods and tools to access, set, and modify values in NumPy arrays.
These included indexing (e.g., ``arr[2, 1]``), slicing (e.g., ``arr[:, 1:5]``), masking (e.g., ``arr[arr > 0]``), fancy indexing (e.g., ``arr[0, [1, 5]]``), and combinations thereof (e.g., ``arr[:, [1, 5]]``).
**Here we'll look at similar means of accessing and modifying values in Pandas ``Series`` and ``DataFrame`` objects.**
If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.

We'll start with the simple case of the one-dimensional ``Series`` object, and then move on to the more complicated two-dimesnional ``DataFrame`` object.

## Data Selection in Series

As we saw in the previous section, a **``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.**
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

In [17]:
import numpy as np

In [18]:
lista = [1,2,3,4,5]
lista[1:-1]

x = np.arange(9).reshape(3,3)
print(x)
print()
print(x[:,:1])

type(x)

[[0 1 2]
 [3 4 5]
 [6 7 8]]

[[0]
 [3]
 [6]]


numpy.ndarray

In [19]:
x[0,[0,2]]



array([0, 2])

### Series as dictionary

Like a dictionary, the ``Series`` object provides a mapping from a collection of keys to a collection of values:

In [20]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [21]:
data['c']    # Por nombre del indice, explicito

0.75

In [22]:
data[2]   # Por indice implícito

0.75

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [23]:
print (0.5 in data)
print (0.75 in data.values)
print('c' in data)

False
True
True


In [24]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [25]:
data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [26]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [27]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [28]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

``Series`` objects can even be modified with a dictionary-like syntax.
Just as you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:

In [29]:
data['e'] = 1.25; data['b'] = 1.2
data

a    0.25
b    1.20
c    0.75
d    1.00
e    1.25
dtype: float64

This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.

### Series as one-dimensional array

A ``Series`` builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, *slices*, *masking*, and *fancy indexing*.
Examples of these are as follows:

In [30]:
dict = {'a':1, 'b':2, 'c':3}
dict['a':'c']

TypeError: unhashable type: 'slice'

In [31]:
# slicing by explicit index
data['a':'c']

a    0.25
b    1.20
c    0.75
dtype: float64

In [32]:
type(data)

pandas.core.series.Series

In [34]:
# slicing by implicit integer index
data[0:3] 

a    0.25
b    1.20
c    0.75
dtype: float64

In [35]:
data

a    0.25
b    1.20
c    0.75
d    1.00
e    1.25
dtype: float64

In [36]:
# masking
data[(data > 0.3) & (data < 0.8)]

c    0.75
dtype: float64

In [37]:
# fancy indexing
data[['a', 'e','c']]

a    0.25
e    1.25
c    0.75
dtype: float64

Among these, slicing may be the source of the most confusion.
**Notice that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is *included* in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is *excluded* from the slice.**

### Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as **``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.**

In [38]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [39]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [40]:
# explicit index when indexing
data[1]

'a'

In [41]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.

First, the **``loc`` attribute allows indexing and slicing that always references the explicit index:**

In [42]:
data

1    a
3    b
5    c
dtype: object

In [43]:
data.loc[1]

'a'

In [44]:
#data[2] = 'j'
data.pop('23')
data

KeyError: '23'

In [45]:
data[24] = 'd'
data

1     a
3     b
5     c
24    d
dtype: object

In [46]:
data.loc[5:]

5     c
24    d
dtype: object

The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

In [47]:
data

1     a
3     b
5     c
24    d
dtype: object

In [48]:
data.iloc[1]

'b'

In [49]:
data.iloc[0:2]

1    a
3    b
dtype: object

A third indexing attribute, ``ix``, is a hybrid of the two, and for ``Series`` objects is equivalent to standard ``[]``-based indexing.
The purpose of the ``ix`` indexer will become more apparent in the context of ``DataFrame`` objects, which we will discuss in a moment.

One guiding principle of Python code is that "explicit is better than implicit."
The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, **I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.**

## Data Selection in DataFrame

Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.

### DataFrame as a dictionary

The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.
Let's return to our example of areas and populations of states:

In [50]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
# cities = pd.Series({'California': "Sacramento", 'Texas': "Otra ciudad 1",
#                  'New York': "Albany", 'Florida': "Miami",
#                  'Illinois': "Otra ciudad 2"})
data = pd.DataFrame({'area':area, 'pop':population})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [51]:
data.reset_index()

Unnamed: 0,index,area,pop
0,California,423967,38332521
1,Texas,695662,26448193
2,New York,141297,19651127
3,Florida,170312,19552860
4,Illinois,149995,12882135


The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:

In [52]:
data['area']                # Esto define una serie, que al crear el DF tenia asignado un nombre,
                            # y que en la creacion del diccionario se le asigna otro nombre, que es la key (llave)
                            # Key = Columna = Nombre de serie

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Equivalently, we can use attribute-style access with column names that are strings:

In [53]:
data['pop']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: pop, dtype: int64

In [54]:
data.pop

<bound method DataFrame.pop of               area       pop
California  423967  38332521
Texas       695662  26448193
New York    141297  19651127
Florida     170312  19552860
Illinois    149995  12882135>

This attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [55]:
data.area is data['area']

True

Though this is a useful shorthand, keep in mind that it does not work for all cases!
For example, **if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible.**
For example, the ``DataFrame`` has a ``pop()`` method, so ``data.pop`` will point to this rather than the ``"pop"`` column:

In [56]:
data.pop is data['pop']

False

In particular, you should avoid the temptation to try column assignment via attribute (i.e., use ``data['pop'] = z`` rather than ``data.pop = z``).

Like with the ``Series`` objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [None]:
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [57]:
data['density'] = data['pop'] / data['area']
data['numero de la suerte'] = 8
data

Unnamed: 0,area,pop,density,numero de la suerte
California,423967,38332521,90.413926,8
Texas,695662,26448193,38.01874,8
New York,141297,19651127,139.076746,8
Florida,170312,19552860,114.806121,8
Illinois,149995,12882135,85.883763,8


This shows a preview of the straightforward syntax of element-by-element arithmetic between ``Series`` objects; we'll dig into this further in [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb).

### DataFrame as two-dimensional array

As mentioned previously, we can also view the ``DataFrame`` as an enhanced two-dimensional array.
We can examine the raw underlying data array using the ``values`` attribute:

In [58]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01, 8.00000000e+00],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01, 8.00000000e+00],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02, 8.00000000e+00],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02, 8.00000000e+00],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01, 8.00000000e+00]])

With this picture in mind, many familiar array-like observations can be done on the ``DataFrame`` itself.
**For example, we can transpose the full ``DataFrame`` to swap rows and columns:**

In [59]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376
numero de la suerte,8.0,8.0,8.0,8.0,8.0


When it comes to indexing of ``DataFrame`` objects, however, it is clear that the dictionary-style indexing of columns precludes our **ability to simply treat it as a NumPy array.**
In particular, passing a single index to an array accesses a row:

In [60]:
data

Unnamed: 0,area,pop,density,numero de la suerte
California,423967,38332521,90.413926,8
Texas,695662,26448193,38.01874,8
New York,141297,19651127,139.076746,8
Florida,170312,19552860,114.806121,8
Illinois,149995,12882135,85.883763,8


In [61]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01, 8.00000000e+00])

In [62]:
data.iloc[:4,2]  # Este ILOC entra separa filas y columnas por una coma
                 # en las filas, del inicio al 4to elemento (n-1), y en las columnas
                 # el segundo elemento (2da columna)

California     90.413926
Texas          38.018740
New York      139.076746
Florida       114.806121
Name: density, dtype: float64

In [63]:
data ['area'] [2]

141297

In [64]:
data.iloc[2]           # Fila 2, con indices de columna (nombre de la serie).
                       # con corchetes simples lo muestra como una Serie de pd.Series

area                   1.412970e+05
pop                    1.965113e+07
density                1.390767e+02
numero de la suerte    8.000000e+00
Name: New York, dtype: float64

In [None]:
data.iloc[[2]]           # Fila 2, con indices de columna (nombre de la serie).
                       # con corchetes dobles lo muestra como un DataFrame con una sola fila

Unnamed: 0,area,pop,density,numero de la suerte
New York,141297,19651127,139.076746,8


In [None]:
data.values[2]      # Fila 2, como un array
#type(data.values)


array([1.41297000e+05, 1.96511270e+07, 1.39076746e+02, 8.00000000e+00])

In [None]:
area['New York']    # Llamar al indexado explicito de una fila entre corchetes solo funciona en SERIES!!
                    # Si el objeto es un DataFrame, el llamar al indexado estamos llamando a COLUMNA. 
                    # Una columna es la 'key' del diccionario de diccionarios......

141297

In [65]:
data.loc['New York', 'pop']    # Entrando por "index" explicito, primero fila y despues columna. 

19651127

In [None]:
data.iloc[2,1]               # Entrando por "index" implícito, primero fila y despues columna. Array style...
data.values[2,1]
# type(data.iloc[2,1])

19651127.0

In [69]:
a = data[2:3]
a

Unnamed: 0,area,pop,density,numero de la suerte
New York,141297,19651127,139.076746,8


In [70]:
b = data.iloc[[2]]
b

Unnamed: 0,area,pop,density,numero de la suerte
New York,141297,19651127,139.076746,8


In [71]:
c = data.loc[['New York']]
c


Unnamed: 0,area,pop,density,numero de la suerte
New York,141297,19651127,139.076746,8


In [72]:
print(a == b)

          area   pop  density  numero de la suerte
New York  True  True     True                 True


and passing a single "index" to a ``DataFrame`` accesses a column:

In [73]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Thus for array-style indexing, we need another convention.
Here Pandas again uses the ``loc``, ``iloc``, and ``ix`` indexers mentioned earlier.
Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the ``DataFrame`` index and column labels are maintained in the result:

In [None]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [None]:
data.iloc[:2, 0:3]

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874


Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [None]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [None]:
data.loc['Texas':'New York', 'pop':'density']

Unnamed: 0,pop,density
Texas,26448193,38.01874
New York,19651127,139.076746


**IX IS DEPRECATED**

The ``ix`` indexer allows a hybrid of these two approaches:

In [None]:
data.ix[:3, :'pop']

AttributeError: 'DataFrame' object has no attribute 'ix'

Keep in mind that for integer indices, the ``ix`` indexer is subject to the same potential sources of confusion as discussed for integer-indexed ``Series`` objects.

Any of the familiar NumPy-style data access patterns can be used within these indexers.
For example, in the ``loc`` indexer we can combine masking and fancy indexing as in the following:

In [None]:
data.loc[(data.density > 100) & (data.area > 150000) , ['pop', 'density', 'area']]

Unnamed: 0,pop,density,area
Florida,19552860,114.806121,170312


Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

In [None]:
data.iloc[0, 2] = 90.59
data

Unnamed: 0,area,pop,density,numero de la suerte
California,423967,38332521,90.59,8
Texas,695662,26448193,38.01874,8
New York,141297,19651127,139.076746,8
Florida,170312,19552860,114.806121,8
Illinois,149995,12882135,85.883763,8


To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple ``DataFrame`` and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.

### Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice.
**First, while *indexing* refers to columns, *slicing* refers to rows:**

In [74]:
data['area'][['California','Illinois']]    # Fancy indexing de las filas, que son los index de la Serie (Columna)

California    423967
Illinois      149995
Name: area, dtype: int64

In [None]:
data.loc[:, 'area':'numero de la suerte']

Unnamed: 0,area,pop,density,numero de la suerte
California,423967,38332521,90.59,8
Texas,695662,26448193,38.01874,8
New York,141297,19651127,139.076746,8
Florida,170312,19552860,114.806121,8
Illinois,149995,12882135,85.883763,8


Such slices can also refer to rows by number rather than by index:

In [None]:
data[0:2]                  # Esto es lo mas falopa que vi en DataFrames

Unnamed: 0,area,pop,density,numero de la suerte
California,423967,38332521,90.59,8
Texas,695662,26448193,38.01874,8


Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [None]:
data[data.density < 100]  # Filtra por las filas que contienen ese valor

Unnamed: 0,area,pop,density,numero de la suerte
California,423967,38332521,90.59,8
Texas,695662,26448193,38.01874,8
Illinois,149995,12882135,85.883763,8


These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.