# Data Indexing and Selection

There are two types of objects in Pandas, Series and DataFrame. We will look at the means of accessing and modifying values in Pandas Series and DataFrame.

## Data selection in Series

Series in Pandas actd iin many ways like a one-dimensional Numpy array as well as a Python dictionary. Keeping this analogy in mind, it helps understand the patterns of indexing in these arrays.

## Series as dictionary

Like a dictionary, the Series provides a mapping from a collection of keys to a colleciton of values:

In [40]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
               index = ['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [42]:
data['b']

0.5

We can also use dictionary-like Python expression and methods to examine the keys/indices and values:

In [43]:
'a' in data

True

In [44]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [45]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value

In [47]:
data['e'] = 1.25
data

a    0.250
b    0.500
c    0.750
d    1.000
e    0.125
dtype: float64

##  Series as one-dimensional array

A Series builds on this dictionary-like interface and provides array-style item selection via the same basic machanisms as Numpy arrays - that is, slices, masking, and fancy indexing.

In [48]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [49]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [51]:
# masking 
# it's like conditional filtering. remember to use the () between the &
data[(data>0.3) & (data <0.8)] 

b    0.50
c    0.75
dtype: float64

>data > 0.3 and data < 0.8 are boolean arrays. In here, we use boolean arrays as masks, to select particular subsets of data themselves.
Thus, when we index using data[data>0.3],it returned one-dimensional array filled with all the values that meet this condition; in other words, all the values in positions at which the mask array is True.

In [52]:
# fancy indexing
data[['a', 'e']]

a    0.250
e    0.125
dtype: float64

>Fancy indexing is like the simple indexing we've already seen, but we pass arrays of indices in place of single scalars.  
*When using fancy indexing, the shape of the result reflects the shape of the index arrays rather than the shape of the array being indexed

Note: When Series has an explicit index (i.e. data['a':'c']), the final index is included in the slice. While for slicing with implicit indeix (i.e. data[0:2], the final index is **excluded** from the slice.

## Indexers: loc, iloc, and ix

The slicing and indexing conventions can be a source of confusion. For example, if you Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation such as data[1:3] will use the implicit index. Try understand from the examples below:

In [5]:
import pandas as pd

In [4]:
data = pd.Series(['a','b','c'], index = [1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [8]:
# explicit when indexing
data[1]

'a'

In [9]:
# implicit when slicing
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer index, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. There are not functional methods, but attributes that expose particular slicing interface to the data in the Series.

First, the loc attribute allows indexing and slicing that always references the explicit index:

In [10]:
data.loc[1]

'a'

In [11]:
data.loc[1:3]

1    a
3    b
dtype: object

The iloc attribute allows indexing and slicing that always references the implicit Python-style index:

In [12]:
data.iloc[1]

'b'

In [15]:
data.iloc[1:3]

3    b
5    c
dtype: object

loc is used for explicit indexing, i.e. the index **values**, while iloc is used for inexplicit indexing, i.e. the index **locations**

A third indexing attribute, ix, is a hybrid of the two, and for Series object is equvilent to standard []-based indexing. The purpose of the ix indexer will become more apparent in the context of DataFrame objects.

One guiding principle of Pythin code is that 'explicit is better than implicit.' The explicit nature of the loc and iloc make them very useful in maintaining clean and readable code; especially in the case of integer indexes. It's recommended to use both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

## Data selection in DataFrame

DataFrame act in many ways like a two-dimensional array and in other ways like a dictionary of Series sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

### DataFrame as a dictionary

In [16]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

In [17]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Equivalently, we can use attribute-style acces with column names that are strings:

In [18]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

This attribute column access is the exact same object as the dictionary access:

In [19]:
data.area is data['area']

True

Thought this is a useful shorthand, it does not work for all the cases! For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute access is not possible. For example, the DataFrame has a pop() method, so data.pop will point to this rather than the 'pop' column <br>
i.e. it does not work for the coulumns that have the same name as methods

In [20]:
data.pop is data['pop']

False

In particular, you should aviod trying column assignment via attribute (i.e. use data['pop'] = z rather than data.pop = z) <br>
i.e. it's safer to use data['pop']

Like with the Series objects discussed earlier, this dictionary syntax can also be used to modify the objectm in this case adding a new column:

In [22]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### DataFrame as two-dimensional array

We can view DataFrame as an enhanced two-dimensional array. We can examine the underlying data array using the values attribute:

In [24]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:

In [25]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


When it comes to indexing of DataFrame objects, however, it is clear that the dictionary style indexing of columns precludes our ability to simply treat it as a Numpy array. In particularm passing a single index to an array accesses a row:

In [27]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [26]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

and passing a single 'index' to a DataFrame accesses a column:

In [28]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Thus for array indexing, we need another convention. Here Pandas again uses loc, iloc and ix indexers. Using iloc indexer, we can index the underlying array as if it is a simple Numpy array, but the DataFrame index and column labels are maintained in the result:

In [31]:
data.iloc[:3, :2]
# row 0-2 and column 0-1

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


Similarly, using the loc indexer we can index the underlying data in an array like style but using the explicit index and column names:

In [33]:
data.loc[:'Illinois', :'area']

Unnamed: 0,area
California,423967
Texas,695662
New York,141297
Florida,170312
Illinois,149995


* The ix indexer allows a hybrid of these two approaches, however, 'ix' has been removed in pandas 1.0.

Any of the familiar Numpy data access patterns can be used within these indexes. For example, in the loc indexer, we can combine masking and fancy indexing as in the following:

In [37]:
data.loc[data.density > 100, ['pop', 'density']]
# it's always row, column

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


Any of these indexing convensions may also be used to set or modify values; this is done in the standard way in Numpy:

In [38]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


To build up fluency in Pandas data manipulation, we can spend some time with a simple DataFrame and explore the types of indexing, slicing, masking and fancy indexing that are allowed by these various indexing approaches.

In [49]:
data.loc['Texas':, :'pop']

Unnamed: 0,area,pop
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [55]:
data.loc[data['pop']>25000000, ['pop', 'density']]

Unnamed: 0,pop,density
California,38332521,90.0
Texas,26448193,38.01874


## Additional indexing conventions

For DataFrame, indexing refers to columns and slicing refers to rows:

In [56]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,86.0


Such slices can also refer to rows by number rather than by index:

In [62]:
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


Similiarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [63]:
data[data.density>100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


This means, if without loc and iloc, the slicing on data is operated on rows