## Pandas (Part 2): Data Indexing and Selection

In this notebook, you will learn how to create the following objects:
 - Data selection in Series
 - Data selection in DataFrame
 - Pandas indexer attributes
 
Read more: 
 - textbook (https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html) and
 - [Pandas website] (https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html).

In [56]:
import pandas as pd
import numpy as np

## 1. Data Selection in Series
A ``Series`` object acts:
- in many ways like a one-dimensional NumPy array, and 
- in many ways like a standard Python dictionary.

### 1.1 Series as dictionary

In [57]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [58]:
# acts like a dictionary
print (data['b'])
print ('a' in data)
print (data.keys())
print (list(data.items()))

0.5
True
Index(['a', 'b', 'c', 'd'], dtype='object')
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]


### 1.2 Series as one-dimensional array

A ``Series`` builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, *slices*, *masking*, and *fancy indexing*.
Examples of these are as follows:

In [59]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [60]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [61]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

## 2. Data Selection in DataFrame

A ``DataFrame`` acts:
 - in many ways like a two-dimensional or structured array, and 
 - in other ways like a dictionary of ``Series`` structures sharing the same index.

### 2.1. DataFrame as a dictionary

In [70]:
area1 = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area1, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [71]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [72]:
# We can use attribute-style access with column names that are strings
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [73]:
# addtional field
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### 2.2. DataFrame as two-dimensional array

In [74]:
# slicing by index
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [75]:
# slicing by row number
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


In [76]:
# masking
data[data.density > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


In [77]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [78]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

## 3. Special indexer attributes 

Pandas provides some special indexer attributes:
 - `loc`:  Access a group of rows and columns by label(s).
 - `iloc`: Access a group of rows and columns by integer position(s).
 - `at`:   Access a single value for a row/column label pair.
 - `iat`:  Access a single value for a row/column pair by integer position.

In [79]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [90]:
# same thing
print(data.at['Florida', 'pop'])
print(data.iat[3, 1])

19552860
19552860


In [85]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [87]:
# loc indexer uses the explicit index and column names:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [82]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121
