# Pandas Introduction Part 1

## Overview

This notebook uses simple examples to show the details of selecting from a Pandas DataFrame with:
* **df\[\]**
* **df.loc\[\]**
* **df.iloc\[\]**

This notebook includes more detail than a typical Pandas introduction.

Later notebooks will use real-world data and perform data analysis.

In [1]:
import pandas as pd
import numpy as np

### Components of a DataFrame 
A DataFrame has:
1. column labels (identifiers)
2. row labels (identifiers)
3. values (cells)

In [2]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 21, 22],
       [ 20, 31, 32]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,21,22
Row3,20,31,32


In [3]:
# column IDs as list
df.columns.tolist()

['Col1', 'Col2', 'Col3']

In [4]:
# row IDs as list
df.index.tolist()

['Row1', 'Row2', 'Row3']

In [5]:
# data as nested list
df.values.tolist()

[[0, 1, 2], [10, 21, 22], [20, 31, 32]]

### DataFrame Column and Row Labels

df.columns contains the column labels  
df.index contains the row labels  

Both are instances of pd.Index (or a subclass thereof)

In [6]:
# get column index
df.columns

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [7]:
isinstance(df.columns, pd.Index)

True

In [8]:
# the index is iterable
[col for col in df.columns]

['Col1', 'Col2', 'Col3']

In [9]:
# get row index
df.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [10]:
isinstance(df.index, pd.Index)

True

In [11]:
# the index is iterable
[row for row in df.index]

['Row1', 'Row2', 'Row3']

### Be Careful to Distinguish Between pd.Index and df.index

**pd.Index:** This is a class.  
**df.columns:** This is an instance of pd.Index (or one of its subclasses)  
**df.index:**   This is an instance of pd.Index (or one of its subclasses)

## Selection using **df\[filter\]**

Selects columns when filter is:
* a single column label (which is in df.columns)
* a list of columns labels (all of which are in df.columns)

Selects rows when filter is:
* a slice representing row positions
* a boolean Series (with index matching df.index)
* a list of boolean values

### **df\[filter\]** Examples

In [12]:
# single column label
df['Col1']

Row1     0
Row2    10
Row3    20
Name: Col1, dtype: int64

In [13]:
isinstance(df['Col1'], pd.Series)

True

In [15]:
# list of column labels
df[['Col1', 'Col3']]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,22
Row3,20,32


In [16]:
isinstance(df[['Col1', 'Col3']], pd.DataFrame)

True

In [17]:
# datatypes for dataframe
df[['Col1', 'Col3']].dtypes

Col1    int64
Col3    int64
dtype: object

In [18]:
# slice refering to row positions
df[1:]

Unnamed: 0,Col1,Col2,Col3
Row2,10,21,22
Row3,20,31,32


In [19]:
# boolean series with matching index
row_boolean_series = pd.Series([False, True, False], index=['Row1', 'Row2', 'Row3'])
df[row_boolean_series]

Unnamed: 0,Col1,Col2,Col3
Row2,10,21,22


In [20]:
# list of boolean values
df[[False, True, False]]

Unnamed: 0,Col1,Col2,Col3
Row2,10,21,22


## Selection using **df.loc\[rows, cols\]**

Where row is:
* a single row label
* a list of row labels
* a slice matching row labels in order, inclusive of the slice end
* a boolean Series (with index matching df.index)
* a list of boolean values

Where col is:
* a single col label
* a list of col labels
* a slice matching col labels in order, inclusive of the slice end
* a boolean Series (with index matching df.columns)
* a list of boolean values

**df.loc\[rows\]** is shorthand for: **df.loc\[rows, : \]**  

### **df.loc\[rows, cols\]** Examples

#### Column Examples

In [21]:
s1 = df['Col1']
s1

Row1     0
Row2    10
Row3    20
Name: Col1, dtype: int64

In [22]:
s2 = df.loc[:, 'Col1']
s2

Row1     0
Row2    10
Row3    20
Name: Col1, dtype: int64

In [23]:
# are the Series equal
s1.equals(s2)

True

In [24]:
# are the Series values equal?
s1 == s2

Row1    True
Row2    True
Row3    True
Name: Col1, dtype: bool

In [105]:
# are all the Series values equal?
(s1 == s2).all()

True

#### Row Examples

In [26]:
df.loc['Row1', :]

Col1    0
Col2    1
Col3    2
Name: Row1, dtype: int64

In [27]:
# the 2nd argument can be skipped if you want all columns
df.loc['Row1']

Col1    0
Col2    1
Col3    2
Name: Row1, dtype: int64

In [28]:
isinstance(df.loc['Row1'], pd.Series)

True

#### Column and Row Examples

In [30]:
# a single label for both references exactly 1 cell, return: single value
df.loc['Row2', 'Col2']

21

In [31]:
# when using a boolean_series, the index should match df.columns or df.index as appropriate
row_series = pd.Series([False, True, False], index=['Row1', 'Row2', 'Row3'])
col_series = pd.Series([False, True, False], index=['Col1', 'Col2', 'Col3'])

df.loc[row_series, col_series]

Unnamed: 0,Col2
Row2,21


In [32]:
# when using a list of boolean values, no index is needed
df.loc[[False, True, False], [False, True, False]]

Unnamed: 0,Col2
Row2,21


In [33]:
# a list of labels may reference more than 1 cell, return: DataFrame
df.loc[['Row2'], ['Col2']]

Unnamed: 0,Col2
Row2,21


In [34]:
df.loc[['Row2', 'Row3'], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
Row2,10,21
Row3,20,31


In [35]:
# same labels specified in different order
df.loc[['Row3', 'Row2'], ['Col2', 'Col1']]

Unnamed: 0,Col2,Col1
Row3,31,20
Row2,21,10


In [36]:
# .loc[] is inclusive with slices!
df.loc['Row2':'Row3','Col1':'Col2']

Unnamed: 0,Col1,Col2
Row2,10,21
Row3,20,31


In [37]:
# same but use explicit slice
# note that slice is inclusive of its end point
row_slice = slice('Row2', 'Row3')
col_slice = slice('Col1', 'Col2')
df.loc[row_slice, col_slice]

Unnamed: 0,Col1,Col2
Row2,10,21
Row3,20,31


In [38]:
# .loc[] slices must match labels in order!
df.loc['Row3':'Row2','Col2':'Col1']

In [39]:
# it is a KeyError if the label is not in the column index
try:
    df.loc[:, 'Col99']
except KeyError as err:
    print(f'KeyError: {err}')

KeyError: 'Col99'


In [40]:
# it is a KeyError if the label is not in the row index
try:
    df.loc['Row99', :]
except KeyError as err:
    print(f'KeyError: {err}')

KeyError: 'Row99'


#### Examples with Default Index
The default index creates a range of integers that match the row position.

Even though the default index matches the row position, it should still be thought of as "row ID", not "row position".

In [41]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 21, 22],
       [ 20, 31, 32]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,21,22
2,20,31,32


In [42]:
# index is concisely represented as a RangeIndex
df.index

RangeIndex(start=0, stop=3, step=1)

In [43]:
# df.index is a subclass of pd.Index
isinstance(df.index, pd.Index)

True

In [44]:
df.loc[1:2, 'Col1':'Col2']

Unnamed: 0,Col1,Col2
1,10,21
2,20,31


In [45]:
df.loc[[1, 2], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
1,10,21
2,20,31


In [46]:
# sort the dataframe
df = df.sort_index(ascending=False)
df

Unnamed: 0,Col1,Col2,Col3
2,20,31,32
1,10,21,22
0,0,1,2


In [47]:
# slice depends on the sort order of the DataFrame
# there are no row labels that match in order from 1 to 2 inclusive
df.loc[1:2, ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2


In [48]:
# list based selection is independent of the sort order of the DataFrame
df.loc[[1, 2], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
1,10,21
2,20,31


## Pandas Operations Modify in-place or return Modified values
By default, most Pandas Operations will perform the requested operation without modifying the DataFrame or Series they are operating on.  The return value is the operated on object.

Some Pandas operations have the keyword argument 'inplace'.  If this is set to True, then the underlying DataFrame or Series is modified directly.

In [49]:
# Common Python example of inplace operation
x = [3, 2, 1]
x.sort()
x

[1, 2, 3]

In [50]:
# Common Python example of non-inplace operation
x = [3, 2, 1]
y = sorted(x)
print(x)
print(y)

[3, 2, 1]
[1, 2, 3]


In [51]:
# DataFrame inplace operation
df.sort_values('Col2', inplace=True, ascending=False)
df

Unnamed: 0,Col1,Col2,Col3
2,20,31,32
1,10,21,22
0,0,1,2


In [52]:
# DataFrame non inplace operation
df2 = df.sort_values('Col2', ascending=True)

# df is unchanged
df

Unnamed: 0,Col1,Col2,Col3
2,20,31,32
1,10,21,22
0,0,1,2


In [53]:
# df2 is the newly sorted dataframe
df2

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,21,22
2,20,31,32


##  Selection using **df.iloc\[rows, cols\]**

### As Compared with numpy 2D Array Selection

When:
1. rows is either an integer or a slice,   
2. cols is either an integer or a slice,  

then selecting values with **df.loc\[rows, cols\]** is very similar to selecting values from a 2D numpy array.

In [54]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 21, 22],
       [ 20, 31, 32]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,21,22
Row3,20,31,32


In [55]:
values = df.values
values

array([[ 0,  1,  2],
       [10, 21, 22],
       [20, 31, 32]])

In [56]:
print(f'values is a {type(values)} with {values.ndim} dimensions')

values is a <class 'numpy.ndarray'> with 2 dimensions


In [57]:
values[1,:]

array([10, 21, 22])

In [58]:
df.iloc[1,:].values

array([10, 21, 22])

In [59]:
# all values are the same
(values[1, :] == df.iloc[1,:].values).all()

True

In [60]:
# same as above, but use an explict slice object
col_slice = slice(None, None)
values[1, col_slice]

array([10, 21, 22])

In [61]:
df.iloc[1, col_slice].values

array([10, 21, 22])

In [62]:
# all values are the same
(values[1, col_slice] == df.iloc[1, col_slice].values).all()

True

In [63]:
values[1:, 0:2]

array([[10, 21],
       [20, 31]])

In [64]:
df.iloc[1:3, 0:2].values

array([[10, 21],
       [20, 31]])

In [65]:
# all values are the same
(values[1:3, 0:2] == df.iloc[1:3, 0:2].values).all()

True

In [66]:
# same as above but use explict slice objects
row_slice = slice(1, None)
col_slice = slice(0, 2)
values[row_slice, col_slice]

array([[10, 21],
       [20, 31]])

In [67]:
df.iloc[row_slice, col_slice].values

array([[10, 21],
       [20, 31]])

In [68]:
# all values are the same
(values[row_slice, col_slice] == df.iloc[row_slice, col_slice].values).all().all()

True

In [69]:
# rows are selected if second component of index is missing
values[1:]

array([[10, 21, 22],
       [20, 31, 32]])

In [70]:
# same for .iloc
df.iloc[1:].values

array([[10, 21, 22],
       [20, 31, 32]])

In [71]:
# are all row and column values equal
(values[1:] == df.iloc[1:].values).all().all()

True

In [72]:
# same as above but with an explict slice object
row_slice = slice(1, None)
values[row_slice]

array([[10, 21, 22],
       [20, 31, 32]])

In [73]:
df.iloc[row_slice].values

array([[10, 21, 22],
       [20, 31, 32]])

In [74]:
(values[row_slice] == df.iloc[row_slice]).all().all()

True

### Selection using **df.iloc\[rows, cols\]** in General

Where row is:
* a single row position
* a list of row positions
* a slice of row positions
* a boolean list

Where col is:
* a single col position
* a list of col positions
* a slice of col positions
* a boolean list

Slice works as it normally does for Python.

**.iloc\[rows\]** is shorthand for **.iloc\[rows\, :]**  
That is, all columns are selected.

In [75]:
rows = [True, False, True]
df.iloc[rows]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,31,32


In [76]:
df.iloc[[0,2]]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,31,32


In [77]:
df.iloc[1:]

Unnamed: 0,Col1,Col2,Col3
Row2,10,21,22
Row3,20,31,32


In [78]:
df.iloc[[1,2]]

Unnamed: 0,Col1,Col2,Col3
Row2,10,21,22
Row3,20,31,32


In [79]:
df.iloc[[1,2], [0,2]]

Unnamed: 0,Col1,Col3
Row2,10,22
Row3,20,32


In [80]:
df.iloc[[False, True, True], [True, False, True]]

Unnamed: 0,Col1,Col3
Row2,10,22
Row3,20,32


## .index of Series Created from DataFrame

A Series created from a DataFrame df:
* will have its index be a (subset of) df.index, if the operation was a column operation
* will have its index be a (subset of) df.columns, if the operations was a row operation

In [81]:
# select an entire column from df
s = df['Col1']
s.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [82]:
# the indexes are equal
s.index.equals(df.index)

True

In [83]:
# the index values are an (improper) subset of df.index
set(s.index).issubset(set(df.index))

True

In [84]:
# select one entire row from df
s_subset = df.iloc[2]
s_subset

Col1    20
Col2    31
Col3    32
Name: Row3, dtype: int64

In [85]:
s_subset.index

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [86]:
# the index values are an (improper) subset of df.columns
set(s_subset.index).issubset(set(df.columns))

True

In [87]:
# create a Series by applying a relational operator to an entire column
# relational operation
bool_series = s < 20
bool_series

Row1     True
Row2     True
Row3    False
Name: Col1, dtype: bool

In [88]:
# the indexes are equal
bool_series.index.equals(df.index)

True

In [89]:
# the index values are an (improper) subset of df.index
set(bool_series.index).issubset(set(df.index))

True

## .columns and .index of DataFrame from another DataFrame

In [90]:
df_subset = df.iloc[:2, 1:]
df_subset

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,21,22


In [91]:
set(df_subset.index).issubset(set(df.index))

True

In [92]:
set(df_subset.columns).issubset(set(df.columns))

True

## Boolean Series from Value Comparison

In [93]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 21, 22],
       [ 20, 31, 32]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,21,22
2,20,31,32


In [94]:
# relational operator on series produces True/False for each value
boolean_series = df['Col1'] < df['Col2']
boolean_series

0    True
1    True
2    True
dtype: bool

In [95]:
# select rows based on boolean_series
criteria = df['Col1'] < df['Col2']
df[criteria]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,21,22
2,20,31,32


In [96]:
df.loc[criteria]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,21,22
2,20,31,32


In [97]:
criteria.index.equals(df.index)

True

In [98]:
critera1 = df['Col2'] - df['Col1'] == 11
critera2 = df['Col1'] < 20
filter_rows = critera1 & critera2
df[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,21,22


In [99]:
# in one line, requires () around expressions
filter_rows = (df['Col2'] - df['Col1'] == 11) & (df['Col1'] < 20)
df[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,21,22


In [100]:
# boolean series constructed from columns, has the same index as the dataframe
filter_rows.index.equals(df.index)

True

## Axis Specification

In [101]:
df.reset_index(inplace=True, drop=True)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,21,22
2,20,31,32


In [102]:
df.sum(axis='rows')

Col1    30
Col2    53
Col3    56
dtype: int64

In [103]:
df.sum(axis='columns')

0     3
1    53
2    83
dtype: int64

In [104]:
df.drop('Col2', axis='columns')

Unnamed: 0,Col1,Col3
0,0,2
1,10,22
2,20,32
