# Pandas Introduction Part 1

## Overview

This notebook uses simple examples to show the details of selecting from a Pandas DataFrame with:
* **df\[\]**
* **df.loc\[\]**
* **df.iloc\[\]**

This notebook includes more detail than a typical Pandas introduction.

Later notebooks will use real-world data and perform data analysis.

In [1]:
import pandas as pd
import numpy as np

### Components of a DataFrame 
A DataFrame has:
1. column labels (identifiers)
2. row labels (identifiers)
3. values (cells)

In [2]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [3]:
# column IDs as list
df.columns.tolist()

['Col1', 'Col2', 'Col3']

In [4]:
# row IDs as list
df.index.tolist()

['Row1', 'Row2', 'Row3']

In [5]:
# data as nested list
df.values.tolist()

[[0, 1, 2], [10, 11, 12], [20, 21, 22]]

### DataFrame Column and Row Labels

df.columns contains the column labels.  
df.index contains the row labels.  

Both are instances of pd.Index (or a subclass thereof).

In [6]:
# get column index
df.columns

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [7]:
isinstance(df.columns, pd.Index)

True

In [8]:
# the index is iterable
[col for col in df.columns]

['Col1', 'Col2', 'Col3']

In [9]:
# get row index
df.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [10]:
isinstance(df.index, pd.Index)

True

In [11]:
# the index is iterable
[row for row in df.index]

['Row1', 'Row2', 'Row3']

### Be Careful to Distinguish Between pd.Index and df.index

**pd.Index:** This is a class.  
**df.columns:** This is an instance of pd.Index (or one of its subclasses)  
**df.index:**   This is an instance of pd.Index (or one of its subclasses)

## Selection using **df\[filter\]**

Selects columns when filter is:
1. a single column label (which is in df.columns)
2. a list of columns labels (all of which are in df.columns)

Selects rows when filter is:
3. a slice representing row positions
4. a boolean Series (with index matching df.index)
5. a list of boolean values

Examples of each of the above 5 use cases follows.

### **df\[filter\]** Examples

In [12]:
# 1. single column label
df['Col1']

Row1     0
Row2    10
Row3    20
Name: Col1, dtype: int64

In [13]:
isinstance(df['Col1'], pd.Series)

True

In [14]:
# 2. list of column labels
df[['Col1', 'Col3']]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [15]:
isinstance(df[['Col1', 'Col3']], pd.DataFrame)

True

In [16]:
# 3. slice refering to row positions
df[1:]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [17]:
# 4. boolean series with matching index
row_boolean_series = pd.Series([False, True, False], index=['Row1', 'Row2', 'Row3'])
df[row_boolean_series]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


In [18]:
# 5. list of boolean values
df[[False, True, False]]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


## Selection using **df.loc\[rows, cols\]**

Where row is:
1. a single row label
2. a list of row labels
3. a slice matching row labels in order, inclusive of the slice end
4. a boolean Series (with index matching df.index)
5. a list of boolean values

Where col is:
6. a single col label
7. a list of col labels
8. a slice matching col labels in order, inclusive of the slice end
9. a boolean Series (with index matching df.columns)
10. a list of boolean values

**df.loc\[rows\]** is shorthand for: **df.loc\[rows, : \]**  

Examples of each of the above 10 use cases follow.

### **df.loc\[rows, cols\]** Examples

In [19]:
# 1a. single row label
df.loc['Row2']

Col1    10
Col2    11
Col3    12
Name: Row2, dtype: int64

In [20]:
# 1b. single row lable, as dataframe
df.loc['Row2'].to_frame()

Unnamed: 0,Row2
Col1,10
Col2,11
Col3,12


In [21]:
# 2. list of row labels
df.loc[['Row1', 'Row3']]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [22]:
# 3a. slice of row labels
df.loc['Row1':'Row2']

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


In [23]:
# 3b. slice of row lables
df.loc[:'Row2']

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


In [24]:
# 4. boolean series with matching index
row_series = pd.Series([False, True, False], index=['Row1', 'Row2', 'Row3'])
print(row_series.index.equals(df.index))
df.loc[row_series]

True


Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


In [25]:
# 5. list of boolean values
selection = [False, True, False]
df.loc[selection]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


In [26]:
# 6. single column label (as dataframe)
df.loc[:'Row2', 'Col2'].to_frame()

Unnamed: 0,Col2
Row1,1
Row2,11


In [27]:
# 7. list of column labels
df.loc[:'Row2', ['Col1', 'Col3']]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12


In [28]:
# 8a. slice of column labels
df.loc['Row1':'Row2', 'Col2':]

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


In [29]:
# 8b. same as 8a, using explicit slice object
# Reminder: with .loc[] and only with .loc[], slice is inclusive of its end point
row_slice = slice('Row1', 'Row2')
col_slice = slice('Col2', None)
df.loc[row_slice, col_slice]

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


In [30]:
# 8c. slice order matters, there are no rows that match Row2 followed by Row1
df.loc['Row2':'Row1', 'Col2':]

Unnamed: 0,Col2,Col3


In [31]:
# 9. boolean series with matching index values that match df.columns
col_series = pd.Series([False, True, False], index=['Col1', 'Col2', 'Col3'])
print(df.columns.equals(col_series.index))
df.loc[:'Row2',col_series]

True


Unnamed: 0,Col2
Row1,1
Row2,11


In [32]:
# 10. list of boolean values
df.loc[:'Row2', [True, False, True]]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12


In [33]:
# it is a KeyError if the label is not in the column index
print('Col99' in df.columns)
try:
    df.loc[:, 'Col99']
except KeyError as err:
    print(f'KeyError: {err}')

False
KeyError: 'Col99'


In [34]:
# it is a KeyError if the label is not in the row index
print('Row99' in df.index)
try:
    df.loc['Row99', :]
except KeyError as err:
    print(f'KeyError: {err}')

False
KeyError: 'Row99'


#### **df.loc\[\]** Examples with Default Index
The default index creates a range of integers that match the row position.

Even though the default index matches the row position, **df.loc\[\]** is selecting "row ID", not "row position".

In [35]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [36]:
# index is concisely represented as a RangeIndex
df.index

RangeIndex(start=0, stop=3, step=1)

In [37]:
# df.index is a subclass of pd.Index
isinstance(df.index, pd.Index)

True

In [38]:
# matches row IDs, 1 thru 2
df.loc[1:2, 'Col1':'Col2']

Unnamed: 0,Col1,Col2
1,10,11
2,20,21


In [39]:
# matches row IDs, 1 and 2
df.loc[[1, 2], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
1,10,11
2,20,21


In [40]:
# sort the dataframe in reverse order
df = df.sort_index(ascending=False)
df

Unnamed: 0,Col1,Col2,Col3
2,20,21,22
1,10,11,12
0,0,1,2


In [41]:
# there are no row IDs 1 followed by 2
df.loc[1:2, ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2


In [42]:
# there is a row ID 1, and a row ID 2
df.loc[[1, 2], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
1,10,11
2,20,21


## Pandas Operations Modify in-place or return Modified values
By default, most Pandas Operations will perform the requested operation without modifying the DataFrame or Series they are operating on.  The return value is the operated on object.

Some Pandas operations have the keyword argument 'inplace'.  If this is set to True, then the underlying DataFrame or Series is modified directly.

In [43]:
# Pure Python example of non-inplace operation
x = [3, 2, 1]
x_orig = x.copy()

y = sorted(x)
print(f'x unmodified: {x_orig == x}')

x unmodified: True


In [44]:
# Pure Python example of inplace operation
x = [3, 2, 1]
x_orig = x.copy()

x.sort()
print(f'x unmodified: {x_orig == x}')

x unmodified: False


In [45]:
# to avoid throwing errors in Jupyter Cell
def df_equals(df1, df2):
    try:
        return df1.equals(df2)
    except KeyError:
        return False

In [46]:
# DataFrame example of non-inplace operation
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(data=data, columns=columns)
df_orig = df.copy()

df.drop('Col1', axis = 'columns')
print(f'df unmodified: {df_equals(df_orig, df)}')

df unmodified: True


In [47]:
# DataFrame example of inplace operation
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(data=data, columns=columns)
df_orig = df.copy()

df.drop('Col1', axis = 'columns', inplace=True)
print(f'df unmodified: {df_equals(df_orig, df)}')

df unmodified: False


##  Selection using **df.iloc\[rows, cols\]**

### Compare with numpy 2D Array Selection

When:
1. rows is either an integer or a slice,   
2. cols is either an integer or a slice,  

**df.loc\[rows, cols\]** is very similar to selecting values from a 2D numpy array.

In [48]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [49]:
values = df.values
values

array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22]])

In [50]:
print(f'values is: {type(values)} with {values.ndim} dimensions')

values is: <class 'numpy.ndarray'> with 2 dimensions


In [51]:
values[1,:]

array([10, 11, 12])

In [52]:
df.iloc[1,:].values

array([10, 11, 12])

In [53]:
values[2:,:2]

array([[20, 21]])

In [54]:
df.iloc[2:,:2].values

array([[20, 21]])

In [55]:
row_slice = slice(0,2)
col_slice = slice(None, None)

In [56]:
values[row_slice, col_slice]

array([[ 0,  1,  2],
       [10, 11, 12]])

In [57]:
df.iloc[row_slice, col_slice].values

array([[ 0,  1,  2],
       [10, 11, 12]])

In [58]:
(values[row_slice, col_slice] == df.iloc[row_slice, col_slice].values).all()

True

### Selection using **df.iloc\[rows, cols\]** in General

Where row is:
* a single row position
* a list of row positions
* a slice of row positions
* a boolean list

Where col is:
* a single col position
* a list of col positions
* a slice of col positions
* a boolean list

Slice works as it normally does for Python.

**.iloc\[rows\]** is shorthand for **.iloc\[rows\, :]**  
That is, all columns are selected.

In [59]:
rows = [True, False, True]
df.iloc[rows]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [60]:
df.iloc[[0,2]]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [61]:
df.iloc[1:]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [62]:
df.iloc[[1,2]]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [63]:
df.iloc[[1,2], [0,2]]

Unnamed: 0,Col1,Col3
Row2,10,12
Row3,20,22


In [64]:
df.iloc[[False, True, True], [True, False, True]]

Unnamed: 0,Col1,Col3
Row2,10,12
Row3,20,22


## .index of DataFrame/Series Created from DataFrame

A Series created from a DataFrame df:
* will have its index be a (subset of) df.index, if the operation was a column operation
* will have its index be a (subset of) df.columns, if the operations was a row operation

In [65]:
# select an entire column
s = df['Col1']
s.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [66]:
# the indexes are equal
s.index.equals(df.index)

True

In [67]:
# select an entire row
s_subset = df.iloc[2]
s_subset.index

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [68]:
# the series index equals df.columns
s_subset.index.equals(df.columns)

True

In [69]:
# select partial rows and columns
df_subset = df.iloc[:2,1:]
df_subset

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


In [70]:
# df_subset's index is a subset of df.index
df_subset.index.isin(df.index)

array([ True,  True])

In [71]:
# df_subset's columns is a subset of df.columns
df_subset.columns.isin(df.columns)

array([ True,  True])

In [72]:
# create a Series by applying a relational operator to an entire column
bool_series = df['Col2'] < 20
bool_series.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [73]:
# the indexes are equal
bool_series.index.equals(df.index)

True

## Boolean Series from Value Comparison


The term **"comparison operator"** or "relational operator" or "arithmetic relational operator" means:  
<, <=, ==, !=, >, >=  
and produces True/False results.

The term **"relational algebra operator"** means operations which perform:  
inner-join, outer-join, left-join, cross-product, union, etc.  
and produces DataFrames.

In [74]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [75]:
# comparison operator on series produces True/False for each value
boolean_series = df['Col1'] < df['Col2']
boolean_series

0    True
1    True
2    True
dtype: bool

In [76]:
# select rows based on boolean_series
criteria = df['Col1'] < df['Col2']
df[criteria]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [77]:
df.loc[criteria]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [78]:
criteria.index.equals(df.index)

True

In [79]:
critera1 = df['Col2'] > 10
critera2 = df['Col1'] < 20
filter_rows = critera1 & critera2
df[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,11,12


In [80]:
filter_rows = critera1 | critera2
df[filter_rows]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [81]:
# due to & and | operator precendance, when written on one line, 
# it is necessary to use parathensis around comparisons
filter_rows = (df['Col2'] > 10) & (df['Col1'] < 20)
df[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,11,12


In [82]:
# boolean series constructed from columns, has the same index as the dataframe
filter_rows.index.equals(df.index)

True

## Axis Specification

For someone coming from another programming language, the axis specification may be confusing.

Fortunately, unless the same label is used for both a column and a row (not recommended), an error will be thrown if the incorrect axis is specified.

### Operations which Modify the Structure of a DataFrame
The axis specification is intuitive.

Consider: row,col = 0, 1

Use axis=0 or axis='index' to drop a row.  
Use axis=1 or axis='columns' to drop a column.

In [83]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [84]:
df.drop('Col2', axis='columns')

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [85]:
# same as above
df.drop('Col2', axis=1)

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [86]:
df.drop('Row2', axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [87]:
# same as above
df.drop('Row1', axis=0)

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [88]:
modify_col_names = {'Col2':'Col-Two'}
df.rename(modify_col_names, axis='columns')

Unnamed: 0,Col1,Col-Two,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [89]:
modify_row_names = {'Row2':'Row-Two'}
df.rename(modify_row_names, axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row-Two,10,11,12
Row3,20,21,22


In [90]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ np.nan, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0.0,1,2
Row2,,11,12
Row3,20.0,21,22


In [91]:
df.dropna(axis='columns')

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12
Row3,21,22


In [92]:
df.dropna(axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0.0,1,2
Row3,20.0,21,22


### Other DataFrame Operations

To someone coming from another programming language, or who is familiar with SQL, the axis specification may seem confusing for other types of operations.

**Terminology Common to Other Programming Languages and SQL**  
A **"column operation"** operates on a column and produces a result using that entire column as input.  The column label is maintained in the result to identify which column was operated on.

A **"row operation"** operates on a row and produces a result using that entire row as input.  The row label is maintained in the result to identify which row was operated on.

**Terminology Common to Numpy and Pandas**  
A "column operation" works by examining the values **"across each row"** in the column.  In Pandas and numpy, this operation is specified as axis='index' or axis=0.

A "row operation" works by examining the values **"across each column"** in the row.  In Pandas and numpy, this operation is specified as axis='columns' or axis=1.

In [93]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [94]:
# perform the "column operation" which "sums across the rows" of each column
df.sum(axis='index')

Col1    30
Col2    33
Col3    36
dtype: int64

In [95]:
# same as above
df.sum(axis=0)

Col1    30
Col2    33
Col3    36
dtype: int64

The above is the result of a "column operation" as the column labels are in the index.

In [96]:
# perform the "row operation" which "sums across the columns" of each row
df.sum(axis='columns')

Row1     3
Row2    33
Row3    63
dtype: int64

In [97]:
# same as above
df.sum(axis=1)

Row1     3
Row2    33
Row3    63
dtype: int64

The above is the result of a "row operation" as the row labels are in the index.

In [98]:
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [99]:
# perform the "column operation" of finding the max "across the rows" of each column
df.max(axis=0)

Col1    20
Col2    21
Col3    22
dtype: int64

The above is a column operation as the column labels are in the index.

In [100]:
# perform the "row operation" of finding the minimum "across the columns" of each row
df.min(axis=1)

Row1     0
Row2    10
Row3    20
dtype: int64

The above is a row operation as the row labels are in the index.