# Pandas Introduction

## Overview

The first section describes the basic elements of a Pandas DataFrame.

The second section provides examples of selecting from a Pandas DataFrame with:
* **df\[\]**
* **df.loc\[\]**
* **df.iloc\[\]**

This notebook includes more detail than a typical Pandas introduction.

This notebook does not discuss time-series indexes.

Later notebooks will use real-world data and perform data analysis.

# Elements of a Pandas Dataframe

In [1]:
import pandas as pd
import numpy as np

### Components of a DataFrame 
A DataFrame has:
1. column labels
2. row labels
3. values

In [2]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [3]:
# column labels as list
df.columns.tolist()

['Col1', 'Col2', 'Col3']

In [4]:
# row labels as list
df.index.tolist()

['Row1', 'Row2', 'Row3']

In [5]:
# data as nested list
df.values.tolist()

[[0, 1, 2], [10, 11, 12], [20, 21, 22]]

### DataFrame Column and Row Labels

Column labels and row labels are each instances of pd.Index (or one of its subclasses).

As they are instances of the same class, the same methods can be applied to each.

Sometimes the term "column names" is used instead of "column labels", but I prefer the term "column labels" as it emphasizes the fact that the column and row labels are of the same data type.

df.columns contains the column labels.  
df.index contains the row labels.  

In [6]:
# column index
df.columns

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [7]:
isinstance(df.columns, pd.Index)

True

In [8]:
# the index is iterable
[col for col in df.columns]

['Col1', 'Col2', 'Col3']

In [9]:
# row index
df.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [10]:
isinstance(df.index, pd.Index)

True

In [11]:
# the index is iterable
[row for row in df.index]

['Row1', 'Row2', 'Row3']

### Dataframe Values

Each column of a dataframe is a pd.Series containing a single datatype.

Pandas DataFrames and Series are built on top of numpy.  A ps.Series allows only a single data type, just as a numpy array only allows a single data type.

In [12]:
# single dataframe column
s_col1 = df['Col1']
type(s_col1)

pandas.core.series.Series

In [13]:
array_1d = s_col1.values
array_1d

array([ 0, 10, 20])

In [14]:
print(type(array_1d))
print(f'ndim: {array_1d.ndim}')
print(f'shape: {array_1d.shape}')
print(f'element dtype: {array_1d.dtype}')
df['Col1'].dtype == df['Col1'].values.dtype

<class 'numpy.ndarray'>
ndim: 1
shape: (3,)
element dtype: int64


True

In [15]:
# all values in this dataframe
array_2d = df.values
array_2d

array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22]])

In [16]:
print(type(array_2d))
print(f'ndim: {array_2d.ndim}')
print(f'shape: {array_2d.shape}')
print(f'element dtype: {array_2d.dtype}')

<class 'numpy.ndarray'>
ndim: 2
shape: (3, 3)
element dtype: int64


### Be Careful to Distinguish Between pd.Index and df.index

**pd.Index:** This is a class.  
**df.columns:** This is an instance of pd.Index (or one of its subclasses)  
**df.index:**   This is an instance of pd.Index (or one of its subclasses)

# Selecting Data from a Pandas DataFrame

## Indexing Overview

'Indexing', in the context of pd.DataFrame, means the selection of rows or columns or both.

Pandas allows for many types of indexing.

Most of Pandas Indexing can be summarized as:
* **df\[columns\]** provides Python dictionary like access to the columns.
* **df.loc\[row_labels, col_labels\]** provides access by row and column labels.
* **df.iloc\[row_positions, col_positions\]** provides access by row and column positions

Note: **df.iloc\[row_positions, col_positions\]** is almost identical to indexing numpy 2D arrays.

Additionally masking (the use of boolean arrays to pick out values where the mask value is True) is supported for both rows and columns for each of the above indexing operators.

## Numpy

Basic knowledge of numpy is often considered a prerequisite to learning Pandas. Pandas is built on top of numpy and its indexing is closely related to numpy's indexing.

For an excellent free introductory presentation on numpy see: [Enthought Scipy 2018: Numpy](https://www.youtube.com/watch?v=V0D2mhVt7NE) 

The resources for this presentation are at: [Enthought Scipy 2018: Numpy Github Repo](https://github.com/enthought/Numpy-Tutorial-SciPyConf-2018).  

The resources include "slides.pdf", which from page 17 through page 64, offers an excellent illustrated introduction to numpy.

## Recommendations

I highly recommend the excellent blog post: [Minimally Sufficient Pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428)

Here I will repeat two of Ted Petrou's recommendations from that blog post.
* do not use dot notation to select columns
* do not use inplace=True

The above recommendations allow for more easily writing correct code.

Both dot notation and inplace=True are nice for exploratory data analysis, as they require less typing to explore the data.  However when the code is intended for more complex usage, such as preparing for use with Machine Learning, or the code is intended for production use and will need to be maintained, potential problems may arise.

If you are on a team of data scientists, there is likely a written set of guidelines for how your team should use Pandas, if so, those recommendations should be followed.

## Selection using **df\[filter\]**

Selects columns when filter is:
1. a single column label (which is in df.columns)
2. a list of columns labels (all of which are in df.columns)

Selects rows when filter is:
3. a slice representing row positions
4. a boolean Series (with index matching df.index)
5. a list of boolean values

Examples of each of the above 5 use cases follows.

### **df\[filter\]** Examples

In [17]:
# 1. single column label
df['Col1']

Row1     0
Row2    10
Row3    20
Name: Col1, dtype: int64

In [18]:
isinstance(df['Col1'], pd.Series)

True

In [19]:
# 2. list of column labels
df[['Col1', 'Col3']]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [20]:
isinstance(df[['Col1', 'Col3']], pd.DataFrame)

True

In [21]:
# 3. slice refering to row positions
df[1:]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [22]:
# 4. boolean series with matching index
row_boolean_series = pd.Series([False, True, False], index=['Row1', 'Row2', 'Row3'])
df[row_boolean_series]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


In [23]:
# 5. list of boolean values
df[[False, True, False]]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


## Selection using **df.loc\[row, col\]**

Where row is:
1. a single row label
2. a list of row labels
3. a slice matching row labels in order, inclusive of the slice end
4. a boolean Series (with index matching df.index)
5. a list of boolean values

Where col is:
6. a single col label
7. a list of col labels
8. a slice matching col labels in order, inclusive of the slice end
9. a boolean Series (with index matching df.columns)
10. a list of boolean values

**df.loc\[rows\]** is the same as: **df.loc\[rows, : \]**  

Examples of each of the above 10 use cases follow.

### **df.loc\[rows, cols\]** Examples

In [24]:
# 1a. single row label, result is series
df.loc['Row2']

Col1    10
Col2    11
Col3    12
Name: Row2, dtype: int64

In [25]:
# 1b. single row label, result is dataframe
df.loc[['Row2']]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


In [26]:
# 2. list of row labels
df.loc[['Row1', 'Row3']]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [27]:
# 3a. slice of row labels
df.loc['Row1':'Row2']

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


In [28]:
# 3b. slice of row labels
df.loc[:'Row2']

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


In [29]:
# 4. boolean series with matching index
# Pandas will align the index
idx = ['Row2', 'Row1', 'Row3']

# show that the indexes contain the same values
print(f'indexes have same values: {set(idx) == set(df.index)}')

data = [False, True, False]
row_series = pd.Series(data=data, index=idx)
print(row_series)

df.loc[row_series]

indexes have same values: True
Row2    False
Row1     True
Row3    False
dtype: bool


Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2


In [30]:
# 5. list of boolean values
# there is no index to align
selection = [False, True, False]
df.loc[selection]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


In [31]:
# 6a. single column label, result is series
df.loc[:'Row2', 'Col2']

Row1     1
Row2    11
Name: Col2, dtype: int64

In [32]:
# 6b. single column label, result is dataframe
df.loc[:'Row2', ['Col2']]

Unnamed: 0,Col2
Row1,1
Row2,11


In [33]:
# 7. list of column labels
df.loc[:'Row2', ['Col1', 'Col3']]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12


In [34]:
# 8a. slice of column labels
df.loc['Row1':'Row2', 'Col2':]

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


In [35]:
# 8b. same as 8a, using explicit Python slice object
# Reminder: with .loc[] and only with .loc[], slice is inclusive of its end point
row_slice = slice('Row1', 'Row2')
col_slice = slice('Col2', None)
df.loc[row_slice, col_slice]

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


In [36]:
# 8c. slice order matters, there are no rows that match Row2 followed by Row1
df.loc['Row2':'Row1', 'Col2':]

Unnamed: 0,Col2,Col3


In [37]:
# 9. boolean series with matching index values that match df.columns
# Pandas will align the column index
idx = ['Col2', 'Col1', 'Col3']

# show that the indexes contain the same values
print(f'indexes have same values: {set(idx) == set(df.columns)}')

data = [False, True, False]
col_series = pd.Series(data=data, index=idx)
print(col_series)
df.loc[:,col_series]

indexes have same values: True
Col2    False
Col1     True
Col3    False
dtype: bool


Unnamed: 0,Col1
Row1,0
Row2,10
Row3,20


In [38]:
# 10. list of boolean values
# there are no indexes to align
df.loc[:, [True, False, True]]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [39]:
# it is a KeyError if the label is not in the column index
print('Col99' in df.columns)
try:
    df.loc[:, 'Col99']
except KeyError as err:
    print(f'KeyError: {err}')

False
KeyError: 'Col99'


In [40]:
# it is a KeyError if the label is not in the row index
print('Row99' in df.index)
try:
    df.loc['Row99', :]
except KeyError as err:
    print(f'KeyError: {err}')

False
KeyError: 'Row99'


#### **df.loc\[\]** Examples with Default Index
The default index creates a range of integers that match the row position.

Even though the default index initially matches the row position, **df.loc\[\]** is selecting "row label", not "row position".

In [41]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [42]:
# index is concisely represented as a RangeIndex
df.index

RangeIndex(start=0, stop=3, step=1)

In [43]:
# df.index is a subclass of pd.Index
isinstance(df.index, pd.Index)

True

In [44]:
# matches row labels, 1 thru 2 inclusive
df.loc[1:2, 'Col1':'Col2']

Unnamed: 0,Col1,Col2
1,10,11
2,20,21


In [45]:
# matches row labels, 1 and 2
df.loc[[1, 2], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
1,10,11
2,20,21


In [46]:
# sort the dataframe in reverse order
df = df.sort_index(ascending=False)
df

Unnamed: 0,Col1,Col2,Col3
2,20,21,22
1,10,11,12
0,0,1,2


In [47]:
# there are no row labels 1 followed by 2
df.loc[1:2, ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2


In [48]:
# there is a row label 1, and a row label 2
df.loc[[1, 2], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
1,10,11
2,20,21


#### **df.loc\[\]** Examples using Sorted Index

When a slice is used for row labels or column labels, the dataframe is often sorted first, so that the slice performs as expected.

Except for dataframes whose column names have a meaningful lexicographical sorting, sorting the columns is not usually done, but it is sometimes useful.

In [49]:
columns = ['x_03', 'y_02', 'y_01', 'x_01', 'x_02']
index = ['z_03', 'z_02', 'z_01']
data = [[ 0, 1, 2, 3, 4],
       [ 10, 11, 12, 13, 14],
       [ 20, 21, 22, 23, 24]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,x_03,y_02,y_01,x_01,x_02
z_03,0,1,2,3,4
z_02,10,11,12,13,14
z_01,20,21,22,23,24


In [50]:
# attempt to get all rows from z_01 to z_02; nothing found
df.loc['z_01':'z_02']

Unnamed: 0,x_03,y_02,y_01,x_01,x_02


In [51]:
df = df.sort_index(axis='index')
df

Unnamed: 0,x_03,y_02,y_01,x_01,x_02
z_01,20,21,22,23,24
z_02,10,11,12,13,14
z_03,0,1,2,3,4


In [52]:
# now the slice does get all rows from z_01 to z_02 inclusive
df.loc['z_01':'z_02']

Unnamed: 0,x_03,y_02,y_01,x_01,x_02
z_01,20,21,22,23,24
z_02,10,11,12,13,14


In [53]:
# attempt to get all columns from y_01 to y_02; nothing is returned
df.loc[:, 'y_01':'y_02']

z_01
z_02
z_03


In [54]:
df = df.sort_index(axis='columns')
df

Unnamed: 0,x_01,x_02,x_03,y_01,y_02
z_01,23,24,20,22,21
z_02,13,14,10,12,11
z_03,3,4,0,2,1


In [55]:
# now the slice does get all columns from y_01 to y_02
df.loc[:, 'y_01':'y_02']

Unnamed: 0,y_01,y_02
z_01,22,21
z_02,12,11
z_03,2,1


## Pandas Operations Modify in-place or return Modified values
By default, most Pandas Operations will perform the requested operation without modifying the DataFrame or Series they are operating on.  The return value is a copy of the modified object.

Some Pandas operations have the keyword argument 'inplace'.  If this is set to True, then the underlying DataFrame or Series is modified directly.

There are several reasons for not using inplace=True, including:
* immutable objects are easier to write correct code for
* immutable objects are easier to parallelize 
* immutable objects are better for method chaining
* inplace=True is not using less memory

In [56]:
# Pure Python example of non-inplace operation: sorted()
x = [3, 2, 1]
x_orig = x.copy()

# sorted will not modify x, it will return an copy of x in sorted order
y = sorted(x)
print(f'x unmodified: {x_orig == x}')

x unmodified: True


In [57]:
# Pure Python example of inplace operation: list.sort()
x = [3, 2, 1]
x_orig = x.copy()

# sort will modify x, None is returned
x.sort()
print(f'x unmodified: {x_orig == x}')

x unmodified: False


In [58]:
# convenience method to avoid throwing errors in the following Jupyter Cells
def df_equals(df1, df2):
    try:
        return df1.equals(df2)
    except KeyError:
        return False

In [59]:
# DataFrame example of non-inplace operation: df.drop()
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(data=data, columns=columns)
df_orig = df.copy()

# by default, drop returns a copy of the df with the specifiec columns dropped
df.drop('Col1', axis = 'columns')
print(f'df unmodified: {df_equals(df_orig, df)}')

df unmodified: True


In [60]:
# DataFrame example of inplace operation, df.drop(inplace=True)
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(data=data, columns=columns)
df_orig = df.copy()

# drop with inplace=True, will modify df and return None
df.drop('Col1', axis = 'columns', inplace=True)
print(f'df unmodified: {df_equals(df_orig, df)}')

df unmodified: False


### Method Chaining Requires Copy of Modified Object to be Returned

Some programming languages allow for data to be piped from one data operator to the next to form a data pipeline.  Python does not allow this syntax using a pipe, but it does allow for nearly the same thing using method chaining.

Long method chains read easier, if each operation is placed on a separate line.  To tell Python that the statement extends across multiple lines, wrap the entire statement in parenthesis.

The convention in Python and Pandas, is that an in-place operation returns None and a non in-place operation returns a copy of modified data.

In [61]:
# Example of Method Chaining
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
df = pd.DataFrame(data=data, columns=columns, index=index)

print('Original Data')
display(df)

print('Method Chained Result')
(df.assign(Sum=df['Col2']+df['Col3'])
   .drop(['Col1'], axis='columns')
   .reset_index(drop=True))

Original Data


Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


Method Chained Result


Unnamed: 0,Col2,Col3,Sum
0,1,2,3
1,11,12,23
2,21,22,43


## Selection using **df.iloc\[rows, cols\]**

Where row is:
1. a single row position
2. a list of row positions
3. a slice of row positions
4. a boolean list

Where col is:
1. a single col position
2. a list of col positions
3. a slice of col positions
4. a boolean list

Slice works as it normally does for Python.  That is, the end point is exclusive.

**.iloc\[rows\]** is the same as: **.iloc\[rows\, :]**  
    
All of the above works the same for numpy.

In [62]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [63]:
values = df.values
values

array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22]])

In [64]:
print(f'values is: {type(values)} with {values.ndim} dimensions')

values is: <class 'numpy.ndarray'> with 2 dimensions


In [65]:
# 1. numpy specify single row position
values[1,:]

array([10, 11, 12])

In [66]:
# 1. iloc[] specify single row position
df.iloc[1,:].values

array([10, 11, 12])

In [67]:
# returns a series
df.iloc[1,:]

Col1    10
Col2    11
Col3    12
Name: Row2, dtype: int64

In [68]:
# 2. numpy specify list of row positions (aka fancy indexing)
values[[0,2], :]

array([[ 0,  1,  2],
       [20, 21, 22]])

In [69]:
# 2. iloc[] specify list of row positions
df.iloc[[0,2], :].values

array([[ 0,  1,  2],
       [20, 21, 22]])

In [70]:
# returns a dataframe
df.iloc[[0,2], :]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [71]:
# 3. numpy specify slice of row positions
values[0:2, :]

array([[ 0,  1,  2],
       [10, 11, 12]])

In [72]:
# 3. iloc[] specify slice of row positions
values[0:2, :]

array([[ 0,  1,  2],
       [10, 11, 12]])

In [73]:
# same but with explict slice objectw
row_slice = slice(0,2)
col_slice = slice(None, None)

In [74]:
values[row_slice, col_slice]

array([[ 0,  1,  2],
       [10, 11, 12]])

In [75]:
df.iloc[row_slice, col_slice].values

array([[ 0,  1,  2],
       [10, 11, 12]])

In [76]:
# returns a dataframe
df.iloc[row_slice, col_slice]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


In [77]:
# one liner to verify all values are the same
(values[row_slice, col_slice] == df.iloc[row_slice, col_slice].values).all()

True

In [78]:
# 4. numpy specify list of boolean values (a mask)
values[[True, False, True], :]

array([[ 0,  1,  2],
       [20, 21, 22]])

In [79]:
# 4. iloc[] specify list of boolean values
df.iloc[[True, False, True], :].values

array([[ 0,  1,  2],
       [20, 21, 22]])

In [80]:
# return is a dataframe
df.iloc[[True, False, True], :]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


## .index of DataFrame/Series Created from DataFrame

A Series created from a DataFrame:
* will have its index be a (subset of) df.index, if the operation was a column operation
* will have its index be a (subset of) df.columns, if the operation was a row operation

For a definition of "column operation" and "row operation" see below under: Axis Specification.

In [81]:
# select an entire column
s = df['Col1']
s.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [82]:
# show that the series as the same index as the df it was selected from
s.index.equals(df.index)

True

In [83]:
# often the index of the column is shared with dataframe.index
s.index is df.index

True

In [84]:
# select an entire row
s_subset = df.iloc[2]
s_subset.index

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [85]:
# show that the series as the same index as the df.columns it was selected from
s_subset.index.equals(df.columns)

True

In [86]:
# often the index of the row is shared with dataframe.columns
s_subset.index is df.columns

True

In [87]:
# select partial rows and columns
df_subset = df.iloc[:2,1:]
df_subset

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


In [88]:
# every value in df_subset index is a value in df.index 
df_subset.index.isin(df.index).all()

True

In [89]:
# every value in df_subset columns is a value in df.columns
df_subset.columns.isin(df.columns).all()

True

In [90]:
# create a Series by applying a comparison operator to an entire column
bool_series = df['Col2'] < 20
bool_series.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [91]:
# show that the series as the same index as the df it was compared against
bool_series.index.equals(df.index)

True

## Boolean Series from Value Comparison

A **comparison operator** is one of:  
* <  
* <=  
* ==  
* \>  
* \>=  
* !=   
and produces True/False results.

Other operators, such as .isin(), and .isnull(), also produce True/False results.

Selecting rows through the use of a comparison operator is similar to selecting rows using the WHERE clause of a SQL query. 

In [92]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [93]:
# comparison produces True/False for each value
boolean_series = df['Col1'] < df['Col2']
boolean_series

0    True
1    True
2    True
dtype: bool

In [94]:
# select rows based on boolean_series
criteria = df['Col1'] < df['Col2']
df[criteria]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [95]:
# either df[criteria] or df.loc[criteria] will work
# df.loc[criteria] is clearer as it shows that rows are being selected
df.loc[criteria]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [96]:
criteria.index.equals(df.index)

True

In [97]:
critera1 = df['Col2'] > 10
critera2 = df['Col1'] < 20
filter_rows = critera1 & critera2
df.loc[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,11,12


In [98]:
filter_rows = critera1 | critera2
df.loc[filter_rows]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [99]:
# due to & and | operator precendance, when written on one line, 
# it is necessary to use parentheses around comparisons
filter_rows = (df['Col2'] > 10) & (df['Col1'] < 20)
df.loc[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,11,12


In [100]:
# boolean series constructed from columns, has the same index as the dataframe
filter_rows.index.equals(df.index)

True

## Axis Specification

For someone coming from another programming language, the axis specification may be confusing.

Under most circumstances, Pandas will be able to detect a problem if you incorrectly specify the axis.

### Operations which Modify the Structure of a DataFrame
The axis specification is intuitive.

row,col = 0, 1

Use axis=0 or axis='index' to affect rows.  
Use axis=1 or axis='columns' to affect columns.

In [101]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [102]:
df.drop('Col2', axis='columns')

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [103]:
# same as above
df.drop('Col2', axis=1)

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [104]:
df.drop('Row2', axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [105]:
# same as above
df.drop('Row1', axis=0)

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [106]:
modify_col_names = {'Col2':'Col-Two'}
df.rename(modify_col_names, axis='columns')

Unnamed: 0,Col1,Col-Two,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [107]:
modify_row_names = {'Row2':'Row-Two'}
df.rename(modify_row_names, axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row-Two,10,11,12
Row3,20,21,22


In [108]:
# modify both in a single step by telling rename which are the rows and which are the columns
df.rename(columns=modify_col_names, index=modify_row_names)

Unnamed: 0,Col1,Col-Two,Col3
Row1,0,1,2
Row-Two,10,11,12
Row3,20,21,22


In [109]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ np.nan, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0.0,1,2
Row2,,11,12
Row3,20.0,21,22


In [110]:
# drop all columns contining any nans
df.dropna(axis='columns')

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12
Row3,21,22


In [111]:
# drop all rows continain any nans
df.dropna(axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0.0,1,2
Row3,20.0,21,22


### Operations which act on the Data in the DataFrame

The axis specification may seem confusing at first.

#### Terminology used by Other Programming Languages and SQL   
A **column operation** operates on a column of data.  The column label is kept in the result to identify which columns of data were operated on.

A **row operation** operates on a row of data. The row label is kept in the result to identify which rows of data were operated on.

#### Terminology used by Numpy and Pandas
In both Numpy and Pandas, the row axis is axis=0, and the column axis is axis=1.

However:
* to specify a column data operation, use axis=0 (or axis='index').
* to specify a row data operation, use axis=1 (or axis='columns').

One way to remember this is that the axis that disappears (or is collapsed), is the specified axis.

For a column operation, the row axis disappears, so specify axis=0.

For a row operation, the column axis disappears, so specify axis=1.

In [112]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [113]:
# column operation (result has column labels)
df.sum(axis='index')

Col1    30
Col2    33
Col3    36
dtype: int64

In [114]:
# same as above
df.sum(axis=0)

Col1    30
Col2    33
Col3    36
dtype: int64

In [115]:
# row operation (result has row labels)
df.sum(axis='columns')

Row1     3
Row2    33
Row3    63
dtype: int64

In [116]:
# same as above
df.sum(axis=1)

Row1     3
Row2    33
Row3    63
dtype: int64

In [117]:
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [118]:
# find max per column (result has column labels)
df.max(axis=0)

Col1    20
Col2    21
Col3    22
dtype: int64

In [119]:
# find max per row (result has row labels)
df.min(axis=1)

Row1     0
Row2    10
Row3    20
dtype: int64