# Pandas Introduction

## Overview

This notebook is example based.

The first section describes the basic elements of a Pandas DataFrame.

The second section shows the details of selecting from a Pandas DataFrame with:
* **df\[\]**
* **df.loc\[\]**
* **df.iloc\[\]**

This notebook includes more detail than does a typical Pandas introduction.

Later notebooks will use real-world data and perform data analysis.

# Elements of a Pandas Dataframe

In [1]:
import pandas as pd
import numpy as np

### Components of a DataFrame 
A DataFrame has:
1. column labels
2. row labels
3. values

In [126]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [127]:
# column labels as list
df.columns.tolist()

['Col1', 'Col2', 'Col3']

In [128]:
# row labels as list
df.index.tolist()

['Row1', 'Row2', 'Row3']

In [129]:
# data as nested list
df.values.tolist()

[[0, 1, 2], [10, 11, 12], [20, 21, 22]]

### DataFrame Column and Row Labels

Column labels are often called column names. I prefer to use the term "column labels" as it emphasizes the fact that columns and rows can be used in a very similar manner. 

df.columns contains the column labels.  
df.index contains the row labels.  

Both are instances of pd.Index (or a subclass thereof).

In [136]:
# get column index
df.columns

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [137]:
isinstance(df.columns, pd.Index)

True

In [138]:
# the index is iterable
[col for col in df.columns]

['Col1', 'Col2', 'Col3']

In [139]:
# get row index
df.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [140]:
isinstance(df.index, pd.Index)

True

In [141]:
# the index is iterable
[row for row in df.index]

['Row1', 'Row2', 'Row3']

### Dataframe Values

Each column is a pd.Series containing a single datatype.  Pandas is built on top of numpy.

In [165]:
# one column of dataframe
s_col1 = df['Col1']
type(s_col1)

pandas.core.series.Series

In [166]:
array_1d = s_col1.values
array_1d

array([ 0, 10, 20])

In [167]:
print(type(array_1d))
print(f'ndim: {array_1d.ndim}')
print(f'shape: {array_1d.shape}')
print(f'element dtype: {array_1d.dtype}')
df['Col1'].dtype == df['Col1'].values.dtype

<class 'numpy.ndarray'>
ndim: 1
shape: (3,)
element dtype: int64


True

In [168]:
# all values in dataframe
array_2d = df.values
array_2d

array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22]])

In [169]:
print(type(array_2d))
print(f'ndim: {array_2d.ndim}')
print(f'shape: {array_2d.shape}')
print(f'element dtype: {array_2d.dtype}')

<class 'numpy.ndarray'>
ndim: 2
shape: (3, 3)
element dtype: int64


### Be Careful to Distinguish Between pd.Index and df.index

**pd.Index:** This is a class.  
**df.columns:** This is an instance of pd.Index (or one of its subclasses)  
**df.index:**   This is an instance of pd.Index (or one of its subclasses)

# Selecting Data from a Pandas DataFrame

## Indexing Overview

Pandas has so many ways of indexing that it is helpful to have an overview that is correct, even if it is not complete.  This notebook cell provides a useful starting point for how to think about Pandas indexing.  The reminder of this notebook presents a more complete view of Pandas indexing.

Most of Pandas Indexing can be described as:
* **df\[columns\]** provides Python dictionary like access to the columns.
* **df.loc\[row_labels, col_labels\]** provides access by row and column labels.
* **df.iloc\[row_positions, col_positions\]** provides access by row and column positions

**df.iloc\[row_positions, col_positions\]** is almost identical to indexing numpy 2D arrays.

Additionally masking (the use of boolean arrays to pick out values where the mask value is True) is supported for both rows and columns.

## Numpy

Basic knowledge of Numpy is often considered a prerequisite to learning Pandas. Pandas is built on top of Numpy and its indexing is closely related to Numpy's indexing.

For a free 2018 introductory youtube presentation on Numpy, see: [Enthought Numpy](https://www.youtube.com/watch?v=V0D2mhVt7NE).  

The github resources for this talk are at: [Enthought Numpy Github Repo](https://github.com/enthought/Numpy-Tutorial-SciPyConf-2018).  

Included in the Enthought github resources is "slides.pdf", which starting on page 18, provides an excellent and concise introduction to Numpy.

## Recomendations

* do not use dot notation to select columns
* do not use inplace=True
* do not use "index chaining", that is, perfer df.iloc\[\] to df\[\]\[\]

## Selection using **df\[filter\]**

Selects columns when filter is:
1. a single column label (which is in df.columns)
2. a list of columns labels (all of which are in df.columns)

Selects rows when filter is:
3. a slice representing row positions
4. a boolean Series (with index matching df.index)
5. a list of boolean values

Examples of each of the above 5 use cases follows.

### **df\[filter\]** Examples

In [170]:
# 1. single column label
df['Col1']

Row1     0
Row2    10
Row3    20
Name: Col1, dtype: int64

In [171]:
isinstance(df['Col1'], pd.Series)

True

In [172]:
# 2. list of column labels
df[['Col1', 'Col3']]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [173]:
isinstance(df[['Col1', 'Col3']], pd.DataFrame)

True

In [174]:
# 3. slice refering to row positions
df[1:]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [175]:
# 4. boolean series with matching index
row_boolean_series = pd.Series([False, True, False], index=['Row1', 'Row2', 'Row3'])
df[row_boolean_series]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


In [176]:
# 5. list of boolean values
df[[False, True, False]]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


## Selection using **df.loc\[rows, cols\]**

Where row is:
1. a single row label
2. a list of row labels
3. a slice matching row labels in order, inclusive of the slice end
4. a boolean Series (with index matching df.index)
5. a list of boolean values

Where col is:
6. a single col label
7. a list of col labels
8. a slice matching col labels in order, inclusive of the slice end
9. a boolean Series (with index matching df.columns)
10. a list of boolean values

**df.loc\[rows\]** is the same as: **df.loc\[rows, : \]**  

Examples of each of the above 10 use cases follow.

### **df.loc\[rows, cols\]** Examples

In [177]:
# 1a. single row label
df.loc['Row2']

Col1    10
Col2    11
Col3    12
Name: Row2, dtype: int64

In [178]:
# 1b. single row lable, as dataframe
df.loc['Row2'].to_frame()

Unnamed: 0,Row2
Col1,10
Col2,11
Col3,12


In [179]:
# 2. list of row labels
df.loc[['Row1', 'Row3']]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [180]:
# 3a. slice of row labels
df.loc['Row1':'Row2']

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


In [181]:
# 3b. slice of row lables
df.loc[:'Row2']

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


In [182]:
# 4. boolean series with matching index
row_series = pd.Series([False, True, False], index=['Row1', 'Row2', 'Row3'])
print(row_series.index.equals(df.index))
df.loc[row_series]

True


Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


In [183]:
# 5. list of boolean values
selection = [False, True, False]
df.loc[selection]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


In [184]:
# 6. single column label (as dataframe)
df.loc[:'Row2', 'Col2'].to_frame()

Unnamed: 0,Col2
Row1,1
Row2,11


In [185]:
# 7. list of column labels
df.loc[:'Row2', ['Col1', 'Col3']]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12


In [186]:
# 8a. slice of column labels
df.loc['Row1':'Row2', 'Col2':]

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


In [187]:
# 8b. same as 8a, using explicit slice object
# Reminder: with .loc[] and only with .loc[], slice is inclusive of its end point
row_slice = slice('Row1', 'Row2')
col_slice = slice('Col2', None)
df.loc[row_slice, col_slice]

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


In [188]:
# 8c. slice order matters, there are no rows that match Row2 followed by Row1
df.loc['Row2':'Row1', 'Col2':]

Unnamed: 0,Col2,Col3


In [189]:
# 9. boolean series with matching index values that match df.columns
col_series = pd.Series([False, True, False], index=['Col1', 'Col2', 'Col3'])
print(df.columns.equals(col_series.index))
df.loc[:'Row2',col_series]

True


Unnamed: 0,Col2
Row1,1
Row2,11


In [190]:
# 10. list of boolean values
df.loc[:'Row2', [True, False, True]]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12


In [191]:
# it is a KeyError if the label is not in the column index
print('Col99' in df.columns)
try:
    df.loc[:, 'Col99']
except KeyError as err:
    print(f'KeyError: {err}')

False
KeyError: 'Col99'


In [192]:
# it is a KeyError if the label is not in the row index
print('Row99' in df.index)
try:
    df.loc['Row99', :]
except KeyError as err:
    print(f'KeyError: {err}')

False
KeyError: 'Row99'


#### **df.loc\[\]** Examples with Default Index
The default index creates a range of integers that match the row position.

Even though the default index initially matches the row position, **df.loc\[\]** is selecting "row label", not "row position".

In [193]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [194]:
# index is concisely represented as a RangeIndex
df.index

RangeIndex(start=0, stop=3, step=1)

In [195]:
# df.index is a subclass of pd.Index
isinstance(df.index, pd.Index)

True

In [196]:
# matches row labels, 1 thru 2
df.loc[1:2, 'Col1':'Col2']

Unnamed: 0,Col1,Col2
1,10,11
2,20,21


In [197]:
# matches row labels, 1 and 2
df.loc[[1, 2], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
1,10,11
2,20,21


In [198]:
# sort the dataframe in reverse order
df = df.sort_index(ascending=False)
df

Unnamed: 0,Col1,Col2,Col3
2,20,21,22
1,10,11,12
0,0,1,2


In [199]:
# there are no row labels 1 followed by 2
df.loc[1:2, ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2


In [200]:
# there is a row label 1, and a row label 2
df.loc[[1, 2], ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
1,10,11
2,20,21


#### **df.loc\[\]** Examples using Sorted Index

When a slice is used for row labels or column labels, the dataframe is often sorted first, so that the slice performs as expected.

Except for dataframes whose column names have a meaningful lexicographical sorting, sorting the columns is not usually done, but sometimes it is useful.

In [201]:
columns = ['x_03', 'y_02', 'y_01', 'x_01', 'x_02']
index = ['z_03', 'z_02', 'z_01']
data = [[ 0, 1, 2, 3, 4],
       [ 10, 11, 12, 13, 14],
       [ 20, 21, 22, 23, 24]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,x_03,y_02,y_01,x_01,x_02
z_03,0,1,2,3,4
z_02,10,11,12,13,14
z_01,20,21,22,23,24


In [202]:
# attempt to get all rows from z_01 to z_02; nothing found
df.loc['z_01':'z_02']

Unnamed: 0,x_03,y_02,y_01,x_01,x_02


In [203]:
df = df.sort_index(axis='index')
df

Unnamed: 0,x_03,y_02,y_01,x_01,x_02
z_01,20,21,22,23,24
z_02,10,11,12,13,14
z_03,0,1,2,3,4


In [205]:
# now the slice does get all rows from z_01 to z_02 inclusive
df.loc['z_01':'z_02']

Unnamed: 0,x_03,y_02,y_01,x_01,x_02
z_01,20,21,22,23,24
z_02,10,11,12,13,14


In [206]:
# attempt to get all columns from y_01 to y_02; nothing is returned
df.loc[:, 'y_01':'y_02']

z_01
z_02
z_03


In [207]:
df = df.sort_index(axis='columns')
df

Unnamed: 0,x_01,x_02,x_03,y_01,y_02
z_01,23,24,20,22,21
z_02,13,14,10,12,11
z_03,3,4,0,2,1


In [208]:
# now the slice does get all columns from y_01 to y_02
df.loc[:, 'y_01':'y_02']

Unnamed: 0,y_01,y_02
z_01,22,21
z_02,12,11
z_03,2,1


## Pandas Operations Modify in-place or return Modified values
By default, most Pandas Operations will perform the requested operation without modifying the DataFrame or Series they are operating on.  The return value is a copy of the modified object.

Some Pandas operations have the keyword argument 'inplace'.  If this is set to True, then the underlying DataFrame or Series is modified directly.

There are several reasons for not using inplace=True, such as:
* immutable objects being easier to write correct code for
* immutable objects being easier to parallelize 
* allowing for method chaining
* internal implementation details of Pandas

In [209]:
# Pure Python example of non-inplace operation
x = [3, 2, 1]
x_orig = x.copy()

# sorted will not modify x, it will return an copy of x in sorted order
y = sorted(x)
print(f'x unmodified: {x_orig == x}')

x unmodified: True


In [211]:
# Pure Python example of inplace operation
x = [3, 2, 1]
x_orig = x.copy()

# sort will modify x, None is returned
x.sort()
print(f'x unmodified: {x_orig == x}')

x unmodified: False


In [214]:
# convenience method to avoid throwing errors in following Jupyter Cells
def df_equals(df1, df2):
    try:
        return df1.equals(df2)
    except KeyError:
        return False

In [215]:
# DataFrame example of non-inplace operation
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(data=data, columns=columns)
df_orig = df.copy()

# by default, drop returns a copy of the df with the specifiec columns dropped
df.drop('Col1', axis = 'columns')
print(f'df unmodified: {df_equals(df_orig, df)}')

df unmodified: True


In [216]:
# DataFrame example of inplace operation
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(data=data, columns=columns)
df_orig = df.copy()

# drop with inplace=True, will modify df and return None
df.drop('Col1', axis = 'columns', inplace=True)
print(f'df unmodified: {df_equals(df_orig, df)}')

df unmodified: False


### Method Chaining Requires Copy of Modified Object to be Returned

Some programming languages allow for data to be piped from one data operator to the next to form a data pipeline.  Python does not allow this syntax using a pipe, but it does allow for the equivalent using method chaining.

Long method chains reads easier, if each operation is placed on a separate line.  To tell Python that the statement extends across multiple lines, wrap the entire statement in parenthesis.

The convention in Python and Pandas, is that an in-place operation returns None and a non in-place operation returns a copy of modified data.

In [219]:
df

Unnamed: 0,Col2,Col3
0,1,2
1,11,12
2,21,22


In [220]:
df['Col2']

0     1
1    11
2    21
Name: Col2, dtype: int64

In [223]:
# Example of Method Chaining
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
print('Original Data')
display(df)

print('Method Chained Result')
(df.assign(Sum=df['Col2']+df['Col3'])
   .drop(['Col1'], axis='columns')
   .reset_index(drop=True))

Original Data


Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


Method Chained Result


Unnamed: 0,Col2,Col3,Sum
0,1,2,3
1,11,12,23
2,21,22,43


##  Selection using **df.iloc\[rows, cols\]**

### Compare with numpy 2D Array Selection

When:
1. rows is either an integer or a slice,   
2. cols is either an integer or a slice,  

**df.loc\[rows, cols\]** is very similar to selecting values from a 2D numpy array.

In [227]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [228]:
values = df.values
values

array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22]])

In [229]:
print(f'values is: {type(values)} with {values.ndim} dimensions')

values is: <class 'numpy.ndarray'> with 2 dimensions


In [230]:
values[1,:]

array([10, 11, 12])

In [231]:
df.iloc[1,:].values

array([10, 11, 12])

In [232]:
values[2:,:2]

array([[20, 21]])

In [233]:
df.iloc[2:,:2].values

array([[20, 21]])

In [234]:
row_slice = slice(0,2)
col_slice = slice(None, None)

In [235]:
values[row_slice, col_slice]

array([[ 0,  1,  2],
       [10, 11, 12]])

In [236]:
df.iloc[row_slice, col_slice].values

array([[ 0,  1,  2],
       [10, 11, 12]])

In [237]:
(values[row_slice, col_slice] == df.iloc[row_slice, col_slice].values).all()

True

### Selection using **df.iloc\[rows, cols\]** in General

Where row is:
* a single row position
* a list of row positions
* a slice of row positions
* a boolean list

Where col is:
* a single col position
* a list of col positions
* a slice of col positions
* a boolean list

Slice works as it normally does for Python.

**.iloc\[rows\]** is the same as: **.iloc\[rows\, :]**  
That is, all columns are selected.

In [238]:
rows = [True, False, True]
df.iloc[rows]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [239]:
df.iloc[[0,2]]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [240]:
df.iloc[1:]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [241]:
df.iloc[[1,2]]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [242]:
df.iloc[[1,2], [0,2]]

Unnamed: 0,Col1,Col3
Row2,10,12
Row3,20,22


In [243]:
df.iloc[[False, True, True], [True, False, True]]

Unnamed: 0,Col1,Col3
Row2,10,12
Row3,20,22


## .index of DataFrame/Series Created from DataFrame

A Series created from a DataFrame df:
* will have its index be a (subset of) df.index, if the operation was a column operation
* will have its index be a (subset of) df.columns, if the operations was a row operation

In [244]:
# select an entire column
s = df['Col1']
s.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [245]:
# the indexes are equal
s.index.equals(df.index)

True

In [249]:
# often the index of the column is shared with dataframe.index
s.index is df.index

True

In [250]:
# select an entire row
s_subset = df.iloc[2]
s_subset.index

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [251]:
# the series index equals df.columns
s_subset.index.equals(df.columns)

True

In [252]:
# often the index of the row is shared with dataframe.columns
s_subset.index is df.columns

True

In [253]:
# select partial rows and columns
df_subset = df.iloc[:2,1:]
df_subset

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


In [262]:
# every value in df_subset index is a value in df.index 
df_subset.index.isin(df.index).all()

True

In [263]:
# every value in df_subset columns is a value in df.columns
df_subset.columns.isin(df.columns).all()

True

In [256]:
# create a Series by applying a relational operator to an entire column
bool_series = df['Col2'] < 20
bool_series.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [257]:
# the indexes are equal
bool_series.index.equals(df.index)

True

## Boolean Series from Value Comparison

A **comparison operator** is one of:  
* <  
* <=  
* ==  
* \>  
* \>=  
* !=   
and produces True/False results.

Other operators, such as .isin(), and .isnull(), also produce True/False results.

Selecting rows through the use of a comparison operator is similar to selecting rows using the WHERE clause of a SQL query. 

In [265]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [266]:
# comparison produces True/False for each value
boolean_series = df['Col1'] < df['Col2']
boolean_series

0    True
1    True
2    True
dtype: bool

In [267]:
# select rows based on boolean_series
criteria = df['Col1'] < df['Col2']
df[criteria]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [268]:
# either df[criteria] or df.loc[criteria] will work
df.loc[criteria]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [269]:
criteria.index.equals(df.index)

True

In [270]:
critera1 = df['Col2'] > 10
critera2 = df['Col1'] < 20
filter_rows = critera1 & critera2
df[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,11,12


In [271]:
filter_rows = critera1 | critera2
df[filter_rows]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [272]:
# due to & and | operator precendance, when written on one line, 
# it is necessary to use parathensis around comparisons
filter_rows = (df['Col2'] > 10) & (df['Col1'] < 20)
df[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,11,12


In [273]:
# boolean series constructed from columns, has the same index as the dataframe
filter_rows.index.equals(df.index)

True

## Axis Specification

For someone coming from another programming language, the axis specification may be confusing.

Fortunately, under most circumstances, Pandas will detect a problem if you incorrectly specify the axis.

### Operations which Modify the Structure of a DataFrame
The axis specification is intuitive.

Consider: row,col = 0, 1

Use axis=0 or axis='index' to affect rows.  
Use axis=1 or axis='columns' to affect columns.

In [275]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [276]:
df.drop('Col2', axis='columns')

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [277]:
# same as above
df.drop('Col2', axis=1)

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [278]:
df.drop('Row2', axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [279]:
# same as above
df.drop('Row1', axis=0)

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [280]:
modify_col_names = {'Col2':'Col-Two'}
df.rename(modify_col_names, axis='columns')

Unnamed: 0,Col1,Col-Two,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [281]:
modify_row_names = {'Row2':'Row-Two'}
df.rename(modify_row_names, axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row-Two,10,11,12
Row3,20,21,22


In [282]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ np.nan, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0.0,1,2
Row2,,11,12
Row3,20.0,21,22


In [283]:
# drop all columns contining any nans
df.dropna(axis='columns')

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12
Row3,21,22


In [284]:
# drop all rows continain any nans
df.dropna(axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0.0,1,2
Row3,20.0,21,22


### Operations which act on the Data in the DataFrame

The axis specification may seem confusing at first.

#### Terminology used by Other Programming Languages and SQL   
A **column operation** operates on a column of data.  The column label is kept in the result to identify which columns of data were operated on.

A **row operation** operates on a row of data. The row label is kept in the result to identify which rows of data were operated on.

#### Terminology used by Numpy and Pandas
In both Numpy and Pandas, the row axis is axis=0, and the column axis is axis=1.

However:
* to specify a column operation, use axis=0 (or axis='index').
* to specify a row operation, use axis=1 (or axis='columns').

One way to remember this is that the axis that disappears, is the specified axis.

For a column operation, the row axis disappears, so specify axis=1.

For a row operation, the column axis disappears, so specify axis=0.

In [285]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [287]:
# column operation
# column labels are kept, so this is a column operation
df.sum(axis='index')

Col1    30
Col2    33
Col3    36
dtype: int64

In [288]:
# same as above
# column labels are kep, so this is a column operation
df.sum(axis=0)

Col1    30
Col2    33
Col3    36
dtype: int64

In [289]:
# row operation
# row labels are kept, so this is a row operation
df.sum(axis='columns')

Row1     3
Row2    33
Row3    63
dtype: int64

In [290]:
# same as above
# row labels are kept, so this is a row operation
df.sum(axis=1)

Row1     3
Row2    33
Row3    63
dtype: int64

In [None]:
df

In [None]:
# find max for each column
# column labels are kept, so this is a column operation
df.max(axis=0)

The above is a column operation as the column labels are in the index.

In [None]:
# perform the "row operation" of finding the minimum "across the columns" of each row
df.min(axis=1)

The above is a row operation as the row labels are in the index.