# Axis Specification in Pandas
For math operations, specifying the axis works the same as in numpy.

In [1]:
import pandas as pd
print("Pandas Version {}".format(pd.__version__))
import numpy as np
print("Numpy Version {}".format(np.__version__))

Pandas Version 0.22.0
Numpy Version 1.14.1


### Definition of Row and Column
As in an Excel spreadsheet or a database table:
* columns are the attributes and run vertically
* rows are the records (or observations) and run horizontally.

### My "Row Operation" Definition
Apply sum as a "row operation" means to produce a sum for each row of data.  If the rows have identifiers, the identifier for each sum is the corresponding row name. 

### My "Column Operation" Definition
Apply sum as a "column operation" means to produce a sum for each column of data.  If the columns have identifiers, the identifier for each column is the corresponding column name.

### Caution: Pandas and Numpy Users define Row and Column Operation Differently
My definitions are consistent with how these terms are used in Excel or a database table, however they are inconsistent with how these terms are used in Numpy and Pandas!

In Numpy and Pandas, if you have 4 columns and you sum over each column producing 4 results, each result having a column name, you have summed *across* or *along* the rows and you have therefore performed a "row operation".

To communicate with people from different backgrounds, I find it best to simply say "axis=0" or "axis=1".  The Pandas DataFrame API is well defined and everyone agrees what happens when you specify the axis value for a mathematical operation such as sum().

This should become clearer with the examples below.

In [2]:
X = np.arange(12).reshape((3,4))
X

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [3]:
# apply the sum operation to each column of data
# this is a numpy operation *not* a Pandas operation!
print(type(X))
X.sum(axis=0)

<class 'numpy.ndarray'>


array([12, 15, 18, 21])

The result has 4 values, as there are 4 columns of data on which the sum operated on.

<pre>
12 = 0 + 4 +  8 = Sum of Column 0  
15 = 1 + 5 +  9 = Sum of Column 1  
18 = 2 + 6 + 10 = Sum of Column 2   
21 = 3 + 7 + 11 = Sum of Column 3  
</pre>

Again, I call this a column operation.  It is specified with axis=0.

In [4]:
# apply the sum operation to each row of data
# this is a numpy operation *not* a Pandas operation!
X.sum(axis=1)

array([ 6, 22, 38])

The result has 3 values, as there are 3 rows of data on which the sum operated.

<pre>
 6 = 0 + 1 +  2 +  3 = Sum of Row 0  
22 = 4 + 5 +  6 +  7 = Sum of Row 1  
38 = 8 + 9 + 10 + 11 = Sum of Row 2   
</pre>

Again, I call this a row operation.  It is specified with axis=1.

In [5]:
df = pd.DataFrame(data=X, \
                  columns="c1 c2 c3 c4".split(), \
                  index="r1 r2 r3".split())
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,c1,c2,c3,c4
r1,0,1,2,3
r2,4,5,6,7
r3,8,9,10,11


In [6]:
# sum each column of data
# this is a Pandas DataFrame operation
# the axis specification is the same as in numpy
df.sum(axis=0)

c1    12
c2    15
c3    18
c4    21
dtype: int64

The result has 4 values, as there are 4 columns on which the sum operated.

The identifiers for each value are the column names.

In [7]:
# sum each row of data
# this is a Pandas DataFrame operation
# the axis specification is the same as in numpy
df.sum(axis=1)

r1     6
r2    22
r3    38
dtype: int64

The result has 3 values, as there are 3 rows that the sum operation acted on.

The identifiers for each row are the row names.

In [8]:
# verify again that axis works the same way, for sum()
# sum each column of data and compare numpy arrays
X.sum(axis=0) == df.sum(axis=0).values

array([ True,  True,  True,  True])

In [9]:
# verify again that axis works the same way, for sum()
# sum each row of data and compare numpy arrays
X.sum(axis=1) == df.sum(axis=1).values

array([ True,  True,  True])

### Pandas is not consistent about the use of axis specification with 0 and 1

Three operations which use an axis specification inconsistent with the above are:
* DataFrame.dropna()
* DataFrame.drop()
* pd.concat()

In [10]:
# to drop a column, use axis=1 !!
df.drop('c2', axis=1)

Unnamed: 0,c1,c3,c4
r1,0,2,3
r2,4,6,7
r3,8,10,11


In [11]:
# to drop a row, use axis=0 !!
df.drop('r2', axis=0)

Unnamed: 0,c1,c2,c3,c4
r1,0,1,2,3
r3,8,9,10,11


### How to remember which is which
For numpy like operations such as sum(), axis works the same as in numpy.  Axis=0 means to apply the operation to each column of data.  The number of results is the number of columns.  If there is an identifier, the identifier is the name of each column.

For DataFrame structure modification operations, that is operations which modify the number of rows or columns, axis works the opposite.  Axis=0 means to modify the number of rows and axis=1 means to modify the number of columns.

### Pandas DataFrame Alternative Axis Specification

In [12]:
# sum over each row of data
df.sum(axis='columns')

r1     6
r2    22
r3    38
dtype: int64

There are 3 rows of data, so there are 3 results.  The identifier for each sum is the row name of each row. 

However in the above, we specified axis='columns' in order to get a result for each row. Or as a numpy or Pandas person would say, we sum "across" or "along" columns to produce results for each row.

I find this terminology counter-intuitive and prefer to use axis=1.

In [13]:
df.sum(axis='index')

c1    12
c2    15
c3    18
c4    21
dtype: int64

There are 4 columns of data, so there are 4 results.  The identifier for each sum is the column name of each column.

However in the above, we specified axis='index' in order to get a result for each column. Or as a numpy or Pandas person would say, we sum "across" or "along" each index to produce results for each column.

I find this terminology counter-intuitive and prefer to use axis=0.