# Axis Specification in Pandas and Numpy
For math operations, specifying the axis works the same in Pandas as in Numpy.

In [1]:
import pandas as pd
print("Pandas Version {}".format(pd.__version__))
import numpy as np
print("Numpy Version {}".format(np.__version__))

Pandas Version 0.22.0
Numpy Version 1.14.2


### Definition of "Row" and "Column"
As in an Excel spreadsheet or a database table:
* columns are the attributes and run vertically
* rows are the records (or observations) and run horizontally.

### Definition of "Column Operation"
Apply sum as a "column operation" means to apply the sum operator to each column of data. If the columns have column names, the identifier for each column sum, is the column name.

### Definition of "Row Operation"
Apply sum as a "row operation" means to apply the sum operator to each row of data.  If the rows have row names, the identifier for each row sum, is the row name.

### Caution: Pandas and Numpy use Different Terminology

In Numpy and Pandas, if you have 4 columns and you sum over each column producing 4 results, each result having a column name, you have summed *across* or *along* the rows and you therefore provide an axis specification for *rows*.

To communicate with people from different backgrounds, I find it best to simply say "axis=0" or "axis=1". The Pandas and Numpy APIs are well defined and everyone agrees on what happens when you specify the axis value for an operations such as sum().

This should become clearer with the examples below.

In [2]:
X = np.arange(12).reshape((3,4))
X

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [3]:
df = pd.DataFrame(data=X,
                  columns="col1 col2 col3 col4".split(),
                  index="row1 row2 row3".split())
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,col1,col2,col3,col4
row1,0,1,2,3
row2,4,5,6,7
row3,8,9,10,11


In [4]:
# column names
df.columns

Index(['col1', 'col2', 'col3', 'col4'], dtype='object')

In [5]:
# row names
df.index

Index(['row1', 'row2', 'row3'], dtype='object')

In [6]:
df

Unnamed: 0,col1,col2,col3,col4
row1,0,1,2,3
row2,4,5,6,7
row3,8,9,10,11


In [7]:
# compute sum for each column
df.sum(axis=0)

col1    12
col2    15
col3    18
col4    21
dtype: int64

There were 4 columns, therefore there are 4 column sums.  The name of each column sum corresponds to the column name.

<pre>
col1: 12 = 0 + 4 +  8 
col2: 15 = 1 + 5 +  9
col3: 18 = 2 + 6 + 10
col4: 21 = 3 + 7 + 11
</pre>

In other computing languages, this is called a "column operation".

However in Numpy and Pandas, this is called, "summing across the rows".  As such, you give it an axis specification for "rows".

Whether you call it a "column operation", or "summing across the rows", the axis specification is axis=0.

In [8]:
df

Unnamed: 0,col1,col2,col3,col4
row1,0,1,2,3
row2,4,5,6,7
row3,8,9,10,11


In [9]:
# compute sum for each row
df.sum(axis=1)

row1     6
row2    22
row3    38
dtype: int64

There were 3 rows, therefore there are 3 row sums.  The name of each row sum corresponds to the row name.

<pre>
row1:  6 = 0 + 1 +  2 +  3 
row2: 22 = 4 + 5 +  6 +  7  
row3: 38 = 8 + 9 + 10 + 11   
</pre>

In other computer languages, this is called a "row operation".

However in Numpy and Pandas, this is called, "summing across the columns".  As such, you give it an axis specification for "columns".

Whether you call it a "row operation", or "summing across the columns", the axis specification is axis=1.

In [10]:
# you can also use a string speficiation
df.sum(axis='columns')

row1     6
row2    22
row3    38
dtype: int64

Here we see that using the string 'columns', means that we applied the sum function to each row, producing one result per row, identified by the row name.

In [11]:
# alternative string specification
df.sum(axis='index')

col1    12
col2    15
col3    18
col4    21
dtype: int64

Here we see that using the string 'index', means that we applied the sum function to each column, producing one result per column, identified by the column name.

## Summary: Aggregate Operations on Columns and Rows
I remember this as:
* axis=0 means to operate on a column of data
* axis=1 means to operate on a row of data

### Adding/Removing Columns and Rows
If you are modifying the *shape* of the DataFrame, then the axis specification is reversed.  For example, to perform the column operation to remove a column, you specify axis=1.

In [12]:
# to drop a column, use axis=1
df.drop('col2', axis=1)

Unnamed: 0,col1,col3,col4
row1,0,2,3
row2,4,6,7
row3,8,10,11


In [13]:
# to drop a row, use axis=0
df.drop('row2', axis=0)

Unnamed: 0,col1,col2,col3,col4
row1,0,1,2,3
row3,8,9,10,11


In [14]:
# create a row to try adding a row
s1 = pd.Series(data=[12, 13, 14, 15], index=['col1', 'col2', 'col3', 'col4'])
df2 = pd.DataFrame(data={'col1': 12, 'col2':12, 'col3':13, 'col4':14}, index=['row4'])
df2

Unnamed: 0,col1,col2,col3,col4
row4,12,12,13,14


In [15]:
# to add a row, just like to drop a row, use axis=0
pd.concat([df, df2], axis=0)

Unnamed: 0,col1,col2,col3,col4
row1,0,1,2,3
row2,4,5,6,7
row3,8,9,10,11
row4,12,12,13,14


## Summary: Adding/Removing Columns and Rows
I remember this as, for adding or removing columns:
* axis=1 means to add/drop a column
* axis=0 means to add/drop a row