# Additional Pandas DataFrame Topics

## In-Place Modification vs Return Copy of Modified Value

By default, most Pandas Operations will perform the requested operation without modifying the DataFrame or Series they are operating on.  The return value is a copy of the modified object.

Some Pandas operations have the keyword argument 'inplace'.  If this is set to True, then the underlying DataFrame or Series is modified directly.

There are several reasons for not using inplace=True, including:
* immutable objects are easier to write correct code for
* immutable objects are easier to parallelize 
* immutable objects are better for method chaining
* inplace=True is not using less memory

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Pure Python Example of Return Modified Value
x = [3, 2, 1]
y = sorted(x)
print(y == x)
print(y)

False
[1, 2, 3]


In [3]:
# Pure Python example of In-Place Operation
x = [3, 2, 1]
y = x.sort()
print(y)
print(x)

None
[1, 2, 3]


In [4]:
# convenience method to avoid throwing errors in the following Jupyter Cells
def df_equals(df1, df2):
    try:
        return df1.equals(df2)
    except KeyError:
        return False

In [5]:
# DataFrame example of Return Modified Value: df.drop()
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(data=data, columns=columns)

df2 = df.drop('Col1', axis = 'columns')
display(df2)
display(df)

Unnamed: 0,Col2,Col3
0,1,2
1,11,12
2,21,22


Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [6]:
# DataFrame example of inplace operation, df.drop(inplace=True)
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(data=data, columns=columns)

df2 = df.drop('Col1', axis = 'columns', inplace=True)
print(df2)
display(df)

None


Unnamed: 0,Col2,Col3
0,1,2
1,11,12
2,21,22


### Method Chaining Requires Copy of Modified Object to be Returned

Some programming languages allow for data to be piped from one data operator to the next to form a data pipeline.  Python does not allow this syntax using a pipe, but it does allow for nearly the same thing using method chaining.

Long method chains read easier, if each operation is placed on a separate line.  To tell Python that the statement extends across multiple lines, wrap the entire statement in parenthesis.

The convention in Python and Pandas, is that an in-place operation returns None and a non in-place operation returns a copy of modified data.

In [7]:
# Example of Method Chaining
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
df = pd.DataFrame(data=data, columns=columns, index=index)

print('Original Data')
display(df)

print('Method Chained Result')
(df.assign(Sum=df['Col2']+df['Col3'])
   .drop(['Col1'], axis='columns')
   .reset_index(drop=True))

Original Data


Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


Method Chained Result


Unnamed: 0,Col2,Col3,Sum
0,1,2,3
1,11,12,23
2,21,22,43


## .index of DataFrame/Series Created from DataFrame

A Series created from a DataFrame:
* will have its index be a (subset of) df.index, if the operation was a column operation
* will have its index be a (subset of) df.columns, if the operation was a row operation

For a definition of "column operation" and "row operation" see below: [Axis Specification](#Axis)

In [8]:
# select an entire column
s = df['Col1']
s.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [9]:
# show that the series as the same index as the df it was selected from
s.index.equals(df.index)

True

In [10]:
# often the index of the column is shared with dataframe.index
s.index is df.index

True

In [11]:
# select an entire row
s = df.iloc[2]
s.index

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [12]:
# show that the series as the same index as the df.columns it was selected from
s.index.equals(df.columns)

True

In [13]:
# often the index of the row is shared with dataframe.columns
s.index is df.columns

True

In [14]:
# select partial rows and columns
df_subset = df.iloc[:2,1:]
df_subset

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


In [15]:
# every value in df_subset index is a value in df.index 
df_subset.index.isin(df.index).all()

True

In [16]:
# same using set notation
set(df_subset.index).issubset(df.index)

True

In [17]:
# every value in df_subset columns is a value in df.columns
df_subset.columns.isin(df.columns).all()

True

In [18]:
# same using set notation
set(df_subset.columns).issubset(df.columns)

True

In [19]:
# create a Series by applying a comparison operator to an entire column
bool_series = df['Col2'] < 20
bool_series.index

Index(['Row1', 'Row2', 'Row3'], dtype='object')

In [20]:
# the boolean series has the same index as the df it was compared against
bool_series.index.equals(df.index)

True

## Boolean Series from Value Comparison

A **comparison operator** is one of:  
* <  
* <=  
* ==  
* \>  
* \>=  
* !=   
and produces True/False results.

Other operators, such as .isin(), and .isnull(), also produce True/False results.

Selecting rows through the use of a comparison operator is similar to selecting rows using the WHERE clause of a SQL query. 

In [21]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [22]:
# comparison produces True/False for each value
boolean_series = df['Col1'] < df['Col2']
boolean_series

0    True
1    True
2    True
dtype: bool

In [23]:
# select rows based on boolean_series
criteria = df['Col1'] < df['Col2']
df[criteria]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [24]:
# either df[criteria] or df.loc[criteria] will work
# df.loc[criteria] is clearer as it shows that rows are being selected
df.loc[criteria]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [25]:
criteria.index.equals(df.index)

True

In [26]:
critera1 = df['Col2'] > 10
critera2 = df['Col1'] < 20
filter_rows = critera1 & critera2
df.loc[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,11,12


In [27]:
filter_rows = critera1 | critera2
df.loc[filter_rows]

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [28]:
# due to & and | operator precendance, when written on one line, 
# it is necessary to use parentheses around comparisons
filter_rows = (df['Col2'] > 10) & (df['Col1'] < 20)
df.loc[filter_rows]

Unnamed: 0,Col1,Col2,Col3
1,10,11,12


In [29]:
# boolean series constructed from columns, has the same index as the dataframe
filter_rows.index.equals(df.index)

True

<a name="Axis"></a>

## Axis Specification

For someone coming from another programming language, the axis specification may be confusing.

Under most circumstances, Pandas will be able to detect a problem if you incorrectly specify the axis.

### Operations which Modify the Shape of a DataFrame
The axis specification is intuitive.

row, col = 0, 1

Use axis=0 or axis='index' to affect rows.  
Use axis=1 or axis='columns' to affect columns.

In [30]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [31]:
df.drop('Col2', axis='columns')

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [32]:
# same as above
df.drop('Col2', axis=1)

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [33]:
df.drop('Row2', axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


In [34]:
# same as above
df.drop('Row1', axis=0)

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


In [35]:
modify_col_names = {'Col2':'Col-Two'}
df.rename(modify_col_names, axis='columns')

Unnamed: 0,Col1,Col-Two,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [36]:
modify_row_names = {'Row2':'Row-Two'}
df.rename(modify_row_names, axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row-Two,10,11,12
Row3,20,21,22


In [37]:
# modify both in a single step by telling rename which are the rows and which are the columns
df.rename(columns=modify_col_names, index=modify_row_names)

Unnamed: 0,Col1,Col-Two,Col3
Row1,0,1,2
Row-Two,10,11,12
Row3,20,21,22


In [38]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ np.nan, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0.0,1,2
Row2,,11,12
Row3,20.0,21,22


In [39]:
# drop all columns contining any nans
df.dropna(axis='columns')

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12
Row3,21,22


In [40]:
# drop all rows continain any nans
df.dropna(axis='index')

Unnamed: 0,Col1,Col2,Col3
Row1,0.0,1,2
Row3,20.0,21,22


### Operations which act on the Data in the DataFrame

The axis specification may seem confusing at first.

#### Terminology used by Other Programming Languages and SQL   
A **column operation** operates on columns of data.  The column labels are kept in the result to identify which columns were operated on.

A **row operation** operates on rows of data. The row labels are kept in the result to identify which rows were operated on.

#### Terminology used by Numpy and Pandas
In both Numpy and Pandas, the row axis is axis=0, and the column axis is axis=1.

However:
* to specify a column data operation, use axis=0 (or axis='index').
* to specify a row data operation, use axis=1 (or axis='columns').

One way to remember this is that the axis that disappears (or is collapsed), is the specified axis.

For a column operation, the row axis disappears, so specify axis=0.

For a row operation, the column axis disappears, so specify axis=1.

In [41]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [42]:
# column operation (result has column labels)
df.sum(axis='index')

Col1    30
Col2    33
Col3    36
dtype: int64

In [43]:
# same as above
df.sum(axis=0)

Col1    30
Col2    33
Col3    36
dtype: int64

In [44]:
# row operation (result has row labels)
df.sum(axis='columns')

Row1     3
Row2    33
Row3    63
dtype: int64

In [45]:
# same as above
df.sum(axis=1)

Row1     3
Row2    33
Row3    63
dtype: int64

In [46]:
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


In [47]:
# find max per column (result has column labels)
df.max(axis=0)

Col1    20
Col2    21
Col3    22
dtype: int64

In [48]:
# find max per row (result has row labels)
df.min(axis=1)

Row1     0
Row2    10
Row3    20
dtype: int64