# The pandas Series and DataFrame

pandas provides a comprehensive set of data structures for working with and manipulating data and performing various statistical and financial analyses. The two primary data structures in pandas are Series and DataFrame. In this chapter, we will examine the Series object and how it extends a NumPy ndarray to provide operations such as indexed data retrieval, axis labeling, and automatic alignment. Then, we will move on to examine how DataFrame extends the capabilities of Series to use columnar/tabular data, which can be of more than one data type.

The intention of this chapter is to be not only a refresher for those with basic familiarity with pandas, but also a means by which someone who is not initiated with pandas can gain enough familiarity with the two data structures and have a good foundation as we move into more finance-related subjects in later chapters. We will not cover all the details of using Series and DataFrame but will focus on core functionality related to what will be used later in this book for financial analysis. For extensive coverage of Series and DataFrame, please refer to the GSF Python for finance course. 

Specifically, this chapter will cover the following topics:

    An overview of the Series and DataFrame objects
    Creating and accessing elements of a Series
    Determining the shapes and counts of items in a Series
    Alignment of items in a Series via index labels
    Creating a DataFrame
    Loading example financial data to demonstrate the DataFrame
    Selecting rows of a DataFrame through several concepts using its index
    Boolean selection of rows of a DataFrame using logical expressions
    Performing arithmetic on a DataFrame
    Reindexing the Series and DataFrame objects

The main pandas data structures – Series and DataFrame
Several classes for manipulating data are provided by pandas. Of those, we are interested in Series and more interested in DataFrame.

#### The Series  

The Series is the primary building block of pandas and represents a one-dimensional labeled array based on the NumPy ndarray. The Series extends the functionality of the NumPy ndarray by adding an associated set of labels that are used to index the elements of the array. A Series can hold zero or more instances of any single data type.

This labeled index adds significant power to access the elements of the Series over a NumPy array. Instead of simply accessing elements by position, a Series allows access to items through the associated index labels. The index also assists in a feature of pandas referred to as alignment, where operations between two Series are applied to values with identical labels.

#### The DataFrame  

The Series is the basis for data representation and manipulation in pandas, but since it can only associate a single value with any given index label, it ends up having limited ability to model multiple variables of data at each index label. The pandas DataFrame solves this by providing the ability to seamlessly manage multiple Series, where each of the Series represents a column of the DataFrame and also by automatically aligning values in each column along the index labels of the DataFrame.

In a sense, a DataFrame can be thought of as a dictionary-like container of one or more Series objects, as a spreadsheet, or probably the best description for those new to pandas is to compare a DataFrame to a relational database table. But even that comparison is limiting, as a DataFrame has very distinct qualities (such as automatic alignment of Series data by index labels) that make it much more capable of exploratory data analysis than either a spreadsheet or a relational database table.

A good way to think about a DataFrame is that it unifies two or more Series into a single data structure. Each Series then represents a named column of the DataFrame, and instead of each column having its own index, the DataFrame provides a single index and the data in all columns is aligned to the master index of the DataFrame. Each index label then references a slice of data across all of the Series at the label, forming what is essentially a record of information associated with that particular index label.

A DataFrame also introduces the concept of an axis, which you will often see in the pandas documentation and in many of its methods. A DataFrame has two axes, horizontal and vertical. Functions from pandas can then be applied to either axis, in essence, stating that it applies either to all the values in selected rows or to all the items in specific columns.

To utilize the examples in this session we will first start by importing the pandas and numpy packages or libraries 

In [1]:
import pandas as pd   # import pandas
import numpy as np    # and NumPy

# Set some Pandas options
# pd.set_option('display.notebook_repr_html', False)
# pd.set_option('display.max_columns', 8)
# pd.set_option('display.max_rows', 8)

## The basics of the Series and DataFrame objects

### Creating a Series and accessing elements

A Series can be created by passing a scalar value, a NumPy array, or a Python dictionary/list to the constructor of the Series object. The following command creates a Series from 100 normally distributed random numbers:

In [2]:
# create a Series from a NumPy array of random values
np.random.seed(1)
s = pd.Series(np.random.randn(100))
s

0     1.624345
1    -0.611756
2    -0.528172
3    -1.072969
4     0.865408
        ...   
95    0.077340
96   -0.343854
97    0.043597
98   -0.620001
99    0.698032
Length: 100, dtype: float64

Individual elements of a Series can be retrieved using the [] operator of the Series object. The item with the index label 2 can be retrieved using the following code:

In [3]:
# select item with matching label of 2
s[2]

-0.5281717522634557

Multiple values can be retrieved using an array of label values, as shown here:

In [4]:
# selected elements at positions 2, 5, and 20
s[[2, 5, 20]]

2    -0.528172
5    -2.301539
20   -1.100619
dtype: float64

A Series supports slicing using the : slice notation. The following command retrieves the elements of the Series where labels are greater than 3 but less than 8 (the end value is not inclusive in pandas slicing, which is a slight difference from NumPy arrays):

In [5]:
# slice the Series
s[3:8]

3   -1.072969
4    0.865408
5   -2.301539
6    1.744812
7   -0.761207
dtype: float64

Note that the slice did not return only the values but each element (index label and value) of the Series with the specified labels.

The .head() and .tail() methods are provided by pandas to examine just the first or last few records in a Series. By default, these return the first or last five rows, respectively, but you can use the n parameter or just pass in an integer to specify the number of rows:

In [6]:
s.head()

0    1.624345
1   -0.611756
2   -0.528172
3   -1.072969
4    0.865408
dtype: float64

In [7]:
s.tail()

95    0.077340
96   -0.343854
97    0.043597
98   -0.620001
99    0.698032
dtype: float64

A Series consists of an index and a sequence of values. The index can be retrieved using the .index property:

In [8]:
s.index

RangeIndex(start=0, stop=100, step=1)

The values in the series using the .values property are as follows:

In [9]:
s.values

array([ 1.62434536, -0.61175641, -0.52817175, -1.07296862,  0.86540763,
       -2.3015387 ,  1.74481176, -0.7612069 ,  0.3190391 , -0.24937038,
        1.46210794, -2.06014071, -0.3224172 , -0.38405435,  1.13376944,
       -1.09989127, -0.17242821, -0.87785842,  0.04221375,  0.58281521,
       -1.10061918,  1.14472371,  0.90159072,  0.50249434,  0.90085595,
       -0.68372786, -0.12289023, -0.93576943, -0.26788808,  0.53035547,
       -0.69166075, -0.39675353, -0.6871727 , -0.84520564, -0.67124613,
       -0.0126646 , -1.11731035,  0.2344157 ,  1.65980218,  0.74204416,
       -0.19183555, -0.88762896, -0.74715829,  1.6924546 ,  0.05080775,
       -0.63699565,  0.19091548,  2.10025514,  0.12015895,  0.61720311,
        0.30017032, -0.35224985, -1.1425182 , -0.34934272, -0.20889423,
        0.58662319,  0.83898341,  0.93110208,  0.28558733,  0.88514116,
       -0.75439794,  1.25286816,  0.51292982, -0.29809284,  0.48851815,
       -0.07557171,  1.13162939,  1.51981682,  2.18557541, -1.39

When creating a Series and not explicitly setting the index label values via the Series constructor, pandas will assign sequential integer values starting at 0. To specify non-default index labels, use the index parameter of the Series object constructor or assign them using the .index property after creation.

The following command creates a Series and sets the index labels at the time of construction:

In [10]:
# specify an index at creation time
s2 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2

a    1
b    2
c    3
d    4
dtype: int64

A Series can be directly initialized from a Python dictionary. The keys of the dictionary are used as index labels for the Series:

In [11]:
# create a Sereis from a Python dict
s2 = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
s2

a    1
b    2
c    3
d    4
e    5
dtype: int64

## Size, shape, uniqueness and counts of values
There are several useful methods of determining the size of a Series as well as to get measurements of the distinct values and their quantities that are contained within the Series.

The number of elements in a Series can be determined using the len() function:

In [12]:
# create a series
s = pd.Series([5, 0, 1, 1, 2, 3, 4, 5, 6, np.nan])
len(s)

10

This can also be determined using the .shape property, which returns a tuple containing the dimensionality of the Series. Since a Series is one-dimensional, only the length value is provided in the tuple:

In [13]:
# reports the shape, which is a tuple with len in the first value
s.shape

(10,)

The number of rows in a Series that do not have a value of NaN can be determined with the .count() method:

In [14]:
# .count() is the number of non NaN values
s.count()

9

To determine all of the unique values in a Series, pandas provides the .unique() method:

In [15]:
# all unique values
s.unique()

array([ 5.,  0.,  1.,  2.,  3.,  4.,  6., nan])

The count of each of the unique items in a Series can be obtained using .value_counts():

In [16]:
# all unique values and their counts
s.value_counts()

1.0    2
5.0    2
6.0    1
4.0    1
3.0    1
2.0    1
0.0    1
dtype: int64

This result is sorted by pandas such that the counts are descending so that the most common values are at the top, which can help with quick analysis of data.

## Alignment via index labels

A fundamental difference between a NumPy ndarray and a pandas Series is the ability of a Series to automatically align data from another Series based upon label values before performing an operation. We will examine alignment using the following two Series objects:

In [17]:
s3 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s3

a    1
b    2
c    3
d    4
dtype: int64

In [18]:
s4 = pd.Series([4, 3, 2, 1], index=['d', 'c', 'b', 'a'])
s4

d    4
c    3
b    2
a    1
dtype: int64

The values in the two series are added in the following:

In [19]:
# add s3 and s4
s3 + s4

a    2
b    4
c    6
d    8
dtype: int64

The process of adding two Series objects differs from an array as it first aligns data based upon the index label values instead of simply applying the operation to elements in the same position. This becomes significantly powerful when using the pandas Series to combine data based upon labels instead of having to first order the data manually.

This is a very different result than if it was a pure NumPy ndarray being added. A NumPy ndarray would add the items in identical positions of each array, resulting in different values, as shown here:

In [20]:
# see how different from adding numpy arrays
a1 = np.array([1, 2, 3, 4])
a2 = np.array([4, 3, 2, 1])
a1 + a2

array([5, 5, 5, 5])

# Creating a DataFrame

There are a several ways to create a DataFrame. Probably, the most straightforward one is creating it from a NumPy array. The following command creates a DataFrame from a two-dimensional NumPy array:

In [21]:
# create a DataFrame from a 2-d ndarray
pd.DataFrame(np.array([[10, 11], [20, 21]]))

Unnamed: 0,0,1
0,10,11
1,20,21


Each row of the array forms a row in the DataFrame. Since we did not specify an index, pandas creates a default int64 index in the same manner as a Series. Since we also did not specify column names, pandas also assigns the names for each column with a zero-based integer series.

A DataFrame can also be initialized by passing a list of Series objects:

In [22]:
# create a DataFrame for a list of Series objects
df1 = pd.DataFrame([pd.Series(np.arange(10, 15)), 
                    pd.Series(np.arange(15, 20))])
df1

Unnamed: 0,0,1,2,3,4
0,10,11,12,13,14
1,15,16,17,18,19


The dimensions of a DataFrame can be determined using its .shape property. A DataFrame is always two-dimensional. The shape informs us with the first value the number of rows and with the second the number of columns:

In [23]:
# what's the shape of this DataFrame
df1.shape  # it is two rows by 5 columns

(2, 5)

Column names can be specified at the time of creating the DataFrame using the columns parameter of the DataFrame constructor:

In [24]:
# specify column names
df = pd.DataFrame(np.array([[10, 11], [20, 21]]), 
                  columns=['a', 'b'])
df

Unnamed: 0,a,b
0,10,11
1,20,21


The names of the columns of a DataFrame can be accessed with its .columns property:

In [25]:
# what are names of the columns?
df.columns

Index(['a', 'b'], dtype='object')

The names of the columns can be changed by assigning the .columns property with a list of new names:

In [26]:
df.columns = ['c1', 'c2']
df

Unnamed: 0,c1,c2
0,10,11
1,20,21


Index labels can likewise be assigned using the index parameter of the constructor or by assigning a list directly to the .index property:

In [27]:
# create a DataFrame with named columns and rows
df = pd.DataFrame(np.array([[0, 1], [2, 3]]), 
                  columns=['c1', 'c2'], 
                  index=['r1', 'r2'])
df

Unnamed: 0,c1,c2
r1,0,1
r2,2,3


Like the Series, the index of a DataFrame can be accessed with its .index property:

In [28]:
# retrieve the index of the DataFrame
df.index

Index(['r1', 'r2'], dtype='object')

Likewise, the values can be accessed using the .values property. Note that the result is a multidimensional array:

In [29]:
df.values

array([[0, 1],
       [2, 3]])

A DataFrame can also be created by passing a dictionary containing one or more Series objects, where the dictionary keys contain the column names and each Series is one column of data:

In [30]:
# create a DataFrame with two Series objects
# and a dictionary
s1 = pd.Series(np.arange(1, 6, 1))
s2 = pd.Series(np.arange(6, 11, 1))
pd.DataFrame({'c1': s1, 'c2': s2})

Unnamed: 0,c1,c2
0,1,6
1,2,7
2,3,8
3,4,9
4,5,10


A DataFrame also does automatic alignment of the data for each Series passed in by a dictionary. As a demonstration, the following command adds a third column in the DataFrame initialization. This third Series contains two values and will specify its index. When the DataFrame is created, all Series in the dictionary are aligned with each other by the index label as it is added to the DataFrame:

In [31]:
# demonstrate alignment during creation
s3 = pd.Series(np.arange(12, 14), index=[1, 2])
pd.DataFrame({'c1': s1, 'c2': s2, 'c3': s3})

Unnamed: 0,c1,c2,c3
0,1,6,
1,2,7,12.0
2,3,8,13.0
3,4,9,
4,5,10,


The first two Series did not have an index specified so they both were indexed with 0 to 4. The third Series has index values; therefore, the values for those indices are placed in the DataFrame in the row with the matching index from the previous columns. Then, pandas automatically fills in NaN for the values that were not supplied.

## Example data

Wherever possible, the code samples in this chapter will utilize a dataset provided with the code bundle of the book. This dataset makes the examples a little less academic in nature. These will be read from files using the pd.read_csv() function, which will load the sample data from the file into a DataFrame.

The dataset we will use is a snapshot of the S&P 500 from Yahoo! Finance. For now, we will load this data into a DataFrame that can be used to demonstrate various operations. This code only uses four specific columns of data in the file by specifying those columns via the usecols parameter to pd.read_csv(). The following command reads in the 50 lines of data:

In [32]:
# read in the data and print the first five rows
# use the Symbol column as the index, and only read in columns in positions 0, 2, 3, 7
# The refrence was was uploaded on github
sp500 = pd.read_csv("https://raw.githubusercontent.com/safarini/Python/master/sp500.csv", 
                    index_col='Symbol', 
                    usecols=[0, 2, 3, 7])

We can examine the first five rows of the DataFrame using the .head() method:

In [33]:
# peek at the first 5 rows of the data using .head()
sp500.head()

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954
ACN,Information Technology,79.79,8.326
ACE,Financials,102.91,86.897


The index of the DataFrame consists of the symbols for the 500 stocks representing the S&P 500:

In [34]:
# examine the index
sp500.index

Index(['MMM', 'ABT', 'ABBV', 'ACN', 'ACE', 'ACT', 'ADBE', 'AES', 'AET', 'AFL',
       ...
       'XEL', 'XRX', 'XLNX', 'XL', 'XYL', 'YHOO', 'YUM', 'ZMH', 'ZION', 'ZTS'],
      dtype='object', name='Symbol', length=500)

### Selecting columns of a DataFrame

Selecting the data in specific columns of a DataFrame is performed using the [] operator. This can be passed to either a single object or a list of objects. These objects are then used to look up columns either by the zero-based location or by matching the objects to the values in the columns index.

Passing a single integer, or a list of integers, to [] will have the DataFrame attempt to perform a location-based lookup of the columns. The following command retrieves the data in the second and third columns:

In [35]:
# get first and second columns (1 and 2) by location using iloc >>  [ : , 1:3 ] means all rows and and fromt he second column up to the third
sp500.iloc[ : , 1:3 ].head()

Unnamed: 0_level_0,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
MMM,141.14,26.668
ABT,39.6,15.573
ABBV,53.95,2.954
ACN,79.79,8.326
ACE,102.91,86.897


Note that even though we asked for just a single column by position, the value was still in a list passed to the [] operator, hence the double set of brackets [[]]. This is important, as not passing a list always results in a value-based lookup of the column.

If the values passed to [] consist of non-integers, then the DataFrame will attempt to match the values to the values in the columns index. The following command retrieves the Price column by name:

In [36]:
# get price column by name
# result is a Series
sp500['Price']

Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
         ...  
YHOO     35.02
YUM      74.77
ZMH     101.84
ZION     28.43
ZTS      30.53
Name: Price, Length: 500, dtype: float64

Multiple columns can be selected by name by passing a list of the column names and results in a DataFrame (even if a single item is passed in the list):

In [37]:
# We can also achive similar result by passing the names of the columns as stings, make sure you used double brackets [[]] 
sp500[['Price', 'Sector']].head(3)

Unnamed: 0_level_0,Price,Sector
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
MMM,141.14,Industrials
ABT,39.6,Health Care
ABBV,53.95,Health Care


Columns can also be retrieved using what is referred to as attribute access. Each column in a DataFrame dynamically adds a property to the DataFrame for each column where the name of the property is the name of the column. Since this selects a single column, the resulting value is a Series:

In [38]:
# attribute access of column by name
sp500.Price

Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
         ...  
YHOO     35.02
YUM      74.77
ZMH     101.84
ZION     28.43
ZTS      30.53
Name: Price, Length: 500, dtype: float64

## Selecting rows of a DataFrame using the index

The elements of an array or Series are selected using the [ ] operator. The DataFrame overloads [ ] to select columns instead of rows except for a specific case of slicing. Therefore, most operations of selecting one or more rows in a DataFrame require alternate methods to using [].

Understanding this is important in pandas as it is a common mistake to try to select rows using [ ] due to familiarity with other languages or data structures. When doing so, errors are often received and can often be difficult to diagnose without realizing that [ ] is working along a completely different axis than with a Series object.

Row selection using index on a DataFrame then breaks down into the following general categories of operations:

Slicing using the [ ] operator
Label- or location-based lookup using .loc, and .iloc
Scalar lookup by label or location using .at and .iat
We will briefly examine each of these techniques and attributes. Remember, all of these are working against the content of the index of the DataFrame. There is no involvement of data in the columns when selecting rows. We will cover this in the next section on Boolean selection.

### Slicing using the [ ] operator

Slicing a DataFrame across its index is syntactically identical to slicing a Series. Because of this, we will not go into the details of the various permutations of slices in this section and only give representative examples applied to a DataFrame.

Slicing works along both positions and labels. The following command demonstrates several examples of slicing by position:

In [39]:
# first five rows
sp500[:3]

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954


The following command returns rows starting with the XYL label through the YUM label:



In [40]:
# XYL through YUM labels
sp500['XYL':'YUM']

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
XYL,Industrials,38.42,12.127
YHOO,Information Technology,35.02,12.768
YUM,Consumer Discretionary,74.77,5.147


In general, although slicing a DataFrame has its uses, high-performance systems tend to shy away from it and use other methods. Additionally, the slice notation for rows on a DataFrame using integers can be confusing as it looks like accessing columns by position and hence can lead to subtle bugs.

### Selecting rows by index label and location: .loc[ ] and .iloc[ ]

Rows can be retrieved via the index label value using .loc[ ]:

In [41]:
# get row with label MMM
# returned as a Series
sp500.loc['MMM']

Sector        Industrials
Price              141.14
Book Value         26.668
Name: MMM, dtype: object

In [42]:
# rows with label MMM and MSFT
# this is a DataFrame result
sp500.loc[['MMM', 'MSFT']]

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
MSFT,Information Technology,40.12,10.584


Rows can be retrieved by location using .iloc[ ]:

In [43]:
# get rows in location 0 and 2
sp500.iloc[[0, 2]]

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABBV,Health Care,53.95,2.954


It is possible to look up the location in index of a specific label value, which can then be used to retrieve the row(s):


In [44]:
# get the location of MMM and A in the index
i1 = sp500.index.get_loc('MMM')
i2 = sp500.index.get_loc('A')
i1, i2

(0, 10)

In [45]:
# and get the rows
sp500.iloc[[i1, i2]]

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
A,Health Care,56.18,16.928


### Scalar lookup by label or location using .at[ ] and .iat[ ] 

Scalar values can be looked up by label using .at[] by passing the row label and then the column name/value:

In [48]:
# by label in both the index and column
sp500.at['MMM', 'Price']

141.14

Scalar values can also be looked up by location using .iat[] by passing both the row location and then the column location. This is the preferred method of accessing single values and results at the highest performance:

In [49]:
# by location.  Row 0, column 1
sp500.iat[0, 1]

141.14

## Selecting rows using Boolean selection

Rows can also be selected using the Boolean selection with an array calculated from the result of applying a log logical condition to the values in any of the columns. This allows us to build more complicated selections than those based simply upon index labels or positions.

Consider the following command that is an array of all companies that have a price below 100.0:

In [52]:
# what rows have a price < 100?
sp500.Price < 100

Symbol
MMM     False
ABT      True
ABBV     True
ACN      True
ACE     False
        ...  
YHOO     True
YUM      True
ZMH     False
ZION     True
ZTS      True
Name: Price, Length: 500, dtype: bool

This results in a Series that can be used to select rows where the value is True:

In [53]:
# now get the rows with Price < 100
sp500[sp500.Price < 100]

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABT,Health Care,39.60,15.573
ABBV,Health Care,53.95,2.954
ACN,Information Technology,79.79,8.326
ADBE,Information Technology,64.30,13.262
AES,Utilities,13.61,5.781
...,...,...,...
XYL,Industrials,38.42,12.127
YHOO,Information Technology,35.02,12.768
YUM,Consumer Discretionary,74.77,5.147
ZION,Financials,28.43,30.191


Multiple conditions can be put together using parentheses, and at the same time, it is possible to select only a subset of the columns. The following command retrieves the symbols and price for all stocks with a price less than 10 and greater than 0:

In [54]:
# get only the Price where Price is < 10 and > 0
sp500[(sp500.Price < 10) & (sp500.Price > 0)] [['Price']]

Unnamed: 0_level_0,Price
Symbol,Unnamed: 1_level_1
FTR,5.81
HCBK,9.8
HBAN,9.1
SLM,8.82
WIN,9.38


## Arithmetic on a DataFrame

Arithmetic operations using scalar values will be applied to every element of a DataFrame. To demonstrate this, we will use a DataFrame initialized with random values:

In [55]:
# set the seed to allow replicatable results
np.random.seed(123456)
# create the DataFrame
df = pd.DataFrame(np.random.randn(5, 4), 
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,0.469112,-0.282863,-1.509059,-1.135632
1,1.212112,-0.173215,0.119209,-1.044236
2,-0.861849,-2.104569,-0.494929,1.071804
3,0.721555,-0.706771,-1.039575,0.27186
4,-0.424972,0.56702,0.276232,-1.087401


By default, any arithmetic operation will be applied across all rows and columns of a DataFrame and will return a new DataFrame with the results (leaving the original unchanged):

In [56]:
# multiply everything by 2
df * 2

Unnamed: 0,A,B,C,D
0,0.938225,-0.565727,-3.018117,-2.271265
1,2.424224,-0.346429,0.238417,-2.088472
2,-1.723698,-4.209138,-0.989859,2.143608
3,1.44311,-1.413542,-2.07915,0.54372
4,-0.849945,1.134041,0.552464,-2.174801


When performing an operation between a DataFrame and a Series, pandas will align the Series index along the DataFrame columns, performing what is referred to as a row-wise broadcast. To demonstrate this, the following example retrieves the first row of the DataFrame and then subtracts this from each row of the DataFrame. The Series is being broadcast by pandas to each row of the DataFrame, which aligns each series item with the DataFrame item of the same index label and then applies the minus operator on the matched values:

In [57]:
df - df.iloc[0]

Unnamed: 0,A,B,C,D
0,0.0,0.0,0.0,0.0
1,0.743,0.109649,1.628267,0.091396
2,-1.330961,-1.821706,1.014129,2.207436
3,0.252443,-0.423908,0.469484,1.407492
4,-0.894085,0.849884,1.785291,0.048232


An arithmetic operation between two DataFrame objects will align with both the column and index labels. The following command extracts a small portion of df and subtracts it from df. The result demonstrates that the aligned values subtract to 0, while the others are set to NaN:

In [58]:
# get rows 1 through three, and only B, C columns
subframe = df[1:4][['B', 'C']]
# we have extracted a little square in the middle of df
subframe

Unnamed: 0,B,C
1,-0.173215,0.119209
2,-2.104569,-0.494929
3,-0.706771,-1.039575


In [59]:
# demonstrate the alignment of the subtraction
df - subframe

Unnamed: 0,A,B,C,D
0,,,,
1,,0.0,0.0,
2,,0.0,0.0,
3,,0.0,0.0,
4,,,,


Additional control of an arithmetic operation can be gained using the arithmetic methods provided by the DataFrame object. These methods provide the specification of a particular axis. The following command demonstrates subtraction along a column axis by using the DataFrame object; the .sub() method subtracts the A column from every column:

In [60]:
# get the A column
a_col = df['A']
df.sub(a_col, axis=0)

Unnamed: 0,A,B,C,D
0,0.0,-0.751976,-1.978171,-1.604745
1,0.0,-1.385327,-1.092903,-2.256348
2,0.0,-1.24272,0.36692,1.933653
3,0.0,-1.428326,-1.76113,-0.449695
4,0.0,0.991993,0.701204,-0.662428


# Reindexing Series and DataFrame objects

Reindexing in pandas is a process that makes the data present in a Series or DataFrame match with a given set of labels along a particular axis. This is core to the functionalities of pandas as it enables label alignment across multiple objects.

The process of performing a reindex does the following:

Reorders existing data to match a set of labels
Inserts NaN markers where no data exists for a label
Fills missing data for a label using a type of logic (defaulting to adding NaNs)
The following is a simple example of reindexing a Series. The following Series has an index with numerical values, and the index is modified to be alphabetic by simply assigning a list of characters to the .index property, making the values able to be accessed via the character labels in the new index:

In [65]:
# create a series of five random numbers
np.random.seed(1)
s = pd.Series(np.random.randn(5))
s

0    1.624345
1   -0.611756
2   -0.528172
3   -1.072969
4    0.865408
dtype: float64

In [66]:
# now set the index to alpha values
s.index = ['a', 'b', 'c', 'd', 'e']
s

a    1.624345
b   -0.611756
c   -0.528172
d   -1.072969
e    0.865408
dtype: float64

Greater flexibility in creating a new index is provided using the .reindex() method. One example of flexibility of .reindex() over assigning the .index property directly is that the list provided to .reindex() can be of a different length than the number of rows in the Series:

In [67]:
# reindex the copy
s2 = s.reindex(['a', 'c', 'e', 'g'])
# change the value at 'a'
s2['a'] = 0
s2

a    0.000000
c   -0.528172
e    0.865408
g         NaN
dtype: float64

In [68]:
s['a']

1.6243453636632417

There are several things here that are important to point out about .reindex():

The result is a new Series (the value of s['a']) remains unchanged) with the labels provided as a parameter, and if the existing Series had a matching label, that value is copied to the new Series
If there is an index label created for which the Series did not have an already existing label, the value will be assigned NaN
Reindexing is also useful when you want to align two Series to perform an operation on matching elements from each series, but for some reason, the two Series had index labels that would not initially align.

The following example demonstrates this, where the first Series has indices as sequential integers, but the second one has string representation of what would be sequential integers.

The addition of both Series has the following result, which is all NaNs and an Int64Index that has repeated label values:

In [69]:
# Series objects with string and integer index types
# with the "same" values will not align
s1 = pd.Series([0, 1, 2], index=[0, 1, 2])
s2 = pd.Series([3, 4, 5], index=['0', '1', '2'])
s1 + s2

0   NaN
1   NaN
2   NaN
0   NaN
1   NaN
2   NaN
dtype: float64

This is almost an epic fail situation that can happen if values intended to be numeric are presented with one being numeric and the other as string. In this case, pandas first tries to align with the indices and finds no matches, so it copies the index labels from the first Series and tries to append the indices from the second Series. But since they are a different type, it defaults back to a zero-based integer sequence, which results in duplicate values. And finally, all the resulting values are NaN because the operation tries to add the item in the first series with the integer label 0, which has the value 0 but can't find the item in the other series with the integer label 0; therefore, the result is NaN (and this fails six times in this case).

Once this situation is identified, it becomes fairly simple to fix with reindexing the second Series by casting the values to int:

In [70]:
# demonstrate treating values at a specific type
s2.index = s2.index.values.astype(int)
s1 + s2

0    3
1    5
2    7
dtype: int64

The default action of inserting NaN as a missing value during .reindex() can be changed using fill_value of the method. The following command demonstrates using 0 instead of NaN:

In [71]:
# show reindexing with filling of NaN with a specified value
s2 = s.copy()
s2.reindex(['a', 'f'], fill_value=0)

a    1.624345
f    0.000000
dtype: float64

When performing a reindex on ordered data, such as a time-series, it is possible to perform interpolation or filling of values. There will be a more elaborate discussion on interpolation and filling of values in Chapter 4, Time-series, but the following examples introduce the concept. To demonstrate the concept, let's use the following Series:

In [72]:
# a Series to demonstrate reindexing
s3 = pd.Series(['red', 'green', 'blue'], index=[0, 3, 5])
s3

0      red
3    green
5     blue
dtype: object

The following command demonstrates forward filling, often referred to as the last known value. The Series is reindexed to create a contiguous integer index, and using the method='ffill' parameter, any new index labels are assigned a value from the previously seen value along the Series. Here's the command:

In [73]:
# forward fill (last known value technique)
s3.reindex(np.arange(0,7), method='ffill')

0      red
1      red
2      red
3    green
4    green
5     blue
6     blue
dtype: object

In [74]:
# demonstrate how backwards fill differs
s3.reindex(np.arange(0,7), method='bfill')

0      red
1    green
2    green
3    green
4     blue
5     blue
6      NaN
dtype: object