# McKinney Chapter 5 - Getting Started with pandas

## Introduction

Chapter 5 of Wes McKinney's [*Python for Data Analysis*](https://wesmckinney.com/book/) discusses the fundamentals of pandas, which will be our main tool for the rest of the semester.
pandas is an abbrviation for *pan*el *da*ta, which provide time-stamped data for multiple individuals or firms.

***Note:*** 
Indented block quotes are from McKinney unless otherwise indicated. 
The section numbers here differ from McKinney because we will only discuss some topics.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
%config InlineBackend.figure_format = 'retina'
%precision 4
pd.options.display.float_format = '{:.4f}'.format

> pandas will be a major tool of interest throughout much of the rest of the book. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant parts of NumPy's idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops. 
>
> While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

We will use pandas---a wrapper for NumPy that helps us manipulate and combine data---every day for the rest of the course.

## Introduction to pandas Data Structures

> To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

### Series

> A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data.

The early examples use integer and string labels, but date-time labels are most useful.

In [3]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Contrast `obj` with a NumPy array equivalent:

In [4]:
np.array([4, 7, -5, 3])

array([ 4,  7, -5,  3])

In [5]:
obj.values

array([ 4,  7, -5,  3])

In [6]:
obj.index  # similar to range(4)

RangeIndex(start=0, stop=4, step=1)

We did not explicitly assign an index to `obj`, so `obj` has an integer index that starts at 0.
We can explicitly assign an index with the `index=` argument.

In [7]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [9]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [10]:
obj2['a']

-5

In [11]:
obj2[2]

-5

In [12]:
obj2['d'] = 6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [13]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

A pandas series behaves like a NumPy array.
We can use Boolean filters and perform vectorized mathematical operations.

In [14]:
obj2 > 0

d     True
b     True
a    False
c     True
dtype: bool

In [15]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [16]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [17]:
'b' in obj2

True

In [18]:
'e' in obj2

False

We can create a pandas series from a dictionary.
The dictionary labels become the series index.

In [19]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

We can create a pandas series from a list, too.
Note that pandas respects the order of the assigned index.
Also, pandas keeps California with `NaN` (not a number or missing value) and drops Utah because it was not in the index.

In [20]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California          NaN
Ohio         35000.0000
Oregon       16000.0000
Texas        71000.0000
dtype: float64

Indices are one of pandas' super powers.
When we perform mathematical operations, pandas aligns series by their indices.
Here `NaN` is "not a number", which indicates missing values.
`NaN` is considered a float, so the data type switches from int64 to float64.

In [21]:
obj3 + obj4

California           NaN
Ohio          70000.0000
Oregon        32000.0000
Texas        142000.0000
Utah                 NaN
dtype: float64

### DataFrame

A pandas data frame is like a worksheet in an Excel workbook with row and columns that provide fast indexing.

> A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. The exact details of DataFrame’s internals are outside the scope of this book.
>
> There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:


In [22]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)

frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


We did not specify an index, so `frame` has the default index of integers starting at 0.

In [23]:
frame2 = pd.DataFrame(
    data, 
    columns=['year', 'state', 'pop', 'debt'],
    index=['one', 'two', 'three', 'four', 'five', 'six']
)

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


If we extract one column, via either `df.column` or `df['column']`, the result is a series.
We can use either the `df.colname` or the `df['colname']` syntax to *extract* a column from a data frame as a series.
***However, we must use the `df['colname']` syntax to *add* a column to a data frame.***
Also, we must use the `df['colname']` syntax to extract or add a column whose name contains a whitespace.

In [24]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [25]:
frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

Similarly, if we extract one row. via either `df.loc['rowlabel']` or `df.iloc[rownumber]`, the result is a series.

In [26]:
frame2.loc['one']

year      2000
state     Ohio
pop     1.5000
debt       NaN
Name: one, dtype: object

Data frame have two dimensions, so we have to slice data frames more precisely than series.

1. The `.loc[]` method slices by row labels and column names
1. The `.iloc[]` method slices by *integer* row and label indices

In [27]:
frame2.loc['three']

year      2002
state     Ohio
pop     3.6000
debt       NaN
Name: three, dtype: object

In [28]:
frame2.iloc[2]

year      2002
state     Ohio
pop     3.6000
debt       NaN
Name: three, dtype: object

We can use NumPy's `[row, column]` syntanx with `.loc[]` and `.iloc[]`.

In [29]:
frame2.loc['three', 'state'] # row, column

'Ohio'

In [30]:
frame2.loc['three', ['state', 'pop']] # row, column

state     Ohio
pop     3.6000
Name: three, dtype: object

We can assign either scalars or arrays to data frame columns.

1. Scalars will broadcast to every row in the data frame
1. Arrays must have the same length as the column

In [31]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [32]:
frame2['debt'] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


If we assign a series to a data frame column, pandas will use the index to align it with the data frame.
Data frame rows not in the series will be missing values `NaN`.

In [33]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val

two    -1.2000
four   -1.5000
five   -1.7000
dtype: float64

In [34]:
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


We can add columns to our data frame, then delete them with `del`.

In [35]:
frame2['eastern'] = (frame2.state == 'Ohio')
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


In [36]:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


### Index Objects

In [37]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index

In [38]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable!

In [40]:
# index[1] = 'd'  # TypeError: Index does not support mutable operations

Indices can contain duplicates, so an index does not guarantee our data are duplicate-free.

In [41]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

## Essential Functionality

This section provides the most import pandas operations.
It is difficult to provide an exhaustive reference, but this section provides a head start on the core pandas functionality.

### Dropping Entries from an Axis

> Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the  drop method will return a new object with the indicated value or values deleted from an axis.

In [42]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a   0.0000
b   1.0000
c   2.0000
d   3.0000
e   4.0000
dtype: float64

In [43]:
obj_without_d_and_c = obj.drop(['d', 'c'])
obj_without_d_and_c

a   0.0000
b   1.0000
e   4.0000
dtype: float64

The `.drop()` method works on data frames, too.

In [44]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [45]:
data.drop(['Colorado', 'Ohio']) # implied ", axis=0"

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [46]:
data.drop(['Colorado', 'Ohio'], axis=0)

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [47]:
data.drop(index=['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


The `.drop()` method accepts an `axis` argument and the default is `axis=0` to drop rows based on labels.
To drop columns, we use `axis=1` or `axis='columns'`.

In [48]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [49]:
data.drop(columns='two')

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


### Indexing, Selection, and Filtering

Indexing, selecting, and filtering will be among our most-used pandas features.

> Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series's index values instead of only integers.  

In [50]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a   0.0000
b   1.0000
c   2.0000
d   3.0000
dtype: float64

In [51]:
obj['b']

1.0

In [52]:
obj[1]

1.0

The code directly above works.
However, I prefer to be explicit and use `.iloc[]` when I index or slice by integers.

In [53]:
obj.iloc[1]

1.0

In [54]:
obj.iloc[1:3]

b   1.0000
c   2.0000
dtype: float64

***When we slice with labels, the left and right endpoints are inclusive.***

In [55]:
obj['b':'c']

b   1.0000
c   2.0000
dtype: float64

In [56]:
obj['b':'c'] = 5
obj

a   0.0000
b   5.0000
c   5.0000
d   3.0000
dtype: float64

In [57]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Indexing one column returns a series.

In [58]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

Indexing two or more columns returns a data frame.

In [59]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


If we want a one-column data frame, we can use `[[]]`:

In [60]:
data['three']

Ohio         2
Colorado     6
Utah        10
New York    14
Name: three, dtype: int64

In [61]:
data[['three']]

Unnamed: 0,three
Ohio,2
Colorado,6
Utah,10
New York,14


When we slice with integer indices with `[]`, we slice rows.

In [62]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


When I slice rows, I prefer to use `.loc[]` or `.iloc[]` to avoid confusion.

In [63]:
data.iloc[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


We can index a data frame with Booleans, as we did with NumPy arrays.

In [64]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [65]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Finally, we can chain slices.

In [66]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


***Table 5-4*** summarizes data frame indexing and slicing options:

- `df[val]`: Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion)
- `df.loc[val]`: Selects single row or subset of rows from the DataFrame by label
- `df.loc[:, val]`: Selects single column or subset of columns by label
- `df.loc[val1, val2]`: Select both rows and columns by label
- `df.iloc[where]`: Selects single row or subset of rows from the DataFrame by integer position
- `df.iloc[:, where]`: Selects single column or subset of columns by integer position
- `df.iloc[where_i, where_j]`: Select both rows and columns by integer position
- `df.at[label_i, label_j]`: Select a single scalar value by row and column label
- `df.iat[i, j]`: Select a single scalar value by row and column position (integers) reindex method Select either rows or columns by labels
- `get_value`, `set_value` methods: Select single value by row and column label

pandas is powerful and these options can be overwhelming!
We will typically use `df[val]` to select columns (here `val` is either a string or list of strings), `df.loc[val]` to select rows (here `val` is a row label), and `df.loc[val1, val2]` to select both rows and columns.
The other options add flexibility, and we may occasionally use them.
However, our data will be large enough that counting row and column number will be tedious, making `.iloc[]` impractical.

### Integer Indexes

In [67]:
ser = pd.Series(np.arange(3.))
ser

0   0.0000
1   1.0000
2   2.0000
dtype: float64

The following indexing yields an error because the series cannot fall back to NumPy array indexing.
Falling back to NumPy array indexing here would generate many subtle bugs elsewhere.

In [69]:
# ser[-1]

In [70]:
ser.iloc[-1]

2.0

However, the following indexing works fine because with string labels there is no ambiguity.

In [71]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2

a   0.0000
b   1.0000
c   2.0000
dtype: float64

In [72]:
ser2[-1]

2.0

In [73]:
ser2.iloc[-1]

2.0

In practice, these errors will not be an issue because we will index or slice with stock identifiers and dates instead of integers.
To avoid condusion, we should use `.iloc[]` to index or slice with integers.

### Arithmetic and Data Alignment

> An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. 

In [74]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [75]:
s1

a    7.3000
c   -2.5000
d    3.4000
e    1.5000
dtype: float64

In [76]:
s2

a   -2.1000
c    3.6000
e   -1.5000
f    4.0000
g    3.1000
dtype: float64

In [77]:
s1 + s2

a   5.2000
c   1.1000
d      NaN
e   0.0000
f      NaN
g      NaN
dtype: float64

In [78]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [79]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [80]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [81]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [82]:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})

In [84]:
df1

Unnamed: 0,A
0,1
1,2


In [85]:
df2

Unnamed: 0,B
0,3
1,4


In [83]:
df1 - df2

Unnamed: 0,A,B
0,,
1,,


#### Arithmetic methods with fill values

In [86]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1, 'b'] = np.nan

In [87]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [88]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [89]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


We can specify a fill value for `NaN` values.
Note that pandas fills would-be `NaN` values in each data frame *before* the arithmetic operation.

In [90]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


#### Operations between DataFrame and Series

In [91]:
arr = np.arange(12.).reshape((3, 4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [92]:
arr[0]

array([0., 1., 2., 3.])

In [93]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

Arithmetic operations between series and data frames behave the same as the example above.

In [94]:
frame = pd.DataFrame(
    np.arange(12.).reshape((4, 3)),
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon']
)

series = frame.iloc[0]

In [95]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [96]:
series

b   0.0000
d   1.0000
e   2.0000
Name: Utah, dtype: float64

In [97]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [98]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])

In [99]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [100]:
series2

b    0
e    1
f    2
dtype: int64

In [101]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [102]:
series3 = frame['d']

In [103]:
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### Function Application and Mapping

In [104]:
np.random.seed(42)
frame = pd.DataFrame(
    np.random.randn(4, 3), 
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon']
)

frame

Unnamed: 0,b,d,e
Utah,0.4967,-0.1383,0.6477
Ohio,1.523,-0.2342,-0.2341
Texas,1.5792,0.7674,-0.4695
Oregon,0.5426,-0.4634,-0.4657


In [105]:
frame.abs()

Unnamed: 0,b,d,e
Utah,0.4967,0.1383,0.6477
Ohio,1.523,0.2342,0.2341
Texas,1.5792,0.7674,0.4695
Oregon,0.5426,0.4634,0.4657


> Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:

Note that we can use anonymous (lambda) functions "on the fly":

In [106]:
frame.apply(lambda x: x.max() - x.min())

b   1.0825
d   1.2309
e   1.1172
dtype: float64

In [107]:
frame.apply(lambda x: x.max() - x.min(), axis=1)

Utah     0.7860
Ohio     1.7572
Texas    2.0487
Oregon   1.0083
dtype: float64

However, under the hood, the `.apply()` is basically a `for` loop and much slowly than optimized, built-in methods.
Here is an example of the speed costs of `.apply()`:

In [108]:
%timeit frame['e'].abs()

9.13 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [109]:
%timeit frame['e'].apply(np.abs)

26.3 µs ± 989 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## Summarizing and Computing Descriptive Statistics

In [110]:
df = pd.DataFrame(
    [[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],
    columns=['one', 'two']
)

df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [111]:
df.sum()

one    9.2500
two   -5.8000
dtype: float64

In [112]:
df.sum(axis=1)

a    1.4000
b    2.6000
c    0.0000
d   -0.5500
dtype: float64

In [113]:
df.mean(axis=1, skipna=False)

a       NaN
b    1.3000
c       NaN
d   -0.2750
dtype: float64

The `.idxmax()` method returns the label for the maximum observation.

In [114]:
df.idxmax()

one    b
two    d
dtype: object

The `.describe()` returns summary statistics for each numerical column in a data frame.

In [115]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.0833,-2.9
std,3.4937,2.2627
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


For non-numerical data, `.describe()` returns alternative summary statistics.

In [117]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

### Correlation and Covariance

To explore correlation and covariance methods, we can use Yahoo! Finance stock data.
We can use the yfinance package to import these data.
We can use the requests-cache package to cache our data requests, which avoid unnecessarily re-downloading data.

We can install these two functions with the `%pip` magic:

In [118]:
# %pip install yfinance requests-cache

If we are running Python locally, we only need to run the `%pip` magic once.
If we are running Python in the cloud, we may need to run the `%pip` magic once *per login*.

In [119]:
import yfinance as yf
import requests_cache
session = requests_cache.CachedSession(expire_after='1D')

In [120]:
tickers = yf.Tickers('AAPL IBM MSFT GOOG', session=session)

In [121]:
prices = tickers.history(period='max', auto_adjust=False, progress=False)

[*********************100%***********************]  4 of 4 completed


In [123]:
prices.index = prices.index.tz_localize(None)

In [125]:
prices['Adj Close']

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1962-01-02,,,1.6330,
1962-01-03,,,1.6473,
1962-01-04,,,1.6309,
1962-01-05,,,1.5988,
1962-01-08,,,1.5688,
...,...,...,...,...
2023-01-20,137.8700,99.2800,141.2000,240.2200
2023-01-23,141.1100,101.2100,141.8600,242.5800
2023-01-24,142.5300,99.2100,141.4900,242.0400
2023-01-25,141.8600,96.7300,140.7600,240.6100


The `prices` data frames contains daily data for AAPL, IBM, MSFT, and GOOG.
The `Adj Close` column provides a reverse-engineered daily closing price that accounts for dividends paid and stock splits (and reverse splits).
As a result, the `.pct_change()` in `Adj Close` considers both price changes (i.e., capital gains) and dividends, so $R_t = \frac{(P_t + D_t) - P_{t-1}}{P_{t-1}} = \frac{\text{Adj Close}_t - \text{Adj Close}_{t-1}}{\text{Adj Close}_{t-1}}.$

In [126]:
returns = prices['Adj Close'].pct_change().dropna()
returns

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004-08-20,0.0029,0.0794,0.0042,0.0029
2004-08-23,0.0091,0.0101,-0.0070,0.0044
2004-08-24,0.0280,-0.0414,0.0007,0.0000
2004-08-25,0.0344,0.0108,0.0042,0.0114
2004-08-26,0.0487,0.0180,-0.0045,-0.0040
...,...,...,...,...
2023-01-20,0.0192,0.0572,0.0041,0.0357
2023-01-23,0.0235,0.0194,0.0047,0.0098
2023-01-24,0.0101,-0.0198,-0.0026,-0.0022
2023-01-25,-0.0047,-0.0250,-0.0052,-0.0059


We multiply by 252 to annualize mean daily returns because means grow linearly with time and there are (about) 252 trading days per year.

In [128]:
returns.mean().mul(252)

AAPL   0.3667
GOOG   0.2463
IBM    0.0822
MSFT   0.1820
dtype: float64

We multiply by $\sqrt{252}$ to annualize the standard deviation of daily returns because variances grow linearly with time, there are (about) 252 trading days per year, and the standard deviation is the square root of the variance.

In [129]:
returns.std().mul(np.sqrt(252))

AAPL   0.3332
GOOG   0.3077
IBM    0.2294
MSFT   0.2738
dtype: float64

***The best explanation I have found on why stock return volatility (the standard deviation of stocks returns) grows with the square root of time is at the bottom of page 7 of [chapter 8 of Ivo Welch's free corporate finance textbook](https://book.ivo-welch.info/read/source5.mba/08-invchoice.pdf).***

We can calculate pairwise correlations.

In [132]:
returns['MSFT'].corr(returns['IBM'])

0.5076061163270416

We can also calculate correlation matrices.

In [133]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.5204,0.4345,0.5237
GOOG,0.5204,1.0,0.4068,0.565
IBM,0.4345,0.4068,1.0,0.5076
MSFT,0.5237,0.565,0.5076,1.0


In [134]:
returns.corr().loc['MSFT', 'IBM']

0.5076061163270453