# Beyond the Basics of pandas

Now that we have covered the basics of pandas and how to manipulate data, let's move on to some different representations of data in pandas. In this section we will analyze financial data that can be readily accessed from within the pandas library.

In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt

3.4.3 |Anaconda 2.3.0 (x86_64)| (default, Mar  6 2015, 12:07:41) 
[GCC 4.2.1 (Apple Inc. build 5577)]
1.9.2
0.16.2




We will leverage the `pandas.io` remote data access as described on this page:

http://pandas.pydata.org/pandas-docs/stable/remote_data.html

Functions from `pandas.io.data` and `pandas.io.ga` extract data from various Internet sources and make them available in our notebooks. At the time of this writing, the following sources are supported:

- Yahoo! Finance
- Google Finance
- St.Louis FED (FRED)
- Kenneth French’s data library
- World Bank
- Google Analytics

This list actively changes so it is a good idea to see what is available to you; it is likely that more useful sources will continue to be added.

Let's explore the module to see if it gives us any information.

In [2]:
import pandas.io.data
?pandas.io.data

Our plan is to look at some stocks from Yahoo! data with `pandas.io.data`.  There was a fair amount of volatility in the oil markets in 2014 to 2015. It was rough for oil producers, to say the least. Let's explore some of the stocks that are involved in that specific market.

First we will set start and end dates. These are just datetimes. As we review in the next cell, we can create datetimes from the datetime package.  However, we can also use pandas to simply parse a string and pull out a datetime as well. This ends up being extremely useful.

In [4]:
import datetime
print(datetime.datetime(2010,1,1))
print(pd.to_datetime("2010-1-1"))

2010-01-01 00:00:00
2010-01-01 00:00:00


Now let's take a look at some specific stocks. Let's look at WTI, CHK, Tesla Motors, and CBAK. Below you will find out what they each cover, although you can certainly look online as well.


WTI - W&T Offshore Inc. (This company drills in the Gulf of Mexico.)

CHK - Chesapeake Energy Corporation

TSLA - Tesla Motors

CBAK - China Bak Battery Incorporated


See below how I use both datetime creation methods. We will get data from 2010 to 2015 for all of these stocks.

In [5]:
start = pd.to_datetime('2010-1-1')
end = datetime.datetime(2015,1,1)
ticker_symbols = ['WTI','CHK','TSLA','CBAK']

Here you can see how to get data for a single stock.

In [6]:
wti = pd.io.data.get_data_yahoo(ticker_symbols[0],start=start,end=end)

In [7]:
wti.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-04,11.9,12.46,11.86,12.26,838800,9.824887
2010-01-05,12.3,12.63,12.17,12.34,625400,9.888997
2010-01-06,12.41,12.65,12.39,12.58,604700,10.081328
2010-01-07,12.6,12.7,12.24,12.45,565300,9.977149
2010-01-08,12.37,12.54,12.12,12.5,521100,10.017218


One way to download the data for all of the stocks is in a `for` loop like the one below.

In [7]:
for symbol in ticker_symbols:
    print(symbol)
    df = pd.io.data.get_data_yahoo(symbol,start=start,end=end)

WTI
CHK
TSLA
CBAK


A simpler solution is to just pass in a list of symbols, which pandas will automatically resolve for us.

In [9]:
panl = pd.io.data.get_data_yahoo(ticker_symbols,start=start,end=end)

Notice that I called this `panl` instead of something conventional like `df`. This is because this query returns a Python type that we have not encountered yet. It is a pandas `panel`. Panels are an advanced topic, and explaining their use cases is outside the scope of this course; however, we will cover the basics.

The word *panel* is derived from "panel data."  In econometrics and statistics, panel data refers to a data set in which multiple units of analysis are observed over multiple time periods.  Such a data set requires specialized statistical modeling techniques for analysis.  

A panel in pandas is a three-dimensional container for data. It is basically a three-dimensional DataFrame. We can query along each of those individual dimensions.

**Warning:** Most of the time, the two definitions of *panel* line up nicely: a pandas panel is a great way to store most panel data.  There could be exceptions, however.  Not all three-dimensional data sets have a time dimension, for example.  As you continue in your training, it is important to keep the separate definitions in mind.  In particular, do not assume that everything you learn about pandas `panel` carries over to the statistical definition.

In [9]:
panl

<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1258 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2010-01-04 00:00:00 to 2014-12-31 00:00:00
Minor_axis axis: CBAK to WTI


We can see that our panel has three axes: an items axis, a major axis, and a minor axis. The major axis is the time axis.  The minor axis has the four companies (these are our principal units of analysis).  Finally, the items axis refers to the different variables: opening price, closing price, and so forth.

Panels are a core part of pandas, but they are much less used in pandas and therefore are a bit neglected. Understand that we are not trying to avoid the topic; this is stated in the documentation.

http://pandas-docs.github.io/pandas-docs-travis/dsintro.html#panel

> Note: Unfortunately Panel, being less commonly used than Series and DataFrame, has been slightly neglected feature-wise. A number of methods and options available in DataFrame are not available in Panel. This will get worked on, of course, in future releases (and even faster if you join me in working on the codebase).

In [10]:
type(panl)

pandas.core.panel.Panel


You are likely to run into panels at some point in your work, so lets touch on their behavior. Panels include a lot of the basic methods that we are comfortable with, like `shape`.


In [11]:
panl.shape

(6, 1258, 4)

We have these three axes so when we want to query data, we need to do that a bit differently. Items are queried like standard DataFrame columns with dot syntax.


In [12]:
panl.Open.head()

Unnamed: 0_level_0,CBAK,CHK,TSLA,WTI
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,2.9,27.429997,,11.9
2010-01-05,2.72,28.300002,,12.3
2010-01-06,2.98,29.209995,,12.41
2010-01-07,2.9,28.629998,,12.6
2010-01-08,2.9,28.389996,,12.37


The major and minor axes are done differently, with the major_xs and minor_xs commands. Notice how the `major_xs` conveniently accepts a date string and parses it for us.

In [13]:
panl.major_xs('2013-5-1')

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close
CBAK,0.62,0.79,0.6,0.77,56800,0.77
CHK,19.900002,19.979996,18.86,19.190006,17267300,17.569929
TSLA,55.990002,55.990002,53.0,53.279999,2742800,53.279999
WTI,11.54,11.54,11.03,11.22,720500,10.394898


In [14]:
panl.minor_xs('CHK').head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-04,27.429997,28.109996,26.920004,28.089999,31146800,24.50381
2010-01-05,28.300002,29.120002,28.199999,28.970004,28692700,25.271467
2010-01-06,29.209995,29.220005,28.530005,28.649996,16055000,24.992314
2010-01-07,28.629998,28.799995,28.180002,28.720002,13906600,25.053382
2010-01-08,28.389996,28.919998,28.050002,28.909998,11656400,25.219122


Some summary statistics are available to us such as the mean.  Notice that this computes the mean across the major axis, which is the time axis.

In [15]:
panl.mean()

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close
CBAK,1.729793,1.801296,1.670469,1.726248,117253.020668,4.791367
CHK,24.194174,24.530963,23.802076,24.165184,14230891.096979,21.921955
TSLA,88.21882,89.988099,86.329102,88.177315,4136559.59507,88.177315
WTI,16.080199,16.418887,15.720723,16.069173,806965.580286,14.3293


We can perform different kinds of selections and transposition using the major and minor axes; however,  we will not cover this material. What we will do is convert this panel to a DataFrame. This will be a convenient way to introduce a new topic as well.

When we convert the `panel` to a `DataFrame` with the `to_frame` command, we will see that it looks a bit different, especially when we print out the data.

In [16]:
df = panl.to_frame()
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Open,High,Low,Close,Volume,Adj Close
Date,minor,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2010-01-04,CBAK,2.9,2.9,2.67,2.7,456600,13.5
2010-01-04,CHK,27.429997,28.109996,26.920004,28.089999,31146800,24.50381
2010-01-04,WTI,11.9,12.46,11.86,12.26,838800,9.824887
2010-01-05,CBAK,2.72,3.1,2.69,2.85,1179500,14.25
2010-01-05,CHK,28.300002,29.120002,28.199999,28.970004,28692700,25.271467


You can see that we have two indices on our data. More formally this is called a hierarchical or multi-index. Hierarchical indices are extremely powerful because they allow for a lot of creative querying. We will not go into them too deeply right now; however, we will touch on them a bit later in this section.

What you need to know now is that there are levels that are stacked on one another and those can be queried. Formally, as you might have guessed, the different indices are known as *levels*.

In [17]:
df.index.levels

FrozenList([[2010-01-04 00:00:00, 2010-01-05 00:00:00, 2010-01-06 00:00:00, 2010-01-07 00:00:00, 2010-01-08 00:00:00, 2010-01-11 00:00:00, 2010-01-12 00:00:00, 2010-01-13 00:00:00, 2010-01-14 00:00:00, 2010-01-15 00:00:00, 2010-01-19 00:00:00, 2010-01-20 00:00:00, 2010-01-21 00:00:00, 2010-01-22 00:00:00, 2010-01-25 00:00:00, 2010-01-26 00:00:00, 2010-01-27 00:00:00, 2010-01-28 00:00:00, 2010-01-29 00:00:00, 2010-02-01 00:00:00, 2010-02-02 00:00:00, 2010-02-03 00:00:00, 2010-02-04 00:00:00, 2010-02-05 00:00:00, 2010-02-08 00:00:00, 2010-02-09 00:00:00, 2010-02-10 00:00:00, 2010-02-11 00:00:00, 2010-02-12 00:00:00, 2010-02-16 00:00:00, 2010-02-17 00:00:00, 2010-02-18 00:00:00, 2010-02-19 00:00:00, 2010-02-22 00:00:00, 2010-02-23 00:00:00, 2010-02-24 00:00:00, 2010-02-25 00:00:00, 2010-02-26 00:00:00, 2010-03-01 00:00:00, 2010-03-02 00:00:00, 2010-03-03 00:00:00, 2010-03-04 00:00:00, 2010-03-05 00:00:00, 2010-03-08 00:00:00, 2010-03-09 00:00:00, 2010-03-10 00:00:00, 2010-03-11 00:00:00, 

In [18]:
print(len(df.index.levels))

2


Since we do not want to work with a hierarchical index right now, we will reset the index to get the data into a format that is a little less structured. Luckily, this is easy to do.

You may often find yourself using the `reset_index` method just to get back to square one and start over when performing analysis. I find myself using it a lot simply because it helps me make sure that I understand what I am doing to my data and what format it is in.

In [19]:
df.reset_index().head()

Unnamed: 0,Date,minor,Open,High,Low,Close,Volume,Adj Close
0,2010-01-04,CBAK,2.9,2.9,2.67,2.7,456600,13.5
1,2010-01-04,CHK,27.429997,28.109996,26.920004,28.089999,31146800,24.50381
2,2010-01-04,WTI,11.9,12.46,11.86,12.26,838800,9.824887
3,2010-01-05,CBAK,2.72,3.1,2.69,2.85,1179500,14.25
4,2010-01-05,CHK,28.300002,29.120002,28.199999,28.970004,28692700,25.271467


Remember, that does not happen in place by default, so we will probably want to set the `inplace` parameter in our method call.

In [20]:
df.reset_index(inplace=True)

Now we have reset our index. 

The purpose of this lesson was to introduce you to some of these more advanced data representations and data source APIs. In the next video we will work with an airplane data set that will let us try out a lot of what we have learned.