# Data Manipulation with Pandas

Previously, we dove into detail on NumPy and its ``ndarray`` object, which provides efficient storage and manipulation of dense typed arrays in Python. Here we'll build on this knowledge by looking in detail at the data structures provided by the Pandas library. Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a ``DataFrame``. ``DataFrame``s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

As we saw, NumPy's ``ndarray`` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us. Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

Due to time constraints, we will only briefly indtroduce the Pandas ``DataFrame`` container here. Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``:

In [None]:
import pandas as pd

While we are at it, let's also import NumPy and Seaborn, and turn on inline plotting for Matplotlib within the Jupyter notebook environment.

In [None]:
import numpy as np
import seaborn as sns
%matplotlib inline

## The Pandas DataFrame object

The fundamental structure in Pandas is the ``DataFrame``. The ``DataFrame`` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. A ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names. To demonstrate this, let's first construct three NumPy arrays of normally distributed random data:

In [None]:
a = np.random.randn(1000)
b = np.random.randn(1000) + 1
c = np.random.randn(1000) - 1

Next, let's construct a Pandas DataFrame  with our normally distributed data indexed by "a" , "b"  and 
"c" :

In [None]:
df = pd.DataFrame({'a': a, 'b': b, 'c': c})

We can inspect the first x elements with the `DataFrame.head(x)`  method:

In [None]:
df.head(5)

We can use the `DataFrame.hist(...)` method to view histograms of each column / class of data:

In [None]:
df.hist();

## Data Indexing and Selection

The ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [None]:
df.columns

Thus the ``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data. The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:

In [None]:
df['a'][:5]

Equivalently, we can use attribute-style access with column names that are strings:

In [None]:
df['a'][-5:]

This dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [None]:
df['d'] = df['a'] / df['b']
df.head(5)

We can use masked indexing operations to easily extract the portions of the `DataFrame` which meet some condition:

In [None]:
df[df['a'] > 3]

Pandas supports a wide variety of indexing operations which we do not have time to cover here. 

## A real world example: Pandas for financial data

Now let's try something more complicated. We will download the daily stock price for Google, Apple and Microsoft and load it into a Pandas DataFrame :

In [None]:
from pandas_datareader import data, wb
from datetime import datetime

# Specify a time interval
start = datetime(2009, 1, 21)
end = datetime(2016, 1, 20)

# Download the opening stock market prices for a few different companies
df_goog = data.DataReader("GOOG", 'google', start, end)
df_aapl = data.DataReader("AAPL", 'google', start, end)
df_msft = data.DataReader("MSFT", 'google', start, end)
df_fb = data.DataReader("FB", 'google', start, end)
df_all = pd.DataFrame({'Google': df_goog.Open,
                       'Apple': df_aapl.Open,
                       'Microsoft': df_msft.Open,
                       'Facebook': df_fb.Open})

We can use the DataFrame.tail(x) method to report the final x rows in our DataFrame :

In [None]:
# Report the last
df_all.tail(5)

Lastly, it is easy to plot the time-series for each column (company) with the `DataFrame.plot()` method:

In [None]:
df_all.plot();