In [86]:
from IPython.core.display import HTML
css = open('style_table.css').read()
HTML('<style>{}</style>'.format(css))

In [2]:
from IPython.display import IFrame

# Brief Introduction to the PyData Stack

The term **PyData** referes to a collection of Open Source packages generally organized around the Python programming language.
These tools are centered around dealing with data, including performing different types of analysis on it.
The tools offer the power of commercial statistical packages, with a lot more flexibility.
One issue, however, is that they are built as modules - meaning there is no one package or name to know.
This talk is designed to give you a good starting point - things to google.

## What is **PyData**?

* **PyData** = collection of tools for working with data

* Organized around the Python programming language, but not exclusively Python

* Power and Flexibility
* But the ecosystem can be overwhelming.

## Key Components

I am going to talk about a few key parts of the PyData ecosystem.
Then I'll mention a few more, but not go into detail.

* Jupyter Notebook
* IPython
* NumPy
* Pandas
* R
* SciPy
* SQLAlchemy
* Scikit-Learn
* Statsmodels
* Julia
* Big Data Stuff

* Jupyter

* Pandas

* SymPy

### Jupyter Notebook

In [3]:
IFrame('http://jupyter.org', width=1000, height=500)

I want to spend a bit of time discussing the Jupyter notebook - which this presentation is created in.
Talk about the difference between Jupyter and IPython (Interactive Python).

IPython started out as a better REPL (Read-Evaluate-Print-Loop) for Python.
A replacement for `IDLE` on the command line.
It gives you better tab-completion and some "magics" (starting with `%`) to do useful things.

Then a web server was added, so the you have the IPython notebook,
which lets you combine text, code and images into one document.

To make it more accessible for languages other than Python (R and Julia being the first),
the notebook project became Jupyter (which stands for JUlia, PYthon and R).

The Jupyter notebook sends messages to the kernal that actually runs the code.
It is possible to quickly write new kernels.
The IPython kernel includes the magic methoeds - including methods that let you run R code.

The notebook files are plain text using the JSON serilization format.
The `nbconvert` utililty can convert them into a variety of formats:

* a Python script
* LaTeX
* PDF
* HTML

The nbviewer site will render notebooks you post somewhere - e.g. Github.
You can use Git to version control your notebooks, and it works well because they are just text.
There is a `nbdiff` package that can give you a graphical display of the changes between two versions of the notebook.

Because the notebook is in a standardized format, it is possible to write Python scripts to clean up notebook files.
Say you wanted to keep track of the code and text in a notebook, but not the output,
you use a script that removes that part of the file.
Git has a way to tell it to run a "cleaning" script on certain files when you add them to your repository.

These notebooks are a great way to explore data.
They are also a great teaching tool and means to write documentation.

You can leverage knowledge about HTML and CSS to style things, etc.

### Pandas

In [4]:
IFrame('http://pandas.pydata.org', width=1000, height=500)

Pandas was the package that really got me into using Python for work.
It's an implementation of the Data Frame concept from R, but in Python.

It builds on the NumPy package, which gives you arrays with highly optimized operations.
Instead of using a loop (which is slow in Python), you can sum the array, add a scalar value to it, compare it to something... very quickly.

Pandas adds **indexing** to that, and gives you two basic objects: the Series and the DataFrame.
The Series is a single vector (or column), and the Data Frame is a number of columns.

The Series and the Data Frame both have a _row index_.
The Data Frame adds a _column index_.
This is essentially your variable names, fields, etc.

You can set any existing column as an index.

Both indices can be **hierarchical**.
You can then `stack` and `unstack` Data Frames - move columns from the row axis to the column axis.
(i.e. pivot tables).

I'm sure it will be old hat to the R and SQL people here,
but the revelation for me (coming from Stata) was the realization that your initial data and your final results could stay in the same type of object. That lets you build exactly the tables you want.