# Introduction to Pandas

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python. 

pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


Key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling** of axes
- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

In [None]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=800 height=350></iframe>")

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np

# Set some Pandas options
pd.set_option('max_columns', 30)
pd.set_option('max_rows', 20)

## Pandas Data Structures

### Series

A **Series** is a single vector of data (like a NumPy array) with an *index* that labels each element in the vector.

In [None]:
counts = pd.Series([632, 1638, 569, 115])
counts

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the `Series`, while the index is a pandas `Index` object.

In [None]:
counts.values

In [None]:
counts.index

In [None]:
counts.index.values

#### Exercise
Assign meaningful labels to the index. 

The labels are: `['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']`

In [None]:
%load 2_Scientific_Libraries/labels.py


These labels can be used to refer to the values in the `Series`.

In [None]:
bacteria['Actinobacteria']

#### Exercise
Show only the name and value when the name ends with 'bacteria'.

In [None]:
%load 2_Scientific_Libraries/ends_with.py


Notice that the indexing operation preserved the association between the values and the corresponding indices.

#### Exercise
Create a mask (= list of True/False values) using the same condition (name ends with 'bacteria').

In [None]:
%load 2_Scientific_Libraries/mask.py


We can still use positional indexing if we wish.

In [None]:
bacteria

In [None]:
bacteria[0]

We can give both the array of values and the index meaningful labels themselves:

In [None]:
bacteria.name = 'counts'
bacteria.index.name = 'phylum'
bacteria

NumPy's math functions and other operations can be applied to Series without losing the data structure.

#### Exercise
Apply the numpy log function to the bacteria data. 

In [None]:
%load 2_Scientific_Libraries/log.py


We can also filter according to the values in the `Series`.

#### Exercise
Show only the data for values bigger than 1000.

In [None]:
bacteria

In [None]:
%load 2_Scientific_Libraries/filter.py


A `Series` can be thought of as an ordered key-value store. In fact, we can create one from a `dict`:

In [None]:
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
pd.Series(bacteria_dict)

Notice that the `Series` is created in key-sorted order.

If we pass a custom index to `Series`, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the `NaN` (not a number) type for missing values.

In [None]:
bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria',
                                            'Actinobacteria'])
bacteria2

In [None]:
bacteria2.isnull()

Critically, the labels are used to **align data** when used in operations with other Series objects:

In [None]:
bacteria

In [None]:
bacteria2

In [None]:
bacteria + bacteria2

In [None]:
# One NaN too many
bacteria.add(bacteria2,fill_value=0)

Contrast this with NumPy arrays, where arrays of the same length will combine values element-wise; adding Series combined values with the same label in the resulting series. Notice also that the missing values were propogated by addition.

### DataFrame

Inevitably, we want to be able to store, view and manipulate data that is *multivariate*, where for every index there are multiple fields or columns of data, often of varying data type.

A `DataFrame` is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. 

In [None]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
                     'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 
                     'Bacteroidetes']})
data

Notice the `DataFrame` is sorted by column name. We can change the order by indexing them in the order we desire:

In [None]:
data[['phylum','value','patient']]

A `DataFrame` has a second index, representing the columns:

In [None]:
data.columns

If we wish to access columns, we can do so either by dict-like indexing or by attribute:

#### Exercise
Find two ways to display only the column 'value' of the data frame.

In [None]:
%load 2_Scientific_Libraries/column_value.py


Notice that pandas returns this column as a Series.

In [None]:
type(data.value)

If you want to return a column as a Data Frame, use double brackets.

In [None]:
data[['value']]

In [None]:
type(data[['value']])

Notice this is different than with `Series`, where dict-like indexing retrieved a particular element (row). If we want access to a row in a `DataFrame`, we index its `iloc` attribute. Or `loc` if we want access the rows by index.


In [None]:
data.iloc[3]

Alternatively, we can create a `DataFrame` with a dict of dicts:

In [None]:
data = pd.DataFrame({0: {'patient': 1, 'phylum': 'Firmicutes', 'value': 632},
                    1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
                    2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
                    3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
                    4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
                    5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
                    6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
                    7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}})

In [None]:
data

We probably want this transposed:

In [None]:
data = data.T
data

Its important to note that the Series returned when a DataFrame is indexed is merely a **view** on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data:

In [None]:
vals = data.value
vals

In [None]:
vals[5] = 0
vals

Notice that your original data was changed as well!

In [None]:
data

Use the copy() function to create a real copy.

In [None]:
vals = data.value.copy()
vals[5] = 1000
data

We can create or modify columns by assignment:

In [None]:
# Modifying a value
data.value[3] = 14
data

In [None]:
# Adding a year column
data['year'] = 2013
data

But note, we cannot use the attribute indexing method to add a new column:

In [None]:
data.treatment = 1
data

In [None]:
data.treatment

In [None]:
data.__dict__

Specifying a `Series` as a new column causes its values to be added according to the `DataFrame`'s index:

In [None]:
treatment = pd.Series([0]*4 + [1]*2)
treatment

In [None]:
data['treatment'] = treatment
data

Other Python data structures (ones without an index) need to be the same length as the `DataFrame`:

In [None]:
# This code will throw an error!
#month = ['Jan', 'Feb', 'Mar', 'Apr']
#data['month'] = month

# but this one wont:
#data['month'] = 'Jan'

In [None]:
data['month'] = ['Jan']*len(data)
data

We can use `del` to remove columns, in the same way `dict` entries can be removed:

In [None]:
del data['month']
data

We can extract the underlying data as a simple `ndarray` by accessing the `values` attribute:

In [None]:
data.values

Notice that because of the mix of string and integer (and `NaN`) values, the dtype of the array is `object`. The dtype will automatically be chosen to be as general as needed to accomodate all the columns.

#### Exercise
Create a pandas dataframe from a dictionary `{'foo': [1,2,3], 'bar':[0.4, -1.0, 4.5]}`.

Display the values of this dataframe.

In [None]:
%load 2_Scientific_Libraries/foo_bar.py


Pandas uses a custom data structure to represent the indices of Series and DataFrames.

In [None]:
data.index

Index objects are immutable:

In [None]:
data

In [None]:
# This code will throw an error
#data.index[0] = 15

This is so that Index objects can be shared between data structures without fear that they will be changed.

In [None]:
bacteria2.index = bacteria.index

In [None]:
bacteria2

## Importing data

A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. Though it is easy to load basic data structures into Python using built-in tools or those provided by packages like NumPy, it is non-trivial to import structured data well, and to easily convert this input into a robust data structure:

    genes = np.loadtxt("genes.csv", delimiter=",", dtype=[('gene', '|S10'), ('value', '<f4')])

Pandas provides a **convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object.** These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

Let's start with some more bacteria data, stored in csv format.

In [None]:
!head data/microbiome.csv  # For Windows users: !type data\microbiome.csv

This table can be read into a DataFrame using `read_csv`:

In [None]:
mb = pd.read_csv("data/microbiome.csv")
mb

Notice that `read_csv` automatically considered the first row in the file to be a header row.

We can override default behavior by customizing some the arguments, like `header`, `names` or `index_col`.

#### Exercise
Load the data in `data/microbiome.csv` without header and show the first five values.
Try to do this in one python statement. What do you see as column names?

In [None]:
# %load 2_Scientific_Libraries/no_header.py


`read_csv` is just a convenience function for `read_table`, since csv is such a common format:

In [None]:
mb = pd.read_table("data/microbiome.csv", sep=',')
mb.head()

The `sep` argument can be customized as needed to accomodate arbitrary separators. For example, we can use a regular expression to define a variable amount of whitespace, which is unfortunately very common in some data formats: 
    
    sep='\s+'

For a more useful index, we can specify the first two columns, which together provide a unique index to the data.

In [None]:
mb = pd.read_csv("data/microbiome.csv", index_col=['Taxon','Patient'])
mb.head(4)

This is called a *hierarchical* index, which we will revisit later.

If we have sections of data that we do not wish to import (for example, known bad data), we can populate the `skiprows` argument:

#### Exercise
Load the data in `data/microbiome.csv` but skip rows 3, 4 and 6. Display the first five rows.

In [None]:
%load 2_Scientific_Libraries/skiprows.py


Conversely, if we only want to import a small number of rows from, say, a very large data file we can use `nrows`:

#### Exercise
Load 4 rows of data in `data/microbiome.csv`.

In [None]:
%load 2_Scientific_Libraries/nrows.py


Alternately, if we want to process our data in reasonable chunks, the `chunksize` argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each and we want to know the average Tissue value by Taxon.

#### Exercise
Load the data in `data/microbiome.csv` in chunks of 15 rows. Calculate the mean_tissue for each chunk and put it as the value in a dictionary with the name of the Taxon as the key. Display the directory.

In [None]:
%load 2_Scientific_Libraries/chunk.py


Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA` and `NULL`.

In [None]:
!head data/microbiome_missing.csv  # For Windows users: !type data\microbiome_missing.csv

In [None]:
pd.read_csv("data/microbiome_missing.csv").head(20)

Above, Pandas recognized `NA` and an empty field as missing data.

In [None]:
pd.read_csv("data/microbiome_missing.csv").head(20)

Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the `na_values` argument:
   

In [None]:
pd.read_csv("data/microbiome_missing.csv", na_values=['?', -99999]).head(20)

These can be specified on a column-wise basis using an appropriate dict as the argument for `na_values`.

## Pandas Fundamentals

This section introduces the new user to the key functionality of Pandas that is required to use the software effectively.

For some variety, we will leave our digestive tract bacteria behind and employ some baseball data.

#### Exercise
Read the data in data/baseball.csv using pandas read_csv() function. Assign the column 'id' ad the index column.
Display the first 10 records.

In [None]:
!head data/baseball.csv  # For Windows users: !type data\baseball.csv

Baseball statistics: https://en.wikipedia.org/wiki/Baseball_statistics

In [None]:
# %load 2_Scientific_Libraries/read_baseball_data.py
baseball = pd.read_csv("data/baseball.csv", index_col='id')
baseball.head(15)

Notice that we specified the `id` column as the index, since it appears to be a unique identifier. We could try to create a unique index ourselves by combining `player` and `year`.

#### Exercise
Create a new index called '`player_id`' which is the concatentation of '`player`' and '`year`'. Create a copy of the baseball dataframe and call it '`baseball_newind`'. Assign your '`player_id`' index to this new dataframe. Check your result by looking at the first 15 records.

Hint: you might want to use the `astype()` function.

In [None]:
# %load 2_Scientific_Libraries/player_id.py
player_id = baseball.player + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_id
baseball_newind.head(15)

This looks okay, but let's check:

In [None]:
baseball_newind.index.is_unique

So, indices need not be unique. Our choice is not unique because some players change teams within years.

In [None]:
pd.Series(baseball_newind.index).value_counts()

The most important consequence of a non-unique index is that indexing by label will return multiple values for some labels:

In [None]:
baseball_newind.loc['wickmbo012007']

We will learn more about indexing below.

We can create a truly unique index by combining `player`, `team` and `year`:

In [None]:
player_unique = baseball.player + baseball.team + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_unique
baseball_newind.head()

In [None]:
baseball_newind.index.is_unique

We can create meaningful indices more easily using a hierarchical index; for now, we will stick with the numeric `id` field as our index.

### Manipulating indices

**Reindexing** allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and **optionally, fill in missing data if requested**.

A simple use of `reindex` is to alter the order of the rows:

In [None]:
l = list(range(10))
print(l)
print()
print(l[::-1])

In [None]:
baseball.index

In [None]:
baseball.index[::-1]

In [None]:
baseball.reindex(baseball.index[::-1]).head()

In [None]:
baseball.reindex(baseball.index[::-1]).tail()

Notice that the **`id` index is not sequential**. Say we wanted to populate the table with every `id` value. We could specify an index that is a sequence from the first to the last `id` numbers in the database, and Pandas would fill in the missing data with `NaN` values:

In [None]:
id_range = range(baseball.index.values.min(), baseball.index.values.max())
baseball.reindex(id_range).head()

We can remove rows or columns via the `drop` method:

In [None]:
baseball.shape

In [None]:
# Dropping rows
baseball.drop([89525, 89526])

In [None]:
# Dropping columns
baseball.drop(['ibb','hbp'], axis=1)

## Indexing and Selection

Indexing works analogously to indexing in NumPy arrays, except we can **use the labels in the `Index` object** to extract values in addition to arrays of integers.

In [None]:
# Sample Series object
hits = baseball_newind.h
hits

In [None]:
type(hits)

#### Exercise
You can still use **Numpy-style indexing**. Select the first three rows of the 'hits' Series.

In [None]:
# %load 2_Scientific_Libraries/hits.py
hits[:3]

#### Exercise
You can also **index by label**. Find the 'hits' data for player_ids 'womacto01CHN2006' and 'schilcu01BOS2006'.

In [None]:
# %load 2_Scientific_Libraries/hits_index.py
hits[['womacto01CHN2006','schilcu01BOS2006']]

We can also **slice with data labels**, since they have an intrinsic order within the Index:

#### Exercise
Display a slice from the 'hits' data from player_id 'womacto01CHN2006' to 'gonzalu01ARI2006'.

In [None]:
%load 2_Scientific_Libraries/hits_slice.py


#### Exercise
Set the values of this slice to 5. and display the 'hits' data again.
What is the warning about?

In [None]:
%load 2_Scientific_Libraries/hits_modify.py


In a `DataFrame` we can slice along either or both axes:

In [None]:
baseball_newind[['h','ab']]

In [None]:
# Filtering data
baseball_newind[baseball_newind.ab > 500]

In [None]:
# Finding certain statistics for a particular player
baseball_newind.loc['gonzalu01ARI2006', ['h','X2b', 'X3b', 'hr']]

The indexing field `loc` allows us to **select subsets of rows and columns** in an intuitive way:

In [None]:
baseball_newind.loc[:'myersmi01NYA2006', 'hr']

Similarly, the **cross-section method `xs`** (not a field) **extracts a single column or row *by label* and returns it as a `Series`**:

In [None]:
baseball_newind.xs('myersmi01NYA2006')

## Sorting and Ranking

Pandas objects include **methods for re-ordering data**.

In [None]:
#Sorting ascending
baseball_newind.sort_index().head()

In [None]:
# Sorting descending
baseball_newind.sort_index(ascending=False).head()

In [None]:
# Sorting columns
baseball_newind.sort_index(axis=1).head()

We can also use `sort_values` to **sort a `Series` by value, rather than by label**.

In [None]:
baseball.hr.sort_values(ascending=False)[:10]

For a `DataFrame`, we can **sort according to the values of one or more columns** using the `by` argument of `sort_index`:

In [None]:
baseball[['player','sb','cs']].sort_values(ascending=[False,True], by=['sb', 'cs']).head(10)

## Data summarization

We often wish to summarize data in `Series` or `DataFrame` objects, so that they can more easily be understood or compared with similar data. The NumPy package contains **several functions that are useful here, but several summarization or reduction methods are built into Pandas data structures**.

In [None]:
baseball.head()

In [None]:
baseball.sum()

Clearly, `sum` is more meaningful for some columns than others. For methods like `mean` for which application to string variables is not just meaningless, but impossible, these columns are automatically exculded:

In [None]:
baseball.mean()

The important difference between NumPy's functions and Pandas' methods is that the latter have **built-in support for handling missing data**.

In [None]:
bacteria2

In [None]:
bacteria2.mean()

Sometimes we may not want to ignore missing values, and allow the `nan` to propagate.

In [None]:
bacteria2.mean(skipna=False)

Passing `axis=1` will summarize over rows instead of columns, which only makes sense in certain situations.

In [None]:
extra_bases = baseball[['X2b','X3b','hr']].sum(axis=1)
extra_bases.sort_values(ascending=False)

A useful summarization that gives a quick snapshot of multiple statistics for a `Series` or `DataFrame` is **`describe`**:

In [None]:
baseball.describe()

`describe` can detect non-numeric data and sometimes yield useful information about it.

In [None]:
baseball.player.describe()

We can also calculate summary statistics *across* multiple columns, for example, correlation and covariance.

$$cov(x,y) = \sum_i (x_i - \bar{x})(y_i - \bar{y})$$

In [None]:
baseball.hr.cov(baseball.X2b)

$$corr(x,y) = \frac{cov(x,y)}{(n-1)s_x s_y} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2}}$$

In [None]:
baseball.hr.corr(baseball.X2b)

In [None]:
baseball.ab.corr(baseball.h)

In [None]:
baseball.corr()

If we have a `DataFrame` with a hierarchical index (or indices), summary statistics can be applied with respect to any of the index levels:

In [None]:
mb.head(15)

In [None]:
mb.sum(level='Taxon')

## Writing Data to Files

As well as being able to read several data input formats, Pandas can also **export data to a variety of storage formats**. We will bring your attention to just a couple of these.

In [None]:
mb.to_csv("data/mb.csv")

The `to_csv` method **writes a `DataFrame` to a comma-separated values (csv) file**. You can specify custom delimiters (via **`sep`** argument), how missing values are written (via **`na_rep`** argument), whether the index is writen (via **`index`** argument), whether the header is included (via **`header`** argument), among other options.

An efficient way of storing data to disk is in binary format. Pandas supports this using Python’s built-in pickle serialization.

In [None]:
baseball.to_pickle("data/baseball_pickle")

The complement to `to_pickle` is the `read_pickle` function, which restores the pickle to a `DataFrame` or `Series`:

In [None]:
pd.read_pickle("data/baseball_pickle")

Source: https://github.com/fonnesbeck/statistical-analysis-python-tutorial
        