# Getting started with pandas

In [None]:
from pandas import Series, DataFrame
import pandas as pd

In [None]:
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import matplotlib.pyplot as plt
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)

In [None]:
%pwd

## Introduction to pandas data structures

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: `Series` and `DataFrame`. While they are not a universal solution for every
problem, they provide a solid, easy-to-use basis for most applications.

### Series

A `Series` is a one-dimensional array-like object containing an array of data (of any
`NumPy` data type) and an associated array of data labels, called its index. The simplest
`Series` is formed from only an array of data:

In [None]:
obj = Series([4, 7, -5, 3])
obj

The string representation of a Series displayed interactively shows the index on the left
and the values on the right. Since we did not specify an index for the data, a default
one consisting of the integers 0 through N - 1 (where N is the length of the data) is
created. You can get the array representation and index object of the Series via its values
and index attributes, respectively:

In [None]:
print obj.values
print obj.index

In [None]:
type(obj.values)

Often it will be desirable to create a Series with an index identifying each data point:

In [None]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

In [None]:
obj2.index

Compared with a regular `NumPy` array, you can use values in the index when selecting
single values or a set of values:

In [None]:
obj2['a']

In [None]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]

`NumPy` array operations, such as filtering with a boolean array, scalar multiplication,
or applying math functions, will preserve the index-value link:

In [None]:
obj2[obj2 > 0]

In [None]:
obj2 * 2

In [None]:
np.exp(obj2)

Another way to think about a `Series` is as a fixed-length, ordered dict, as it is a mapping
of index values to data values. It can be substituted into many functions that expect a
`dict`:

In [None]:
'b' in obj2

In [None]:
'e' in obj2

Should you have data contained in a `Python dict`, you can create a `Series` from it by
passing the `dict`:

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

When only passing a `dict`, the index in the resulting Series will have the dict’s keys in
sorted order.

In [None]:
states = ['California', 'Ohio', 'Texas', 'Oregon']
obj4 = Series(sdata, index=states)
obj4

In this case, 3 values found in **sdata** were placed in the appropriate locations, but since
no value for **'California'** was found, it appears as **NaN** (not a number) which is con-
sidered in pandas to mark missing or *NA* values. I will use the terms “missing” or “NA”
to refer to missing data. The **isnull** and **notnull** functions in pandas should be used to
detect missing data:

In [None]:
pd.isnull(obj4)

In [None]:
pd.notnull(obj4)

In [None]:
obj4.isnull()

I discuss working with missing data in more detail later in this chapter.
A critical Series feature for many applications is that it automatically aligns differently-
indexed data in arithmetic operations:

In [None]:
obj3

In [None]:
obj4

In [None]:
obj3 + obj4

Data alignment features are addressed as a separate topic.
Both the Series object itself and its index have a `name` attribute, which integrates with
other key areas of pandas functionality:

In [None]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

A Series’s index can be altered in place by assignment:

In [None]:
obj

In [None]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

### DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index). Compared with other
such DataFrame-like structures you may have used before (like R’s data.frame ), row-oriented and column-oriented operations in DataFrame are treated roughly symmetrically. Under the hood, the data is stored as one or more two-dimensional blocks rather
than a list, dict, or some other collection of one-dimensional arrays. The exact details
of DataFrame’s internals are far outside the scope of this book.

There are numerous ways to construct a DataFrame, though one of the most common
is from a `dict` of equal-length lists or NumPy arrays

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and
the columns are placed in sorted order:

In [None]:
frame

If you specify a sequence of columns, the DataFrame’s columns will be exactly what
you pass:

In [None]:
DataFrame(data, columns=['year', 'state', 'pop'])

If you specify a sequence of columns, the DataFrame’s columns will be exactly what
you pass:

In [None]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
frame2

In [None]:
frame2.columns

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute:

In [None]:
frame2['state']

In [None]:
frame2.year

Note that the returned Series have the same index as the DataFrame, and their name
attribute has been appropriately set.
Rows can also be retrieved by position or name by a couple of methods, such as the
`loc` indexing field (much more on this later):

In [None]:
frame2.loc['three']

In [None]:
frame2['debt'] = 16.5
frame2

In [None]:
frame2['debt'] = np.arange(5.)
frame2

When assigning lists or arrays to a column, the value’s length must match the length
of the DataFrame. If you assign a Series, it will be instead conformed exactly to the
DataFrame’s index, inserting missing values in any holes:

In [None]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dict:

In [None]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

In [None]:
del frame2['eastern']
frame2.columns

The column returned when indexing a DataFrame is a **view** on the underlying data, not a copy. Thus, any in-place modifications to the Series
will be reflected in the DataFrame. The column can be explicitly copied
using the Series’s **copy** method.

Another common form of data is a nested dict of dicts format:

In [None]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
keys as the row indices:

In [None]:
frame3 = DataFrame(pop)
frame3

Of course you can always transpose the result:

In [None]:
frame3.T

The keys in the inner dicts are unioned and sorted to form the index in the result. This
isn’t true if an explicit index is specified:

In [None]:
DataFrame(pop, index=[2001, 2002, 2003])

Dicts of Series are treated much in the same way:

In [None]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)

For a complete list of things you can pass the DataFrame constructor, see **Table 5-1**.

**Table 5-1. Possible data inputs to DataFrame constructor**

Type | Notes
--- | ---
2D ndarray | A matrix of data, passing optional row and column labels
dict of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame. All sequences must be the same length.
NumPy structured/record array | Treated as the “dict of arrays” case
dict of Series | Each value becomes a column. Indexes from each Series are unioned together to form the result’s row index if no explicit index is passed.
dict of dicts | Each inner dict becomes a column. Keys are unioned to form the row index as in the “dict of Series” case.
list of dicts or Series | Each item becomes a row in the DataFrame. Union of dict keys or Series indexes become the DataFrame’s column labels
List of lists or tuples | Treated as the “2D ndarray” case
Another DataFrame | The DataFrame’s indexes are used unless different ones are passed
NumPy MaskedArray | Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result

If a DataFrame’s **index** and **columns** have their **name** attributes set, these will also be
displayed:

In [None]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

In [None]:
frame3.values

In [None]:
frame2.values

### Index objects

pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index:

In [None]:
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

In [None]:
index[1:]

Index objects are immutable and thus can’t be modified by the user:

In [None]:
index[1] = 'd'

Immutability is important so that Index objects can be safely shared among data
structures:

In [None]:
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index

**Table 5-2** has a list of built-in Index classes in the library. With some development
effort, Index can even be subclassed to implement specialized axis indexing function-
ality.

**Table 5-2. Main Index objects in pandas**

Class | Description
--- | ---
Index | The most general Index object, representing axis labels in a NumPy array of Python objects.
Int64Index | Specialized Index for integer values.
MultiIndex | “Hierarchical” index object representing multiple levels of indexing on a single axis. Can be thought of as similar to an array of tuples.
DatetimeIndex | Stores nanosecond timestamps (represented using NumPy’s datetime64 dtype).
PeriodIndex | Specialized Index for Period data (timespans).

In addition to being array-like, an Index also functions as a fixed-size set:

In [None]:
frame3

In [None]:
'Ohio' in frame3.columns

In [None]:
2003 in frame3.index

Each Index has a number of methods and properties for set logic and answering other
common questions about the data it contains. These are summarized in **Table 5-3**.

**Table 5-3. Index methods and properties**

Method | Description
--- | ---
append | Concatenate with additional Index objects, producing a new Index
diff | Compute set difference as an Index
intersection | Compute set intersection
union | Compute set union
isin | Compute boolean array indicating whether each value is contained in the passed collection
delete | Compute new Index with element at index i deleted
drop | Compute new index by deleting passed values
insert | Compute new Index by inserting element at index i
is_monotonic | Returns True if each element is greater than or equal to the previous element
is_unique | Returns True if the Index has no duplicate values
unique | Compute the array of unique values in the Index

## Essential functionality

In this section, I’ll walk you through the fundamental mechanics of interacting with
the data contained in a Series or DataFrame. Upcoming chapters will delve more deeply
into data analysis and manipulation topics using pandas. This book is not intended to
serve as exhaustive documentation for the pandas library; I instead focus on the most
important features, leaving the less common (that is, more esoteric) things for you to
explore on your own.

### Reindexing

A critical method on pandas objects is **reindex** , which means to create a new object
with the data *conformed* to a new index. Consider a simple example from above:

In [None]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

Calling **reindex** on this Series rearranges the data according to the new index, intro-
ducing missing values if any index values were not already present:

In [None]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

In [None]:
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

For ordered data like time series, it may be desirable to do some interpolation or filling
of values when reindexing. The **method** option allows us to do this, using a method such
as **ffill** which forward fills the values:

In [None]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

**Table 5-4 **lists available method options. At this time, interpolation more sophisticated
than forward- and backfilling would need to be applied after the fact.

**Table 5-4. reindex method (interpolation) options**

Argument | Description
--- | ---
ffill or pad | Fill (or carry) values forward
bfill or backfill | Fill (or carry) values backward

With DataFrame, **reindex** can alter either the (row) index, columns, or both. When
passed just a sequence, the rows are reindexed in the result:

In [None]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                  columns=['Ohio', 'Texas', 'California'])
frame

In [None]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

The columns can be reindexed using the columns keyword:

In [None]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Both can be reindexed in one shot, though interpolation will only apply row-wise (axis
0):

In [None]:
#frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', columns=states)
frame.reindex(index=['a', 'b', 'c', 'd'], columns=states).ffill()

As you’ll see soon, reindexing can be done more succinctly by label-indexing with `loc` :

In [None]:
frame.loc[['a', 'b', 'c', 'd'], states]

**Table 5-5. reindex function arguments**

Argument | Description
--- | ---
index | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying
method | Interpolation (fill) method, see Table 5-4 for options.
fill_value | Substitute value to use when introducing missing data by reindexing
limit | When forward- or backfilling, maximum size gap to fill
level | Match simple Index on level of MultiIndex, otherwise select subset of
copy | Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data).

### Dropping entries from an axis

Dropping one or more entries from an axis is easy if you have an index array or list
without those entries. As that can require a bit of munging and set logic, the drop
method will return a new object with the indicated value or values deleted from an axis:

In [None]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj

In [None]:
obj.drop(['d', 'c'])

With DataFrame, index values can be deleted from either axis:

In [None]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])

In [None]:
data.drop(['Colorado', 'Ohio'])

In [None]:
data.drop('two', axis=1)

In [None]:
data.drop(['two', 'four'], axis=1)

### Indexing, selection, and filtering

Series indexing ( `obj[...]` ) works analogously to NumPy array indexing, except you can
use the Series’s index values instead of only integers. Here are some examples this:

In [None]:
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b']

In [None]:
obj[1]

In [None]:
obj[2:4]

In [None]:
obj[['b', 'a', 'd']]

In [None]:
obj[[1, 3]]

In [None]:
obj[obj < 2]

Slicing with labels behaves differently than normal Python slicing in that the endpoint
is inclusive:

In [None]:
obj['b':'c']

Setting using these methods works just as you would expect:

In [None]:
obj['b':'c'] = 5
obj

As you’ve seen above, indexing into a DataFrame is for retrieving one or more columns
either with a single value or sequence:

In [None]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

In [None]:
data['two']

In [None]:
data[['three', 'one']]

Indexing like this has a few special cases. First selecting rows by slicing or a boolean
array:

In [None]:
data[:2]

In [None]:
data[data['three'] > 5]

This might seem inconsistent to some readers, but this syntax arose out of practicality
and nothing more. Another use case is in indexing with a boolean DataFrame, such as
one produced by a scalar comparison:

In [None]:
data < 5

In [None]:
data[data < 5] = 0

In [None]:
data

This is intended to make DataFrame syntactically more like an ndarray in this case.
For DataFrame label-indexing on the rows, I introduce the special indexing field `loc` . It
enables you to select a subset of the rows and columns from a DataFrame with NumPy-like notation plus axis labels. As I mentioned earlier, this is also a less verbose way to
do reindexing:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing

In [None]:
data.loc['Colorado', ['two', 'three']]

In [None]:
data.ix[['Colorado', 'Utah'], [3, 0, 1]]

In [None]:
data.iloc[2]

In [None]:
data.loc[:'Utah', 'two']

In [None]:
data.loc[data.three > 5, :'two']

**Table 5-6. Indexing options with DataFrame**

Type | Notes
--- | ---
obj[val]  |Select single column or sequence of columns from the DataFrame. Special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion).
obj.ix[val] | Selects single row of subset of rows from the DataFrame.
obj.ix[:, val] | Selects single column of subset of columns.
obj.ix[val1, val2] | Select both rows and columns. 
reindex method | Conform one or more axes to new indexes.
xs method | Select single row or column as a Series by label.
icol, irow methods | Select single column or row, respectively, as a Series by integer location.
get_value, set_value methods | Select single value by row and column label.

### Arithmetic and data alignment

One of the most important pandas features is the behavior of arithmetic between objects with different indexes. When adding together objects, if any index pairs are not
the same, the respective index in the result will be the union of the index pairs. Let’s
look at a simple example:

In [None]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [None]:
s1

In [None]:
s2

Adding these together yields:

In [None]:
s1 + s2

The internal data alignment introduces NA values in the indices that don’t overlap.
Missing values propagate in arithmetic computations.
In the case of DataFrame, alignment is performed on both the rows and the columns:

In [None]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

In [None]:
df2

Adding these together returns a DataFrame whose index and columns are the unions
of the ones in each DataFrame:

In [None]:
df1 + df2

#### Arithmetic methods with fill values

In arithmetic operations between differently-indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other:

In [None]:
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1

In [None]:
df2

Adding these together results in NA values in the locations that don’t overlap:

In [None]:
df1 + df2

Using the add method on df1 , I pass df2 and an argument to fill_value :

In [None]:
df1.add(df2, fill_value=0)

Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill
value:

In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

**Table 5-7. Flexible arithmetic methods**

Method | Description
--- | ---
add | Method for addition (+)
sub | Method for subtraction (-)
div | Method for division (/)
mul | Method for multiplication (*)

#### Operations between DataFrame and Series

As with NumPy arrays, arithmetic between DataFrame and Series is well-defined. First,
as a motivating example, consider the difference between a 2D array and one of its rows:

In [None]:
arr = np.arange(12.).reshape((3, 4))
arr

In [None]:
arr[0]

In [None]:
arr - arr[0]

This is referred to as broadcasting and is explained in more detail in Chapter 12. Op-
erations between a DataFrame and a Series are similar:

In [None]:
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame

In [None]:
series

By default, arithmetic between DataFrame and Series matches the index of the Series
on the DataFrame's columns, broadcasting down the rows:

In [None]:
frame - series

If an index value is not found in either the DataFrame’s columns or the Series’s index,
the objects will be reindexed to form the union:

In [None]:
series2 = Series(range(3), index=['b', 'e', 'f'])
frame + series2

If you want to instead broadcast over the columns, matching on the rows, you have to
use one of the arithmetic methods. For example:

In [None]:
series3 = frame['d']
frame

In [None]:
series3

In [None]:
frame.sub(series3, axis=0)

The axis number that you pass is the axis to match on. In this case we mean to match
on the DataFrame’s row index and broadcast across.

### Function application and mapping

NumPy ufuncs (element-wise array methods) work fine with pandas objects:

In [None]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [None]:
frame

In [None]:
np.abs(frame)

Another frequent operation is applying a function on 1D arrays to each column or row.
DataFrame’s **apply** method does exactly this:

In [None]:
f = lambda x: x.max() - x.min()

In [None]:
frame.apply(f)

In [None]:
frame.apply(f, axis=1)

Many of the most common array statistics (like **sum** and **mean** ) are DataFrame methods,
so using **apply** is not necessary.
The function passed to **apply** need not return a scalar value, it can also return a Series
with multiple values:

In [None]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Element-wise Python functions can be used, too. Suppose you wanted to compute a
formatted string from each floating point value in **frame** . You can do this with **applymap** :

In [None]:
format = lambda x: '%.2f' % x
frame.applymap(format)

In [None]:
frame['e'].map(format)

### Sorting and ranking

Sorting a data set by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object:

In [None]:
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

With a DataFrame, you can sort by index on either axis:

In [None]:
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame.sort_index()

In [None]:
frame.sort_index(axis=1)

The data is sorted in ascending order by default, but can be sorted in descending order,
too:

In [None]:
frame.sort_index(axis=1, ascending=False)

To sort a Series by its values, use its order method:

In [None]:
obj = Series([4, 7, -3, 2])
obj.order()

Any missing values are sorted to the end of the Series by default:

In [None]:
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.order()

On DataFrame, you may want to sort by the values in one or more columns. To do so,
pass one or more column names to the by option:

In [None]:
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

In [None]:
frame.sort_index(by='b')

To sort by multiple columns, pass a list of names:

In [None]:
frame.sort_index(by=['a', 'b'])

*Ranking* is closely related to sorting, assigning ranks from one through the number of
valid data points in an array. It is similar to the indirect sort indices produced by
**numpy.argsort** , except that ties are broken according to a rule. The **rank** methods for
Series and DataFrame are the place to look; by default **rank** breaks ties by assigning
each group the mean rank:

In [None]:
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

Ranks can also be assigned according to the order they’re observed in the data:

In [None]:
obj.rank(method='first')

Naturally, you can rank in descending order, too:

In [None]:
obj.rank(ascending=False, method='max')

See **Table 5-8** for a list of tie-breaking methods available. DataFrame can compute ranks
over the rows or the columns:

In [None]:
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})
frame

In [None]:
frame.rank(axis=1)

**Table 5-8. Tie-breaking methods with rank**

Method | Description
--- | ---
'average' | Default: assign the average rank to each entry in the equal group.
'min' | Use the minimum rank for the whole group.
'max' | Use the maximum rank for the whole group.
'first' | Assign ranks in the order the values appear in the data.

### Axis indexes with duplicate values

Up until now all of the examples I’ve showed you have had unique axis labels (index
values). While many pandas functions (like reindex ) require that the labels be unique,
it’s not mandatory. Let’s consider a small Series with duplicate indices:

In [None]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

The index’s is_unique property can tell you whether its values are unique or not:

In [None]:
obj.index.is_unique

Data selection is one of the main things that behaves differently with duplicates. Indexing a value with multiple entries returns a Series while single entries return a scalar
value:

In [None]:
obj['a']

In [None]:
obj['c']

The same logic extends to indexing rows in a DataFrame:

In [None]:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

In [None]:
df.loc['b']

## Summarizing and computing descriptive statistics

pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of *reductions* or *summary statistics*, methods
that extract a single value (like the sum or mean) from a Series or a Series of values from
the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data. Consider
a small DataFrame:

In [None]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])
df

Calling DataFrame’s sum method returns a Series containing column sums:

In [None]:
df.sum()

Passing axis=1 sums over the rows instead:

In [None]:
df.sum(axis=1)

NA values are excluded unless the entire slice (row or column in this case) is NA. This
can be disabled using the skipna option:

In [None]:
df.mean(axis=1, skipna=False)

See **Table 5-9** for a list of common options for each reduction method options.

**Table 5-9. Options for reduction methods**

Method | Description
--- | ---
axis | Axis to reduce over. 0 for DataFrame’s rows and 1 for columns.
skipna | Exclude missing values, True by default.
level | Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex).

Some methods, like idxmin and idxmax , return indirect statistics like the index value
where the minimum or maximum values are attained:

In [None]:
df.idxmax()

In [None]:
df.apply(np.argmax)

In [None]:
df.cumsum()

Another type of method is neither a reduction nor an accumulation. describe is one
such example, producing multiple summary statistics in one shot:

In [None]:
df.describe()

On non-numeric data, describe produces alternate summary statistics:

In [None]:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

See **Table 5-10** for a full list of summary statistics and related methods.

**Table 5-10. Descriptive and summary statistics**

Method | Description
--- | ---
count | Number of non-NA values
describe | Compute set of summary statistics for Series or each DataFrame column
min, max | Compute minimum and maximum values
argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmax | Compute index values at which minimum or maximum value obtained, respectively
quantile | Compute sample quantile ranging from 0 to 1
sum | Sum of values
mean | Mean of values
median | Arithmetic median (50% quantile) of values
mad | Mean absolute deviation from mean value
var | Sample variance of values
std | Sample standard deviation of values
skew | Sample skewness (3rd moment) of values
kurt | Sample kurtosis (4th moment) of values
cumsum | Cumulative sum of values
cummin, cummax | Cumulative minimum or maximum of values, respectively
cumprod | Cumulative product of values
diff | Compute 1st arithmetic difference (useful for time series)
pct_change | Compute percent changes

### Correlation and covariance

Some summary statistics, like correlation and covariance, are computed from pairs of
arguments. Let’s consider some DataFrames of stock prices and volumes obtained from
Yahoo! Finance:

In [None]:
!pip install pandas-datareader

In [None]:
import pandas_datareader as pdr

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = pdr.get_data_yahoo(ticker)

price = DataFrame({tic: data['Adj Close']
                   for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume']
                    for tic, data in all_data.iteritems()})

In [None]:
returns = price.pct_change()
returns.tail()

In [None]:
returns.MSFT.corr(returns.IBM)

In [None]:
returns.MSFT.cov(returns.IBM)

In [None]:
returns.corr()

In [None]:
returns.cov()

In [None]:
returns.corrwith(returns.IBM)

In [None]:
returns.corrwith(volume)

### Unique values, value counts, and membership

In [None]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [None]:
uniques = obj.unique()
uniques

In [None]:
obj.value_counts()

In [None]:
pd.value_counts(obj.values, sort=False)

In [None]:
mask = obj.isin(['b', 'c'])
mask

In [None]:
obj[mask]

In [None]:
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
                  'Qu2': [2, 3, 1, 2, 3],
                  'Qu3': [1, 5, 2, 4, 4]})
data

In [None]:
result = data.apply(pd.value_counts).fillna(0)
result

## Handling missing data

In [None]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

In [None]:
string_data.isnull()

In [None]:
string_data[0] = None
string_data.isnull()

### Filtering out missing data

In [None]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()

In [None]:
data[data.notnull()]

In [None]:
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
                  [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data

In [None]:
cleaned

In [None]:
data.dropna(how='all')

In [None]:
data[4] = NA
data

In [None]:
data.dropna(axis=1, how='all')

In [None]:
df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = NA; df.ix[:2, 2] = NA
df

In [None]:
df.dropna(thresh=3)

### Filling in missing data

In [None]:
df.fillna(0)

In [None]:
df.fillna({1: 0.5, 3: -1})

In [None]:
# always returns a reference to the filled object
_ = df.fillna(0, inplace=True)
df

In [None]:
df = DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = NA; df.ix[4:, 2] = NA
df

In [None]:
df.fillna(method='ffill')

In [None]:
df.fillna(method='ffill', limit=2)

In [None]:
data = Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

## Hierarchical indexing

In [None]:
data = Series(np.random.randn(10),
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                     [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

In [None]:
data.index

In [None]:
data['b']

In [None]:
data['b':'c']

In [None]:
data.ix[['b', 'd']]

In [None]:
data[:, 2]

In [None]:
data.unstack()

In [None]:
data.unstack().stack()

In [None]:
frame = DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])
frame

In [None]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

In [None]:
frame['Ohio']

### Reordering and sorting levels

In [None]:
frame.swaplevel('key1', 'key2')

In [None]:
frame.sortlevel(1)

In [None]:
frame.swaplevel(0, 1).sortlevel(0)

### Summary statistics by level

In [None]:
frame.sum(level='key2')

In [None]:
frame.sum(level='color', axis=1)

### Using a DataFrame's columns

In [None]:
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
                   'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                   'd': [0, 1, 2, 0, 1, 2, 3]})
frame

In [None]:
frame2 = frame.set_index(['c', 'd'])
frame2

In [None]:
frame.set_index(['c', 'd'], drop=False)

In [None]:
frame2.reset_index()

## Other pandas topics

### Integer indexing

In [None]:
ser = Series(np.arange(3.))
ser.iloc[-1]

In [None]:
ser

In [None]:
ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]

In [None]:
ser.ix[:1]

In [None]:
ser3 = Series(range(3), index=[-5, 1, 3])
ser3.iloc[2]

In [None]:
frame = DataFrame(np.arange(6).reshape((3, 2)), index=[2, 0, 1])
frame.iloc[0]

### Panel data

In [None]:
import pandas.io.data as web

pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk))
                       for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))

In [None]:
pdata

In [None]:
pdata = pdata.swapaxes('items', 'minor')
pdata['Adj Close']

In [None]:
pdata.ix[:, '6/1/2012', :]

In [None]:
pdata.ix['Adj Close', '5/22/2012':, :]

In [None]:
stacked = pdata.ix[:, '5/30/2012':, :].to_frame()
stacked

In [None]:
stacked.to_panel()