here we discuss a lot of the essential functionality common to the pandas data structure. Here's the some of the object used in the examples from previous section

In [1]:
import pandas as pd
import numpy as np

In [2]:
index = pd.date_range('1/1/2000', periods=8)
s = pd.Series(np.random.randn(5), index=['a','b','c','d','e'])
df = pd.DataFrame(np.random.randn(8,3), index=index, columns=['A','B','C',])
wp = pd.Panel(np.random.randn(2, 5, 4), items=['item1', 'item2'],
                  major_axis=pd.date_range('1/1/2000', periods=5),
                  minor_axis=['A','B','C','D']
              )

### Head and Tail

To view a small sample of a Sereis or DataFrame, use the head() and tail() methods. The defalut number of elements to display is five, but you may pass a custom number.

In [3]:
long_series = pd.Series(np.random.randn(1000))
print long_series.head()
print long_series.tail(3)

0    1.600384
1   -0.407873
2    0.359483
3    0.532786
4    1.210273
dtype: float64
997   -0.544460
998    1.228556
999    1.610510
dtype: float64


### Attributes and the raw ndarray(s)

pandas object have a number of attributes enabling you to aceess the metadata

- shape: give the axis dimetion of the object. consistent with ndarray
- Axis labels
> - Sereis: index (only axis)
  - DataFrame : index(row) and columns
  - Panel : itmes. major_axis, and minor_axis
  
Note, there attributes can be safely assigned to

In [4]:
print df[:2]
print [x for x in df.columns]
df.columns = [x.lower() for x in df.columns]
print df

                   A         B         C
2000-01-01  0.145118  0.630566  0.337466
2000-01-02 -0.006667 -1.635134 -0.154327
['A', 'B', 'C']
                   a         b         c
2000-01-01  0.145118  0.630566  0.337466
2000-01-02 -0.006667 -1.635134 -0.154327
2000-01-03 -1.993847  0.591524 -1.615003
2000-01-04 -0.701164  0.183968  1.112280
2000-01-05 -0.625369  2.013312 -1.447237
2000-01-06  0.624275 -2.083235 -0.161841
2000-01-07 -0.928790  2.062428  3.404676
2000-01-08 -0.142214 -0.921791  0.230453


To get the actual data inside a data structure, one need only acesss the values property

In [5]:
print s.values
print df.values
print wp.values

[-1.63140465  0.13828657  0.58947383  1.12658133  0.88258443]
[[ 0.14511772  0.63056563  0.33746631]
 [-0.00666655 -1.63513443 -0.15432732]
 [-1.99384747  0.59152437 -1.61500302]
 [-0.70116449  0.18396827  1.1122797 ]
 [-0.62536948  2.01331164 -1.44723727]
 [ 0.624275   -2.08323481 -0.16184054]
 [-0.92878975  2.0624284   3.40467586]
 [-0.14221446 -0.92179092  0.23045339]]
[[[  4.53614677e-01   1.19413750e+00   8.83063425e-01   2.01563526e-02]
  [  1.02468534e+00   6.25340607e-01   6.99774812e-01   3.31339568e-01]
  [  7.14750819e-04  -5.09452983e-01   5.29484013e-01   8.46319166e-01]
  [  3.05680585e-01  -4.29092971e-01  -8.93992702e-01   2.27485822e-01]
  [  3.11783753e-01  -4.39327881e-01   5.74645376e-01   2.99857797e-01]]

 [[ -1.27069649e+00  -8.47117176e-01   1.61655553e+00  -2.80847866e-01]
  [ -5.18218421e-01   1.70342023e+00   7.13978108e-01   5.98623046e-01]
  [ -6.04164928e-02  -3.32276393e-01   7.02853576e-01   1.36219867e-01]
  [  1.10780688e+00   9.18750383e-01   1.779852

if a DataFrame or Panel homogeneously-typed data, the ndarray can actually be modified in place, and the changes will be reflected in the data structure. For heterogeneous data (e.g some of the DataFrame's columns are not all the same dtype), this will not be case. the values attribute itself, unlike the axis labels, cannot be assigned to.

> *Note* When working with hetergeneous data, the dtype of the resulting ndarray will be choosen to accommodate all of the data involved.
For example. if string are involved, the result will be of object dtype. if there are only float and integers, the resulting array will be of float dtype

## Flexible binary operations
with binary operation between pandas data structures, there are two key point of interest:

- broadcasting behavior between higher ( e.g DataFrame) and lower-dimensional(e.g Series) objects
- missing data in computation



### Matching / broadcasting behavior

DataFrame has the methods add, sub, mul, div and related function radd, rsub... for carrying out binary operations. For broadcasting behavior. Series input is of primary interest, Using these functions, you can use to either match on the index or columns via the axis keyword

In [6]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])
    })

print df
row = df.ix[1]
print row
column = df['two']
print column

print df.sub(row, axis='columns')
print df.sub(row, axis=1)
print df.sub(column, axis='index')
print df.sub(column, axis=0)


        one     three       two
a -1.076256       NaN -0.843741
b  0.594169 -0.390963  0.218790
c -0.347490 -0.612919 -0.426137
d       NaN -1.188434 -0.955574
one      0.594169
three   -0.390963
two      0.218790
Name: b, dtype: float64
a   -0.843741
b    0.218790
c   -0.426137
d   -0.955574
Name: two, dtype: float64
        one     three       two
a -1.670425       NaN -1.062532
b  0.000000  0.000000  0.000000
c -0.941659 -0.221956 -0.644928
d       NaN -0.797470 -1.174364
        one     three       two
a -1.670425       NaN -1.062532
b  0.000000  0.000000  0.000000
c -0.941659 -0.221956 -0.644928
d       NaN -0.797470 -1.174364
        one     three  two
a -0.232515       NaN  0.0
b  0.375378 -0.609754  0.0
c  0.078647 -0.186782  0.0
d       NaN -0.232860  0.0
        one     three  two
a -0.232515       NaN  0.0
b  0.375378 -0.609754  0.0
c  0.078647 -0.186782  0.0
d       NaN -0.232860  0.0


Futuremore you can align a level of a multi-indexed DataFrame with a Series

In [7]:
dfmi = df.copy()
dfmi.indes = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'), (1, 'c'),(2, 'a')], names=['first','second'])
print dfmi.sub(column, axis=0, level='second')
print dfmi

        one     three  two
a -0.232515       NaN  0.0
b  0.375378 -0.609754  0.0
c  0.078647 -0.186782  0.0
d       NaN -0.232860  0.0
        one     three       two
a -1.076256       NaN -0.843741
b  0.594169 -0.390963  0.218790
c -0.347490 -0.612919 -0.426137
d       NaN -1.188434 -0.955574


### Missing data / operation with fill values

in Series and DataFrame (though not yet in Panel), the arithmetic funcitons have the option of inputting a fill_values, namely a value to substitute when at most one of the values at a location are missing.
For Example when adding two dataFrame objects,  you may wish to treat NaN as 0 Unless both DataFrame are missing that values, in which case the result will be NaN( you can later replace NaN with some other values using fillna if you wish)

In [8]:
print df
df2 = df.copy()
df2.iloc[0][1] = 12
print df2
print df + df2
print df.add(df2, fill_value=0)

        one     three       two
a -1.076256       NaN -0.843741
b  0.594169 -0.390963  0.218790
c -0.347490 -0.612919 -0.426137
d       NaN -1.188434 -0.955574
        one      three       two
a -1.076256  12.000000 -0.843741
b  0.594169  -0.390963  0.218790
c -0.347490  -0.612919 -0.426137
d       NaN  -1.188434 -0.955574
        one     three       two
a -2.152513       NaN -1.687483
b  1.188337 -0.781927  0.437581
c -0.694980 -1.225839 -0.852275
d       NaN -2.376868 -1.911147
        one      three       two
a -2.152513  12.000000 -1.687483
b  1.188337  -0.781927  0.437581
c -0.694980  -1.225839 -0.852275
d       NaN  -2.376868 -1.911147


### Flexible Comparsions

pandas introduced binary comparsion methods eq, ne, lt, gt, le and ge to Series and DataFrame whose behavior is analogous to the binary artimetic operations described above:

In [9]:
print df.gt(df2)
print df.ne(df2)

     one  three    two
a  False  False  False
b  False  False  False
c  False  False  False
d  False  False  False
     one  three    two
a  False   True  False
b  False  False  False
c  False  False  False
d   True  False  False


### Boolean Redcutions

you can apply the reductions: empty, any, all and boolean to provide a way to summarize a boolean result

In [10]:
print df > 0
print (df > 0).all()
print (df > 0).all(axis='columns')
print (df > 0).any()

     one  three    two
a  False  False  False
b   True  False   True
c  False  False  False
d  False  False  False
one      False
three    False
two      False
dtype: bool
a    False
b    False
c    False
d    False
dtype: bool
one       True
three    False
two       True
dtype: bool


### Descriptive statistic

A large number of methods for computing descriptive statistics and other related operation on Series, DataFrame, Panel. Most of these are aggregation(hence producing a lower-dimensional result) like sum(), mean(), and quantile(), but more of them, like cumsum(), cumprod(), produce an object of the same size. Generally speaking there methods take an axis argument, just like ndarray {sum, std,...} but the axis can be specified by name or integer

- Sereis: no axis argument needed
- DataFrame: "index"(axis=0, default), "column"(axis=1)
- Panel: "item"(axis=0),"major"(axis=1,default), "minor"(axis=2)

For example:

In [13]:
print df
print df.mean()
print df.mean(1)

        one     three       two
a -1.076256       NaN -0.843741
b  0.594169 -0.390963  0.218790
c -0.347490 -0.612919 -0.426137
d       NaN -1.188434 -0.955574
one     -0.276526
three   -0.730772
two     -0.501665
dtype: float64
a   -0.959999
b    0.140665
c   -0.462182
d   -1.072004
dtype: float64


All such methods have a skipna option signaling whether to exclude missing data(True by default)

In [15]:
print df.sum(0, skipna=False)
print df.sum(1, skipna=True)

one           NaN
three         NaN
two     -2.006662
dtype: float64
a   -1.919998
b    0.421996
c   -1.386547
d   -2.144007
dtype: float64


combined with the broadcasting / aritmetic behavior, one can descibe various statistical procedure, like standardization (rendering data zero mean and standard deviation 1), very concisely

In [23]:
ts_stand = (df - df.mean())/ df.std()
print ts_stand
print ts_stand.std()

xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)
print xs_stand.std(1)

        one     three       two
a -0.954936       NaN -0.643478
b  1.039672  0.825599  1.355248
c -0.084736  0.286335  0.142076
d       NaN -1.111934 -0.853846
one      1.0
three    1.0
two      1.0
dtype: float64
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64


Note that method like cumsum() and cumprod() preserve the location of NA values:

In [25]:
print df
print df.cumsum()

        one     three       two
a -1.076256       NaN -0.843741
b  0.594169 -0.390963  0.218790
c -0.347490 -0.612919 -0.426137
d       NaN -1.188434 -0.955574
        one     three       two
a -1.076256       NaN -0.843741
b -0.482088 -0.390963 -0.624951
c -0.829578 -1.003883 -1.051088
d       NaN -2.192317 -2.006662


here is a quick reference sumarry table of common functions. Each also takes an optional level parameter which applies only if the object has hierarchical index
- Function : Description
- count: Number of non-null observations
- sum: sum of values
- mean: mean of values
- mad: mean absolute deviation
- median: artimetic median of values
- min : minimun
- max: maximun
- mode: xmode
- abs: absoulte value
- prod: product of values
- std: Bessel-corrected sample standart defiveaton
- var: unvised variance
- sem: stadard error of the mean
- skew: sample skwe
- kurt: Sample kurtoris
- quantile: smaple quantile
- comsum: cumulative sum
- cumprod: 