pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy. It's built on top of NumPy. 

**Series** is a 1D array-like object containing an array of data. 

**Index** an associated array of data labels.

In [170]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

In [171]:
obj = Series([4,7,-5,3])
print obj

# Index at left column by default starts at 0

0    4
1    7
2   -5
3    3
dtype: int64


We can retrieve the index and valuse attributes respectively using `values` and `index` object

In [172]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [173]:
obj.index

RangeIndex(start=0, stop=4, step=1)

We often want to create a series with a specific index that clearly identifies each point. We can do so within `series` method using the `index` parameter

In [174]:
obj2=Series([4,7,-5,3], index =['d','b','a','c'])
print obj2

d    4
b    7
a   -5
c    3
dtype: int64


In [175]:
obj2.index  # u for unicode

Index([u'd', u'b', u'a', u'c'], dtype='object')

Retrieve values by calling on the index

In [176]:
obj2['a']

-5

To retrive more than 1 value, pass through an array of indices

In [177]:
obj2[['c','a','d']]

c    3
a   -5
d    4
dtype: int64

Any math operations or filtering on series will preseve the index-value relationship

In [178]:
obj2[obj2 >0]

d    4
b    7
c    3
dtype: int64

In [179]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [180]:
np.exp(obj2) # Calculate the exponential of all elements 

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way of thinking about a series is as a fixed-length, ordered dict, as it maps index values to dat avalues

In [181]:
'b' in obj2

True

We can create a series from a Python dictionary by passing the dictionary through Series() method. The keys of a dictionary become the indices while the values remain values. The resulting series is sorted by its index.

In [182]:
# dictionary
sdata= {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah':5000}

obj3=Series(sdata)

obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [183]:
# Explicitly assign index using a variable containing an array of labels

states = ['California','Ohio','Oregon','Texas']

obj4=Series(sdata, index=states)

obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Use isnull or notnull functions in pd to detect missing data

In [184]:
print pd.isnull(obj4)

print pd.notnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool


A critical Series feature for many applications is that it automatically aligns indexed data in arithmetic operations. For example:

In [185]:
print obj3,obj4
print''
print obj3 + obj4

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64 California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


Series and index have a name attribute. You can name your series or indices.

In [186]:
obj4.name = 'population'

obj4.index.name = 'state'

obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series's index can be updated in place by explicit assignment

In [187]:
obj.index=['Bob','Steve','Jeff','Ryan']

obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

### DataFrame

Represents a pivot-table like structure where columns can have different value types (numeric, boolean, string, etc.) Dataframes have both a row and column index. It's like a dictionary of Series. 

There are many ways to build a DataFrame though one of hte most popular ways is from a dictionary of equal-length lists or NumPy arrays.

In [188]:
#Dictionary
data = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year': [2000,2001,2002,2001,2002],
        'pop': [1.5,1.7,3.6,2.4,2.9]}

frame=DataFrame(data)
frame

# Index automatically assigned from 0 to N-1
# Columns are automatically sorted in alphabetical order

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


You can explicitly specify the sequence of columns by passing the column headiangs through columns parameter of DataFrame

In [189]:
DataFrame(data, columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


As with Series, if you pass a column that isn't contained in data, it will appear as NaN for its values

In [190]:
frame2=DataFrame(data, columns = ['year','state','pop','debt'],
                 index =['one','two','three','four','five'])

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [191]:
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

Return a column of DataFrame by using dictionary-like notation or by attribute notation

In [192]:
# bracket/dict-like notation
print frame2['state']

# dot/attribute notation
print frame2.year

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64


Rows can also be retrieved by name using the `loc` indexing field. To retrive by position, use the `iloc` indexing field.

In [193]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [194]:
frame2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values.

In [195]:
frame2['debt'] = 16.5

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [196]:
frame2['debt'] = np.arange(5)

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


When assigning <u>lists or arrays</u> to a column, the value's lenghth must = length of data frame.

If we assign a Series to a column, it will instead be fitted exactly according to the DataFrame's index, inserting missing values in any gaps. Similar to a VLOOKUP in Excel

In [197]:
val = Series([-1.2,-1.5,-1.7], index=['two','four','five'])

print val

frame2['debt'] = val

print''
print frame2

two    -1.2
four   -1.5
five   -1.7
dtype: float64

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7


Assigning a column that doesn't exist will create a brand new column. The `del` keyword will delete columns as with a dict.

In [198]:
# Create a column called eastern whose values take on True if state = Ohio

frame2['eastern']=frame2['state']=='Ohio'

frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [199]:
del frame2['eastern']
frame2.columns
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


Another common form of data is a nested dict of dicts format (Outer keys, Inner keys, Values). If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner keys as the row indices:

In [200]:
pop={'Nevada':{2001: 2.4, 2002: 2.9},
     'Ohio': {2000: 1.5,2001: 1.7, 2002:3.6}}

frame3 = DataFrame(pop)

print frame3, "\n \n",frame3.T # Pivot using transpose

      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6 
 
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6


The keys in the inner dicts are unioned and sorted alphabetically to form indices, but this isn't true if an explicit index is specified

In [201]:
DataFrame(pop, index=[2001,2002,2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [202]:
pdata = {'Ohio':frame3['Ohio'][:-1],
         'Nevada':frame3['Nevada'][:2]}

DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


If a DataFrame's `index` and `columns` have their `name` attributes set, these will also be displayed in the DataFrame

In [203]:
frame3.index.name = 'year'
frame3.columns.name='state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Like Series, the `values` attribute returns the data contained in the DataFrame as a 2D ndarray

In [204]:
frame3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

In [205]:
print frame2, "\n"
print frame2.values

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7 

[[2000L 'Ohio' 1.5 nan]
 [2001L 'Ohio' 1.7 -1.2]
 [2002L 'Ohio' 3.6 nan]
 [2001L 'Nevada' 2.4 -1.5]
 [2002L 'Nevada' 2.9 -1.7]]


#### Index Objects 

In [206]:
obj = Series(range(3), index = ['a','b','c'])

index = obj.index

index[0:]

index = pd.Index(np.arange(3))

obj2 = Series([1.5,-2.5,0], index = index)

obj2.index is index

True

In [207]:
print frame3
print''

print 'Ohio' in frame3.columns

print 2001 in frame3.index

state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

True
True


### Essential Functionality 

#### Reindexing

To create a new object with the data of another object fitted to a new index. Any indices that don't match are `NaN` unless we specify `fill_value =` parameter

In [208]:
obj = Series([4.5,7.2,-5.3,3.6], index=['d','b','a','c'])
print obj

obj2=obj.reindex(['a','b','c','d','e'])
print obj2

# Fill empty values with 0
obj.reindex(['a','b','c','d','e'], fill_value=0)

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64


a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

Sometimes, it's desirable to do some **interpolation** or filling of values between other values when reindexing (similar to Excel). Interpolation is a method of constructing new data points within the range of a discrete set of known data points. This is common for ordered data like time series. The `method=ffill` option of reindex allows us to fill in the empty cells by forward fills.

In [209]:
obj3 = Series(['blue','purple','yellow'], index = [0,2,4])

obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, reindex can change either row index, columns, or both. When passed just a sequence, the rows are reindexed in the result: 

In [210]:
frame = DataFrame(np.arange(9).reshape(3,3), index=['a','c','d'],
                  columns=['Ohio','Texas','California'])

frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [211]:
frame2=frame.reindex(['a','b','c','d'])
print frame2

   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0


Columns can be reindexed using the `columns` keyword

In [212]:
print frame

states=['Texas','Utah','California']

frame.reindex(columns=states)

   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8


Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Both can be reindexed in 1 line, though interpolation will only apply row-wise. Reindexing can be done more succinctly by label-indexing with loc:

In [213]:
frame.loc[['a','b','c','d'], states]

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


#### Dropping entries from an axis (index, column)

Dropping one or more entries from an axis is easy if you have an index array or list without those entiries. The `drop` method will return a new object with the indicated value or values deleted from an axis

In [214]:
obj = Series(np.arange(5.), index=['a','b','c','d','e'])

new_obj = obj.drop('c')

print new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64


In [215]:
#Delete from axis 0 (indices)

obj.drop(['d','c'])

a    0.0
b    1.0
e    4.0
dtype: float64

In [216]:
data = DataFrame(np.arange(16).reshape(4,4),
                 index=['Ohio','Colorado','Utah','New York'],
                 columns = ['one','two','three','four'])

print data, "\n"
print data.drop(['Colorado','Ohio']), "\n"
print data.drop('two',axis=1), "\n"
print data.drop(['two','four'], axis=1)


          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15 

          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15 

          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15 

          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14


####  Indexing, selection, and filtering

Series indexing (obj[..]) works same as numpy array indexing, except you can use the Series's index values instead of only integers. 

In [217]:
obj = Series(np.arange(4.), index = ['a','b','c','d'])

print obj['b']
print obj[2:4]
print obj[['a','b','c']]
print obj[[1,3]]
print obj[obj <2]

1.0
c    2.0
d    3.0
dtype: float64
a    0.0
b    1.0
c    2.0
dtype: float64
b    1.0
d    3.0
dtype: float64
a    0.0
b    1.0
dtype: float64


Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive of the return:

In [218]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

Assign values using slicing

In [219]:
obj['b':'c'] = 5

obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [220]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [221]:
# Selecting more than 1 column
data[['two','three']]

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


In [222]:
#First 2 columns

data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [223]:
#Select rows where under 3 column, value is > 5
data[data['three']>5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [224]:
# Return a boolean array
data['three']>5

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

In [225]:
data > 5

Unnamed: 0,one,two,three,four
Ohio,False,False,False,False
Colorado,False,False,True,True
Utah,True,True,True,True
New York,True,True,True,True


In [226]:
data[data <5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [227]:
# Select certain columns and rows at once

print data.loc['Colorado',['two','three']]

print data.loc[['Colorado','Utah'],['two','three']]

two      5
three    6
Name: Colorado, dtype: int32
          two  three
Colorado    5      6
Utah        9     10


### Arithmetic and data alignment

When adding together objects, if index does not match, the respective index in the output will be the union of the index pairs and NaN's

**list**

Return a list whose items are the same and in the same order as iterable‘s items. iterable may be either a sequence, a container that supports iteration, or an iterator object. If iterable is already a list, a copy is made and returned, similar to iterable[:]. For instance, list('abc') returns ['a', 'b', 'c'] and list( (1, 2, 3) ) returns [1, 2, 3]. If no argument is given, returns a new empty list, [].

In [228]:
s1=Series([7.3,-2.5,3.4,1.5], index=['a','c','d','e'])
s2=Series([-2.1,3.6,-1.5,4,3.1], index=['a','c','e','f','g'])

print s1, s2

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64 a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64


In [229]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

For arithmetic on dataframes, the internal data alignment introduces NaN's for indices that don't match. Missing values propagate in arithmetic computations even if an index exists in another. 

In [230]:
df1 = DataFrame(np.arange(9).reshape((3,3)),columns=list('bcd'),
                 index=['Ohio','Texas','Colorado'])
                
df2= DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),
               index=['Utah','Ohio','Texas','Oregon'])

df1

Unnamed: 0,b,c,d
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [231]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [232]:
df1+df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


To override this, use the `add` method instead on a dataframe and pass the dataframe to be added + fill_value=0. This allows values to show up for a dataframe even if it doesn't exist in another

**fill_value**
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

In [233]:


df1 = DataFrame(np.arange(12.).reshape((3,4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4,5)), columns=list('abcde'))

df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [234]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [235]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [236]:
# subtraction
print df1.sub(df2, fill_value=0)
#division
print df1.div(df2, fill_value=0)
#multiplication
print df1.mul(df2, fill_value=0)

      a     b     c     d     e
0   0.0   0.0   0.0   0.0  -4.0
1  -1.0  -1.0  -1.0  -1.0  -9.0
2  -2.0  -2.0  -2.0  -2.0 -14.0
3 -15.0 -16.0 -17.0 -18.0 -19.0
     a         b         c         d    e
0  NaN  1.000000  1.000000  1.000000  0.0
1  0.8  0.833333  0.857143  0.875000  0.0
2  0.8  0.818182  0.833333  0.846154  0.0
3  0.0  0.000000  0.000000  0.000000  0.0
      a     b      c      d    e
0   0.0   1.0    4.0    9.0  0.0
1  20.0  30.0   42.0   56.0  0.0
2  80.0  99.0  120.0  143.0  0.0
3   0.0   0.0    0.0    0.0  0.0


#### Operations between DataFrame and Series 

In [237]:
arr = np.arange(12.).reshape((3,4))
arr

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

In [238]:
print arr[0]
print''
print "Array - arr[0] for each row: ","\n\n",arr - arr[0]

[ 0.  1.  2.  3.]

Array - arr[0] for each row:  

[[ 0.  0.  0.  0.]
 [ 4.  4.  4.  4.]
 [ 8.  8.  8.  8.]]


In [239]:
frame = DataFrame(np.arange(12.).reshape((4,3)), columns = list('bde'), index=['Utah','Ohio','Texas','Oregon'])

series =frame.iloc[0]

print frame
print series

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64


By default, arithmetic between DataFrames and Series matches the index of the Series on the DataFrame's columns, broadcasting down the rows.

**broadcasting**

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

In [240]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


If you want to broadcast over the columns, matching on the rows, use the arithmetic methods and pass the <u>axis to match on (axis=0) </u>. Matching rows over columns

In [241]:
series3=frame['d']

frame.sub(series3,axis=0)

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### Function Application and Mapping

Can use numpy unfuncs on panda objects:

In [242]:
frame = DataFrame(np.random.randn(4,3), columns = list('bde'),
                  index=['Utah','Ohio','Texas','Orgeon'])

print frame
print np.abs(frame)

               b         d         e
Utah   -0.207254  2.268150  0.134474
Ohio   -0.147122  0.561105  0.288688
Texas   0.898882  0.506460  1.574982
Orgeon -0.478546  1.175631 -0.950428
               b         d         e
Utah    0.207254  2.268150  0.134474
Ohio    0.147122  0.561105  0.288688
Texas   0.898882  0.506460  1.574982
Orgeon  0.478546  1.175631  0.950428


Apply a function on 1D array to each column or row. The DataFrame's `apply` method does this. 

**lambda**
Python supports the creation of anonymous functions (i.e. functions that are not bound to a name) at runtime, using a construct called "lambda". This is not exactly the same as lambda in functional programming languages, but it is a very powerful concept that's well integrated into Python and is often used in conjunction with typical functional concepts like filter(), map() and reduce().

Note that the lambda definition does not include a "return" statement -- it always contains an expression which is returned. Also note that you can put a lambda definition anywhere a function is expected, and you don't have to assign it to a variable at all.

In [243]:
f= lambda x: x.max() - x.min()

# By column --> Returns results by column
frame.apply(f, axis = 0)

b    1.377428
d    1.761691
e    2.525410
dtype: float64

In [244]:
# by Row
frame.apply(f, axis = 1)

Utah      2.475404
Ohio      0.708227
Texas     1.068523
Orgeon    2.126058
dtype: float64

The functions passed to `apply` need not return a scalar value. It can also return a Series with multiple values.

In [245]:
def f(x):
    return Series([x.min(), x.max()], index=['min','max'])

print frame

# Calculates the min, max for each column
frame.apply(f)

               b         d         e
Utah   -0.207254  2.268150  0.134474
Ohio   -0.147122  0.561105  0.288688
Texas   0.898882  0.506460  1.574982
Orgeon -0.478546  1.175631 -0.950428


Unnamed: 0,b,d,e
min,-0.478546,0.50646,-0.950428
max,0.898882,2.26815,1.574982


Element-wise Python functions can be used by `applymap` method. It's called applymap because Series has a map method for applying an element-wise function

In [246]:
format = lambda x: '%.1f' % x

# DataFrame elementwise application
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.2,2.3,0.1
Ohio,-0.1,0.6,0.3
Texas,0.9,0.5,1.6
Orgeon,-0.5,1.2,-1.0


In [247]:
# Series elementwise application
frame['e'].map(format)

Utah       0.1
Ohio       0.3
Texas      1.6
Orgeon    -1.0
Name: e, dtype: object

### Sorting and ranking

Sorting a data set by criterions can be done a few ways

In [248]:
obj=Series(range(4), index=['d','a','b','c'])
print obj

# Sort series by index
obj.sort_index()

#Sort dataframe columns alphabetically in ascending order by default
frame=DataFrame(np.arange(8).reshape((2,4)), index =['three','one'],
                columns = ['d','a','b','c'])

print frame.sort_index(axis=1)



d    0
a    1
b    2
c    3
dtype: int64
       a  b  c  d
three  1  2  3  0
one    5  6  7  4


In [249]:
# Sort columns in desc order
print frame.sort_index(axis=1, ascending=False)

       d  c  b  a
three  0  3  2  1
one    4  7  6  5


To sort a series by its values, use `sort_values` method where missing values are sorted to the end of the Series by default

In [250]:
obj=Series([4,7,-3,2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [251]:
obj = Series([4,np.nan,7,np.nan,-3,0])

obj.sort_values()

4   -3.0
5    0.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

To sort by values in 1+ columns, pass one or more column names in the by option of `sort_values` method

In [252]:
frame=DataFrame({'spend':[4,7,-3,2],'category':[0,1,0,1]})

frame

Unnamed: 0,category,spend
0,0,4
1,1,7
2,0,-3
3,1,2


In [253]:
# Sort DataFrame by values in column spend
frame.sort_values(by='spend', ascending=False)

Unnamed: 0,category,spend
1,1,7
0,0,4
3,1,2
2,0,-3


In [254]:
# Sort by multiple columns

frame.sort_values(by=['category','spend'])

Unnamed: 0,category,spend
2,0,-3
0,0,4
3,1,2
1,1,7


**Ranking** is close to sorting except that ties are broken according to a rule (by default rank break ties by assigning each group the mean rank).

Tie-Breaking methods = : average, min, max, first

In [255]:
obj = Series([7,-5,7,4,2,0,4])

obj.rank() # Rank returned

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [256]:
# Assigning rank based on the order they're observed in data

obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [257]:
# Rank in descending order

obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [258]:
frame = DataFrame({'b': [4.3,7,-3,2], 'a': [0,1,0,1],'c':[-2,5,8,-2.5]})

print frame
print frame.rank(axis=1) # Ranked by column
print frame.rank(axis=0) # Ranked by row


   a    b    c
0  0  4.3 -2.0
1  1  7.0  5.0
2  0 -3.0  8.0
3  1  2.0 -2.5
     a    b    c
0  2.0  3.0  1.0
1  1.0  3.0  2.0
2  2.0  1.0  3.0
3  2.0  3.0  1.0
     a    b    c
0  1.5  3.0  2.0
1  3.5  4.0  3.0
2  1.5  1.0  4.0
3  3.5  2.0  1.0


In [259]:
obj = Series(range(5), index=['a','a','b','b','c'])

obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

The index's `is_unique` property can tell you whether its values are unique or not

In [260]:
print obj.index.is_unique
print frame.index.is_unique

False
True


Indexing a value with multiple entries returns a Series while single entries returns a scalar value

In [261]:
obj['a']

a    0
a    1
dtype: int64

In [262]:
df = DataFrame(np.random.randn(4,3), index = ['a','a','b','b'])

print df.loc['b']

          0         1         2
b -1.506353 -0.195720 -0.973282
b  0.225338  1.193085  1.175269


###  Summarizing and Computing Descriptive Statistics

Math and stat methods are built from the ground up to exclude missing data.

In [263]:
df = DataFrame([[1.4,np.nan],[7.1, -4.5],[np.nan,np.nan],[0.75,-1.3]],
                index=['a','b','c','d'],
                columns = ['one','two'])

df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [264]:
# Sum method returns a series containing column names

df.sum()

one    9.25
two   -5.80
dtype: float64

In [265]:
# Passing axis = 1 sums over the rows instead.
# Axis to reduce over. 0 for rows and 1 for columns
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [266]:
# Include Missing values with skipna= False

df.mean(axis=1,skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

`describe`  is a method that produces multiple summary statistics in one snapshot

In [267]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [268]:
# On non-numeric data, describe produces different statistics

obj=Series(['a','a','b','c']*4)

print obj

obj.describe()

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object


count     16
unique     3
top        a
freq       8
dtype: object

In [269]:

# Reduction done over the columns so rows are returned
print df.var()
print''
print df.std()
print''
print df.sum()
print ''
print df.mean()

one    12.205833
two     5.120000
dtype: float64

one    3.493685
two    2.262742
dtype: float64

one    9.25
two   -5.80
dtype: float64

one    3.083333
two   -2.900000
dtype: float64


### Correlation and Covariance

computed from pairs of arguments. Consider DataFrames of stock prices and volumes obtained from Yahoo! finance

In [270]:
from pandas_datareader import data as web

all_data={}

for ticker in ['AAPL','IBM','MSFT','GOOG']:
    all_data[ticker] = web.get_quote_yahoo(ticker)
    
price=DataFrame({tic: data['Adj Close']
                 for tic, data in all_data.iteritems()})
volume=DataFrame({tic: data['Volume']
                  for tic, data in all_data.iteritems()})

KeyError: 'Adj Close'

#### Unique Values, Value Counts, and Memberships

Another class of related methods extracts info about the values contained in a 1D Series.

In [None]:
obj = Series(['c','a','d','a','a','b','b','c','c'])

# Return unique values of the series
print obj.unique()

# Return a series containing unique values as its index and frequencies as its values, ordered count in descending order
print obj.value_counts()

# Compute boolean array indicating whether each Series value is contained in the passed sequence of values. Only accepts lists
print obj.isin(('a','b'))


In some cases, we may want to compute a histogram on multiple columns. Passing `pandas.value_counts` to apply function gives a frequency table

In [None]:
data = DataFrame({'Qu1': [1,3,4,3,4],
                  'Qu2': [2,3,1,2,3],
                  'Qu3': [1,5,2,4,4]})

result = data.apply(pd.value_counts)
print result

# To replace NaN's with 0's
result.fillna(0)

### Handling Missing Data

Missing data is common in most applications. All of the descriptive statistics on pandas objects exclude missing data. pandas uses NaN to represent missing values in both numeric and non-numeric point arrays

In [None]:
string_data = Series(['aardvark','artichoke',np.nan,'avocado'])

print string_data.isnull()

# Fill NA's with another value
string_data.fillna("Not Avail")

### Filtering Out Missing Data

`dropna` is very useful. On a Series, it returns the series with only the non-null data and index values

In [None]:
from numpy import nan as NA

data=Series([1,NA,3.5,NA,7])

data.dropna()

In [None]:
# Alternatively, use boolean indexing
data[data.notnull()]

With DataFrame objects, you may want to drop rows or columns where it's all NA's. `dropna` by default drops any row containing a missing value. 

In [None]:
data= DataFrame([[1.,6.5,3.], [1.,NA,NA],
                 [NA,NA,NA], [NA,6.5,3.]])

# drops rows containing NA
cleaned =data.dropna()

print data
print cleaned
print''

# Only drop rows where values are all NA
cleaned = data.dropna(how='all')
print cleaned

Dropping columns in the same way is only a matter of passing axis=1 for row reduction

In [None]:
data[4] = NA

data

In [None]:
data.dropna(axis=1, how='all')

A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows with min # of observations. Use the `thresh` argument

In [None]:
df = DataFrame(np.random.randn(7,3))

df.iloc[:4,1] = NA

df

In [None]:
df.dropna(thresh=3) # Only rows with min 3 observations are returned

Fill in missing data with 0 to avoid throwing away data

In [None]:
df.fillna(0)

Calling `fillna` with a dict allows different fill values for each column

In [None]:
df.fillna({1:0.5,2:-1})

In [None]:
# fillna returns a new object but you can modify the existing object using inplace=True

df.fillna(0, inplace = True)
df

In [None]:
df = DataFrame(np.random.randn(6,3))

df.iloc[2:,1] = NA; df.iloc[4:,2] = NA

df

In [None]:
#interpolation

print df.fillna(method='ffill')

print df.fillna(method='ffill', limit=2)

With `fillna` you can do lots of other things with a little creativity. For example, you can pass the mean or median value of a Series:

In [None]:
data = Series([1.,NA,3.5,NA,7])

data.fillna(data.mean())

**Hierarchical Indexing** is an important feature of pandas enabling us to have multiple index levels on an axis. It provides a way for you to work with higher dimensional data in a lower dimensional form. Equivalent of pivot tables in excel.


**Lists** are enclosed in square brackets ( [ and ] ) and **tuples** in parentheses ( ( and ) ).

In [None]:
# Series with a list of lists or arrays as the index

data = Series(np.random.randn(10),
              index=[['a','a','a','b','b','b','c','c','d','d'],
                     [1,2,3,1,2,3,1,2,2,3]])

data # prettified view of series with a Multiindex

**MultiIndex** when there is more than 1 layer of indices for a Series or DataFrame

In [None]:
data.index

With a hierarchically-index object, **partial indexing** is possible enabling us to concisely select subsets of the dat

In [None]:
print data['b':'c']
print''
print data[:,2] # Selecting from a nested level

Data could be rearranged into a DataFrame using its `unstack` method

In [None]:
print data

In [None]:
data.unstack() # DataFrame where inner level becomes column and outer level becomes rows

In [None]:
# Reverse operation is stack

data.unstack().stack()

With a DataFrame, either axis can have a hierarchical index

In [None]:
# Hierarch index on columns axis=1

frame = DataFrame(np.arange(12).reshape((4,3)),
                  index=[['a','a','b','b'],[1,2,1,2]],
                  columns = [['Ohio','Ohio','Colorado'],
                             ['Green','Red','Green']])

frame

Hierarch. levels can have names. (Don't confuse the index names with the axis labels)

In [None]:
frame.index.names = ['key1','key2']

frame.columns.names = ['states','colors']

frame

With partial column indexing, we can similarly select groups of columns same as we did with rows earlier

In [None]:
frame['Ohio']

 ### Reordering and Sorting Levels
 
 It's common to rearrange the order of the levels on an axis or sort data by the values of one level. The `swaplevel` takes 2 level numbers or names and returns a new object with the levels swapped.
 
`DataFrame.swaplevel(i=-2, j=-1, axis=0)[source]`

**i, j :** int, string (can be mixed)
Level of index to be swapped. Can pass level name as string.

In [None]:
frame

In [None]:
frame.swaplevel('key1','key2')

# Swap by rows

In [None]:
frame.swaplevel('states','colors', axis=1)

# Swap by columns

`sort_index`, on the other hand, sorts the data using only the values in 1 level. When swapping levels, it's common to also use `sort_index` so that the result is sorted too.

`DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None)[source]`

In [None]:
print frame
frame.sort_index(axis =1) # Sorted Columns


In [None]:
frame.swaplevel(0,1, axis = 0)

In [None]:
frame.swaplevel(0,1).sort_index(0) # Sort row indices

In [None]:
frame.swaplevel(0,1).sort_index(0)

FYI: Data selection performance is much better on hierarchically indexed objects if the index is sorted starting with the outer-most level, that is `sortlevel(0)`

#### Summary Stats by Level

Many descriptive and summary stats on DataFrame and Series have a `level` option which specifies the nested indices if you want to do math on a specific level of an axis.

In [None]:
frame.sum(level='key2')

In [None]:
frame.sum(level='states',axis=1)

In actuality, this utilizes pandas's groupby machinery which will be discussed in much detail later in the book

####  Using a DataFrame's Columns

It's common to use 1+ columns from a DataFrame as the row index and vice versa, similar to pivot tables. You can do so with `set_index`. It creates a new DataFrame using 1+ of its columns as the index

In [271]:
frame = DataFrame({'a':range(7),'b': range(7,0,-1),
                   'c': ['one','one','one','two','two','two','two'],
                   'd': [0,1,2,0,1,2,3]})

frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [272]:
frame2=frame.set_index(['c'])
frame2

Unnamed: 0_level_0,a,b,d
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,7,0
one,1,6,1
one,2,5,2
two,3,4,0
two,4,3,1
two,5,2,2
two,6,1,3


In [273]:
frame2=frame.set_index(['c','d'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


By default, the columns are removed from the DataFrame, though you can leave them in using `drop=false` option in `set_index`

In [274]:
frame.set_index(['c','d'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [275]:
# Reset the indexes; the hierarch. index levels are moved into the columns

frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1
