Pandas represents panel-data and python-data-analysis, from its name we can see that the basic objects of pandas are data series and data frame.

It's like a key-value based data table.

Pandas is based on NumPy, it's value of the series and dataframe are all numpy objects, once we are familiar with numpy, some pandas operation can be very easy.

Pandas is widely used in financial data analysis.

Today we are going to cover those topics:



1. Pandas basic data structure:
    

    *   Series
    *   DataFrame


2. Basic pandas functions

3. Scientific computing



In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
import numpy as np
from numpy.random import randn
import os
import matplotlib.pyplot as plt
import os
np.random.seed(42)
plt.rc('figure',figsize = (10,6))
np.set_printoptions(precision = 4)

Pandas series is like 1-dim data list, based on value and its corresponding index

In [3]:
se = Series([1,2,3,4])
print(se)
print(type(se))

0    1
1    2
2    3
3    4
dtype: int64
<class 'pandas.core.series.Series'>


In [4]:
se_np = Series(np.array([1,2,3,4]))
se_np

0    1
1    2
2    3
3    4
dtype: int32

In [6]:
print("value: ", se.values)
print("index: ", se.index)

value:  [1 2 3 4]
index:  RangeIndex(start=0, stop=4, step=1)


In [5]:
print(type(se.values))

<class 'numpy.ndarray'>


In [6]:
## Customize index:
se2 = Series({'a':1,'b':2})
se2

a    1
b    2
dtype: int64

In [7]:
print("Value: ", se2.values)
print("Index: ", se2.index)

Value:  [1 2]
Index:  Index(['a', 'b'], dtype='object')


In [8]:
## Pass in an numpy array to construct the series
se3 = Series(np.array([1,2,3,4]), index = ['a','b','c','d'])
se3

a    1
b    2
c    3
d    4
dtype: int32

In [9]:
## Pass in a python list to construct the series
se4 = Series([1,2,3,4], index = ['a','b','c','d'])
se4

a    1
b    2
c    3
d    4
dtype: int64

Visiting :

In [10]:
## We can visit the value by its index
se4['a']

1

In [11]:
se4[['a','b']]

a    1
b    2
dtype: int64

In [12]:
ind = ['a', 'b']
se4[ind]

a    1
b    2
dtype: int64

Apply numpy functions on pandas series:

In [17]:
## function will work on values, not index
np.exp(se4)

a     2.718282
b     7.389056
c    20.085537
d    54.598150
dtype: float64

In [13]:
se3.values

array([1, 2, 3, 4])

In [14]:
se4.values

array([1, 2, 3, 4], dtype=int64)

In [15]:
se3

a    1
b    2
c    3
d    4
dtype: int32

In [16]:
## Add corresponding to the index, if have index
se4 + se3

a    2
b    4
c    6
d    8
dtype: int64

In [17]:
se4 + np.array([1,2,3,4])

a    2
b    4
c    6
d    8
dtype: int64

In [18]:
se4 + [1,2,3,4]

a    2
b    4
c    6
d    8
dtype: int64

can also sum by scala

In [20]:
## Broadcast & Elementwise
se4 + 2

a    3
b    4
c    5
d    6
dtype: int64

In [23]:
se4%2 == 0

a    False
b     True
c    False
d     True
dtype: bool

In [24]:
se4

a    1
b    2
c    3
d    4
dtype: int64

Get the value by using boolean:

In [25]:
se4[se4%2 == 1]

a    1
c    3
dtype: int64

In [27]:
2 in se4.values

True

In [28]:
10 in se4.values

False

The 'in' operation only effective for index, not value:

In [29]:
2 in se4

False

In [30]:
'a' in se4

True

The most intuitive way is using python dictionary to construct the Series:

In [31]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [32]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4
         

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [33]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

select the row with NaN values

In [34]:
obj4[pd.isnull(obj4)]

California   NaN
dtype: float64

Series will automatically calculated based on index :

In [35]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [36]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [37]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

We can change index by assigning values:

In [38]:
se

0    1
1    2
2    3
3    4
dtype: int64

In [39]:
se.index = ['Jason','Schwartz','Katie','Flood']
se

Jason       1
Schwartz    2
Katie       3
Flood       4
dtype: int64

In [40]:
se.index

Index(['Jason', 'Schwartz', 'Katie', 'Flood'], dtype='object')

DataFrame

Dataframe is a table like datastructure, it contains a whole series of columns, each column can be different data type (numerical, string, boolean, etc.)

Dataframe has both column and row index, it can be seen as dictionary based on Series. (with same indexing)

The most common way to build DataFrame is to pass in a dictionary

In [41]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

frame = DataFrame(data)

print(frame.iloc[1],'\n')
print(frame.index,'\n')
frame

state    Ohio
year     2001
pop       1.7
Name: 1, dtype: object 

RangeIndex(start=0, stop=5, step=1) 



Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


Re-order the columns:

In [42]:
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


If the data does not exist, there would be NA:

In [44]:

frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
frame2


Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


By referencing a column, we can get a Series:

In [47]:
print(frame2['state'])

print(frame2.state)

# print(frame2.state == frame2['state'])

print('\n',type(frame2.debt))
print('\n',type(frame2.iloc[0]))
print('\n',type(frame2.loc['four']))
print('\n',frame2.loc['four'])
print('\n',type(frame2['year']))
print('\n',frame2['year'])

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

 <class 'pandas.core.series.Series'>

 <class 'pandas.core.series.Series'>

 <class 'pandas.core.series.Series'>

 year       2001
state    Nevada
pop         2.4
debt        NaN
Name: four, dtype: object

 <class 'pandas.core.series.Series'>

 one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64


How to get the row? using .iloc[number] or loc[index] method
 -- How to iter rows?

In [48]:
frame2.iloc[1]

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: two, dtype: object

In [49]:
frame2.loc['two']

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: two, dtype: object

In [50]:
frame2.iloc[1] == frame2.loc['two']

year      True
state     True
pop       True
debt     False
Name: two, dtype: bool

In [51]:
frame2['debt'] = [1,2,3,4,4]
print(frame2['debt']['one'])
frame2

1


Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,1
two,2001,Ohio,1.7,2
three,2002,Ohio,3.6,3
four,2001,Nevada,2.4,4
five,2002,Nevada,2.9,4


Use list or np.array to assign value, need to match the length of the DataFrame

In [52]:
frame2['debt'] = np.arange(5.)
print(type(frame2.values))
frame2.values

<class 'numpy.ndarray'>


array([[2000, 'Ohio', 1.5, 0.0],
       [2001, 'Ohio', 1.7, 1.0],
       [2002, 'Ohio', 3.6, 2.0],
       [2001, 'Nevada', 2.4, 3.0],
       [2002, 'Nevada', 2.9, 4.0]], dtype=object)

If you assign a Series,  the assignment will be finished with matching the index

In [54]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
print(frame2)

print(frame2.values)

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
[[2000 'Ohio' 1.5 nan]
 [2001 'Ohio' 1.7 -1.2]
 [2002 'Ohio' 3.6 nan]
 [2001 'Nevada' 2.4 -1.5]
 [2002 'Nevada' 2.9 -1.7]]


In [55]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


In [53]:
frame2.state == 'Ohio'

one       True
two       True
three     True
four     False
five     False
Name: state, dtype: bool

In [63]:
frame2['eastern'] = (frame2.state == 'Ohio')
frame2

Unnamed: 0,year,state,eastern
one,2000,Ohio,True
two,2001,Ohio,True
three,2002,Ohio,True
four,2001,Nevada,False
five,2002,Nevada,False


In [61]:
frame2 = frame2[['year','state']]

In [62]:
frame2

Unnamed: 0,year,state
one,2000,Ohio
two,2001,Ohio
three,2002,Ohio
four,2001,Nevada
five,2002,Nevada


In [64]:
del frame2['eastern']
frame2

Unnamed: 0,year,state
one,2000,Ohio
two,2001,Ohio
three,2002,Ohio
four,2001,Nevada
five,2002,Nevada


When passing dictionary as the argument for DataFrame, the out layer is the column, and the inner layer is the row index

In [65]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


We can also transpose a dataframe:

In [66]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


The value of dataframe is ndarray:

In [68]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

In [69]:
print(frame2.values)
print(type(frame2.values))

[[2000 'Ohio']
 [2001 'Ohio']
 [2002 'Ohio']
 [2001 'Nevada']
 [2002 'Nevada']]
<class 'numpy.ndarray'>


We can also assign value for the DataFrame directly for generation:

In [70]:
frame4=DataFrame(np.array([[0, 1.5],
                           [ 2.4,  1.7],
                           [ 2.9,  3.6]]),
                 index=['2001','2002','2003'],
                 columns=['Nevada','Ohio'])
frame4

Unnamed: 0,Nevada,Ohio
2001,0.0,1.5
2002,2.4,1.7
2003,2.9,3.6


Drop a row:

In [72]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print(obj)

obj.drop(['d', 'c'])


a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64


a    0.0
b    1.0
e    4.0
dtype: float64

If specify that axis=1, will delete the column:

In [76]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [77]:
data = data.drop(['Ohio', 'Colorado'])

In [78]:
data

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [79]:
data.drop(['two','four'],axis=0)

KeyError: "['two' 'four'] not found in axis"

In [80]:
data.drop(['two','four'],axis = 1)

Unnamed: 0,one,three
Utah,8,10
New York,12,14


Indexing, selection and filtering:


*   Series can use index, int, boolean for selection
*   DataFrame's indexing can get one or several columns
*   If want to manipulate row, need to use iloc



In [84]:
obj = Series(np.arange(4.0), index=['a', 'b', 'c', 'd'])
print(obj.dtype)

print(obj['b']) # index
print(obj[['b', 'a', 'd']])


float64
1.0
b    1.0
a    0.0
d    3.0
dtype: float64


In [85]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [86]:
## by using int or list
print(obj[1])
print(obj[1:3])
print(obj[[1,3]])
print(obj[['b', 'd']])

1.0
b    1.0
c    2.0
dtype: float64
b    1.0
d    3.0
dtype: float64
b    1.0
d    3.0
dtype: float64


In [87]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [88]:
obj[[True,False,False,False]]

a    0.0
dtype: float64

In [68]:
print(obj<2)
print(obj[obj<2])

a     True
b     True
c    False
d    False
dtype: bool
a    0.0
b    1.0
dtype: float64


In [98]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [90]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [91]:
data.iloc[0]

one      0
two      1
three    2
four     3
Name: Ohio, dtype: int32

In [92]:
data.loc['Ohio']

one      0
two      1
three    2
four     3
Name: Ohio, dtype: int32

In [93]:
data.loc['Ohio':]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [74]:
data.loc['Utah':]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [94]:
data.loc[:'Utah']

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11


In [95]:
data[1:]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [96]:
data.iloc[1:]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [99]:
data>5

Unnamed: 0,one,two,three,four
Ohio,False,False,False,False
Colorado,False,False,True,True
Utah,True,True,True,True
New York,True,True,True,True


In [101]:
data[data>5]

Unnamed: 0,one,two,three,four
Ohio,,,,
Colorado,,,6.0,7.0
Utah,8.0,9.0,10.0,11.0
New York,12.0,13.0,14.0,15.0


In [102]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [105]:
data.loc[['Colorado','Utah']]['one']

Colorado    4
Utah        8
Name: one, dtype: int32

In [106]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [111]:
data.ix[data.three > 5, :2] 

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,one,two
Colorado,4,5
Utah,8,9
New York,12,13


In [113]:
data[data.three > 5].T.iloc[:2].T

Unnamed: 0,one,two
Colorado,4,5
Utah,8,9
New York,12,13


In [115]:
data2=data.ix[['Colorado', 'Utah'], [3, 0, 1]]
 
data3=data.ix[data.three > 5, :2] 

print(data)
print(data2)
print(data3)


          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
          four  one  two
Colorado     7    4    5
Utah        11    8    9
          one  two
Colorado    4    5
Utah        8    9
New York   12   13


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


Calculation:
The basic idea is to calculate based on index.

In [116]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1 * s2

a   -15.33
c    -9.00
d      NaN
e    -2.25
f      NaN
g      NaN
dtype: float64

In [117]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=['as','xi','ba'],
                index=['Ohio', 'Texas', 'Colorado'])

print(df1)
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=['as','b','e'],
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df2)
print(df1 + df2)

           as   xi   ba
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
         as     b     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
           as   b  ba   e  xi
Colorado  NaN NaN NaN NaN NaN
Ohio      3.0 NaN NaN NaN NaN
Oregon    NaN NaN NaN NaN NaN
Texas     9.0 NaN NaN NaN NaN
Utah      NaN NaN NaN NaN NaN


Fill NA:

In [121]:
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
#dfadd=df1+df2
dfaddfill0=df1.add(df2, fill_value=0)
print(df1)
print(df2)
dfaddfill0

     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0


Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [120]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


DataFrame vs Series:

In [122]:
arr = np.arange(12.).reshape((3, 4))
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [123]:
arr - arr[1]

array([[-4., -4., -4., -4.],
       [ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.]])

In [125]:
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
fsubr=frame - series
print('frame \n',frame)
print('fsubr \n', fsubr)

frame 
           b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
fsubr 
           b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


In [126]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [127]:
series2 = Series(range(3), index=['b', 'e', 'f'])
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


Broadcast: by default is by row, if you want to broadcast by columns, you need to claim that axis=0

In [128]:
frame['d']

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [129]:
series3 = frame['d']
fsubc=frame.sub(series3, axis=0)
print(frame)
print(fsubc)


          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0


numpy functions can also be applied to dataframe:

In [130]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
fabs=np.abs(frame)

In [131]:
frame

Unnamed: 0,b,d,e
Utah,0.496714,-0.138264,0.647689
Ohio,1.52303,-0.234153,-0.234137
Texas,1.579213,0.767435,-0.469474
Oregon,0.54256,-0.463418,-0.46573


In [132]:
fabs

Unnamed: 0,b,d,e
Utah,0.496714,0.138264,0.647689
Ohio,1.52303,0.234153,0.234137
Texas,1.579213,0.767435,0.469474
Oregon,0.54256,0.463418,0.46573


In [133]:
f = lambda x: x.max() - x.min()
frame.apply(f)

b    1.082499
d    1.230852
e    1.117163
dtype: float64

In [134]:
f(frame)

b    1.082499
d    1.230852
e    1.117163
dtype: float64

In [135]:
frame

Unnamed: 0,b,d,e
Utah,0.496714,-0.138264,0.647689
Ohio,1.52303,-0.234153,-0.234137
Texas,1.579213,0.767435,-0.469474
Oregon,0.54256,-0.463418,-0.46573


In [136]:
frame.apply(f,axis=1)

Utah      0.785953
Ohio      1.757183
Texas     2.048687
Oregon    1.008290
dtype: float64

In [137]:
## When return value has multiple dims
def nf(x):
    return Series([x.min(), x.max()], index=['min','max'])
frame.apply(nf,axis = 0)

Unnamed: 0,b,d,e
min,0.496714,-0.463418,-0.469474
max,1.579213,0.767435,0.647689


In [138]:
## elementwise function
frame.apply(np.exp)

Unnamed: 0,b,d,e
Utah,1.643313,0.870868,1.911118
Ohio,4.586099,0.79124,0.791253
Texas,4.851136,2.154233,0.625331
Oregon,1.720406,0.62913,0.627677


In [139]:
np.exp(frame)

Unnamed: 0,b,d,e
Utah,1.643313,0.870868,1.911118
Ohio,4.586099,0.79124,0.791253
Texas,4.851136,2.154233,0.625331
Oregon,1.720406,0.62913,0.627677


In [140]:
np.exp(frame)

Unnamed: 0,b,d,e
Utah,1.643313,0.870868,1.911118
Ohio,4.586099,0.79124,0.791253
Texas,4.851136,2.154233,0.625331
Oregon,1.720406,0.62913,0.627677


Order:


1.   Ordering based on keyword, using sort_index
2.   When soring the dataframe, by default we will sort wrt rows, if want to operate on columns, need to claim that axis = 1
3.   accending = True or False to control the order
4.    If want to sort based on values of Series, need to use sort_values. Missing data will be appended to the end
5.    To sort the DataFrame's value, need to pass value by the column name



In [141]:
obj = Series(np.arange(4), index = ['a','c','d','b'])
print(obj)
obj.sort_index()

a    0
c    1
d    2
b    3
dtype: int32


a    0
b    3
c    1
d    2
dtype: int32

In [142]:
obj.sort_values()

a    0
c    1
d    2
b    3
dtype: int32

In [143]:
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
print(frame)
frame.sort_index()


       d  a  b  c
three  0  1  2  3
one    4  5  6  7


Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [144]:
frame.sort_values(by='a')

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [145]:
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [146]:
frame.sort_values(by='a',ascending=False)

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [147]:
obj = Series([5,1,7,2,np.nan, 23,57,np.nan])
obj.sort_values()

1     1.0
3     2.0
0     5.0
2     7.0
5    23.0
6    57.0
4     NaN
7     NaN
dtype: float64

In [149]:
obj = Series([7, -5,  7, 4, 2, 0, 4])
obj.rank(method = 'min')

0    6.0
1    1.0
2    6.0
3    4.0
4    3.0
5    2.0
6    4.0
dtype: float64

In [150]:
obj = Series([7, -5,  7, 4, 2, 0, 4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [151]:
obj.rank(ascending = False, method = 'max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [152]:
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})
print(frame)
print(frame.rank())

     b  a    c
0  4.3  0 -2.0
1  7.0  1  5.0
2 -3.0  0  8.0
3  2.0  1 -2.5
     b    a    c
0  3.0  1.5  2.0
1  4.0  3.5  3.0
2  1.0  1.5  4.0
3  2.0  3.5  1.0


In [121]:
frame.rank(axis = 1)

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


Duplicated indices:

In [153]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [154]:
obj['a']

a    0
a    1
dtype: int64

In [155]:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df


Unnamed: 0,0,1,2
a,0.241962,-1.91328,-1.724918
a,-0.562288,-1.012831,0.314247
b,-0.908024,-1.412304,1.465649
b,-0.225776,0.067528,-1.424748


In [156]:
df.loc['a']

Unnamed: 0,0,1,2
a,0.241962,-1.91328,-1.724918
a,-0.562288,-1.012831,0.314247


Some simple calculation:

In [157]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [158]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [159]:
df.sum(axis = 1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [129]:
df.mean(axis = 1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [160]:
df.idxmax()

one    b
two    d
dtype: object

In [161]:
df.cumsum(skipna = False)

Unnamed: 0,one,two
a,1.4,
b,8.5,
c,,
d,,


In [163]:
df['one'].unique()

array([1.4 , 7.1 ,  nan, 0.75])

Data Merge:

pandas.merge will combine row by row, i.e. horizontally

pandas.concat will combine column by column, i.e. vertically

In [164]:
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})
df2 = DataFrame({'key': ['a', 'b','b', 'd'],
                 'data2': range(4)})
df3 = DataFrame({'key': ['a', 'b','b', 'd','e'],
                 'data2': range(5)})

print(df1)
print(df1.index)

print(df2)

print(df3)

  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6
RangeIndex(start=0, stop=7, step=1)
  key  data2
0   a      0
1   b      1
2   b      2
3   d      3
  key  data2
0   a      0
1   b      1
2   b      2
3   d      3
4   e      4


If not specify based on which cretiria, pandas will use the overlapped column as the key and concat/merge based on that

pd.merge include different methods, like inner(default), left, right, outer

In [165]:
dfmerge1=pd.merge(df1, df2)
print (dfmerge1)

  key  data1  data2
0   b      0      1
1   b      0      2
2   b      1      1
3   b      1      2
4   b      6      1
5   b      6      2
6   a      2      0
7   a      4      0
8   a      5      0


In [166]:
dfmerge1=pd.merge(df1, df2,on='key')
dfmerge2=pd.merge(df1, df2, on='key',how="left")
dfmerge3=pd.merge(df1, df2, on='key',how="right")
dfmerge4=pd.merge(df1, df2, on='key',how="outer")
print(df1)
print(df2)
print(dfmerge1)
print(dfmerge2)
print(dfmerge3)
print(dfmerge4)

  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6
  key  data2
0   a      0
1   b      1
2   b      2
3   d      3
  key  data1  data2
0   b      0      1
1   b      0      2
2   b      1      1
3   b      1      2
4   b      6      1
5   b      6      2
6   a      2      0
7   a      4      0
8   a      5      0
  key  data1  data2
0   b      0    1.0
1   b      0    2.0
2   b      1    1.0
3   b      1    2.0
4   a      2    0.0
5   c      3    NaN
6   a      4    0.0
7   a      5    0.0
8   b      6    1.0
9   b      6    2.0
  key  data1  data2
0   b    0.0      1
1   b    1.0      1
2   b    6.0      1
3   b    0.0      2
4   b    1.0      2
5   b    6.0      2
6   a    2.0      0
7   a    4.0      0
8   a    5.0      0
9   d    NaN      3
   key  data1  data2
0    b    0.0    1.0
1    b    0.0    2.0
2    b    1.0    1.0
3    b    1.0    2.0
4    b    6.0    1.0
5    b    6.0    2.0
6    a    2.0    0.0
7    a    4.0    0.0
8 

If merge based on key, and we have duplicated keys:

In [167]:
df1 = DataFrame({'key': ['a', 'b'],
                 'data1': range(2)})
df2 = DataFrame({'key': ['b',  'b' ],
                 'data2': range(2)})
merge4=pd.merge(df1, df2, how='inner')
print(df1)
print(df2)
print(merge4)

  key  data1
0   a      0
1   b      1
  key  data2
0   b      0
1   b      1
  key  data1  data2
0   b      1      0
1   b      1      1


In [168]:
left = DataFrame({'代码': ['foo', 'foo', 'bar'],
                  'key2': ['one', 'two', 'one'],
                  'lval': [1, 2, 3]}) 
right = DataFrame({'代码': ['foo', 'foo', 'bar', 'bar'],
                   'key2': ['one', 'one', 'one', 'two'],
                   'rval': [4, 5, 6, 7]})
merge5=pd.merge(left, right, on='代码', how='outer')
merge6=pd.merge(left, right, on=['代码', 'key2'], how='inner')
print(left)
print(right)
print(merge5)
print(merge6)


    代码 key2  lval
0  foo  one     1
1  foo  two     2
2  bar  one     3
    代码 key2  rval
0  foo  one     4
1  foo  one     5
2  bar  one     6
3  bar  two     7
    代码 key2_x  lval key2_y  rval
0  foo    one     1    one     4
1  foo    one     1    one     5
2  foo    two     2    one     4
3  foo    two     2    one     5
4  bar    one     3    one     6
5  bar    one     3    two     7
    代码 key2  lval  rval
0  foo  one     1     4
1  foo  one     1     5
2  bar  one     3     6


In [169]:
left1 = DataFrame({'代码': ['a', 'b', 'a', 'a', 'b', 'c'],
                  'value': range(6)})
right1 = DataFrame({'平均值': [3.5, 7]}, index=['a', 'b'])


pm2=pd.merge(left1, right1, left_on='代码', right_index=True)
pm2

Unnamed: 0,代码,value,平均值
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


In [170]:
arr = np.arange(12).reshape((3, 4))
arr


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [171]:
np.concatenate([arr, arr], axis=0)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [172]:
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [173]:
s1 = Series([0, 1], index=['a', 'b'])
s2 = Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = Series([5, 6], index=['c', 'g'])
s_all = pd.concat([s1, s2, s3])
print(s1)
print(s2)
print(s3)
print(s_all)

a    0
b    1
dtype: int64
c    2
d    3
e    4
dtype: int64
c    5
g    6
dtype: int64
a    0
b    1
c    2
d    3
e    4
c    5
g    6
dtype: int64


In [174]:
pd.concat([s1, s2, s3], axis=1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,5.0
d,,3.0,
e,,4.0,
g,,,6.0


In [176]:
df1 = DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
                columns=['one', 'two'])
df2 = DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
                columns=['three', 'four'])
cdf = pd.concat([df1, df2], axis=1)
print(df1)
print(df2)
print(cdf)

   one  two
a    0    1
b    2    3
c    4    5
   three  four
a      5     6
c      7     8
   one  two  three  four
a    0    1    5.0   6.0
b    2    3    NaN   NaN
c    4    5    7.0   8.0


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """


In [178]:
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                  'k2': [1, 1, 2, 3, 3, 4, 4]})
newdata1=data.drop_duplicates()
print(data)
print(newdata1)

    k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4
    k1  k2
0  one   1
2  one   2
3  two   3
5  two   4


In [180]:
data['v1'] = range(7)##data is a dataframe

newdata2=data.drop_duplicates(['k1'])
newdata3=data.drop_duplicates(['k1', 'k2'], keep='last')

print(data)
print(newdata2)
print(newdata3)

    k1  k2  v1
0  one   1   0
1  one   1   1
2  one   2   2
3  two   3   3
4  two   3   4
5  two   4   5
6  two   4   6
    k1  k2  v1
0  one   1   0
3  two   3   3
    k1  k2  v1
1  one   1   1
2  one   2   2
4  two   3   4
6  two   4   6


Group by:

In [181]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : range(5),
                'data2' : range(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0,0
1,a,two,1,1
2,b,one,2,2
3,b,two,3,3
4,a,one,4,4


In [182]:
grouped = df['data1'].groupby(df['key1'])##***groupby****
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000022B63963F60>

In [149]:
dict(list(grouped))

{'a': 0    0
 1    1
 4    4
 Name: data1, dtype: int64, 'b': 2    2
 3    3
 Name: data1, dtype: int64}

In [183]:
pd.DataFrame(dict(list(grouped)))

Unnamed: 0,a,b
0,0.0,
1,1.0,
2,,2.0
3,,3.0
4,4.0,


In [151]:
pd.DataFrame(dict(list(grouped))).mean()

a    1.666667
b    2.500000
dtype: float64

In [152]:
grouped.mean()

key1
a    1.666667
b    2.500000
Name: data1, dtype: float64

In [184]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : range(5),
                'data2' : range(5)})
grouped = df['data1'].groupby(df['key1'])##***groupby****
a = df.groupby('key1')

print(df) 
print(a.mean())
print(grouped.mean())

  key1 key2  data1  data2
0    a  one      0      0
1    a  two      1      1
2    b  one      2      2
3    b  two      3      3
4    a  one      4      4
         data1     data2
key1                    
a     1.666667  1.666667
b     2.500000  2.500000
key1
a    1.666667
b    2.500000
Name: data1, dtype: float64


In [186]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()##[[]]用法
print(means)

key1  key2
a     one     2
      two     1
b     one     2
      two     3
Name: data1, dtype: int64


In [187]:
means

key1  key2
a     one     2
      two     1
b     one     2
      two     3
Name: data1, dtype: int64

In [190]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,1
b,2,3


In [191]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0,0
1,a,two,1,1
2,b,one,2,2
3,b,two,3,3
4,a,one,4,4


In [189]:
dfmeans = df.groupby(['key1', 'key2']).mean()
dfmeans.unstack()

Unnamed: 0_level_0,data1,data1,data2,data2
key2,one,two,one,two
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,2,1,2,1
b,2,3,2,3


In [192]:
df.groupby([df['key1'], df['key2']])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,2
a,two,1
b,one,2
b,two,3


In [194]:
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

key1
a    3.4
b    2.9
Name: data1, dtype: float64

In [195]:
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022B6397B3C8>

In [196]:
def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,4,4
b,1,1


iterate grouped object

In [197]:
for key, group in grouped:
  print(key)
  print(group)

a
  key1 key2  data1  data2
0    a  one      0      0
1    a  two      1      1
4    a  one      4      4
b
  key1 key2  data1  data2
2    b  one      2      2
3    b  two      3      3
