# pandas From Scratch


Data Structures, Series and Dataframe
______________________

**Source**
[Python for Data Analysis Ch05, by Wes McKinney](https://github.com/wesm/pydata-book/blob/2nd-edition/ch05.ipynb)

______________________

__**pandas**__ is a Python toolkit optimized for building data structures and data cleaning operations. Works with tabular or heterogeneous data sets, unlike Numpy which perfers homogeneous numerical data.  Similar to Numpy array-based functions, processing with out FOR loops.  Many applications can begin with either Series or Dataframes. 

In [2]:
import pandas as pd

In [3]:
from pandas import Series, DataFrame

In [4]:
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)

## pandas Series

**Series**

A sequence of values matched to its index. An *index* is an array of data labels, the 
A one-dimensional array-like object, default index will start at 0.  Option to use strings as indexes, you can select one plus values if quoted in the call function.

Series concept: a fixed-length, ordered dictionary. 


In [22]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [None]:
obj.values
obj.index  # like range(4)

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [None]:
obj2

In [None]:
obj2.index

In [None]:
#quote the index value if not numeric
obj2['a']
obj2['d'] = 6
obj2[['c', 'a', 'd']]

In [None]:
#operations against the object to not change index-value relationship
obj2[obj2 > 0]
obj2 * 2
np.exp(obj2)

In [None]:
#Similar to dict usages
'b' in obj2
'e' in obj2

In [5]:
#sdata starts as a dictionary, pass into pandas to create a series
sdata = {'Camelot': 33350100, 'Castle Anthrax': 7105400, 'Oregon': 1643000, 'Spamtowne': 555000}
obj3 = pd.Series(sdata)
obj3

Camelot           33350100
Castle Anthrax     7105400
Oregon             1643000
Spamtowne           555000
dtype: int64

In [6]:
#you can specify the returned key order, defaults to softed 
#Utopia does not exist in the series
states = ['Utopia', 'Spamtowne', 'Oregon', 'Camelot']
obj4 = pd.Series(sdata, index=states)
obj4

Utopia              NaN
Spamtowne      555000.0
Oregon        1643000.0
Camelot      33350100.0
dtype: float64

In [12]:
#detect if data object has nulls
print('$----test null----$\n', pd.isnull(obj4))
print('$----test not null----$\n', pd.notnull(obj4))

$----test null----$
 Utopia        True
Spamtowne    False
Oregon       False
Camelot      False
dtype: bool
$----test not null----$
 Utopia       False
Spamtowne     True
Oregon        True
Camelot       True
dtype: bool


In [19]:
#optional to use instance methods
print(obj3.isnull())
print('$$-----------more---------$$')
print(obj4.isnull())

Camelot           False
Castle Anthrax    False
Oregon            False
Spamtowne         False
dtype: bool
$$-----------more---------$$
Utopia        True
Spamtowne    False
Oregon       False
Camelot      False
dtype: bool


In [20]:
#Series auto align by index labels in math operations, similar to database joins
print(obj3 + obj4)

Camelot           66700200.0
Castle Anthrax           NaN
Oregon             3286000.0
Spamtowne          1110000.0
Utopia                   NaN
dtype: float64


In [None]:
#the series object and index 
obj4.name = 'population'
obj4.index.name = 'state'
obj4

In [27]:
obj.index = ['Lancelot', 'Thripshaw', 'Boubacar', 'Camara']
print(obj)

Lancelot     4
Thripshaw    7
Boubacar    -5
Camara       3
dtype: int64


### DataFrame


_________________
*Dataframes* are order collections of various type columns. Stored as 2-dimensional arrays, indexed by columns and rows, basically a dictionary that contains series with the same index. Stored as 2D blocks, but can be used in higher dimensional data by using hierarchical indexing. The lists have to have the same length, index is assigned if you do not assign it.

In [7]:
data = {'country': ['Senegal', 'Mali', 'Gambia', 'Guinee', 'Burkina Faso', 'Ghana'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.3, 1.6, 3.6, 2.4, 22.9, 35.2]}
frame = pd.DataFrame(data)

In [8]:
#the columns should result in alphabetical order by 
frame

Unnamed: 0,country,year,pop
0,Senegal,2000,1.3
1,Mali,2001,1.6
2,Gambia,2002,3.6
3,Guinee,2001,2.4
4,Burkina Faso,2002,22.9
5,Ghana,2003,35.2


In [9]:
frame.head() #defaults to first five rows

Unnamed: 0,country,year,pop
0,Senegal,2000,1.3
1,Mali,2001,1.6
2,Gambia,2002,3.6
3,Guinee,2001,2.4
4,Burkina Faso,2002,22.9


In [10]:
#specify the column order
pd.DataFrame(data, columns=['year', 'country', 'pop'])

Unnamed: 0,year,country,pop
0,2000,Senegal,1.3
1,2001,Mali,1.6
2,2002,Gambia,3.6
3,2001,Guinee,2.4
4,2002,Burkina Faso,22.9
5,2003,Ghana,35.2


In [11]:
#create a dataframe, if you pass in a crocodile then it fills with NaN
frame2 = pd.DataFrame(data, columns=['year', 'country', 'pop', 'crocodiles'],
                      index=['one', 'two', 'three', 'four',
                             'five', 'six'])
print(frame2)
print(frame2.columns)

       year       country   pop crocodiles
one    2000       Senegal   1.3        NaN
two    2001          Mali   1.6        NaN
three  2002        Gambia   3.6        NaN
four   2001        Guinee   2.4        NaN
five   2002  Burkina Faso  22.9        NaN
six    2003         Ghana  35.2        NaN
Index(['year', 'country', 'pop', 'crocodiles'], dtype='object')


In [14]:
#return a column by dictionary-like notation or attribute
print(frame2['country'])

one           Senegal
two              Mali
three          Gambia
four           Guinee
five     Burkina Faso
six             Ghana
Name: country, dtype: object


In [15]:
print(frame2.year)

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64


In [16]:
#retrieve a row by position or name 
frame2.loc['four']

year            2001
country       Guinee
pop              2.4
crocodiles       NaN
Name: four, dtype: object

In [20]:
#you can assign a value for each row
#frame2['crocodiles'] = 17.56
frame2['crocodiles'] = np.arange(6.)
print(frame2)

       year       country   pop  crocodiles
one    2000       Senegal   1.3         0.0
two    2001          Mali   1.6         1.0
three  2002        Gambia   3.6         2.0
four   2001        Guinee   2.4         3.0
five   2002  Burkina Faso  22.9         4.0
six    2003         Ghana  35.2         5.0


In [39]:
#example of the length not matching, it fills in the missing values with NaN
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['crocodiles'] = val
frame2

Unnamed: 0,year,country,pop,crocodiles
one,2000,Senegal,1.3,
two,2001,Mali,1.6,-1.2
three,2002,Gambia,3.6,
four,2001,Guinee,2.4,-1.5
five,2002,Burkina Faso,22.9,-1.7
six,2003,Ghana,35.2,


In [41]:
#create a new column, assign a boolean where country equals Senegal
#have to use square brackets when creating, later you can return without
frame2['western'] = frame2.country == 'Senegal'
frame2

Unnamed: 0,year,country,pop,crocodiles,western
one,2000,Senegal,1.3,,True
two,2001,Mali,1.6,-1.2,False
three,2002,Gambia,3.6,,False
four,2001,Guinee,2.4,-1.5,False
five,2002,Burkina Faso,22.9,-1.7,False
six,2003,Ghana,35.2,,False


In [42]:
#remove the column created above, should return a list of what is still there
del frame2['western']
frame2.columns

Index(['year', 'country', 'pop', 'crocodiles'], dtype='object')

In [13]:
#create a nested dictionary of dictionaries
#pandas will default to inner keys as row indices and outer dict keys as columns
pop = {'Coleoptera': {3001: 2.48, 4002: 2.79},
       'Lepidoptera': {3000: 1.15, 5001: 12.7, 7003: 33.6}}

In [14]:
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Coleoptera,Lepidoptera
3000,,1.15
3001,2.48,
4002,2.79,
5001,,12.7
7003,,33.6


In [24]:
#Transpose, swap the columns and row, similar syntax to Numpy arrays
#the inner dict keys mix together and sort to create the index
frame3.T

Unnamed: 0,3000,3001,4002,5001,7003
Coleoptera,,2.48,2.79,,
Lepidoptera,1.15,,,12.7,33.6


In [15]:
#sort the keys in the order you specify 
#inner dict keys are mixed and sorted if you are not explicit
pd.DataFrame(pop, index=[3001, 5001, 4002])

Unnamed: 0,Coleoptera,Lepidoptera
3001,2.48,
5001,,12.7
4002,2.79,


In [26]:
pdata = {'Coleopter': frame3['Coleoptera'][:-1],
         'Lepidoptera': frame3['Lepidoptera'][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Coleopter,Lepidoptera
3000,,1.15
3001,2.48,
4002,2.79,
5001,,


In [16]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Coleoptera,Lepidoptera
year,Unnamed: 1_level_1,Unnamed: 2_level_1
3000,,1.15
3001,2.48,
4002,2.79,
5001,,12.7
7003,,33.6


In [17]:
frame3.values

array([[  nan,  1.15],
       [ 2.48,   nan],
       [ 2.79,   nan],
       [  nan, 12.7 ],
       [  nan, 33.6 ]])

In [18]:
frame2.values

NameError: name 'frame2' is not defined

### Index Objects

Hold axis labels and metadata such as axis names. If you use an array of labels when building a Series or Dataframe it will be converted into an index internally. 

In [5]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
index[1:]

Index(['b', 'c'], dtype='object')

In [None]:
index[1] = 'd'  # TypeError because index objects are immutable

In [7]:
#not mutable makes it a good choice to share Index Objects between dataframes
labels = pd.Index(np.arange(3))
labels


Int64Index([0, 1, 2], dtype='int64')

In [10]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
print(obj2)
obj2.index is labels

0    1.5
1   -2.5
2    0.0
dtype: float64


True

In [23]:
frame3

state,Coleoptera,Lepidoptera
year,Unnamed: 1_level_1,Unnamed: 2_level_1
3000,,1.15
3001,2.48,
4002,2.79,
5001,,12.7
7003,,33.6


In [20]:
frame3.columns

Index(['Coleoptera', 'Lepidoptera'], dtype='object', name='state')

In [21]:
'Ohio' in frame3.columns

False

In [22]:
2003 in frame3.index

False

In [None]:
#pd indexes can contain duplicate labels, not a set
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

## Essential Functionality

### Reindexing

In [24]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [25]:
#perform a reindex on the above object, rearranges to these new values, NaN if missing a match
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [26]:
#Sometime it is needed to fill in values when reindexing, the ffill method option 
#forward fills
obj3 = pd.Series(['Hymenoptera', 'Coleoptera', 'Hemiptera'], index=[0, 2, 4])
obj3

0    Hymenoptera
2     Coleoptera
4      Hemiptera
dtype: object

In [27]:
obj3.reindex(range(6), method='ffill')

0    Hymenoptera
1    Hymenoptera
2     Coleoptera
3     Coleoptera
4      Hemiptera
5      Hemiptera
dtype: object

In [108]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a24', '1c3', '66d'],
                     columns=['Psocoptera', 'Neuroptera', 'Trichoptera'])
frame

Unnamed: 0,Psocoptera,Neuroptera,Trichoptera
a24,0,1,2
1c3,3,4,5
66d,6,7,8


In [29]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Psocoptera,Neuroptera,Trichoptera
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [31]:
#use the columns keyword to 
orders = ['Neuroptera', 'Mecoptera', 'Trichoptera']
frame.reindex(columns=orders)

Unnamed: 0,Neuroptera,Mecoptera,Trichoptera
a,1,,2
c,4,,5
d,7,,8


In [None]:
frame.loc[['a', 'b', 'c', 'd'], orders]

### Dropping Entries from an Axis

In [35]:
obj = pd.Series(np.arange(5.), index=['Cnidaria', 'Annelida', 'Echinodermata', 'Urochordata', 'Nematoda'])
obj

Cnidaria         0.0
Annelida         1.0
Echinodermata    2.0
Urochordata      3.0
Nematoda         4.0
dtype: float64

In [37]:
new_obj = obj.drop('Urochordata')
new_obj

Cnidaria         0.0
Annelida         1.0
Echinodermata    2.0
Nematoda         4.0
dtype: float64

In [39]:
obj.drop(['Cnidaria', 'Annelida'])

Echinodermata    2.0
Urochordata      3.0
Nematoda         4.0
dtype: float64

In [76]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Asteroidea', 'Ophiuroidea', 'Holothuroidea', 'Echinoidea'],
                    columns=["go'o", 'didi', 'tatti', 'nye'])
data

Unnamed: 0,go'o,didi,tatti,nye
Asteroidea,0,1,2,3
Ophiuroidea,4,5,6,7
Holothuroidea,8,9,10,11
Echinoidea,12,13,14,15


In [41]:
data.drop(['Holothuroidea', 'Ophiuroidea'])

Unnamed: 0,go'o,didi,tatti,nye
Asteroidea,0,1,2,3
Echinoidea,12,13,14,15


In [42]:
data.drop('didi', axis=1)

Unnamed: 0,go'o,tatti,nye
Asteroidea,0,2,3
Ophiuroidea,4,6,7
Holothuroidea,8,10,11
Echinoidea,12,14,15


In [43]:
data.drop(["go'o", 'nye'], axis='columns')

Unnamed: 0,didi,tatti
Asteroidea,1,2
Ophiuroidea,5,6
Holothuroidea,9,10
Echinoidea,13,14


In [46]:
obj.drop('Cnidaria', inplace=True)
obj

Annelida         1.0
Echinodermata    2.0
Urochordata      3.0
Nematoda         4.0
dtype: float64

### Indexing, Selection, and Filtering

In [90]:
obj = pd.Series(np.arange(6,11,1.), index=['Khaya', 'Mangifera', 'Balanites', 'Azadirachta', 'Delonix'])
obj

Khaya           6.0
Mangifera       7.0
Balanites       8.0
Azadirachta     9.0
Delonix        10.0
dtype: float64

In [79]:
obj['Mangifera']

1.0

In [80]:
obj[3]

3.0

In [81]:
obj[2:4]

Balanites      2.0
Azadirachta    3.0
dtype: float64

In [55]:
obj[[1, 3]]

Mangifera      1.0
Azadirachta    3.0
dtype: float64

In [85]:
obj[obj < 7]

Khaya    6.0
dtype: float64

In [86]:
obj[['Mangifera', 'Khaya', 'Azadirachta']]

Mangifera      7.0
Khaya          6.0
Azadirachta    9.0
dtype: float64

In [83]:
obj

Khaya           6.0
Mangifera       7.0
Balanites       8.0
Azadirachta     9.0
Delonix        10.0
dtype: float64

In [71]:
obj[3:5]

Azadirachta    3.0
Delonix        4.0
dtype: float64

In [91]:
obj['Mangifera':'Azadirachta']

Mangifera      7.0
Balanites      8.0
Azadirachta    9.0
dtype: float64

In [89]:
obj['Khaya':'Azadirachta'] = 8
obj

Khaya           8.0
Mangifera       8.0
Balanites       8.0
Azadirachta     8.0
Delonix        10.0
dtype: float64

In [110]:
data = pd.DataFrame(np.arange(21, 41,1).reshape((4, 5)),
                    index=['Gibberellin', 'Oligosaccharins', 'Cytokinins', 'Auxin'],
                    columns=['one', 'two', 'three', 'four', 'five'])
data

Unnamed: 0,one,two,three,four,five
Gibberellin,21,22,23,24,25
Oligosaccharins,26,27,28,29,30
Cytokinins,31,32,33,34,35
Auxin,36,37,38,39,40


In [95]:
data['two']

Gibberellin        22
Oligosaccharins    27
Cytokinins         32
Auxin              37
Name: two, dtype: int64

In [96]:
data[['three', 'one']]

Unnamed: 0,three,one
Gibberellin,23,21
Oligosaccharins,28,26
Cytokinins,33,31
Auxin,38,36


In [97]:
data[:2]

Unnamed: 0,one,two,three,four,five
Gibberellin,21,22,23,24,25
Oligosaccharins,26,27,28,29,30


In [99]:
data[data['three'] > 28]

Unnamed: 0,one,two,three,four,five
Cytokinins,31,32,33,34,35
Auxin,36,37,38,39,40


In [None]:
data < 5
data[data < 5] = 0
data

#### Selection with loc and iloc

In [107]:
data.loc['Cytokinins', ['two', 'three']]

two      32
three    33
Name: Cytokinins, dtype: int64

In [111]:
data.iloc[2, [3, 0, 1]]

four    34
one     31
two     32
Name: Cytokinins, dtype: int64

In [None]:
data.iloc[2]

In [None]:
data.iloc[[1, 2], [3, 0, 1]]

In [113]:
data.loc[:'Auxin', 'two']

Gibberellin        22
Oligosaccharins    27
Cytokinins         32
Auxin              37
Name: two, dtype: int64

In [114]:
data.iloc[:, :3][data.three > 25]

Unnamed: 0,one,two,three
Oligosaccharins,26,27,28
Cytokinins,31,32,33
Auxin,36,37,38


### Integer Indexes

ser = pd.Series(np.arange(3.))
ser
ser[-1]

In [None]:
ser = pd.Series(np.arange(3.))

In [None]:
ser

In [None]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]

In [None]:
ser[:1]
ser.loc[:1]
ser.iloc[:1]

### Arithmetic and Data Alignment

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=['a', 'c', 'e', 'f', 'g'])
s1
s2

In [None]:
s1 + s2

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
df2

In [None]:
df1 + df2

In [None]:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df1
df2
df1 - df2

#### Arithmetic methods with fill values

In [65]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))

print(df1)
print(df2)


     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0


In [66]:
df2.loc[1, 'b'] = np.nan

In [67]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [68]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [69]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [70]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [None]:
1 / df1
df1.rdiv(1)

In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

#### Operations between DataFrame and Series

In [101]:
arr = np.arange(12.).reshape((3, 4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [102]:
arr[0]

array([0., 1., 2., 3.])

In [103]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [115]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('txw'),
                     index=['Bozicactus', 'Cotyledon', 'Hatioria', 'Lobivia'])
series = frame.iloc[0]

In [116]:
frame

Unnamed: 0,t,x,w
Bozicactus,0.0,1.0,2.0
Cotyledon,3.0,4.0,5.0
Hatioria,6.0,7.0,8.0
Lobivia,9.0,10.0,11.0


In [117]:
series

t    0.0
x    1.0
w    2.0
Name: Bozicactus, dtype: float64

In [118]:
frame - series

Unnamed: 0,t,x,w
Bozicactus,0.0,0.0,0.0
Cotyledon,3.0,3.0,3.0
Hatioria,6.0,6.0,6.0
Lobivia,9.0,9.0,9.0


In [None]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])

In [None]:
frame + series2

In [None]:
series3 = frame['d']
frame

In [None]:
series3

In [None]:
frame.sub(series3, axis='index')

### Function Application and Mapping

In [119]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['denudatum', 'netrelianum', 'platense', 'saglionis'])
frame

Unnamed: 0,b,d,e
denudatum,-0.204708,0.478943,-0.519439
netrelianum,-0.55573,1.965781,1.393406
platense,0.092908,0.281746,0.769023
saglionis,1.246435,1.007189,-1.296221


In [120]:
np.abs(frame)

Unnamed: 0,b,d,e
denudatum,0.204708,0.478943,0.519439
netrelianum,0.55573,1.965781,1.393406
platense,0.092908,0.281746,0.769023
saglionis,1.246435,1.007189,1.296221


In [121]:
f = lambda x: x.max() - x.min()
frame.apply(f)

b    1.802165
d    1.684034
e    2.689627
dtype: float64

In [122]:
frame.apply(f, axis='columns')

denudatum      0.998382
netrelianum    2.521511
platense       0.676115
saglionis      2.542656
dtype: float64

In [123]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.55573,0.281746,-1.296221
max,1.246435,1.965781,1.393406


In [124]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
denudatum,-0.2,0.48,-0.52
netrelianum,-0.56,1.97,1.39
platense,0.09,0.28,0.77
saglionis,1.25,1.01,-1.3


In [125]:
frame['e'].map(format)

denudatum      -0.52
netrelianum     1.39
platense        0.77
saglionis      -1.30
Name: e, dtype: object

### Sorting and Ranking

In [None]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame.sort_index()
frame.sort_index(axis=1)

In [None]:
frame.sort_index(axis=1, ascending=False)

In [None]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

In [None]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame
frame.sort_values(by='b')

In [None]:
frame.sort_values(by=['a', 'b'])

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

In [None]:
obj.rank(method='first')

In [None]:
# Assign tie values the maximum rank in the group
obj.rank(ascending=False, method='max')

In [None]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame
frame.rank(axis='columns')

### Axis Indexes with Duplicate Labels

In [None]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

In [None]:
obj.index.is_unique

In [None]:
obj['a']
obj['c']

In [None]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
df.loc['b']

## Summarizing and Computing Descriptive Statistics

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

In [None]:
df.sum()

In [None]:
df.sum(axis='columns')

In [None]:
df.mean(axis='columns', skipna=False)

In [None]:
df.idxmax()

In [None]:
df.cumsum()

In [None]:
df.describe()

In [None]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

### Correlation and Covariance

conda install pandas-datareader

In [None]:
price = pd.read_pickle('examples/yahoo_price.pkl')
volume = pd.read_pickle('examples/yahoo_volume.pkl')

import pandas_datareader.data as web
all_data = {ticker: web.get_data_yahoo(ticker)
            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

price = pd.DataFrame({ticker: data['Adj Close']
                     for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume']
                      for ticker, data in all_data.items()})

In [None]:
returns = price.pct_change()
returns.tail()

In [None]:
returns['MSFT'].corr(returns['IBM'])
returns['MSFT'].cov(returns['IBM'])

In [None]:
returns.MSFT.corr(returns.IBM)

In [None]:
returns.corr()
returns.cov()

In [None]:
returns.corrwith(returns.IBM)

In [None]:
returns.corrwith(volume)

### Unique Values, Value Counts, and Membership

In [None]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [None]:
uniques = obj.unique()
uniques.value_counts()

In [None]:
obj.value_counts()

In [None]:
pd.value_counts(obj.values, sort=False)

In [None]:
obj
mask = obj.isin(['b', 'c'])
mask

In [None]:
obj[mask]

In [None]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)

In [None]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

In [None]:
result = data.apply(pd.value_counts).fillna(0)
result

## Conclusion

In [None]:
pd.options.display.max_rows = PREVIOUS_MAX_ROWS