In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np



Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:



Here, data can be many different things:

a Python dict
an ndarray
a scalar value (like 5)
The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:



In [2]:
s1 = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s1

a    0.323601
b   -0.688684
c    2.752123
d    0.125563
e    0.990408
dtype: float64

In [3]:
s1.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [4]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
s2=pd.Series(d)
s2

a    0.0
b    1.0
c    2.0
dtype: float64

In [5]:
s3=pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
s3

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, things like slicing also slice the index.

In [6]:
s1[0]

0.3236010504009571

In [7]:
s1[:3]

a    0.323601
b   -0.688684
c    2.752123
dtype: float64

In [59]:
s1.describe()

count     6.000000
mean      2.417169
std       4.361046
min      -0.688684
25%       0.175073
50%       0.657005
75%       2.311694
max      11.000000
dtype: float64

In [9]:
s1[s1 > s1.median()]

c    2.752123
e    0.990408
dtype: float64

In [10]:
np.exp(s1)

a     1.382096
b     0.502237
c    15.675874
d     1.133787
e     2.692333
dtype: float64

A Series is like a fixed-size dict in that you can get and set values by index label:

In [11]:
s1['a']

0.3236010504009571

In [12]:
'f' in s1

False

In [13]:
s1['f']=11

In [14]:
s1

a     0.323601
b    -0.688684
c     2.752123
d     0.125563
e     0.990408
f    11.000000
dtype: float64

Series can also have a name attribute:

In [15]:
s = pd.Series(np.random.randn(5), name='something')
s

0   -1.184495
1   -0.103993
2    0.481341
3   -0.551322
4   -0.217556
Name: something, dtype: float64

In [16]:
s2 = s.rename("different")
s2

0   -1.184495
1   -0.103993
2    0.481341
3   -0.551322
4   -0.217556
Name: different, dtype: float64

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Structured or record ndarray
A Series
Another DataFrame
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.



In [40]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [41]:
df=pd.DataFrame(d, index=['d', 'b', 'a'], columns=['one', 'three'])
df

Unnamed: 0,one,three
d,,
b,2.0,
a,1.0,


In [42]:
df.index

Index(['d', 'b', 'a'], dtype='object')

In [43]:
df.columns

Index(['one', 'three'], dtype='object')

If we create a DataFrame from ndarrays, they should be of the same length!

In [44]:
d = {'one' : [1., 2., 3.],
'two' : [1., 2., 3., 4.]} 
df = pd.DataFrame(d,index=['d', 'b', 'a','c'])
df

ValueError: Shape of passed values is (2, 3), indices imply (2, 4)

In [45]:
d = {'one' : [1., 2., 3.,4.],
'two' : [1., 2., 3., 4.]} 
df = pd.DataFrame(d,index=['d', 'b', 'a','c'])
df

Unnamed: 0,one,two
d,1.0,1.0
b,2.0,2.0
a,3.0,3.0
c,4.0,4.0


In [46]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
   ....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
   ....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
   ....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
   ....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,a,b,c,a,b
A,B,4.0,1.0,5.0,8.0,10.0
A,C,3.0,2.0,6.0,7.0,
A,D,,,,,9.0


You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [47]:
df['three']=df.one+df.two
df['flag']=df.three> 4.
df

Unnamed: 0,one,two,three,flag
d,1.0,1.0,2.0,False
b,2.0,2.0,4.0,False
a,3.0,3.0,6.0,True
c,4.0,4.0,8.0,True


In [48]:
del df['two']
df

Unnamed: 0,one,three,flag
d,1.0,2.0,False
b,2.0,4.0,False
a,3.0,6.0,True
c,4.0,8.0,True


In [49]:
three = df.pop('three')
df

Unnamed: 0,one,flag
d,1.0,False
b,2.0,False
a,3.0,True
c,4.0,True


In [50]:
df.insert(0, 'bar3', [3.,5.,8,10])

df

Unnamed: 0,bar3,one,flag
d,3.0,1.0,False
b,5.0,2.0,False
a,8.0,3.0,True
c,10.0,4.0,True


In [52]:
df.loc['c']

bar3      10
one        4
flag    True
Name: c, dtype: object

In [53]:
df.iloc[0]

bar3        3
one         1
flag    False
Name: d, dtype: object

In [None]:
iris['sepal_ratio'] = iris['sepal_width'] / iris['sepal_length']
iris.head()

In [None]:
iris.query('sepal_length > 5').assign(sepal_ratio= lambda x: x.sepal_width / x.sepal_length,
                                      petal_ratio = lambda x: x.petal_width / x.petal_length).plot(kind='scatter',
                                                                                                   x='sepal_ratio', 
                                                                                                   y='petal_ratio')

Boolean operators work as well:

In [54]:
df1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
df2 = pd.DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)

In [57]:
df1&df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


To transpose, access the T attribute (also the transpose function), similar to an ndarray:

In [58]:
 df1.T

Unnamed: 0,0,1,2
a,True,False,True
b,False,True,True


You can conveniently do element-wise comparisons when comparing a pandas data structure with a scalar value:

In [60]:
pd.Series(['foo', 'bar', 'baz']) == 'foo'


0     True
1    False
2    False
dtype: bool

In [61]:
pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

we would like to combine two DataFrame objects where missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation is combine_first(), which we illustrate:

In [62]:
df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],'B' : [np.nan, 2., 3., np.nan, 6.]})

In [63]:
df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],'B' : [np.nan, np.nan, 3., 4., 6., 8.]})

In [64]:
df1.combine_first(df2)


Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,3.0,3.0
3,5.0,4.0
4,3.0,6.0
5,7.0,8.0


In [65]:
 df1.idxmin(axis=0) # locate the index of minimum value (reading rows)
    

A    0
B    1
dtype: int64

In [66]:
 df1.idxmin(axis=1) #locate the index of minimum value (reading cols)

0    A
1    B
2    A
3    A
4    B
dtype: object