pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric applications

Data structures with labeled axes supporting automatic or explicit data alignment. This prevents common errors resulting from misaligned data and working with differently-indexed data coming from different sources.

• Integrated time series functionality.
• The same data structures handle both time series data and non-time series data.
• Arithmetic operations and reductions (like summing across an axis) would pass on the metadata (axis labels).
• Flexible handling of missing data.
• Merge and other relational operations found in popular database databases

In [1]:
import pandas as pd

In [2]:
print(dir(pd))

['Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Float64Index', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int64Index', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NaT', 'Panel', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseArray', 'SparseDataFrame', 'SparseDtype', 'SparseSeries', 'TimeGrouper', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt64Index', 'UInt8Dtype', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_hashtable', '_lib', '_libs', '_np_version_under1p13', '_np_version_under1p14', '_np_version_under1p15', '_np_version_under1p16', '_np_version_under1p17', '_tslib', '_version', 'api', 'array', 'arrays', 'bdate_rang

###### Series and DataFrames

###### Series - 

is a one-dimensional array-like object containing an array of data(of numpy dtype) and an associated array of data labels called its index.

In [4]:
obj = pd.Series([4,5,-7,3])
obj

0    4
1    5
2   -7
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:

In [7]:
print(obj.values)
print(obj.index)

[ 4  5 -7  3]
RangeIndex(start=0, stop=4, step=1)


In [8]:
obj2 = pd.Series([4,5,-7,3], index=['d','b','a','c'])
obj2

d    4
b    5
a   -7
c    3
dtype: int64

In [9]:
obj2['a']

-7

In [10]:
obj2['a']=6

In [11]:
obj2

d    4
b    5
a    6
c    3
dtype: int64

In [13]:
obj2[obj2 >0]

d    4
b    5
a    6
c    3
dtype: int64

In [16]:
obj2*2

d     8
b    10
a    12
c     6
dtype: int64

In [18]:
import numpy as np
np.exp(obj2)

d     54.598150
b    148.413159
a    403.428793
c     20.085537
dtype: float64

In [19]:
'b' in obj2

True

In [20]:
'e' in obj2

False

In [25]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [30]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [29]:
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In this case, 3 values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number) which is considered in pandas to mark missing or NA values.

isnull and notnull functions in pandas should be used to detect missing values

In [31]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [32]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [33]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [36]:
print(obj3)
print()
print(obj4)
print()
print(obj3+obj4)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


In [37]:
obj4.name = 'population'

In [38]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

###### DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). 

The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index). roworiented and column-oriented operations in DataFrame are treated roughly symmetrically. 

Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. 

In [42]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [50]:
#If you specify a sequence of columns, the DataFrame’s columns will be exactly what
#you pass:
frame2 = pd.DataFrame(data, columns=['year', 'state','pop','debt'], 
             index = ['one','two','three','four','five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [51]:
#A column in a DataFrame can be retrieved as a series either by dict-like notation or by attribute
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [52]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

In [64]:
#Rows can also be retrieved by position or name
print(frame2)
print()
print(frame2.loc['two'])
print()
print(frame2.iloc[0])
print()
print(frame2.iloc[0:2])

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: two, dtype: object

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: one, dtype: object

     year state  pop debt
one  2000  Ohio  1.5  NaN
two  2001  Ohio  1.7  NaN


columns can be modified by assignment. For ex, empty 'debt' command could be assigned a scalar value or an array of values:

In [69]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [71]:
frame2['debt']=np.arange(5)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


When assigning lists or arrys to a column, value's length must match with that of dataframe. If you assign a Series, it will be instead conformed exactly to DataFrame's index, inserting missing values

In [73]:
val = pd.Series([-1.2, -1.5,-1.7], index = ['two', 'three','four'])
frame2['debt']=val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,-1.5
four,2001,Nevada,2.4,-1.7
five,2002,Nevada,2.9,


Assigning a column that doesnt exist will crerate a new column. del keyword will delete columns

In [75]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,-1.5,True
four,2001,Nevada,2.4,-1.7,False
five,2002,Nevada,2.9,,False


In [76]:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,-1.5
four,2001,Nevada,2.4,-1.7
five,2002,Nevada,2.9,


In [78]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

For nested dicts, if passed to a dataFrame, it will interpret outer dict keys as columns and inner keys as row indices

In [80]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [81]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


keys in inner dict are unioned and sorted to form index in result. This isnt true if an explicit index is specified.

In [85]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


If a DataFrame's index and columns have their name attributes set, these will also be displayed.

In [86]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [87]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

If the DataFrame’s columns are different dtypes, the dtype of the values array will be chosen to accomodate all of the columns:

In [88]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, -1.5],
       [2001, 'Nevada', 2.4, -1.7],
       [2002, 'Nevada', 2.9, nan]], dtype=object)

##### Possible data inputs to DataFrame constructor

Type Notes

    2D ndarray 
    : A matrix of data, passing optional row and column labels

    dict of arrays, lists, or tuples 
    : Each sequence becomes a column in the DataFrame. All sequences must be the same length.
    
    NumPy structured/record array 
    : Treated as the “dict of arrays” case

    dict of Series 
    : Each value becomes a column. Indexes from each Series are unioned together to form the result’s row index if no explicit index is passed.

    dict of dicts 
    : Each inner dict becomes a column. Keys are unioned to form the row index as in the “dict of Series” case.

    list of dicts or Series 
    : Each item becomes a row in the DataFrame. Union of dict keys or Series indexes become the DataFrame’s column labels

    List of lists or tuples 
    : Treated as the “2D ndarray” case

    Another DataFrame 
    : The DataFrame’s indexes are used unless different ones are passed

    NumPy MaskedArray 
    : Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result

##### Index objects

pandas index objects are responsible for holding the axis labels and other metadata like axis names. Any array or sequence of labels used when constructing a Series or Dataframe is internally converted to an index

In [92]:
obj = pd.Series(range(3), index=['a','b','c'])
index = obj.index
print(obj)
print()
print(index)

a    0
b    1
c    2
dtype: int64

Index(['a', 'b', 'c'], dtype='object')


In [93]:
print(index[1:])

Index(['b', 'c'], dtype='object')


In [94]:
#index objects are immutable and hence cannot be modified by users

index[1] = 'd'

TypeError: Index does not support mutable operations

In [96]:
index = pd.Index(np.arange(3))
index

Int64Index([0, 1, 2], dtype='int64')

Main Index objects in pandas

Class Description

Index -->The most general Index object, representing axis labels in a NumPy array of Python objects.

Int64Index -->Specialized Index for integer values.

MultiIndex -->“Hierarchical” index object representing multiple levels of indexing on a single axis. Can be thought of as similar to an array of tuples.

DatetimeIndex -->Stores nanosecond timestamps (represented using NumPy’s datetime64 dtype).

PeriodIndex -->Specialized Index for Period data (timespans).

In [97]:
#In addition to being array like, an Index also functions as a fixed-size set:

frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [98]:
'Ohio' in frame3.columns

True

In [99]:
2003 in frame3.index

False

In addition to being array-like, an Index also functions as a fixed-size set

##### Index methods and properties

Method Description

append -->Concatenate with additional Index objects, producing a new Index

diff -->Compute set difference as an Index

intersection -->Compute set intersection

union -->Compute set union

isin -->Compute boolean array indicating whether each value is contained in 
the passed collection

delete -->Compute new Index with element at index i deleted

drop -->Compute new index by deleting passed values

insert -->Compute new Index by inserting element at index i

is_monotonic -->Returns True if each element is greater than or equal to the previous element

is_unique -->Returns True if the Index has no duplicate values

unique -->Compute the array of unique values in the Index

###### Essential functionality

Reindexing:

A critical method on pandas objects is reindex, which means to create a new object with the data conformed to a new index

In [100]:
#Conside following example

obj = pd.Series([4.5,7.2,-5.3,3.6], index = ['a','b','c','d'])
obj

a    4.5
b    7.2
c   -5.3
d    3.6
dtype: float64

In [101]:
obj2 = obj.reindex(['e','c','d','b','a'])
obj2

e    NaN
c   -5.3
d    3.6
b    7.2
a    4.5
dtype: float64

In [105]:
obj3=obj2.reindex(['a','b','c','d','e','f'],fill_value=0)
obj3

a    4.5
b    7.2
c   -5.3
d    3.6
e    NaN
f    0.0
dtype: float64

In [109]:
#ffill method is used to forward fill the values:

obj4 = pd.Series(['blue','green','yellow'], index=[0,2,3])
obj4

0      blue
2     green
3    yellow
dtype: object

In [110]:
obj4.reindex(range(6), method = 'ffill')

0      blue
1      blue
2     green
3    yellow
4    yellow
5    yellow
dtype: object

list available method options. At this time interpolation more sophisticated than forward filling or backward filling would need to be applied:

reindex method (interpolation) options

Argument Description

ffill or pad -->Fill (or carry) values forward

bfill or backfill -->Fill (or carry) values backward

With DataFrame both rows and columns can be altered

In [118]:
frame = pd.DataFrame(np.arange(9).reshape(3,3), index = ['a','c','d'], 
                     columns = ['Ohio','Texas','California'])
frame.columns.name = 'states'
frame.index.name = 'Sno'
frame

states,Ohio,Texas,California
Sno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,0,1,2
c,3,4,5
d,6,7,8


In [119]:
frame2 = frame.reindex(['a','b','c','d'])
frame2

states,Ohio,Texas,California
Sno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [126]:
#columns can be reindexed using the columns keyword:
states = ['Texas','Utah','California']
frame2.reindex(columns = states)

states,Texas,Utah,California
Sno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


###### reindex function arguments

Argument Description

index -->New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying

method -->Interpolation (fill) method, see Table 5-4 for options.

fill_value -->Substitute value to use when introducing missing data by reindexing

limit -->When forward- or backfilling, maximum size gap to fill

level -->Match simple Index on level of MultiIndex, otherwise select subset of

copy -->Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data)

###### Dropping entries from axis

Dropping one or more entries from an axis is easy if you have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:

In [133]:
obj = pd.Series(np.arange(5), index = ['a','b','c','d','e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int32

In [134]:
new_obj = obj.drop('c')
new_obj

a    0
b    1
d    3
e    4
dtype: int32

In [136]:
new_obj = obj.drop(['d','c'])
new_obj

a    0
b    1
e    4
dtype: int32

In [140]:
data = pd.DataFrame(np.arange(16).reshape(4,4),index = ['ohio','colorado','Utah','New York'],
                    columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [148]:
data.drop(['three', 'four'], axis =1)

Unnamed: 0,one,two
ohio,0,1
colorado,4,5
Utah,8,9
New York,12,13


In [149]:
data.drop(['Utah']) #Default axis is rows

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
New York,12,13,14,15


###### indexing, selection and filtering

In [150]:
obj = pd.Series(np.arange(4), index = ['a','b','c','d'])

In [152]:
obj

a    0
b    1
c    2
d    3
dtype: int32

In [153]:
obj['b']

1

In [155]:
obj[3]#index value

3

In [157]:
obj[:]

a    0
b    1
c    2
d    3
dtype: int32

In [159]:
obj[['a','b']]

a    0
b    1
dtype: int32

In [160]:
obj[0:2]

a    0
b    1
dtype: int32

In [161]:
obj[[1,3]]

b    1
d    3
dtype: int32

In [162]:
obj[obj<2]

a    0
b    1
dtype: int32

In [165]:
obj['b':'c'] = 5
obj

a    0
b    5
c    5
d    3
dtype: int32

In [167]:
data = pd.DataFrame(np.arange(16).reshape(4,4),index = ['ohio','colorado','Utah','New York'],
                    columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [168]:
data['two']

ohio         1
colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [169]:
data[['three','one']]

Unnamed: 0,three,one
ohio,2,0
colorado,6,4
Utah,10,8
New York,14,12


In [170]:
data[:2]

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7


In [171]:
data[data['three']>5]

Unnamed: 0,one,two,three,four
colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [172]:
data

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [173]:
data < 5

Unnamed: 0,one,two,three,four
ohio,True,True,True,True
colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [174]:
data[data<5]=0
data

Unnamed: 0,one,two,three,four
ohio,0,0,0,0
colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


###### data.loc

Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

  .. warning:: Note that contrary to usual python slices, **both** the
      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- A ``callable`` function with one argument (the calling Series, DataFrame
  or Panel) and that returns valid output for indexing (one of the above)

KeyError:
    when any items are not found

Also:
DataFrame.at : Access a single value for a row/column label pair.
DataFrame.iloc : Access group of rows and columns by integer position(s).
DataFrame.xs : Returns a cross-section (row(s) or column(s)) from the
    Series/DataFrame.
Series.loc : Access group of values using labels.

In [180]:
data

Unnamed: 0,one,two,three,four
ohio,0,0,0,0
colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [178]:
data.loc['colorado']

one      0
two      5
three    6
four     7
Name: colorado, dtype: int32

In [179]:
data.loc['colorado',['two','three']]

two      5
three    6
Name: colorado, dtype: int32

In [186]:
data.loc[['colorado','Utah'],['two','three']]

Unnamed: 0,two,three
colorado,5,6
Utah,9,10


In [193]:
data.loc[['colorado','Utah'][0:2]]

Unnamed: 0,one,two,three,four
colorado,0,5,6,7
Utah,8,9,10,11


In [198]:
data.loc[data.three>5]

Unnamed: 0,one,two,three,four
colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [200]:
data.loc[data.three>5]

Unnamed: 0,one,two,three,four
colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [201]:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
                  index=['cobra', 'viper', 'sidewinder'],
                  columns=['max_speed', 'shield'])
df

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5
sidewinder,7,8


In [214]:
#Single label. Note this returns the row as a Series.

df.loc['viper']

max_speed    4
shield       5
Name: viper, dtype: int64

In [216]:
#List of labels. Note using ``[[]]`` returns a DataFrame.

df.loc[['viper','sidewinder']]

Unnamed: 0,max_speed,shield
viper,4,5
sidewinder,7,8


In [217]:
df

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5
sidewinder,7,8


In [205]:
print(df.loc['cobra','shield'])
print(df.loc['viper','max_speed'])

2
4


In [220]:
#Slice with labels for row and single label for column. both the start and stop of the slice are included.

df.loc['cobra':'sidewinder','max_speed']

cobra         1
viper         4
sidewinder    7
Name: max_speed, dtype: int64

In [221]:
#Boolean list with the same length as the row axis

df.loc[[False, False, True]]#boolean values refer to rows. false rows are not shown

Unnamed: 0,max_speed,shield
sidewinder,7,8


In [224]:
#Conditional that returns a boolean Series

df.loc[df['shield']>6]

Unnamed: 0,max_speed,shield
sidewinder,7,8


In [209]:
#Conditional that returns a boolean Series with column labels specified

df.loc[df['shield']>6, ['max_speed']]

Unnamed: 0,max_speed
sidewinder,7


In [226]:
#Callable that returns a boolean Series

df.loc[lambda df: df['shield']==8]

Unnamed: 0,max_speed,shield
sidewinder,7,8


In [227]:
df.loc[['viper','sidewinder'],['shield']]=50
df

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,50
sidewinder,7,50


In [228]:
#set value for an entire row
df.loc['cobra'] = 10
df

Unnamed: 0,max_speed,shield
cobra,10,10
viper,4,50
sidewinder,7,50


In [229]:
#set value for an entire column
df.loc[:,'max_speed']=30
df

Unnamed: 0,max_speed,shield
cobra,30,10
viper,30,50
sidewinder,30,50


In [230]:
#Set value for rows matching callable condition

df.loc[df['shield']>35]=0
df

Unnamed: 0,max_speed,shield
cobra,30,10
viper,0,0
sidewinder,0,0


In [231]:
#Getting values on a DataFrame with an index that has integer labels

df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
                  index=[7, 8, 9], columns=['max_speed', 'shield'])
df

Unnamed: 0,max_speed,shield
7,1,2
8,4,5
9,7,8


In [232]:
#Slice with integer labels for rows. start and stop of the slice are included.

df.loc[7:9]

Unnamed: 0,max_speed,shield
7,1,2
8,4,5
9,7,8


In [234]:
#Getting values with multiindex

tuples = [('cobra', 'mark i'), ('cobra', 'mark ii'),
          ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
          ('viper', 'mark ii'), ('viper', 'mark iii')]

In [235]:
index = pd.MultiIndex.from_tuples(tuples)

In [236]:
values = [[12, 2], [0, 4], [10, 20],[1, 4], [7, 1], [16, 36]]

In [237]:
df = pd.DataFrame(values, columns=['max_speed','shield'], index = index)

In [238]:
df

Unnamed: 0,Unnamed: 1,max_speed,shield
cobra,mark i,12,2
cobra,mark ii,0,4
sidewinder,mark i,10,20
sidewinder,mark ii,1,4
viper,mark ii,7,1
viper,mark iii,16,36


In [239]:
#Single label. Note this returns a DataFrame with a single index.

df.loc['viper']

Unnamed: 0,max_speed,shield
mark ii,7,1
mark iii,16,36


In [241]:
#Single label for row and column. Similar to passing in a tuple, this returns a Series.

df.loc['viper','mark ii']

max_speed    7
shield       1
Name: (viper, mark ii), dtype: int64

In [243]:
#Single tuple. Note using ``[[]]`` returns a DataFrame.

df.loc[[('cobra','mark ii')]]


Unnamed: 0,Unnamed: 1,max_speed,shield
cobra,mark ii,0,4


In [244]:
#Single tuple for the index with a single label for the column

df.loc[('cobra','mark i'),'shield']

2

In [245]:
#Slice from index tuple to single label

df.loc[('cobra','mark i'):'viper']

Unnamed: 0,Unnamed: 1,max_speed,shield
cobra,mark i,12,2
cobra,mark ii,0,4
sidewinder,mark i,10,20
sidewinder,mark ii,1,4
viper,mark ii,7,1
viper,mark iii,16,36


In [246]:
#slice from index tuple to index tuple

df.loc[('cobra','mark i'):('viper','mark ii')]

Unnamed: 0,Unnamed: 1,max_speed,shield
cobra,mark i,12,2
cobra,mark ii,0,4
sidewinder,mark i,10,20
sidewinder,mark ii,1,4
viper,mark ii,7,1


###### data.iloc

Purely integer-location based indexing for selection by position.

``.iloc[]`` is primarily integer position based (from ``0`` to
``length-1`` of the axis), but may also be used with a boolean
array.

Allowed inputs are:

- An integer, e.g. ``5``.
- A list or array of integers, e.g. ``[4, 3, 0]``.
- A slice object with ints, e.g. ``1:7``.
- A boolean array.
- A ``callable`` function with one argument (the calling Series, DataFrame
  or Panel) and that returns valid output for indexing (one of the above).
  This is useful in method chains, when you don't have a reference to the
  calling object, but would like to base your selection on some value.

``.iloc`` will raise ``IndexError`` if a requested indexer is
out-of-bounds, except *slice* indexers which allow out-of-bounds
indexing (this conforms with python/numpy *slice* semantics).

See more at ref:`Selection by Position <indexing.integer>`.

Also:
DataFrame.iat : Fast integer location scalar accessor.
DataFrame.loc : Purely label-location based indexer for selection by label.
Series.iloc : Purely integer-location based indexing for
               selection by position.

In [247]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
          {'a': 100, 'b': 200, 'c': 300, 'd': 400},
          {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]

In [248]:
df = pd.DataFrame(mydict)
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


In [251]:
#Indexing just the rows with a scalar integer

print(df.iloc[0])
print(df.iloc[1])

a    1
b    2
c    3
d    4
Name: 0, dtype: int64
a    100
b    200
c    300
d    400
Name: 1, dtype: int64


In [252]:
#With a list of integers
df.iloc[[0]]

Unnamed: 0,a,b,c,d
0,1,2,3,4


In [253]:
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


In [255]:
df.iloc[[0,1]]

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400


In [256]:
df.iloc[:3]

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


In [257]:
df.iloc[[True, False, True]]

Unnamed: 0,a,b,c,d
0,1,2,3,4
2,1000,2000,3000,4000


In [260]:
"""
With a callable, useful in method chains. The `x` passed
to the ``lambda`` is the DataFrame being sliced. This selects
the rows whose index label even.
"""

df.iloc[lambda x: x.index%2==0]

Unnamed: 0,a,b,c,d
0,1,2,3,4
2,1000,2000,3000,4000


In [261]:
"""
**Indexing both axes**

You can mix the indexer types for the index and columns. Use ``:`` to
select the entire axis.

With scalar integers.
"""

df.iloc[0,1]

2

In [262]:
#with list of integers

df.iloc[[0, 2], [1, 3]]

Unnamed: 0,b,d
0,2,4
2,2000,4000


In [263]:
#With `slice` objects

df.iloc[1:3,0:3]

Unnamed: 0,a,b,c
1,100,200,300
2,1000,2000,3000


In [264]:
#With a boolean array whose length matches the columns

df.iloc[:,[True, False, True, False]]

Unnamed: 0,a,c
0,1,3
1,100,300
2,1000,3000


In [265]:
#With a callable function that expects the Series or DataFrame

df.iloc[:, lambda df:[0,2]]

Unnamed: 0,a,c
0,1,3
1,100,300
2,1000,3000


##### Arithmetic and data alignment

One of the most important pandas features is the behavior of arithmetic between objects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs

In [268]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [270]:
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces NA values in the indices that don’t overlap. Missing values propagate in arithmetic computations

In [273]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                 index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [276]:
print(df1, df2, df1+df2, sep='\n')

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN


###### Arithmetic methods with fill values

In [278]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


In [279]:
#when reindexing, fill_value can take different value
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,b,d,e
Ohio,0.0,2.0,0
Texas,3.0,5.0,0
Colorado,6.0,8.0,0


###### Flexible arithmetic methods

Method Description

add -->Method for addition (+)

sub -->Method for subtraction (-)

div -->Method for division (/)

mul -->Method for multiplication (*)

In [280]:
#ops between DataFrame and Series

arr= np.arange(12).reshape(3,4)
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [281]:
arr[0]

array([0, 1, 2, 3])

In [282]:
arr-arr[0]

array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])

In [283]:
frame = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('bde'),
                    index=['Utah','Ohio','Texas','Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [286]:
series=frame.iloc[0]
series

b    0
d    1
e    2
Name: Utah, dtype: int32

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame's columns, broadcasting down the rows:

In [287]:
frame - series

Unnamed: 0,b,d,e
Utah,0,0,0
Ohio,3,3,3
Texas,6,6,6
Oregon,9,9,9


If an index value is not found in either DataFrame's columns or the Series's index, objects will be reindexed to form union

In [288]:
series2 = pd.Series(range(3), index=['b','e','f'])
series2

b    0
e    1
f    2
dtype: int64

In [289]:
frame+series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods

In [291]:
series3 = frame['d']
series3

Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: int32

In [292]:
frame.sub(series3, axis=0)

Unnamed: 0,b,d,e
Utah,-1,0,1
Ohio,-1,0,1
Texas,-1,0,1
Oregon,-1,0,1


###### Function application and mapping

NumPy ufuncs (element-wise array methods) work fine with pandas objects

In [295]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame

Unnamed: 0,b,d,e
Utah,-0.114264,-1.816421,1.723731
Ohio,-0.382077,-1.134099,-0.11242
Texas,0.614498,2.246961,0.580847
Oregon,0.363672,1.835468,-0.573854


In [296]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.114264,1.816421,1.723731
Ohio,0.382077,1.134099,0.11242
Texas,0.614498,2.246961,0.580847
Oregon,0.363672,1.835468,0.573854


In [300]:
#Another frequent operation is applying a function on 1D arrays to each column or row.

f = lambda x:x.max() - x.min()
frame.apply(f)

b    0.996575
d    4.063382
e    2.297584
dtype: float64

In [301]:
frame.apply(f, axis =1)

Utah      3.540151
Ohio      1.021678
Texas     1.666114
Oregon    2.409321
dtype: float64

In [310]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min','max'])

In [311]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.382077,-1.816421,-0.573854
max,0.614498,2.246961,1.723731


Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:

In [334]:
fo = lambda x: '%2f'%x
fo

<function __main__.<lambda>(x)>

In [332]:
frame.applymap(fo)

Unnamed: 0,b,d,e
Utah,-0.114264,-1.816421,1.723731
Ohio,-0.382077,-1.134099,-0.11242
Texas,0.614498,2.246961,0.580847
Oregon,0.363672,1.835468,-0.573854


In [333]:
frame['b'].map(fo)

Utah      -0.114264
Ohio      -0.382077
Texas      0.614498
Oregon     0.363672
Name: b, dtype: object

###### sorting and ranking

To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

In [335]:
obj = pd.Series(range(4), index = ['d','a','b','c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

With a DataFrame, you can sort by index on either axis

In [338]:
frame = pd.DataFrame(np.arange(8).reshape((2,4)), index=['three','one'],
                    columns=['d','a','b','c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [341]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [342]:
frame.sort_index(axis=0)

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [366]:
frame.values.sort()
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


Ranking is closely related to sorting, assigning ranks from one through the number of valid data points in an array. It is similar to the indirect sort indices produced by numpy.argsort, except that ties are broken according to a rule

In [367]:
obj.rank()

0    2.0
1    4.0
2    1.0
3    3.0
dtype: float64

In [368]:
obj.rank(method='first')

0    2.0
1    4.0
2    1.0
3    3.0
dtype: float64

In [369]:
obj.rank(ascending=False, method='max')

0    3.0
1    1.0
2    4.0
3    2.0
dtype: float64

In [371]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})

In [372]:
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [373]:
frame.rank(axis=1)

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


Tie-breaking methods with rank
Method Description
'average' Default: assign the average rank to each entry in the equal group.
'min' Use the minimum rank for the whole group.
'max' Use the maximum rank for the whole group.
'first' Assign ranks in the order the values appear in the data.

###### Axis indices with duplicate values

While many pandas functions (like reindex) require that the labels be unique,
it’s not mandatory. Let’s consider a small Series with duplicate indices

In [374]:
obj=pd.Series(range(5), index = ['a','a','b','c','c'])
obj

a    0
a    1
b    2
c    3
c    4
dtype: int64

In [375]:
#The index’s is_unique property can tell you whether its values are unique or not

obj.index.is_unique

False

In [376]:
#Data selection is one of the main things that behaves differently with duplicates. Indexing a value with multiple entries returns a Series while single entries return a scalar

obj['a']

a    0
a    1
dtype: int64

In [377]:
obj['c']

c    3
c    4
dtype: int64

In [378]:
df = pd.DataFrame(np.random.randn(4,3), index=['a','a','b','b'])
df

Unnamed: 0,0,1,2
a,0.216571,-0.581898,0.827126
a,-0.903509,0.066548,-0.205609
b,-0.073837,-0.505675,0.581462
b,0.077139,0.041985,-0.714642


In [382]:
df.loc['b']

Unnamed: 0,0,1,2
b,-0.073837,-0.505675,0.581462
b,0.077139,0.041985,-0.714642


###### summarizing and computing descriptive statistics

In [384]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])

In [385]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [386]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [387]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [388]:
#NA values are excluded unless the entire row or column is NA. This is disabled using skipna option

df.mean(axis=1, skipna=False)


a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

Options for reduction methods
Method Description

axis -->Axis to reduce over. 0 for DataFrame’s rows and 1 for columns.
skipna -->Exclude missing values, True by default.
level -->Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex).

Some methods, like idxmin and idxmax, return indirect statistics like the index value where the minimum or maximum values are attained

In [389]:
df.idxmax()

one    b
two    d
dtype: object

In [390]:
df.idxmin()

one    d
two    b
dtype: object

In [391]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [392]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [400]:
df.cumprod()

Unnamed: 0,one,two
a,1.4,
b,9.94,-4.5
c,,
d,7.455,5.85


In [407]:
df.diff()

Unnamed: 0,one,two
a,,
b,5.7,
c,,
d,,


In [406]:
df.pct_change()

Unnamed: 0,one,two
a,,
b,4.071429,
c,0.0,0.0
d,-0.894366,-0.711111


In [405]:
df.kurt()

one   NaN
two   NaN
dtype: float64

In [404]:
df.skew()

one    1.664846
two         NaN
dtype: float64

In [393]:
df.head()

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [394]:
df.tail()

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [395]:
df.count()

one    3
two    2
dtype: int64

In [396]:
df.min()

one    0.75
two   -4.50
dtype: float64

In [397]:
df.max()

one    7.1
two   -1.3
dtype: float64

In [408]:
df.var()

one    12.205833
two     5.120000
dtype: float64

In [409]:
df.mad()

one    2.677778
two    1.600000
dtype: float64

###### Descriptive and summary statistics

Method Description

count -->Number of non-NA values

describe -->Compute set of summary statistics for Series or each DataFrame column

min, max -->Compute minimum and maximum values

argmin, argmax -->Compute index locations (integers) at which minimum or maximum 
value obtained, respectively

idxmin, idxmax -->Compute index values at which minimum or maximum value obtained, respectively

quantile -->Compute sample quantile ranging from 0 to 1

sum -->Sum of values

mean -->Mean of values

median -->Arithmetic median (50% quantile) of values

mad -->Mean absolute deviation from mean value

var -->Sample variance of values

std -->Sample standard deviation of values

skew -->Sample skewness (3rd moment) of values

kurt -->Sample kurtosis (4th moment) of values

cumsum -->Cumulative sum of values

cummin, cummax -->Cumulative minimum or maximum of values, respectively

cumprod -->Cumulative product of values

diff -->Compute 1st arithmetic difference (useful for time series)

pct_change -->Compute percent changes

###### unique values, value counts and membership

Another class of related methods extracts information about the values contained in a one-dimensional Series

In [418]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [420]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [422]:
obj.value_counts

<bound method IndexOpsMixin.value_counts of 0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object>

In [423]:
pd.value_counts(obj.values, sort =False)

c    3
a    3
d    1
b    2
dtype: int64

isin is responsible for vectorized set membership and can be very useful in filtering a data set down to a subset of values in a series or column in a DataFrame

In [424]:
mask = obj.isin(['b','c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [425]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

###### Unique, value counts, and binning methods

Method Description

isin -->Compute boolean array indicating whether each Series value is contained in the passed sequence of values.

unique -->Compute array of unique values in a Series, returned in the order observed.

value_counts -->Return a Series containing unique values as its index and frequencies as its values, ordered count in
descending order.

In some cases, you may want to compute a histogram on multiple related columns in a DataFrame

In [427]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                  'Qu2': [2, 3, 1, 2, 3],
                  'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [428]:
#passing pandas.value_counts to this DataFrame's Apply function gives:

res = data.apply(pd.value_counts).fillna(0)
res

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


###### Handling missing Data

pandas uses the floating point value NaN (Not a Number) to represent missing data in both floating as well as in non-floating point arrays.

In [429]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [430]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [431]:
string_data[0] = None

In [432]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

###### NA handling methods

Argument Description

dropna -->Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much
missing data to tolerate.

fillna -->Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.

isnull -->Return like-type object containing boolean values indicating which values are missing / NA.

notnull -->Negation of isnull

###### Filtering Out Missing Data

dropna, On a Series, returns the Series with only the non-null data and index values

In [433]:
from numpy import nan as NA

In [442]:
data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [435]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [436]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [438]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])

In [439]:
cleaned = data.dropna()

In [440]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [441]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [443]:
#passing how='all' will only drop rows that are all NA

data.dropna(how = 'all')

0    1.0
2    3.5
4    7.0
dtype: float64

In [444]:
data[4]=NA
data

0    1.0
1    NaN
2    3.5
3    NaN
4    NaN
dtype: float64

A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument:

In [447]:
df = pd.DataFrame(np.random.randn(7,3))
df

Unnamed: 0,0,1,2
0,0.484085,-0.373092,-0.876402
1,0.390276,0.633878,0.301413
2,0.453415,0.884941,0.677489
3,0.98497,0.130498,-1.353189
4,0.836857,0.65261,-0.15389
5,-1.089885,0.073114,0.027792
6,-0.91589,-0.003913,-1.550792


In [449]:
df.iloc[:4,1]=NA; df.iloc[:2,2]=NA
df

Unnamed: 0,0,1,2
0,0.484085,,
1,0.390276,,
2,0.453415,,0.677489
3,0.98497,,-1.353189
4,0.836857,0.65261,-0.15389
5,-1.089885,0.073114,0.027792
6,-0.91589,-0.003913,-1.550792


In [450]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
4,0.836857,0.65261,-0.15389
5,-1.089885,0.073114,0.027792
6,-0.91589,-0.003913,-1.550792


###### Filling in Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value:

In [451]:
df

Unnamed: 0,0,1,2
0,0.484085,,
1,0.390276,,
2,0.453415,,0.677489
3,0.98497,,-1.353189
4,0.836857,0.65261,-0.15389
5,-1.089885,0.073114,0.027792
6,-0.91589,-0.003913,-1.550792


In [452]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.484085,0.0,0.0
1,0.390276,0.0,0.0
2,0.453415,0.0,0.677489
3,0.98497,0.0,-1.353189
4,0.836857,0.65261,-0.15389
5,-1.089885,0.073114,0.027792
6,-0.91589,-0.003913,-1.550792


In [453]:
df

Unnamed: 0,0,1,2
0,0.484085,,
1,0.390276,,
2,0.453415,,0.677489
3,0.98497,,-1.353189
4,0.836857,0.65261,-0.15389
5,-1.089885,0.073114,0.027792
6,-0.91589,-0.003913,-1.550792


In [454]:
df.fillna(df.mean())

Unnamed: 0,0,1,2
0,0.484085,0.240604,-0.470518
1,0.390276,0.240604,-0.470518
2,0.453415,0.240604,0.677489
3,0.98497,0.240604,-1.353189
4,0.836857,0.65261,-0.15389
5,-1.089885,0.073114,0.027792
6,-0.91589,-0.003913,-1.550792


In [455]:
#Calling fillna with a dict you can use a different fill value for each column

df.fillna({1:0.5, 3:-1})

Unnamed: 0,0,1,2
0,0.484085,0.5,
1,0.390276,0.5,
2,0.453415,0.5,0.677489
3,0.98497,0.5,-1.353189
4,0.836857,0.65261,-0.15389
5,-1.089885,0.073114,0.027792
6,-0.91589,-0.003913,-1.550792


In [458]:
#fillna returns new object, but you can modify the existing object in place

df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,0.484085,0.0,0.0
1,0.390276,0.0,0.0
2,0.453415,0.0,0.677489
3,0.98497,0.0,-1.353189
4,0.836857,0.65261,-0.15389
5,-1.089885,0.073114,0.027792
6,-0.91589,-0.003913,-1.550792


The same interpolation methods available for reindexing can be used with fillna

In [459]:
df = pd.DataFrame(np.random.randn(6,3))

In [462]:
df.iloc[2:, 1] = NA; df.iloc[4:, 2] = NA

In [463]:
df

Unnamed: 0,0,1,2
0,-0.01647,-0.654946,0.047349
1,-0.401781,0.59135,0.072404
2,-0.836336,,1.819697
3,-0.991267,,-1.356521
4,-0.107408,,
5,-0.919154,,


In [464]:
df.fillna(method = 'ffill')

Unnamed: 0,0,1,2
0,-0.01647,-0.654946,0.047349
1,-0.401781,0.59135,0.072404
2,-0.836336,0.59135,1.819697
3,-0.991267,0.59135,-1.356521
4,-0.107408,0.59135,-1.356521
5,-0.919154,0.59135,-1.356521


In [465]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.01647,-0.654946,0.047349
1,-0.401781,0.59135,0.072404
2,-0.836336,0.59135,1.819697
3,-0.991267,0.59135,-1.356521
4,-0.107408,,-1.356521
5,-0.919154,,-1.356521


With fillna you can do lots of other things with a little creativity. For example, you might pass the mean or median value of a Series

In [466]:
data = pd.Series([1, NA, 3.5, NA, 7])

In [467]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

###### fillna function arguments

Argument Description

value -->Scalar value or dict-like object to use to fill missing values

method -->Interpolation, by default 'ffill' if function called with no other arguments

axis -->Axis to fill on, default axis=0

inplace -->Modify the calling object without producing a copy

limit -->For forward and backward filling, maximum number of consecutive periods to fill

###### Hierarchical Indexing

Hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form

In [469]:
data = pd.Series(np.random.randn(10),
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                     [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])

In [470]:
data

a  1    1.024670
   2    1.332792
   3   -0.526461
b  1   -0.658433
   2    0.136787
   3   -0.847386
c  1   -0.240853
   2   -0.648605
d  2    0.306106
   3   -0.177548
dtype: float64

In [471]:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

In [472]:
data['b']

1   -0.658433
2    0.136787
3   -0.847386
dtype: float64

In [473]:
data['b':'c']

b  1   -0.658433
   2    0.136787
   3   -0.847386
c  1   -0.240853
   2   -0.648605
dtype: float64

In [475]:
data.loc[['b','d']]

b  1   -0.658433
   2    0.136787
   3   -0.847386
d  2    0.306106
   3   -0.177548
dtype: float64

In [476]:
data[:,2]

a    1.332792
b    0.136787
c   -0.648605
d    0.306106
dtype: float64

Hierarchical indexing plays a critical role in reshaping data and group-based operations like forming a pivot table. For example, this data could be rearranged into a DataFrame using its unstack method

In [477]:
data.unstack()

Unnamed: 0,1,2,3
a,1.02467,1.332792,-0.526461
b,-0.658433,0.136787,-0.847386
c,-0.240853,-0.648605,
d,,0.306106,-0.177548


In [478]:
data.unstack().stack()

a  1    1.024670
   2    1.332792
   3   -0.526461
b  1   -0.658433
   2    0.136787
   3   -0.847386
c  1   -0.240853
   2   -0.648605
d  2    0.306106
   3   -0.177548
dtype: float64

In [480]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])

In [481]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [482]:
frame.index.names=['key1', 'key2']

In [483]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [484]:
frame.columns.names = ['state','color']

In [485]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [486]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A MultiIndex can be created by itself and then reused; the columns in the above DataFrame with level names could be created like this

In [490]:
a = pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])

###### Re-ordering and sorting levels

At times you will need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The swaplevel takes two level numbers or names and returns a new object with the levels interchanged

In [493]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [494]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


###### Summary Statistics by Level

Many descriptive and summary statistics on DataFrame and Series have a level option in which you can specify the level you want to sum by on a particular axis. Consider the above DataFrame; we can sum by level on either the rows or columns like so:

In [498]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [499]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


In [501]:
#using columns

frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                   'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                   'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


DataFrame’s set_index function will create a new DataFrame using one or more of its columns as the index

In [502]:
frame2 = frame.set_index(['c','d'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


By default the columns are removed from the DataFrame

In [503]:
frame.set_index(['c','d'],drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


reset_index, on the other hand, does the opposite of set_index; the hierarchical index levels are are moved into the columns

In [504]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


###### Integer indexing

Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you would not expect the following code to generate an error.

In [506]:
ser = pd.Series(np.arange(3))
ser

0    0
1    1
2    2
dtype: int32

In [509]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [510]:
ser2.iloc[:]

a    0.0
b    1.0
c    2.0
dtype: float64

In [511]:
ser2.iloc[:1]

a    0.0
dtype: float64

In [513]:
ser2[::-1]

c    2.0
b    1.0
a    0.0
dtype: float64

###### panel Data

pandas has a Panel data structure, which is a three-dimensional analogue of DataFrame. To create a Panel, you can use a dict of DataFrame objects or a three-dimensional ndarray.

In [522]:
conda install pandas-datareader

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\SP\Anaconda3

  added / updated specs:
    - pandas-datareader


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.7.12               |           py37_0         3.0 MB
    pandas-datareader-0.8.0    |             py_0          71 KB
    ------------------------------------------------------------
                                           Total:         3.1 MB

The following NEW packages will be INSTALLED:

  pandas-datareader  pkgs/main/noarch::pandas-datareader-0.8.0-py_0

The following packages will be UPDATED:

  conda                                       4.7.10-py37_0 --> 4.7.12-py37_0



Downloading and Extracting Packages

pandas-datareader-0. | 71 KB     |            |   0% 
pandas-datareader-0. | 71 KB     | 

In [523]:
import pandas as pd
from pandas_datareader import data as web

In [528]:
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2016', '6/1/2019'))
for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  exec(code_obj, self.user_global_ns, self.user_ns)


pd.MultiIndex(
    levels=None,
    codes=None,
    sortorder=None,
    names=None,
    dtype=None,
    copy=False,
    name=None,
    verify_integrity=True,
    _set_identity=True,
)
Docstring:     
A multi-level, or hierarchical, index object for pandas objects.

parameters:

levels : sequence of arrays
    The unique labels for each level.
codes : sequence of arrays
    Integers for each level designating which label at each location.

    .. versionadded:: 0.24.0
labels : sequence of arrays
    Integers for each level designating which label at each location.

    .. deprecated:: 0.24.0
        Use ``codes`` instead
sortorder : optional int
    Level of sortedness (must be lexicographically sorted by that
    level).
names : optional sequence of objects
    Names for each of the index levels. (name is accepted for compat).
copy : bool, default False
    Copy the meta-data.
verify_integrity : bool, default True
    Check that the levels/codes are consistent and valid.

Attributes:

names
levels
codes
nlevels
levshape


Methods:

from_arrays
from_tuples
from_product
from_frame
set_levels
set_codes
to_frame
to_flat_index
is_lexsorted
sortlevel
droplevel
swaplevel
reorder_levels
remove_unused_levels

Also:

MultiIndex.from_arrays  : Convert list of arrays to MultiIndex.
MultiIndex.from_product : Create a MultiIndex from the cartesian product
                          of iterables.
MultiIndex.from_tuples  : Convert list of tuples to a MultiIndex.
MultiIndex.from_frame   : Make a MultiIndex from a DataFrame.
Index : The base pandas Index type.

Examples:

A new ``MultiIndex`` is typically constructed using one of the helper
methods :meth:`MultiIndex.from_arrays`, :meth:`MultiIndex.from_product`
and :meth:`MultiIndex.from_tuples`. For example (using ``.from_arrays``):

>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
MultiIndex(levels=[[1, 2], ['blue', 'red']],
           codes=[[0, 0, 1, 1], [1, 0, 1, 0]],
           names=['number', 'color'])