# Python Pandas Tutorial

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.In this tutorial, we will learn the various features of Python Pandas and how to use them in practice.

# Pandas
is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

# Introduction to Data Structures

Pandas deals with the following three data structures −

Series

DataFrame

Panel

These data structures are built on top of Numpy array, which means they are fast.

Dimension & Description
The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame.

# Data Structure	Dimensions	Description
Series	1	1D labeled homogeneous array, sizeimmutable.

Data Frames	2	General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.

Panel	3	General 3D labeled, size-mutable array.

# Mutability
All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable.

# Python Pandas - Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

# pandas.Series( data, index, dtype, copy)

data

data takes various forms like ndarray, list, constants


index

Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed.

dtype

dtype is for data type. If None, data type will be inferred


copy

Copy data. Default False

# A series can be created using various inputs like −

Array

Dict

Scalar value or constant

In [1]:
# an empty series
import pandas as pd
s = pd.Series()
print (s)

Series([], dtype: float64)


# Create a Series from ndarray
If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].

In [12]:
import pandas as pd
import numpy as np
data = np.array([1,2,3,4])
s = pd.Series(data)
print (s)

0    1
1    2
2    3
3    4
dtype: int32


In [4]:
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print (s)

100    a
101    b
102    c
103    d
dtype: object


# Create a Series from dict
A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [1]:
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print (s)

a    0.0
b    1.0
c    2.0
dtype: float64


In [10]:
# Dictionary keys are used to construct index.

In [8]:
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2., 'd' : 3.}
s = pd.Series(data,index=['b','c','d','a',])
print (s)

b    1.0
c    2.0
d    3.0
a    0.0
dtype: float64


In [12]:
# Index order is persisted and the missing element is filled with NaN (Not a Number).

# Create a Series from Scalar

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [7]:
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print (s)

0    5
1    5
2    5
3    5
dtype: int64


# Accessing Data from Series with Position
Data in the series can be accessed similar to that in an ndarray.

In [12]:
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
print (s['a'])

1


In [5]:
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['f','b','c','d','e'])
# in series indexing is must go to index.
#retrieve the first three element
print (s[:'d'])

f    1
b    2
c    3
d    4
dtype: int64


In [17]:
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element
print (s['c':'e'])

c    3
d    4
e    5
dtype: int64


# Retrieve Data Using Label (Index)
A Series is like a fixed-size dict in that you can get and set values by index label.

In [24]:
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element
print (s['d'])

4


In [3]:
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print (s[['a','c','d']])

a    1
c    3
d    4
dtype: int64


In [1]:
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print (s['f'])
# label f is not contained

KeyError: 'f'

# Python Pandas - DataFrame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

Features of DataFrame

Potentially columns are of different types

Size – Mutable

Labeled axes (rows and columns)

Can Perform Arithmetic operations on rows and columns

# pandas.DataFrame( data, index, columns, dtype, copy)

data

data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.

index

For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed.

columns

For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed.


dtype

Data type of each column.


copy

This command (or whatever it is) is used for copying of data, if the default is False.

# Create DataFrame

A pandas DataFrame can be created using various inputs like −

Lists

dict

Series

Numpy ndarrays

Another DataFrame

In [13]:
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print (df)

# an empty dataframe

Empty DataFrame
Columns: []
Index: []


# Create a DataFrame from Lists

The DataFrame can be created using a single list or a list of lists

In [2]:
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [24]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


In [4]:
import pandas as pd
data = [["Alex",10],["Bob",12],["Clark",13]]
df = pd.DataFrame(data,columns=['Age','Name'],dtype=float)
print (df)

     Age  Name
0   Alex  10.0
1    Bob  12.0
2  Clark  13.0


# Create a DataFrame from Dict of ndarrays / Lists

All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

If no index is passed, then by default, index will be range(n), where n is the array length.

In [27]:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print (df)

   Age   Name
0   28    Tom
1   34   Jack
2   29  Steve
3   42  Ricky


In [8]:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print (df)

        Name  Age
rank1    Tom   28
rank2   Jack   34
rank3  Steve   29
rank4  Ricky   42


# Create a DataFrame from List of Dicts

List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

In [29]:
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print (df)

   a   b     c
0  1   2   NaN
1  5  10  20.0


In [30]:
# The following example shows how to create a DataFrame by passing a list of dictionaries and the row indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print (df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [3]:
#  The following example shows how to create a DataFrame with a list of dictionaries, row indices, and column indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b','c'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
df2['c1']=pd.Series([23,56.0],index=['first','second'])
print (df1)
print (df2)

        a   b     c
first   1   2   NaN
second  5  10  20.0
        a  b1    c1
first   1 NaN  23.0
second  5 NaN  56.0


# Create a DataFrame from Dict of Series
Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.

In [4]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c','d'])}

df = pd.DataFrame(d)
df

# Observe, for the series one, there is no label ‘d’ passed, but in the result, for the d label, NaN is appended with NaN.

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


# Column Selection
We will understand this by selecting a column from the DataFrame.

In [17]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df ['one'])
df

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64


Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


# Column Addition

We will understand this by adding a new column to an existing data frame.

In [1]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print (df)

print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print (df)

Adding a new column by passing as Series:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Adding a new column using the existing columns in DataFrame:
   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN


# Column Deletion
Columns can be deleted or popped; let us take an example to understand how.

In [5]:
# Using the previous DataFrame, we will delete a column
# using del function
# delet  a column use pop, del
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
     'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print (df)

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print (df)

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print (df)

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN


In [18]:
pwd()

'C:\\Users\\HP'

# Row Selection, Addition, and Deletion
We will now understand row selection, addition and deletion through examples. Let us begin with the concept of selection.

Selection by Label
Rows can be selected by passing row label to a loc functio

In [5]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3,4], index=['a', 'b', 'c','d']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df.loc['b']) # by label location
df

one    2
two    2
Name: b, dtype: int64


Unnamed: 0,one,two
a,1,1
b,2,2
c,3,3
d,4,4


# Selection by integer location
Rows can be selected by passing integer location to an iloc function.

In [19]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df.iloc[2])

one    3.0
two    3.0
Name: c, dtype: float64


# Slice Rows
Multiple rows can be selected using ‘ : ’ operator.

In [4]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df)
print (df[2:4])

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
   one  two
c  3.0    3
d  NaN    4


# Addition of Rows
Add new rows to a DataFrame using the append function. This function will append the rows at the end.

In [16]:
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print (df)

   a  b
0  1  2
1  3  4
0  5  6
1  7  8


# Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped.

In [8]:
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print (df)

   a  b
1  3  4
1  7  8


# Python Pandas - Panel

A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s.

The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data. They are −


items − axis 0, each item corresponds to a DataFrame contained inside.


major_axis − axis 1, it is the index (rows) of each of the DataFrames.


minor_axis − axis 2, it is the columns of each of the DataFrames.

# pandas.Panel()
A Panel can be created using the following constructor −

pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)

# The parameters of the constructor are as follows −

Parameter	Description

data	Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame

items	axis=0

major_axis	axis=1

minor_axis	axis=2

dtype	Data type of each column/

copy	Copy data. Default, false

# Create Panel
A Panel can be created using multiple ways like −

From ndarrays

From dict of DataFrames

In [8]:
#From 3D ndarray
# creating an empty panel
import pandas as pd
import numpy as np

data = np.random.rand(2,4,5) # creating 2 items means 2 matrices of 4 rows and 5 columns
p = pd.Panel(data)
print(p)
p
data

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4


array([[[0.87916917, 0.67259274, 0.77687054, 0.67607751, 0.67336213],
        [0.34280814, 0.45778453, 0.3840602 , 0.0517134 , 0.38219897],
        [0.80212136, 0.90642272, 0.78055728, 0.57379854, 0.65051119],
        [0.66626421, 0.95656878, 0.30386207, 0.73625472, 0.49334953]],

       [[0.96390569, 0.20103519, 0.06486072, 0.57537938, 0.49321102],
        [0.31795001, 0.33402992, 0.26971478, 0.73607142, 0.09451902],
        [0.83256531, 0.40042039, 0.39955815, 0.46526305, 0.20176711],
        [0.60469268, 0.49114998, 0.29014889, 0.05198661, 0.93262815]]])

# From dict of DataFrame Objects

In [12]:
import pandas as pd
import numpy as np

data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
        'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p)


<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2


# Create an Empty Panel
An empty panel can be created using the Panel constructor as follows −

In [61]:
#creating an empty panel
import pandas as pd
p = pd.Panel()
print (p)


<class 'pandas.core.panel.Panel'>
Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
Items axis: None
Major_axis axis: None
Minor_axis axis: None


# Selecting the Data from Panel
Select the data from the panel using −

Items

Major_axis

Minor_axis

In [8]:
# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
        'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print (p['Item1'])
p

# We have two items, and we retrieved item1. The result is a DataFrame with 4 rows and 3 columns, 
# which are the Major_axis and Minor_axis dimensions

          0         1         2
0 -1.232119 -0.175259  0.391455
1  0.381731  0.361889  1.281173
2  1.511729  1.094079  0.510116
3  0.137302  0.818142 -0.788186


<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

# Using major_axis
Data can be accessed using the method panel.major_axis(index).

In [10]:
# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
        'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p["Item1"])
print(p["Item2"])
print(p.major_xs(1))
# major is row = so all the elements of ist row of item1 , and item2

          0         1         2
0 -0.169296 -0.075904 -0.390694
1  0.211774 -1.090508 -0.440715
2  0.725766  0.628673 -0.252886
3  1.137261 -1.566678  0.260044
          0         1   2
0  0.543827  0.639022 NaN
1  0.915550 -1.756295 NaN
2  1.102923 -0.682860 NaN
3 -0.613008 -0.375793 NaN
      Item1     Item2
0  0.211774  0.915550
1 -1.090508 -1.756295
2 -0.440715       NaN


# Using minor_axis
Data can be accessed using the method panel.minor_axis(index).

In [9]:
# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
        'Item2' : pd.DataFrame(np.random.randn(4, 2))}

p = pd.Panel(data)
print(data["Item1"])
print(data["Item2"])
print (p.minor_xs(1))

          0         1         2
0 -0.640666 -0.945247 -2.001230
1 -1.012417 -1.260275 -1.715521
2 -0.240978 -1.029336 -1.360998
3  0.436626  2.171764  0.402523
          0         1
0 -0.816497  0.557604
1 -0.557513 -0.136781
2  1.484140  0.873192
3 -0.733081  0.743885
      Item1     Item2
0 -0.945247  0.557604
1 -1.260275 -0.136781
2 -1.029336  0.873192
3  2.171764  0.743885


# Python Pandas - Basic Functionality

In [2]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print (s)

0   -0.204461
1   -1.304044
2    0.550875
3   -1.369659
dtype: float64


# axes
Returns the list of the labels of the series.


In [3]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print ("The axes are:")
print (s.axes)

The axes are:
[RangeIndex(start=0, stop=4, step=1)]


# empty
Returns the Boolean value saying whether the Object is empty or not. True indicates that the object is empty.

In [4]:
import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print ("Is the Object empty?")
print (s.empty)

Is the Object empty?
False


# ndim
Returns the number of dimensions of the object. By definition, a Series is a 1D data structure, so it returns

In [5]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print (s)

print ("The dimensions of the object:")
print (s.ndim)

0    2.056135
1    0.743862
2    1.294055
3   -1.560908
dtype: float64
The dimensions of the object:
1


# size
Returns the size(length) of the series.

In [6]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(2))
print (s)
print ("The size of the object:")
print (s.size)

0   -1.022441
1   -0.258307
dtype: float64
The size of the object:
2


# values
Returns the actual data in the series as an array.

In [7]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print (s)

print ("The actual data series is:")
print (s.values)

0    0.425712
1   -0.262333
2    0.101518
3    0.109211
dtype: float64
The actual data series is:
[ 0.4257124  -0.26233265  0.1015176   0.10921069]


# Head & Tail
To view a small sample of a Series or the DataFrame object, use the head() and the tail() methods.

head() returns the first n rows(observe the index values). The default number of elements to display is five, but you may pass a custom number.

In [35]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print ("The original series is:")
print (s)

print ("The first two rows of the data series:")
print (s.head(1))

The original series is:
0    0.883522
1   -0.210199
2   -0.924236
3    1.056852
dtype: float64
The first two rows of the data series:
0    0.883522
dtype: float64


In [9]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print ("The original series is:")
print (s)

print ("The last two rows of the data series:")
print (s.tail(2))

The original series is:
0    0.346495
1   -1.333358
2    0.315693
3   -0.478332
dtype: float64
The last two rows of the data series:
2    0.315693
3   -0.478332
dtype: float64


# DataFrame Basic Functionality
Let us now understand what DataFrame Basic Functionality is. The following tables lists down the important attributes or methods that help in DataFrame Basic Functionality.

In [10]:
# Let us now create a DataFrame and see all how the above mentioned attributes operate.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data series is:")
print (df)

Our data series is:
   Age   Name  Rating
0   25    Tom    4.23
1   26  James    3.24
2   25  Ricky    3.98
3   23    Vin    2.56
4   30  Steve    3.20
5   29  Smith    4.60
6   23   Jack    3.80


# T (Transpose)
Returns the transpose of the DataFrame. The rows and columns will interchange.

In [1]:
import pandas as pd
import numpy as np
 
# Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

# Create a DataFrame
df = pd.DataFrame(d)
print(df)
print ("The transpose of the data series is:")
print (df.T)

    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80
The transpose of the data series is:
           0      1      2     3      4      5     6
Name     Tom  James  Ricky   Vin  Steve  Smith  Jack
Age       25     26     25    23     30     29    23
Rating  4.23   3.24   3.98  2.56    3.2    4.6   3.8


# axes
Returns the list of row axis labels and column axis labels.

In [12]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Row axis labels and column axis labels are:")
print (df.axes)

Row axis labels and column axis labels are:
[RangeIndex(start=0, stop=7, step=1), Index(['Age', 'Name', 'Rating'], dtype='object')]


# dtypes
Returns the data type of each column.

In [13]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("The data types of each column are:")
print (df.dtypes)

The data types of each column are:
Age         int64
Name       object
Rating    float64
dtype: object


# empty
Returns the Boolean value saying whether the Object is empty or not; True indicates that the object is empty.

In [81]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print ("Is the object empty?")
print (df.empty)

Is the object empty?
False


# ndim
Returns the number of dimensions of the object. By definition, DataFrame is a 2D object.

In [83]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The dimension of the object is:")
print (df.ndim)

Our object is:
   Age   Name  Rating
0   25    Tom    4.23
1   26  James    3.24
2   25  Ricky    3.98
3   23    Vin    2.56
4   30  Steve    3.20
5   29  Smith    4.60
6   23   Jack    3.80
The dimension of the object is:
2


# shape
Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b), where a represents the number of rows and b represents the number of columns.

In [84]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The shape of the object is:")
print (df.shape)

Our object is:
   Age   Name  Rating
0   25    Tom    4.23
1   26  James    3.24
2   25  Ricky    3.98
3   23    Vin    2.56
4   30  Steve    3.20
5   29  Smith    4.60
6   23   Jack    3.80
The shape of the object is:
(7, 3)


# size
Returns the number of elements in the DataFrame.

In [85]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The total number of elements in our object is:")
print (df.size)

Our object is:
   Age   Name  Rating
0   25    Tom    4.23
1   26  James    3.24
2   25  Ricky    3.98
3   23    Vin    2.56
4   30  Steve    3.20
5   29  Smith    4.60
6   23   Jack    3.80
The total number of elements in our object is:
21


# values
Returns the actual data in the DataFrame as an NDarray.

In [86]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The actual data in our data frame is:")
print (df.values)

Our object is:
   Age   Name  Rating
0   25    Tom    4.23
1   26  James    3.24
2   25  Ricky    3.98
3   23    Vin    2.56
4   30  Steve    3.20
5   29  Smith    4.60
6   23   Jack    3.80
The actual data in our data frame is:
[[25 'Tom' 4.23]
 [26 'James' 3.24]
 [25 'Ricky' 3.98]
 [23 'Vin' 2.56]
 [30 'Steve' 3.2]
 [29 'Smith' 4.6]
 [23 'Jack' 3.8]]


# Head & Tail
To view a small sample of a DataFrame object, use the head() and tail() methods. head() returns the first n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number.

In [88]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print (df)
print ("The first two rows of the data frame is:")
print (df.head(2))

Our data frame is:
   Age   Name  Rating
0   25    Tom    4.23
1   26  James    3.24
2   25  Ricky    3.98
3   23    Vin    2.56
4   30  Steve    3.20
5   29  Smith    4.60
6   23   Jack    3.80
The first two rows of the data frame is:
   Age   Name  Rating
0   25    Tom    4.23
1   26  James    3.24


# tail() 
returns the last n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number.

In [90]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]), 
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print (df)
print ("The last two rows of the data frame is:")
print (df.tail(2))

Our data frame is:
   Age   Name  Rating
0   25    Tom    4.23
1   26  James    3.24
2   25  Ricky    3.98
3   23    Vin    2.56
4   30  Steve    3.20
5   29  Smith    4.60
6   23   Jack    3.80
The last two rows of the data frame is:
   Age   Name  Rating
5   29  Smith     4.6
6   23   Jack     3.8


# Python Pandas - Descriptive Statistics

In [17]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df)

      Name  Age  Rating
0      Tom   25    4.23
1    James   26    3.24
2    Ricky   25    3.98
3      Vin   23    2.56
4    Steve   30    3.20
5    Smith   29    4.60
6     Jack   23    3.80
7      Lee   34    3.78
8    David   40    2.98
9   Gasper   30    4.80
10  Betina   51    4.10
11  Andres   46    3.65


# sum()
Returns the sum of the values for the requested axis. By default, axis is index (axis=0).

In [94]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.sum())

# Each individual column is added individually (Strings are appended).

Age                                                     382
Name      TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Rating                                                44.92
dtype: object


# axis=1
This syntax will give the output as shown below.

it will give row wise sum

In [96]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print (df.sum(1))

# it will give row wise sum

0     29.23
1     29.24
2     28.98
3     25.56
4     33.20
5     33.60
6     26.80
7     37.78
8     42.98
9     34.80
10    55.10
11    49.65
dtype: float64


# mean()
Returns the average value

In [97]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.mean())

Age       31.833333
Rating     3.743333
dtype: float64


# std()
Returns the Bressel standard deviation of the numerical columns.

In [98]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.std())

Age       9.232682
Rating    0.661628
dtype: float64


In [16]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.cumsum())

    Age                                               Name Rating
0    25                                                Tom   4.23
1    51                                           TomJames   7.47
2    76                                      TomJamesRicky  11.45
3    99                                   TomJamesRickyVin  14.01
4   129                              TomJamesRickyVinSteve  17.21
5   158                         TomJamesRickyVinSteveSmith  21.81
6   181                     TomJamesRickyVinSteveSmithJack  25.61
7   215                  TomJamesRickyVinSteveSmithJackLee  29.39
8   255             TomJamesRickyVinSteveSmithJackLeeDavid  32.37
9   285       TomJamesRickyVinSteveSmithJackLeeDavidGasper  37.17
10  336  TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...  41.27
11  382  TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...  44.92


In [18]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.abs())

TypeError: bad operand type for abs(): 'str'

In [1]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.cumprod())

TypeError: can't multiply sequence by non-int of type 'str'

In [19]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.max())

Age        51
Name      Vin
Rating    4.8
dtype: object


In [20]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.min())

Age           23
Name      Andres
Rating      2.56
dtype: object


# Functions & Description
Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions −

# Function	Description

1	count()	Number of non-null observations

2	sum()	Sum of values

3	mean()	Mean of Values

4	median()	Median of Values

5	mode()	Mode of values

6	std()	Standard Deviation of the Values

7	min()	Minimum Value

8	max()	Maximum Value


9	abs()	Absolute Value

10	prod()	Product of Values

11	cumsum()	Cumulative Sum

12	cumprod()	Cumulative Product

# Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions.

Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. Though n practice, character aggregations are never used generally, these functions do not throw any exception.

Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.

# Summarizing Data
The describe() function computes a summary of statistics pertaining to the DataFrame columns.

In [1]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.describe())

             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000


# ##object − Summarizes String columns

number − Summarizes Numeric columns

all − Summarizes all columns together (Should not pass it as a list value)

Now, use the following statement in the program and check the output −

In [4]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Wesley',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df.describe(include=['object']))

          Name
count       13
unique      13
top     Betina
freq         1


In [102]:
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df. describe(include='all'))

              Age Name     Rating
count   12.000000   12  12.000000
unique        NaN   12        NaN
top           NaN  Vin        NaN
freq          NaN    1        NaN
mean    31.833333  NaN   3.743333
std      9.232682  NaN   0.661628
min     23.000000  NaN   2.560000
25%     25.000000  NaN   3.230000
50%     29.500000  NaN   3.790000
75%     35.500000  NaN   4.132500
max     51.000000  NaN   4.800000


# Row or Column Wise Function Application

Arbitrary functions can be applied along the axes of a DataFrame or Panel using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument. By default, the operation performs column wise, taking each column as an array-like.

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)
df.apply(np.mean)
print (df.apply(np.mean))

# By passing axis parameter, operations can be performed row wise.

       col1      col2      col3
0  1.073569 -0.075972 -0.462918
1 -0.108936  1.064755  0.420638
2  0.810764 -2.003544 -0.282524
3 -0.458284  0.876984 -0.958171
4 -0.188189  1.816141 -0.497366
col1    0.225785
col2    0.335673
col3   -0.356068
dtype: float64


In [6]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
#df.apply(np.mean,axis=1)
print (df.apply(np.mean, axis=1))

0    0.208348
1   -0.005709
2    0.408772
3   -0.078822
4    0.347232
dtype: float64


In [7]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)
print(df.apply(lambda x: x.max() - x.min()))
print(df.apply(lambda x: x.max() - x.min(), axis = 1))
print (df.apply(np.mean))

       col1      col2      col3
0 -0.582396  0.376691  2.258458
1 -0.232238  1.129860  0.758654
2 -0.182384  0.123328  0.002388
3 -0.365773 -1.905859 -1.292486
4  0.550166  0.618180 -0.177189
col1    1.132562
col2    3.035719
col3    3.550944
dtype: float64
0    2.840854
1    1.362098
2    0.305712
3    1.540085
4    0.795369
dtype: float64
col1   -0.162525
col2    0.068440
col3    0.309965
dtype: float64


# Element Wise Function Application
Not all functions can be vectorized (neither the NumPy arrays which return another array nor any value), the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value.

In [8]:
#if we need anything in a specific col. so we use map
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)

# My custom function
print(df['col1'].map(lambda x:x*100))
print (df.apply(np.mean))

       col1      col2      col3
0  0.594785 -0.210043  1.127132
1  0.093904  1.018843  0.750067
2 -0.420165  1.948793 -0.467100
3 -0.434372 -0.938892  0.549049
4  0.646525 -1.787216  1.517316
0    59.478506
1     9.390354
2   -42.016505
3   -43.437219
4    64.652467
Name: col1, dtype: float64
col1    0.096135
col2    0.006297
col3    0.695293
dtype: float64


In [9]:
import pandas as pd
import numpy as np

# My custom function
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)
print(df.applymap(lambda x:x*100))
#print (df.apply(np.mean))
df

       col1      col2      col3
0  1.312197  0.454589 -1.361580
1 -1.968445 -1.129168 -1.043674
2 -0.204904 -0.354980  0.569082
3 -1.274591 -0.273277  0.505237
4 -0.637699  0.924024  1.829394
         col1        col2        col3
0  131.219746   45.458916 -136.157977
1 -196.844481 -112.916810 -104.367392
2  -20.490431  -35.498006   56.908198
3 -127.459076  -27.327659   50.523715
4  -63.769915   92.402381  182.939388


Unnamed: 0,col1,col2,col3
0,1.312197,0.454589,-1.36158
1,-1.968445,-1.129168,-1.043674
2,-0.204904,-0.35498,0.569082
3,-1.274591,-0.273277,0.505237
4,-0.637699,0.924024,1.829394


# Python Pandas - Reindexing

Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −

Reorder the existing data to match a new set of labels.

Insert missing value (NA) markers in label locations where no data for the label existed.

In [19]:
import pandas as pd
import numpy as np

N=20

df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='M'),
   'X': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N),
   'D': np.random.normal(100, 10, size=(N))

})

print(df)
#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5,3,9,6,8], columns=['A', 'C', 'B'])

print (df_reindexed)
# here B is giving NaN beacause there is no such column so no values

            A     X         y       C           D
0  2016-01-31   0.0  0.623943     Low   89.497995
1  2016-02-29   1.0  0.798120     Low  105.190326
2  2016-03-31   2.0  0.159891     Low  105.140724
3  2016-04-30   3.0  0.133979  Medium  129.697476
4  2016-05-31   4.0  0.562461  Medium  103.316144
5  2016-06-30   5.0  0.169704     Low  109.834739
6  2016-07-31   6.0  0.097757     Low   89.770993
7  2016-08-31   7.0  0.741922     Low  101.807418
8  2016-09-30   8.0  0.453738     Low  112.523211
9  2016-10-31   9.0  0.836906     Low   89.103138
10 2016-11-30  10.0  0.944923     Low   96.725382
11 2016-12-31  11.0  0.030102     Low  124.624125
12 2017-01-31  12.0  0.898274  Medium  110.991132
13 2017-02-28  13.0  0.621962    High  103.694224
14 2017-03-31  14.0  0.147006     Low  108.509711
15 2017-04-30  15.0  0.743615    High  117.362451
16 2017-05-31  16.0  0.075756  Medium   92.709415
17 2017-06-30  17.0  0.128659  Medium  108.470258
18 2017-07-31  18.0  0.629305  Medium  102.316485


In [20]:
 np.random.rand(['Low','Medium','High'],N).tolist()

TypeError: 'list' object cannot be interpreted as an integer

In [3]:
df

Unnamed: 0,A,C,D,x,y
0,2016-01-01,Medium,105.728842,0.0,0.347941
1,2016-01-02,Low,102.809169,1.0,0.099767
2,2016-01-03,Medium,113.239965,2.0,0.008862
3,2016-01-04,Medium,98.625555,3.0,0.743986
4,2016-01-05,High,95.217976,4.0,0.42278
5,2016-01-06,Low,85.534458,5.0,0.297163
6,2016-01-07,High,100.86716,6.0,0.337275
7,2016-01-08,High,94.256373,7.0,0.868135
8,2016-01-09,Medium,95.472075,8.0,0.457669
9,2016-01-10,Low,117.184772,9.0,0.196297


# Reindex to Align with Other Objects
You may wish to take an object and reindex its axes to be labeled the same as another object. Consider the following example to understand the same.

In [15]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
print(df1)
print(df2)
df1 = df1.reindex_like(df2)
df1
# Note − Here, the df1 DataFrame is altered and reindexed like df2. The column names should be matched
# or else NAN will be added for the entire column label.

       col1      col2      col3
0 -0.344550 -0.020199  0.505984
1 -0.624073 -0.813324  0.762062
2 -0.946042  0.467531 -0.671950
3  0.335383  2.406244  0.463100
4 -0.177149 -0.289652  0.091241
5 -1.358156 -1.435299 -0.326110
6 -0.538587 -0.222214 -0.776904
7  0.484647  0.875548 -1.950063
8  1.093404  1.081765 -1.562597
9 -0.252735 -0.500428 -1.922592
       col1      col2      col3
0  0.847384  0.089104 -0.142111
1  0.720392  1.501412 -0.120405
2 -1.020815  0.368106 -1.228359
3  0.000722  1.923304 -0.023081
4  0.224680 -0.000446  0.237381
5 -1.186182 -0.618273  0.304671
6  0.985634  0.846608  0.362550


Unnamed: 0,col1,col2,col3
0,-0.34455,-0.020199,0.505984
1,-0.624073,-0.813324,0.762062
2,-0.946042,0.467531,-0.67195
3,0.335383,2.406244,0.4631
4,-0.177149,-0.289652,0.091241
5,-1.358156,-1.435299,-0.32611
6,-0.538587,-0.222214,-0.776904


# Filling while ReIndexing
reindex() takes an optional parameter method which is a filling method with values as follows −

pad/ffill − Fill values forward

bfill/backfill − Fill values backward

nearest − Fill from the nearest index values

In [9]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
print(df1)
print(df2)
# Padding NAN's
print (df2.reindex_like(df1))

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
print (df2.reindex_like(df1,method='ffill'))

       col1      col2      col3
0  0.477660  0.827504  3.543306
1  1.811839  0.477675 -0.879893
2 -1.059832 -1.367065 -0.679805
3  1.268864  0.766854 -2.301703
4 -0.938063  2.063149  1.034636
5  1.387980 -0.021059 -1.779675
       col1      col2      col3
0  0.552813  0.218812  0.341067
1 -0.294645  0.516725  0.147845
       col1      col2      col3
0  0.552813  0.218812  0.341067
1 -0.294645  0.516725  0.147845
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill:
       col1      col2      col3
0  0.552813  0.218812  0.341067
1 -0.294645  0.516725  0.147845
2 -0.294645  0.516725  0.147845
3 -0.294645  0.516725  0.147845
4 -0.294645  0.516725  0.147845
5 -0.294645  0.516725  0.147845


In [22]:
np.random.randn(6,3)

array([[ 0.54310799,  0.61590481, -0.66217099],
       [-0.79490287, -0.08013453,  0.04040611],
       [-0.99743173,  0.46498463,  1.37589511],
       [-0.10581426,  0.74543834, -1.48301582],
       [ 0.0843389 ,  0.73316326, -0.27956472],
       [-2.68313169,  0.07275769, -1.61028377]])

# Limits on Filling while Reindexing
The limit argument provides additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches. Let us consider the following example to understand the same −

In [10]:
import pandas as pd
import numpy as np
 
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NAN's
print (df2.reindex_like(df1))

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limiting to 1:")
print (df2.reindex_like(df1,method='ffill',limit=3))

       col1      col2      col3
0  0.686541  1.779022  0.244567
1  2.270229 -0.427923  1.266679
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill limiting to 1:
       col1      col2      col3
0  0.686541  1.779022  0.244567
1  2.270229 -0.427923  1.266679
2  2.270229 -0.427923  1.266679
3  2.270229 -0.427923  1.266679
4  2.270229 -0.427923  1.266679
5       NaN       NaN       NaN


In [10]:
# Note − Observe, only the 7th row is filled by the preceding 6th row. Then, the rows are left as they are.

# Renaming
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

Let us consider the following example to understand this −

In [24]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print (df1)

print ("After renaming the rows and columns:")
print (df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}))

       col1      col2      col3
0  0.045675 -1.713324 -1.316147
1  0.856183  0.021364 -0.854462
2 -0.868080 -0.874430  0.842856
3  0.273401  0.389303  0.743954
4  2.695444 -0.338995  0.051446
5  1.511721  2.057655  0.361038
After renaming the rows and columns:
              c1        c2      col3
apple   0.045675 -1.713324 -1.316147
banana  0.856183  0.021364 -0.854462
durian -0.868080 -0.874430  0.842856
3       0.273401  0.389303  0.743954
4       2.695444 -0.338995  0.051446
5       1.511721  2.057655  0.361038


In [14]:
# The rename() method provides an inplace named parameter, which by default is False and copies the 
# underlying data. Pass inplace=True to rename the data in place.

# Python Pandas - Iteration(repetation) 

The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects.

In short, basic iteration (for i in object) produces −

Series − values

DataFrame − column labels

Panel − item labels

# Iterating a DataFrame
Iterating a DataFrame gives column names. Let us consider the following example to understand the same.

In [15]:
import pandas as pd
import numpy as np
 
N=20

df = pd.DataFrame({
    'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
    'x': np.linspace(0,stop=N-1,num=N),
    'y': np.random.rand(N),
    'C': np.random.choice(['Low','Medium','High'],N).tolist(),
    'D': np.random.normal(100, 10, size=(N)).tolist()
    })

for col in df:
   print (col)

A
C
D
x
y


# To iterate over the rows of the DataFrame,

we can use the following functions −

iteritems() − to iterate over the (key,value) pairs

iterrows() − iterate over the rows as (index,series) pairs

itertuples() − iterate over the rows as namedtuples

# iteritems()
Iterates over each column as key, value pair with label as key and column value as a Series object.

In [25]:
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
print(df)
for key,value in df.iteritems():
   print (key,value)

       col1      col2      col3
0  0.228927  0.199153  0.304282
1 -0.461003 -0.466808 -2.006662
2  0.052615 -0.326478 -1.034230
3  0.290439  2.350334  0.553777
col1 0    0.228927
1   -0.461003
2    0.052615
3    0.290439
Name: col1, dtype: float64
col2 0    0.199153
1   -0.466808
2   -0.326478
3    2.350334
Name: col2, dtype: float64
col3 0    0.304282
1   -2.006662
2   -1.034230
3    0.553777
Name: col3, dtype: float64


# iterrows()
iterrows() returns the iterator yielding each index value along with a series containing the data in each row.

In [17]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
   print (row_index,row)

0 col1    1.672976
col2   -0.530435
col3    0.912347
Name: 0, dtype: float64
1 col1    0.857897
col2   -0.758790
col3    0.244746
Name: 1, dtype: float64
2 col1   -0.225628
col2    0.784365
col3    0.273038
Name: 2, dtype: float64
3 col1   -0.061459
col2    1.404453
col3    0.129884
Name: 3, dtype: float64


In [18]:
# Note − Because iterrows() iterate over the rows, it doesn't preserve the data type across the row. 0,1,2 are the 
# row indices and col1,col2,col3 are column indices.

# itertuples()
itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

In [19]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row in df.itertuples():
    print (row)

Pandas(Index=0, col1=1.2089275902145231, col2=1.2611777065930685, col3=0.2747878155250239)
Pandas(Index=1, col1=0.50635884029824274, col2=0.74475245065411799, col3=0.4739326300149258)
Pandas(Index=2, col1=0.79940637733867337, col2=-1.2815799854624281, col3=-0.61904184735999657)
Pandas(Index=3, col1=-0.81283602024809454, col2=-0.71697078538913128, col3=0.75393238414454511)


In [17]:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
print(df)
for index,column in df.iterrows():
        column['col4'] = 10 # this will create a new row with index 'a' assigned value 10
#print (df)
        print(index,column)

       col1      col2      col3
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
0 col1     1.764052
col2     0.400157
col3     0.978738
col4    10.000000
Name: 0, dtype: float64
1 col1     2.240893
col2     1.867558
col3    -0.977278
col4    10.000000
Name: 1, dtype: float64
2 col1     0.950088
col2    -0.151357
col3    -0.103219
col4    10.000000
Name: 2, dtype: float64
3 col1     0.410599
col2     0.144044
col3     1.454274
col4    10.000000
Name: 3, dtype: float64


# Python Pandas - Sorting

There are two kinds of sorting available in Pandas. They are −

By label
By Actual Value

In [23]:
import pandas as pd
import numpy as np

unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
print (unsorted_df)

       col2      col1
1 -1.137512 -0.186933
4  0.281705 -1.904358
6  0.463356 -0.431011
2  1.052714 -0.042893
3 -0.672293  0.458197
5 -0.364777 -1.247856
9  0.141666 -0.546187
8 -0.928082  1.532030
0 -0.946220 -0.656970
7 -1.083718 -0.799550


# By Label
Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order.

In [11]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])
print(unsorted_df)
sorted_df=unsorted_df.sort_index()
print (sorted_df)

       col2      col1
1  0.051807 -1.077868
4  0.279609  0.153305
6  0.414487  0.185684
2  1.415644  0.563947
3 -1.908349 -1.155859
5  0.234662  1.362720
9  1.365233  0.929174
8 -0.604648 -0.015926
0 -0.385683  0.562178
7 -0.270428  0.124974
       col2      col1
0 -0.385683  0.562178
1  0.051807 -1.077868
2  1.415644  0.563947
3 -1.908349 -1.155859
4  0.279609  0.153305
5  0.234662  1.362720
6  0.414487  0.185684
7 -0.270428  0.124974
8 -0.604648 -0.015926
9  1.365233  0.929174


# Order of Sorting
By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same.

In [26]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)

       col2      col1
9  1.796016  0.707235
8  0.497028 -1.335820
7  0.944244  1.319775
6  0.633564 -0.164203
5 -1.095027 -1.264594
4  0.307278  0.561850
3 -1.246677 -1.473788
2  1.386324  1.148111
1  0.126588  0.317058
0 -1.240786 -0.443039


# Sort the Columns
By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same.

In [12]:
import pandas as pd
import numpy as np
 
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])
print(unsorted_df) 
sorted_df=unsorted_df.sort_index(axis=1)

print (sorted_df)

       col2      col1
1 -0.914772  2.156265
4 -0.707053 -1.433711
6 -1.505461  0.864913
2  0.157086  0.243783
3 -1.165836 -1.073775
5 -0.675038  0.388137
9 -0.374875  0.607739
8 -0.586302  0.495104
0  0.327703 -0.730910
7  0.937986  0.261397
       col1      col2
1  2.156265 -0.914772
4 -1.433711 -0.707053
6  0.864913 -1.505461
2  0.243783  0.157086
3 -1.073775 -1.165836
5  0.388137 -0.675038
9  0.607739 -0.374875
8  0.495104 -0.586302
0 -0.730910  0.327703
7  0.261397  0.937986


# By Value
Like index sorting, sort_values() is the method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted.

In [13]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
print(unsorted_df)
sorted_df = unsorted_df.sort_values(by='col1')

print (sorted_df)

   col1  col2
0     2     1
1     1     3
2     1     2
3     1     4
   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1


In [30]:
# Observe, col1 values are sorted and the respective col2 value and row index will alter along with col1. 
# Thus, they look unsorted.'by' argument takes a list of column values.

In [28]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,3,1],'col2':[5,3,2,4]})
print(unsorted_df)
sorted_df = unsorted_df.sort_values(by=['col1','col2'])

print (sorted_df)

   col1  col2
0     2     5
1     1     3
2     3     2
3     1     4
   col1  col2
1     1     3
3     1     4
0     2     5
2     3     2


# Sorting Algorithm
sort_values() provides a provision to choose the algorithm from mergesort, heapsort and quicksort. Mergesort is the only stable algorithm.

In [14]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
print(unsorted_df)
sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')

print (sorted_df)

   col1  col2
0     2     1
1     1     3
2     1     2
3     1     4
   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1


# Python Pandas - Working with Text Data

Almost, all of these methods work with Python string functions (refer: https://docs.python.org/3/library/stdtypes.html#string-methods).

In [34]:
# convert the Series Object to String Object and then perform the operation.

# S.No	Function	Description

1	lower()	Converts strings in the Series/Index to lower case.

2	upper()	Converts strings in the Series/Index to upper case.

3	len()	Computes String length().

4	strip()	Helps strip whitespace(including newline) from each string in the Series/index from both the sides.

5	split(' ')	Splits each string with the given pattern.

6	cat(sep=' ')	Concatenates the series/index elements with given separator.

7	get_dummies()	Returns the DataFrame with One-Hot Encoded values.

8	contains(pattern)	Returns a Boolean value True for each element if the substring contains in the element, else False.

9	replace(a,b)	Replaces the value a with the value b.

10	repeat(value)	Repeats each element with specified number of times.

11	count(pattern)	Returns count of appearance of pattern in each element.

12	startswith(pattern)	Returns true if the element in the Series/Index starts with the pattern.

13	endswith(pattern)	Returns true if the element in the Series/Index ends with the pattern.

14	find(pattern)	Returns the first position of the first occurrence of the pattern.

15	findall(pattern)	Returns a list of all occurrence of the pattern.

16	swapcase	Swaps the case lower/upper.

17	islower()	Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean

18	isupper()	Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.

19	isnumeric()	Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

Let us now create a Series and see how all the above functions work.

In [37]:
import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print (s)

0             Tom
1    William Rick
2            John
3         Alber@t
4             NaN
5            1234
6      SteveSmith
dtype: object


In [38]:
import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print (s.str.lower())

0             tom
1    william rick
2            john
3         alber@t
4             NaN
5            1234
6      stevesmith
dtype: object


In [39]:
import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print (s.str.len())

0     3.0
1    12.0
2     4.0
3     7.0
4     NaN
5     4.0
6    10.0
dtype: float64


In [41]:
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s)
print ("After Stripping:")
print (s.str.strip())

0             Tom 
1     William Rick
2             John
3          Alber@t
dtype: object
After Stripping:
0             Tom
1    William Rick
2            John
3         Alber@t
dtype: object


In [42]:
import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s)
print ("Split Pattern:")
print (s.str.split(' '))

0             Tom 
1     William Rick
2             John
3          Alber@t
dtype: object
Split Pattern:
0              [Tom, ]
1    [, William, Rick]
2               [John]
3            [Alber@t]
dtype: object


In [43]:
import pandas as pd
import numpy as np

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print (s.str.cat(sep='_'))

Tom _ William Rick_John_Alber@t


# get_dummies()

In [21]:
import pandas as pd
import numpy as np

s = pd.Series(['Very Good ', 'Good', 'Average', 'Bad'])

print (s.str.get_dummies())

   Average  Bad  Good  Very Good 
0        0    0     0           1
1        0    0     1           0
2        1    0     0           0
3        0    1     0           0


# contains ()


In [46]:
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s.str.contains(' '))

0     True
1     True
2    False
3    False
dtype: bool


# replace(a,b)

In [47]:
import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s)
print ("After replacing @ with $:")
print (s.str.replace('@','$'))

0             Tom 
1     William Rick
2             John
3          Alber@t
dtype: object
After replacing @ with $:
0             Tom 
1     William Rick
2             John
3          Alber$t
dtype: object


# find(pattern)

In [23]:
import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@te'])

print (s.str.find('e'))

0   -1
1   -1
2   -1
3    3
dtype: int64


# findall(pattern)

In [22]:
import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@te'])

print (s.str.findall('e'))

0        []
1        []
2        []
3    [e, e]
dtype: object


# Python Pandas - Indexing and Selecting Data

Pandas now supports three types of Multi-axes indexing; the three types are mentioned in the following table −

Indexing	Description

.loc()	Label based

.iloc()	Integer based

.ix()	Both Label and Integer based

# loc()

Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. Integers are valid labels, but they refer to the label and not the position.

.loc() has multiple access methods like −

A single scalar label

A list of labels

A slice object

A Boolean array

loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

In [1]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

#select all rows for a specific column
print (df.loc[:,'A'])

a   -0.316517
b    1.091240
c   -1.270601
d    0.455751
e   -0.166292
f   -0.427842
g   -0.694252
h    1.282733
Name: A, dtype: float64


In [5]:
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select all rows for multiple columns, say list[]
print (df.loc[ :,['A','C']])

          A         C
a  0.951917 -0.348413
b -0.045973  0.567391
c -1.396252  1.777808
d -1.471476  0.222327
e  1.193126  1.673692
f  0.329171  0.051306
g  0.921021  0.699672
h  0.554055  0.834023


In [52]:
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select few rows for multiple columns, say list[]
print (df.loc[['a','b','f','h'],['A','C']])

          A         C
a  0.421074 -0.138104
b -1.556228  0.895045
f -1.166419  1.257084
h  1.396856  0.116111


In [2]:
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select range of rows for all columns
print (df.loc['a':'h'])

          A         B         C         D
a -1.096156 -0.331675 -0.429080  0.642509
b -2.090671  0.526775  0.081098 -0.572716
c -0.681225  0.818400 -0.797116  0.447487
d  0.891205 -0.392008 -0.220365  1.968877
e  0.334364 -0.320803 -0.669831 -0.782738
f  2.097875 -0.429507  2.297799 -0.121673
g  0.042948  0.347598 -0.078301 -1.474048
h -0.946089  1.581730 -1.451918  0.716189


In [9]:
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
np.random.seed(10)
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
print(df)
# for getting values with a boolean array
print (df.loc['a']>0)

          A         B         C         D
a  1.331587  0.715279 -1.545400 -0.008384
b  0.621336 -0.720086  0.265512  0.108549
c  0.004291 -0.174600  0.433026  1.203037
d -0.965066  1.028274  0.228630  0.445138
e -1.136602  0.135137  1.484537 -1.079805
f -1.977728 -1.743372  0.266070  2.384967
g  1.123691  1.672622  0.099149  1.397996
h -0.271248  0.613204 -0.267317 -0.549309
A     True
B     True
C    False
D    False
Name: a, dtype: bool


# .iloc()

Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing.

The various access methods are as follows −

An Integer

A list of integers

A range of values

In [55]:
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# select specific rows for a  all column
print (df.iloc[:4])

          A         B         C         D
0 -1.182955  0.569802 -1.677109  0.608616
1 -1.782474 -0.430946 -1.372322 -1.674401
2 -0.798998 -1.139961 -0.138075 -0.421836
3 -0.210284 -1.469152  0.932988  0.776239


In [10]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Integer slicing
print (df.iloc[:4])
print (df.iloc[1:5, 2:4])

          A         B         C         D
0  0.132708 -0.476142  1.308473  0.195013
1  0.400210 -0.337632  1.256472 -0.731970
2  0.660232 -0.350872 -0.939433 -0.489337
3 -0.804591 -0.212698 -0.339140  0.312170
          C         D
1  1.256472 -0.731970
2 -0.939433 -0.489337
3 -0.339140  0.312170
4 -0.025905  0.289094


In [12]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Slicing through list of values
print (df.iloc[[1, 3, 5], [1, 3]])
print (df.iloc[1:3,:])
print (df.iloc[:,1:3])

          B         D
1 -1.907457  0.469751
3 -0.787269 -0.470807
5  0.141104 -1.618571
          A         B         C         D
1  0.117476 -1.907457 -0.922909  0.469751
2 -0.144367 -0.400138 -0.295984  0.848209
          B         C
0  0.089588  0.826999
1 -1.907457 -0.922909
2 -0.400138 -0.295984
3 -0.787269  0.292941
4 -0.739357 -0.312829
5  0.141104  0.273049
6 -1.320448  1.236205
7  0.346233  1.022516


# .ix()
Besides pure label based and integer based, Pandas provides a hybrid method for selections and subsetting the object using the .ix() operator.

In [50]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Integer slicing
print (df.ix[:4])

          A         B         C         D
0  0.856831 -0.651026 -1.034243  0.681595
1 -0.803410 -0.689550 -0.455533  0.017479
2 -0.353994 -1.374951 -0.643618 -2.223403
3  0.625231 -1.602058 -1.104383  0.052165
4 -0.739563  1.543015 -1.292857  0.267051


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  import sys


In [13]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# Index slicing
print (df.ix[1:5,'A'])

1   -0.639963
2    1.339926
3   -0.287629
4    0.377753
5    0.332350
Name: A, dtype: float64


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


# Use of Notations
Getting values from the Pandas object with Multi-axes indexing uses the following notation −

Object	Indexers	Return Type

Series	s.loc[indexer]	Scalar value

DataFrame	df.loc[row_index,col_index]	Series object

Panel	p.loc[item_index,major_index, minor_index]	p.loc[item_index,major_index, minor_index]

Note − .iloc() & .ix() applies the same indexing options and Return value.


Let us now see how each operation can be performed on the DataFrame object. We will use the basic indexing operator '[ ]' −

In [60]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print (df['A'])

0   -0.205356
1    0.321133
2    0.439631
3   -0.099061
4    1.323027
5   -0.138691
6    0.513821
7   -1.022593
Name: A, dtype: float64


In [61]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

print (df[['A','B']])

          A         B
0  0.055788  1.290568
1  1.418078 -0.389710
2 -0.671920 -2.009678
3  0.601136  0.443111
4 -0.062693 -0.392977
5 -1.504051 -0.364832
6 -0.720522  0.950647
7  0.107653 -1.719542


In [14]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print (df[2:4])

          A         B         C         D
2  0.274173 -0.514910 -1.711071  0.612297
3  1.100129  0.564353 -0.712799 -0.260859


In [63]:
# Attribute Access
# Columns can be selected using the attribute operator '.'

In [64]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

print (df.A)

0   -0.411180
1   -0.978560
2   -0.614411
3    0.554230
4   -0.371954
5   -2.003878
6   -1.847602
7   -2.558528
Name: A, dtype: float64


# Python Pandas - Statistical Functions

# Percent_change
Series, DatFrames and Panel, all have the function pct_change(). This function compares every element with its prior element and computes the change percentage.

In [15]:
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5,4])
print (s.pct_change())

df = pd.DataFrame(np.random.randn(5, 2))
print (df.pct_change())
# By default, the pct_change() operates on columns; if you want to apply the same row wise, then use axis=1() argument.

0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
5   -0.200000
dtype: float64
          0          1
0       NaN        NaN
1 -0.637878 -11.198874
2 -3.422428  -1.814477
3  0.232326  -0.440506
4 -0.885931  -3.102784


# Covariance
Covariance is applied on series data. The Series object has a method cov to compute covariance between series objects. NA will be excluded automatically.

Cov Series

In [69]:
import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print (s1.cov(s2))

-0.3765000504


In [70]:
# Covariance method when applied on a DataFrame, computes cov between all the columns.

In [71]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print (frame['a'].cov(frame['b']))
print (frame.cov())

-0.249256449234
          a         b         c         d         e
a  0.468448 -0.249256 -0.099059  0.447822 -0.136773
b -0.249256  1.573827  1.259837 -0.430574  0.618939
c -0.099059  1.259837  1.384858 -0.250537  0.338880
d  0.447822 -0.430574 -0.250537  1.557379  0.751681
e -0.136773  0.618939  0.338880  0.751681  1.973515


# Correlation
Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall.

In [72]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])

print (frame['a'].corr(frame['b']))

print (frame.corr())

-0.323911539229
          a         b         c         d         e
a  1.000000 -0.323912  0.120081  0.120534 -0.212602
b -0.323912  1.000000 -0.533404 -0.198510 -0.137630
c  0.120081 -0.533404  1.000000 -0.202588  0.703600
d  0.120534 -0.198510 -0.202588  1.000000 -0.555976
e -0.212602 -0.137630  0.703600 -0.555976  1.000000


# Data Ranking
Data Ranking produces ranking for each element in the array of elements. In case of ties, assigns the mean rank.

Rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.

Rank supports different tie-breaking methods, specified with the method parameter −

average − average rank of tied group

min − lowest rank in the group

max − highest rank in the group

first − ranks assigned in the order they appear in the array

In [24]:
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5), index=['a','b','c','d','e'])
s1 = pd.Series([10,4,11,9,15], index=['a','b','c','d','e'])
s['d'] = s['b'] # so there's a tie
s1['d'] = s1['b'] 
print (s.rank())
print (s1.rank())

a    3.0
b    4.5
c    2.0
d    4.5
e    1.0
dtype: float64
a    3.0
b    1.5
c    4.0
d    1.5
e    5.0
dtype: float64


# Python Pandas - Missing Data

In [27]:
# import the pandas library
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],
                  columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df)

        one       two     three
a  0.602022  0.438365 -0.782343
b       NaN       NaN       NaN
c  0.192936  0.004025 -0.164075
d       NaN       NaN       NaN
e -1.148812 -0.835509  0.210451
f  1.013985 -0.970198  1.217182
g       NaN       NaN       NaN
h  0.182647 -1.269820  0.323390


In [28]:
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df['one'].isnull())

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool


In [76]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df['one'].notnull())

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool


# Calculations with Missing Data

When summing data, NA will be treated as Zero
If the data are all NA, then the result will be NA

In [77]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].sum())

1.2557754832893244


In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print (df['one'].sum())

0


# Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.

# Replace NaN with a Scalar Value
The following program shows how you can replace "NaN" with "0".

In [80]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print (df)
print ("NaN replaced with '0':")
print (df.fillna(0))

        one       two     three
a  0.407755  0.659469 -0.128220
b       NaN       NaN       NaN
c  0.078914  1.368266  0.684596
NaN replaced with '0':
        one       two     three
a  0.407755  0.659469 -0.128220
b  0.000000  0.000000  0.000000
c  0.078914  1.368266  0.684596


# Fill NA Forward and Backward
Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values.

Method	Action

pad/fill	Fill methods Forward

bfill/backfill	Fill methods Backward

In [81]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df.fillna(method='pad'))

        one       two     three
a  0.611720  1.595157 -0.036167
b  0.611720  1.595157 -0.036167
c  1.369501 -0.973567  0.376473
d  1.369501 -0.973567  0.376473
e -0.929963 -0.267522  0.177213
f -0.356186 -1.668491  0.588191
g -0.356186 -1.668491  0.588191
h -1.492425 -0.791582 -0.042005


In [82]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df.fillna(method='backfill'))

        one       two     three
a -0.719641 -0.169620  0.137299
b  0.461368 -0.337278  0.012493
c  0.461368 -0.337278  0.012493
d -1.208236  0.337140 -0.358174
e -1.208236  0.337140 -0.358174
f  0.467975 -0.685872 -0.655403
g  0.028664 -0.570928  2.595380
h  0.028664 -0.570928  2.595380


# Drop Missing Values
If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.

In [53]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
print (df.dropna())

        one       two     three
a -0.310886  0.097400  0.399046
b       NaN       NaN       NaN
c -2.772593  1.955912  0.390093
d       NaN       NaN       NaN
e -0.652409 -0.390953  0.493742
f -0.116104 -2.030684  2.064493
g       NaN       NaN       NaN
h -0.110541  1.020173 -0.692050
        one       two     three
a -0.310886  0.097400  0.399046
c -2.772593  1.955912  0.390093
e -0.652409 -0.390953  0.493742
f -0.116104 -2.030684  2.064493
h -0.110541  1.020173 -0.692050


In [54]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df.dropna(axis=1))
print (df.dropna(axis=0))

Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]
        one       two     three
a  1.536377  0.286344  0.608844
c -1.045253  1.211145  0.689818
e  1.301846 -0.628088 -0.481027
f  2.303917 -1.060016 -0.135950
h  1.136891  0.097725  0.582954


In [85]:
df

Unnamed: 0,one,two,three
a,0.82709,0.643524,0.619305
b,,,
c,-0.871785,0.882963,-0.036891
d,,,
e,-0.061261,0.751354,0.638887
f,-1.461456,0.041011,1.051264
g,,,
h,0.254678,-0.808888,0.254184


# Replace Missing (or) Generic Values
Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method.

Replacing NA with a scalar value is equivalent behavior of the fillna() function.

In [86]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print (df.replace({1000:10,2000:60}))

   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60


In [87]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print (df.replace({1000:10,2000:60}))

   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60


# Python Pandas - GroupBy

In [88]:
#import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print (df)

    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
2      863     2  Devils  2014
3      673     3  Devils  2015
4      741     3   Kings  2014
5      812     4   kings  2015
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
9      701     4  Royals  2014
10     804     1  Royals  2015
11     690     2  Riders  2017


# Split Data into Groups
Pandas object can be split into any of their objects. There are multiple ways to split an object like −

obj.groupby('key')

obj.groupby(['key1','key2'])

obj.groupby(key,axis=1)

Let us now see how the grouping objects can be applied to the DataFrame object

In [90]:
# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print (df.groupby('Team'))

<pandas.core.groupby.DataFrameGroupBy object at 0x000001BBC60C9160>


# View Groups

In [3]:
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],           
            'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df1 = df.groupby(['Year'])
print(df1.sum())

print (df.groupby('Team').groups)

      Rank  Points
Year              
2014    10    3181
2015    10    3078
2016     3    1450
2017     3    1478
{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}


# Group by with multiple columns −

In [3]:
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df1 = df.groupby(['Team','Year'])
print(df1.sum())
print (df.groupby(['Team','Year']).groups)

             Rank  Points
Team   Year              
Devils 2014     2     863
       2015     3     673
Kings  2014     3     741
       2016     1     756
       2017     1     788
Riders 2014     1     876
       2015     2     789
       2016     2     694
       2017     2     690
Royals 2014     4     701
       2015     1     804
kings  2015     4     812
{('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('Kings', 2014): Int64Index([4], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64')}


# Iterating through Groups

In [96]:
# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')

for name,group in grouped:
      print (name)# this gives the names of the group
      print (group) # this gives the data groupby

2014
   Points  Rank    Team  Year
0     876     1  Riders  2014
2     863     2  Devils  2014
4     741     3   Kings  2014
9     701     4  Royals  2014
2015
    Points  Rank    Team  Year
1      789     2  Riders  2015
3      673     3  Devils  2015
5      812     4   kings  2015
10     804     1  Royals  2015
2016
   Points  Rank    Team  Year
6     756     1   Kings  2016
8     694     2  Riders  2016
2017
    Points  Rank    Team  Year
7      788     1   Kings  2017
11     690     2  Riders  2017


# Select a Group
Using the get_group() method, we can select a single group.

In [97]:
# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
print (grouped.get_group(2014))

   Points  Rank    Team  Year
0     876     1  Riders  2014
2     863     2  Devils  2014
4     741     3   Kings  2014
9     701     4  Royals  2014


# Aggregations
An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data.

An obvious one is aggregation via the aggregate or equivalent agg method

In [34]:
# import the pandas library
import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
print (grouped['Points'].agg(np.mean))

Year
2014    795.25
2015    769.50
2016    725.00
2017    739.00
Name: Points, dtype: float64


In [99]:
import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Team')
print (grouped.agg(np.size))

        Points  Rank  Year
Team                      
Devils       2     2     2
Kings        3     3     3
Riders       4     4     4
Royals       2     2     2
kings        1     1     1


In [100]:
# import the pandas library
import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Team')
print (grouped['Points'].agg([np.sum, np.mean, np.std]))

         sum        mean         std
Team                                
Devils  1536  768.000000  134.350288
Kings   2285  761.666667   24.006943
Riders  3049  762.250000   88.567771
Royals  1505  752.500000   72.831998
kings    812  812.000000         NaN


# Filtration
Filtration filters the data on a defined criteria and returns the subset of data. The filter() function is used to filter the data.

In [41]:
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'Kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Team')
print (grouped.filter(lambda x: len(x) >= 4))

    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
4      741     3   Kings  2014
5      812     4   Kings  2015
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
11     690     2  Riders  2017


In [2]:
pwd


'C:\\Users\\croma\\Dexlab python'