# 1) Pandas

Pandas is an open-source Python Library providing high-performance data manipulation 
and analysis tool using its powerful data structures. 

The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

## Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

## Pandas deals with the following three data structures −
- Series
- DataFrame
- Panel
These data structures are built on top of Numpy array, which means they are fast.

## Dimension & Description
The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. 

For example, 

DataFrame is a container of Series, 

Panel is a container of DataFrame.

|Data| Structure|	Dimensions	Description|
|-|-|-|
|Series|	1|	1D labeled homogeneous array, sizeimmutable.|
|Data Frames|	2	|General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.|
|Panel	|3|	General 3D labeled, size-mutable array.|

Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced.

For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1.

# 2) Python Pandas - Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

A pandas Series can be created using the following constructor −

In [None]:
pandas.Series( data, index, dtype, copy)

In [5]:
# A basic series, which can be created is an Empty Series.
#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print(s)

Series([], dtype: float64)


  after removing the cwd from sys.path.


In [7]:
# Create a Series from ndarray
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)

0    a
1    b
2    c
3    d
dtype: object


In [9]:
# We passed the index values here. Now we can see the customized indexed values in the output.
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print(s)

100    a
101    b
102    c
103    d
dtype: object


In [11]:
# Create a Series from dict
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

a    0.0
b    1.0
c    2.0
dtype: float64


In [13]:
# Passing index
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print(s)

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


In [15]:
# Create a Series from Scalar
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)

0    5
1    5
2    5
3    5
dtype: int64


In [16]:
# Accessing Data from Series with Position
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
print(s[0])

1


In [18]:
# Retrieve the first three elements in the Series.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
print(s[:3])

a    1
b    2
c    3
dtype: int64


In [20]:
# Retrieve the last three elements.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element
print(s[-3:])

c    3
d    4
e    5
dtype: int64


In [21]:
# Retrieve Data Using Label (Index)
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element
print(s['a'])

1


In [23]:
# Retrieve multiple elements using a list of index label values.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print(s[['a','c','d']])

a    1
c    3
d    4
dtype: int64


In [None]:
# If a label is not contained, an exception is raised.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print(s['f'])

# 3) Python Pandas - DataFrame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

- Features of DataFrame
- Potentially columns are of different types
- Size – Mutable
- Labeled axes (rows and columns)
- Can Perform Arithmetic operations on rows and columns

A pandas DataFrame can be created using the following constructor −

In [None]:
pandas.DataFrame( data, index, columns, dtype, copy)

In [27]:
# Create an Empty DataFrame
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


In [29]:
# Create a DataFrame from Lists
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)

   0
0  1
1  2
2  3
3  4
4  5


In [31]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


In [32]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)

     Name   Age
0    Alex  10.0
1     Bob  12.0
2  Clarke  13.0


## 3.1) Create a DataFrame from Dict of ndarrays / Lists
All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

If no index is passed, then by default, index will be range(n), where n is the array length.

In [33]:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42


In [34]:
# Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)

        Name  Age
rank1    Tom   28
rank2   Jack   34
rank3  Steve   29
rank4  Ricky   42


## 3.2) Create a DataFrame from List of Dicts

List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

In [36]:
# The following example shows how to create a DataFrame by passing a list of dictionaries.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

# Note − Observe, NaN (Not a Number) is appended in missing areas.

   a   b     c
0  1   2   NaN
1  5  10  20.0


In [37]:
# The following example shows how to create a DataFrame by passing a list of dictionaries and the row indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print(df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [38]:
# The following example shows how to create a DataFrame with a list of dictionaries, row indices, and column indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(df1)
print(df2)

        a   b
first   1   2
second  5  10
        a  b1
first   1 NaN
second  5 NaN


## 3.3) Create a DataFrame from Dict of Series

Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.

Note − Observe, for the series one, there is no label ‘d’ passed, but in the result, for the d label, NaN is appended with NaN.

Let us now understand column selection, addition, and deletion through examples.

In [39]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df)

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


## 3.4) Column Selection

We will understand this by selecting a column from the DataFrame.

In [40]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df['one'])

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64


## 3.5) Column Addition
We will understand this by adding a new column to an existing data frame.

In [41]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)

print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print(df)

Adding a new column by passing as Series:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Adding a new column using the existing columns in DataFrame:
   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN


## 3.6) Column Deletion
Columns can be deleted or popped; let us take an example to understand how.

In [42]:
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
   'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print(df)

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print(df)

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN


## 3.7) Row Selection, Addition, and Deletion
We will now understand row selection, addition and deletion through examples. 

Let us begin with the concept of selection.

### 3.7.1) Selection by Label

Rows can be selected by passing row label to a loc function.

In [43]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.loc['b'])

one    2.0
two    2.0
Name: b, dtype: float64


### 3.7.2) Selection by integer location
Rows can be selected by passing integer location to an iloc function.

In [44]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.iloc[2])

one    3.0
two    3.0
Name: c, dtype: float64


### 3.7.3) Slice Rows
Multiple rows can be selected using ‘ : ’ operator.

In [45]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df[2:4])

   one  two
c  3.0    3
d  NaN    4


## 3.7.4) Addition of Rows
Add new rows to a DataFrame using the append function. This function will append the rows at the end.

In [46]:
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print(df)

   a  b
0  1  2
1  3  4
0  5  6
1  7  8


## 3.7.5) Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped.

In [47]:
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print(df)

   a  b
1  3  4
1  7  8


# 4) Python Pandas - Xarray
-Pandas panel has been removed, explore Xarray later


# 5) Python Pandas - Basic Functionality

### Series Basic Functionality
|Sr.No.|	Attribute or Method & Description|
|-|-|
|1|axes<br> Returns a list of the row axis labels|
|2|dtype<br> Returns the dtype of the object.|
|3|empty<br> Returns True if series is empty.|
|4|ndim<br> Returns the number of dimensions of the underlying data, by definition 1.|
|5|size<br> Returns the number of elements in the underlying data.|
|6|values<br> Returns the Series as ndarray.|
|7|head()<br> Returns the first n rows.|
|8|tail()<br> Returns the last n rows.|

In [9]:
import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print(s)

0   -1.458714
1    0.475064
2    0.456089
3   -0.015047
dtype: float64


In [10]:
# axes
# Returns the list of the labels of the series.
s = pd.Series(np.random.randn(4))
print("The axes are:")
print(s.axes)

The axes are:
[RangeIndex(start=0, stop=4, step=1)]


In [11]:
# empty
# Returns the Boolean value saying whether the Object is empty or not. True indicates that the object is empty.
s = pd.Series(np.random.randn(4))
print("Is the Object empty?")
print(s.empty)

Is the Object empty?
False


In [12]:
# ndim
# Returns the number of dimensions of the object. 
# By definition, a Series is a 1D data structure, so it returns
s = pd.Series(np.random.randn(4))
print ("The dimensions of the object:")
print(s.ndim)

The dimensions of the object:
1


In [13]:
# size
# Returns the size(length) of the series.
s = pd.Series(np.random.randn(2))
print("The size of the object:")
print(s.size)

The size of the object:
2


In [17]:
# values
# Returns the actual data in the series as an array.
s = pd.Series(np.random.randn(4))
print ("The actual data series is:")
print(s.values)

The actual data series is:
[ 2.12879776  1.5501582   0.34871063 -1.58515656]


In [20]:
# Head & Tail
# To view a small sample of a Series or the DataFrame object, use the head() and the tail() methods.
s = pd.Series(np.random.randn(4))
print ("The original series is:")
print (s)
print ("The first two rows of the data series:")
print (s.head(2))
print ("The last two rows of the data series:")
print (s.tail(2))

The original series is:
0   -0.526445
1   -0.602228
2    0.022171
3   -1.246760
dtype: float64
The first two rows of the data series:
0   -0.526445
1   -0.602228
dtype: float64
The last two rows of the data series:
2    0.022171
3   -1.246760
dtype: float64


## DataFrame Basic Functionality

|Sr.No.	| Attribute or Method & Description|
|-|-|
|1	|T <br> Transposes rows and columns.|
|2	|axes <br> Returns a list with the row axis labels and column axis labels as the only members.|
|3	|dtypes <br> Returns the dtypes in this object.|
|4	|empty <br> True if NDFrame is entirely empty [no items]; if any of the axes are of length 0.|
|5 |ndim <br> Number of axes / array dimensions.|
|6 |shape <br> Returns a tuple representing the dimensionality of the DataFrame.|
|7 |size <br> Number of elements in the NDFrame.|
|8 |values <br> Numpy representation of NDFrame.|
|9 |head() <br> Returns the first n rows.|
|10 |tail() <br> Returns last n rows.|

In [22]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data series is:")
print (df)

Our data series is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80


In [23]:
# T (Transpose)
# Returns the transpose of the DataFrame. The rows and columns will interchange.
# Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

# Create a DataFrame
df = pd.DataFrame(d)
print ("The transpose of the data series is:")
print (df.T)

The transpose of the data series is:
           0      1      2     3      4      5     6
Name     Tom  James  Ricky   Vin  Steve  Smith  Jack
Age       25     26     25    23     30     29    23
Rating  4.23   3.24   3.98  2.56    3.2    4.6   3.8


In [24]:
# axes
# Returns the list of row axis labels and column axis labels.
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Row axis labels and column axis labels are:")
print (df.axes)

Row axis labels and column axis labels are:
[RangeIndex(start=0, stop=7, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')]


In [25]:
# dtypes
# Returns the data type of each column.
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("The data types of each column are:")
print (df.dtypes)

The data types of each column are:
Name       object
Age         int64
Rating    float64
dtype: object


In [26]:
# empty
# Returns the Boolean value saying whether the Object is empty or not; True indicates that the object is empty.

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print ("Is the object empty?")
print (df.empty)


Is the object empty?
False


In [27]:
# ndim
# Returns the number of dimensions of the object. By definition, DataFrame is a 2D object.
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The dimension of the object is:")
print (df.ndim)

Our object is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80
The dimension of the object is:
2


In [28]:
# shape
# Returns a tuple representing the dimensionality of the DataFrame. 
# Tuple (a,b), where a represents the number of rows and b represents the number of columns.

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The shape of the object is:")
print (df.shape)


Our object is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80
The shape of the object is:
(7, 3)


In [29]:
# size
# Returns the number of elements in the DataFrame.
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The total number of elements in our object is:")
print (df.size)

Our object is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80
The total number of elements in our object is:
21


In [30]:
# values
# Returns the actual data in the DataFrame as an NDarray.
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The actual data in our data frame is:")
print (df.values)

Our object is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80
The actual data in our data frame is:
[['Tom' 25 4.23]
 ['James' 26 3.24]
 ['Ricky' 25 3.98]
 ['Vin' 23 2.56]
 ['Steve' 30 3.2]
 ['Smith' 29 4.6]
 ['Jack' 23 3.8]]


In [33]:
# Head & Tail
# To view a small sample of a DataFrame object, use the head() and tail() methods.
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print (df)
print ("The first two rows of the data frame is:")
print (df.head(2))
print ("The last two rows of the data frame is:")
print (df.tail(2))

Our data frame is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80
The first two rows of the data frame is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
The last two rows of the data frame is:
    Name  Age  Rating
5  Smith   29     4.6
6   Jack   23     3.8


# 6) Python Pandas - Descriptive Statistics

A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. 

Most of these are aggregations like sum(), mean(), but some of them, like sumsum(), produce an object of the same size. 

Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer

DataFrame − “index” (axis=0, default), “columns” (axis=1)

In [34]:
# Let us create a DataFrame and use this object throughout this chapter for all the operations.
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print(df)

      Name  Age  Rating
0      Tom   25    4.23
1    James   26    3.24
2    Ricky   25    3.98
3      Vin   23    2.56
4    Steve   30    3.20
5    Smith   29    4.60
6     Jack   23    3.80
7      Lee   34    3.78
8    David   40    2.98
9   Gasper   30    4.80
10  Betina   51    4.10
11  Andres   46    3.65


In [35]:
# sum()
# Returns the sum of the values for the requested axis. 
# By default, axis is index (axis=0).
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print(df.sum())

Name      TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Age                                                     382
Rating                                                44.92
dtype: object


In [38]:
# axis=1
# This syntax will give the output as shown below.
df = pd.DataFrame(d)
print(df.sum(1))

0     29.23
1     29.24
2     28.98
3     25.56
4     33.20
5     33.60
6     26.80
7     37.78
8     42.98
9     34.80
10    55.10
11    49.65
dtype: float64


In [42]:
# mean()
# Returns the average value
df = pd.DataFrame(d)
print(df.mean())

Age       31.833333
Rating     3.743333
dtype: float64


In [44]:
# std()
# Returns the Bressel standard deviation of the numerical columns.
df = pd.DataFrame(d)
print(df.std())

Age       9.232682
Rating    0.661628
dtype: float64


## 6.1) Functions & Description
Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions −

|Sr.No.	|Function	|Description|
|-|-|-|
|1	|count()	|Number of non-null observations|
|2	|sum()	|Sum of values|
|3	|mean()	|Mean of Values|
|4	|median()	|Median of Values|
|5	|mode()	|Mode of values|
|6	|std()	|Standard Deviation of the Values|
|7	|min()	|Minimum Value|
|8	|max()	|Maximum Value|
|9	|abs()	|Absolute Value|
|10	|prod()	|Product of Values|
|11	|cumsum()	|Cumulative Sum|
|12	|cumprod()	|Cumulative Product|

Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. 

Though n practice, character aggregations are never used generally, these functions do not throw any exception.

Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.

In [46]:
# Summarizing Data
# The describe() function computes a summary of statistics pertaining to the DataFrame columns.
df = pd.DataFrame(d)
print(df.describe())

             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000


This function gives the mean, std and IQR values. 

And, function excludes the character columns and given summary about numeric columns. 'include' is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. 

Takes the list of values; by default, 'number'.

object − Summarizes String columns

number − Summarizes Numeric columns

all − Summarizes all columns together (Should not pass it as a list value)

Now, use the following statement in the program and check the output −

In [48]:
print(df.describe(include=['object']))

        Name
count     12
unique    12
top     Jack
freq       1


In [55]:
print(df.describe(include='all'))

        Name        Age     Rating
count     12  12.000000  12.000000
unique    12        NaN        NaN
top     Jack        NaN        NaN
freq       1        NaN        NaN
mean     NaN  31.833333   3.743333
std      NaN   9.232682   0.661628
min      NaN  23.000000   2.560000
25%      NaN  25.000000   3.230000
50%      NaN  29.500000   3.790000
75%      NaN  35.500000   4.132500
max      NaN  51.000000   4.800000


In [56]:
print(df.describe(include='number'))

             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000


In [57]:
print(df.describe(include='number').loc['mean'])

Age       31.833333
Rating     3.743333
Name: mean, dtype: float64


# 7) Python Pandas - Function Application

To apply your own or another library’s functions to Pandas objects, you should be aware of the three important methods. 

The methods have been discussed below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame, row- or column-wise, or element wise.

- Table wise Function Application: pipe()
- Row or Column Wise Function Application: apply()
- Element wise Function Application: applymap()

Custom operations can be performed by passing the function and the appropriate number of parameters as pipe arguments. Thus, operation is performed on the whole DataFrame.

For example, add a value 2 to all the elements in the DataFrame. 

In [78]:
def adder(ele1,ele2):
   return ele1+ele2

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)
print('---------------')
print(df.pipe(adder,2))

       col1      col2      col3
0  1.913940 -1.253825  0.083235
1 -0.662172 -0.182668 -0.109724
2  1.984517 -0.464194  0.110963
3  1.508906  0.585394 -1.886177
4  1.480049 -1.346783  0.760450
---------------
       col1      col2      col3
0  3.913940  0.746175  2.083235
1  1.337828  1.817332  1.890276
2  3.984517  1.535806  2.110963
3  3.508906  2.585394  0.113823
4  3.480049  0.653217  2.760450


## 7.1) Row or Column Wise Function Application
Arbitrary functions can be applied along the axes of a DataFrame or Panel using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument. 

By default, the operation performs column wise, taking each column as an array-like.

In [77]:
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)
print('----------')
print(df.apply(np.mean))

       col1      col2      col3
0  0.659650 -0.027919  0.097700
1 -0.538469 -1.605197 -0.376595
2 -1.581943  0.530840 -0.943131
3  0.201015  1.977503  1.212939
4  0.391447  0.235031  0.715727
----------
col1   -0.173660
col2    0.222052
col3    0.141328
dtype: float64


In [76]:
# By passing axis parameter, operations can be performed row wise.
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)
print('----------')
print(df.apply(np.mean,axis=1))

       col1      col2      col3
0 -0.419601 -0.584702  2.213787
1 -0.016674  0.463008  2.181432
2 -0.272618  1.238778  0.905509
3  0.977969  1.151887  1.159171
4 -1.331923  1.387525 -0.183433
----------
0    0.403162
1    0.875922
2    0.623890
3    1.096342
4   -0.042610
dtype: float64


In [75]:
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)
print('----------')
print(df.apply(lambda x: x.max() - x.min()))

       col1      col2      col3
0 -1.234123  0.741644  1.054229
1 -3.086060  1.101233  0.351777
2  0.053133  0.635425 -0.353082
3  0.410869 -1.892197 -0.625813
4  0.294298  0.163654  1.864004
----------
col1    3.496928
col2    2.993430
col3    2.489817
dtype: float64


## 7.2) Element Wise Function Application
Not all functions can be vectorized <br>
(neither the NumPy arrays which return another array nor any value), 

the methods applymap() on DataFrame 

and analogously map() on Series 

accept any Python function taking a single value and returning a single value.

In [74]:
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)
print('----------')
# My custom function
print(df['col1'].map(lambda x:x*100))

       col1      col2      col3
0 -1.018312  0.908817  0.055768
1  0.620019 -0.863014 -0.232023
2  0.577737  0.354875  0.216955
3 -0.771666 -0.373549  0.315801
4  0.314559 -0.548544  0.943734
----------
0   -101.831205
1     62.001871
2     57.773666
3    -77.166569
4     31.455869
Name: col1, dtype: float64


In [79]:
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)
print('----------')
print(df.applymap(lambda x:x*100))

       col1      col2      col3
0 -0.312702  0.203102  0.786745
1  0.102080 -1.600120 -0.014638
2  0.781406 -0.509548  0.779168
3 -0.987532  0.944156  0.602379
4  1.556662  1.100223 -0.676889
----------
         col1        col2       col3
0  -31.270225   20.310152  78.674451
1   10.208040 -160.012004  -1.463849
2   78.140562  -50.954839  77.916843
3  -98.753237   94.415650  60.237946
4  155.666230  110.022277 -67.688942


# 8) Python Pandas - Reindexing

Reindexing changes the row labels and column labels of a DataFrame. 

To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −

Reorder the existing data to match a new set of labels.

Insert missing value (NA) markers in label locations where no data for the label existed.

In [80]:
import pandas as pd
import numpy as np

N=20

df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})

#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print(df_reindexed)

           A       C   B
0 2016-01-01    High NaN
2 2016-01-03    High NaN
5 2016-01-06  Medium NaN


## 8.1) Reindex to Align with Other Objects
You may wish to take an object and reindex its axes to be labeled the same as another object. Consider the following example to understand the same.

Here, the df1 DataFrame is altered and reindexed like df2. 

The column names should be matched or else NAN will be added for the entire column label.

In [86]:
df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
df3 = pd.DataFrame(np.random.randn(2,2),columns=['col1','col2'])
df4 = pd.DataFrame(np.random.randn(4,4),columns=['col1','col2','col3','col4'])
print(df1)
print('------')
df1 = df1.reindex_like(df2)
print(df1)
print('------')
df1 = df1.reindex_like(df3)
print(df1)
print('------')
df1 = df1.reindex_like(df4)
print(df1)

       col1      col2      col3
0  1.206518 -0.951724  0.115935
1 -0.513212  1.469051  0.163935
2 -0.976731  0.434842 -0.608216
3 -1.079990  1.213452 -0.397083
4  0.496757  2.049719  1.034744
5 -0.638728 -0.307256  1.221415
6 -0.508381  0.053515 -1.646328
7 -0.414141 -0.008552  1.307207
8  0.966929  0.677408 -0.694104
9 -2.015466 -2.130360  0.736448
------
       col1      col2      col3
0  1.206518 -0.951724  0.115935
1 -0.513212  1.469051  0.163935
2 -0.976731  0.434842 -0.608216
3 -1.079990  1.213452 -0.397083
4  0.496757  2.049719  1.034744
5 -0.638728 -0.307256  1.221415
6 -0.508381  0.053515 -1.646328
------
       col1      col2
0  1.206518 -0.951724
1 -0.513212  1.469051
------
       col1      col2  col3  col4
0  1.206518 -0.951724   NaN   NaN
1 -0.513212  1.469051   NaN   NaN
2       NaN       NaN   NaN   NaN
3       NaN       NaN   NaN   NaN


## 8.2) Filling while ReIndexing

reindex() takes an optional parameter method which is a filling method with values as follows −

pad/ffill − Fill values forward

bfill/backfill − Fill values backward

nearest − Fill from the nearest index values

In [87]:
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NAN's
print (df2.reindex_like(df1))

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
print (df2.reindex_like(df1,method='ffill'))

       col1      col2      col3
0  1.180596 -1.979792  0.979817
1  0.652017  0.427577 -1.281322
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill:
       col1      col2      col3
0  1.180596 -1.979792  0.979817
1  0.652017  0.427577 -1.281322
2  0.652017  0.427577 -1.281322
3  0.652017  0.427577 -1.281322
4  0.652017  0.427577 -1.281322
5  0.652017  0.427577 -1.281322


## 8.3) Limits on Filling while Reindexing

The limit argument provides additional control over filling while reindexing. 

Limit specifies the maximum count of consecutive matches. 

Let us consider the following example to understand the same −

Note − Observe, only the 7th row is filled by the preceding 6th row. Then, the rows are left as they are.

In [88]:
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NAN's
print (df2.reindex_like(df1))

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limiting to 1:")
print (df2.reindex_like(df1,method='ffill',limit=1))

       col1      col2      col3
0 -1.882952 -0.722276 -0.566703
1 -1.319018 -1.163026 -0.694021
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill limiting to 1:
       col1      col2      col3
0 -1.882952 -0.722276 -0.566703
1 -1.319018 -1.163026 -0.694021
2 -1.319018 -1.163026 -0.694021
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN


## 8.4) Renaming
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

Let us consider the following example to understand this −

In [89]:
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print (df1)

print ("After renaming the rows and columns:")
print (df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}))

       col1      col2      col3
0 -0.196776 -0.834282 -1.129571
1 -0.511678 -0.782796  0.089167
2 -0.391352 -0.569292  0.718482
3  0.284758 -0.843523 -0.172946
4  0.401410  0.999131 -0.594649
5 -0.069883 -1.766336 -1.764387
After renaming the rows and columns:
              c1        c2      col3
apple  -0.196776 -0.834282 -1.129571
banana -0.511678 -0.782796  0.089167
durian -0.391352 -0.569292  0.718482
3       0.284758 -0.843523 -0.172946
4       0.401410  0.999131 -0.594649
5      -0.069883 -1.766336 -1.764387


# 9) Python Pandas - Iteration

The behavior of basic iteration over Pandas objects depends on the type. 

When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects.

In short, basic iteration (for i in object) produces −

Series − values

DataFrame − column labels

Panel − item labels

## Iterating a DataFrame
Iterating a DataFrame gives column names. Let us consider the following example to understand the same.

In [91]:
N=20
df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
   })

for col in df:
   print (col)

A
x
y
C
D


## iteritems()
Iterates over each column as key, value pair with label as key and column value as a Series object.

In [92]:
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
for key,value in df.iteritems():
   print (key,value)

col1 0    0.516344
1    0.373297
2   -1.359600
3   -0.457608
Name: col1, dtype: float64
col2 0    0.275769
1    2.138695
2   -1.307691
3    1.727049
Name: col2, dtype: float64
col3 0   -0.588594
1    1.619734
2    0.195442
3    0.522783
Name: col3, dtype: float64


## iterrows()
iterrows() returns the iterator yielding each index value along with a series containing the data in each row.

In [93]:
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
   print (row_index,row)

0 col1    1.742466
col2   -0.190905
col3   -0.688590
Name: 0, dtype: float64
1 col1    0.176350
col2    0.361827
col3   -0.204579
Name: 1, dtype: float64
2 col1   -2.227340
col2   -0.158842
col3   -2.192448
Name: 2, dtype: float64
3 col1    0.142847
col2   -0.483820
col3   -0.158134
Name: 3, dtype: float64


## itertuples()
itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. 

The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

In [94]:
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row in df.itertuples():
    print (row)

Pandas(Index=0, col1=-0.46972328579244405, col2=3.143296503210067, col3=-0.039911479524954746)
Pandas(Index=1, col1=1.2272742934665708, col2=1.5879282796288032, col3=-0.24122426168704922)
Pandas(Index=2, col1=0.1916165555742965, col2=1.1366746114348907, col3=0.16655503325078297)
Pandas(Index=3, col1=2.771346027772093, col2=0.8603069676375422, col3=0.9517625234075735)


# 10) Python Pandas - Sorting

There are two kinds of sorting available in Pandas. They are −

- By label
- By Actual Value

In [96]:
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
print (unsorted_df)

       col2      col1
1  0.761392 -0.019561
4 -1.502611  1.650270
6  1.051382  0.124045
2 -1.995197  0.273147
3  0.432691  1.035238
5  0.045451 -0.033926
9  0.088644 -0.551866
8 -1.107953  0.116830
0 -0.689022 -0.898038
7  1.112481 -1.721484


## By Label
Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending orde

In [98]:
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df=unsorted_df.sort_index()
print (sorted_df)

       col2      col1
0 -0.629999  0.015565
1  0.962501  0.008565
2 -0.858136 -0.587373
3  0.680760  0.197507
4  0.255097  1.449897
5 -0.867368 -0.544947
6  1.654593  1.891764
7 -0.076118  0.072947
8 -2.581441 -2.016118
9 -0.158631 -0.204769


## Order of Sorting
By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same.

In [102]:
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)

       col2      col1
9  0.320175  0.717297
8  1.313146 -1.854825
7  0.741328 -0.707559
6 -0.873185  0.609666
5 -1.409327 -0.939388
4  0.493388 -1.195138
3  0.806810  0.681326
2 -0.259036  0.435611
1  0.352179  0.887263
0  0.095931 -0.800361


## Sort the Columns
By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same.

In [103]:
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])
 
sorted_df=unsorted_df.sort_index(axis=1)

print (sorted_df)

       col1      col2
1  0.041546 -0.517166
4 -0.806516  0.640516
6 -1.273199  2.095434
2  1.251439  0.137462
3  0.694747  0.576557
5  0.756727  0.087464
9  1.181048 -0.900617
8  1.474624  0.392552
0  0.668391  0.318459
7 -0.598738 -0.694271


## By Value
Like index sorting, sort_values() is the method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted.

In [106]:
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')

print (sorted_df)

   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1


Observe, col1 values are sorted and the respective col2 value and row index will alter along with col1. Thus, they look unsorted.

'by' argument takes a list of column values.

In [107]:
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by=['col1','col2'])

print (sorted_df)

   col1  col2
2     1     2
1     1     3
3     1     4
0     2     1
