# Pandas (Part 3): Essential Functionality

In this notebook, you will learn how to create the following objects:
 - Arithmetic and Data Alignment
 - Function Application and Mapping
 - Sorting
 - Descriptive statistics
 - Other useful methods
 
Read more: 
 - Python Data Analysis textbook (chapter 5) and 
 - [Pandas website] (https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html).

In [53]:
import pandas as pd
import numpy as np

### 1. Arithmetic and Data Alignment

**between series and series object**

In [54]:
ser_1 = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
ser_2 = pd.Series([10,20,30,40], index=['a', 'b', 'c', 'd'])

ser_1+ser_2

a    11
b    22
c    33
d    44
dtype: int64

In [55]:
ser_1 = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
ser_2 = pd.Series([10,20,30,40], index=['a', 'b', 'c', 'd'])

ser_1+ser_2

a    11.0
b    22.0
c    33.0
d    44.0
e     NaN
dtype: float64

In [56]:
df = pd.DataFrame({'col1': ser_1, 'col2': ser_2})
print(df,'\n')

# elementwise operation
df['addition'] = df['col1']+df['col2']
df['subtraction'] = df['col1']-df['col2']
df['multiplication'] = df['col1']*df['col2']
df['devision'] = df['col1']/df['col2']
df

   col1  col2
a     1  10.0
b     2  20.0
c     3  30.0
d     4  40.0
e     5   NaN 



Unnamed: 0,col1,col2,addition,subtraction,multiplication,devision
a,1,10.0,11.0,-9.0,10.0,0.1
b,2,20.0,22.0,-18.0,40.0,0.1
c,3,30.0,33.0,-27.0,90.0,0.1
d,4,40.0,44.0,-36.0,160.0,0.1
e,5,,,,,


**between dataframe and series object**

The default behavior is to align the index of the series with the column index of the dataframe and 
perform the operations between each row and the series.

In [57]:
df = pd.DataFrame({'col1': ser_1, 'col2': ser_2})

# subtract the whole dataframe from the first row
print(df, '\n')
print(df.iloc[0],'\n')
print(df-df.iloc[0],'\n')

   col1  col2
a     1  10.0
b     2  20.0
c     3  30.0
d     4  40.0
e     5   NaN 

col1     1.0
col2    10.0
Name: a, dtype: float64 

   col1  col2
a   0.0   0.0
b   1.0  10.0
c   2.0  20.0
d   3.0  30.0
e   4.0   NaN 



If you would instead like to operate column-wise, you can use the object methods, while specifying the axis keyword

In [58]:
print(df, '\n')
print(df['col1'],'\n')
df.subtract(df['col1'],axis=0) # {0 or ‘index’, 1 or ‘columns’}

   col1  col2
a     1  10.0
b     2  20.0
c     3  30.0
d     4  40.0
e     5   NaN 

a    1
b    2
c    3
d    4
e    5
Name: col1, dtype: int64 



Unnamed: 0,col1,col2
a,0,9.0
b,0,18.0
c,0,27.0
d,0,36.0
e,0,


In [59]:
df=df.apply(lambda x: x-df['col1'], axis=0)
df

Unnamed: 0,col1,col2
a,0,9.0
b,0,18.0
c,0,27.0
d,0,36.0
e,0,


**between dataframe and dataframe object**

In [60]:
df_1 = pd.DataFrame(np.arange(1,17).reshape(4,4),
                    index= ['Fi', 'Se', 'Th', 'Fo'],
                    columns = ['a', 'b', 'c', 'd'])

df_2 = pd.DataFrame(np.arange(1,17).reshape(4,4) * 10,
                    index= ['Fi', 'Se', 'Th', 'Fo'],
                    columns = ['a', 'b', 'c', 'd'])
print(df_1,'\n')
print(df_2, '\n')

     a   b   c   d
Fi   1   2   3   4
Se   5   6   7   8
Th   9  10  11  12
Fo  13  14  15  16 

      a    b    c    d
Fi   10   20   30   40
Se   50   60   70   80
Th   90  100  110  120
Fo  130  140  150  160 



In [61]:
df_1+df_2

Unnamed: 0,a,b,c,d
Fi,11,22,33,44
Se,55,66,77,88
Th,99,110,121,132
Fo,143,154,165,176


In [62]:
df_1-df_2

Unnamed: 0,a,b,c,d
Fi,-9,-18,-27,-36
Se,-45,-54,-63,-72
Th,-81,-90,-99,-108
Fo,-117,-126,-135,-144


### 2. Function Application and Mapping


**DataFrame.apply(func, axis=0)**

 - Apply a function along an axis of the DataFrame.
 - Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1).

In [63]:
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])

print(df, '\n')
print(df.apply(np.sqrt),'\n')
print(df.apply(np.sqrt, axis=0),'\n')
print(df.apply(np.sqrt, axis=1))

   A  B
0  4  9
1  4  9
2  4  9 

     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0 

     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0 

     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0


In [64]:
print(df, '\n')
print(df.apply(np.sum, axis=0),'\n')
print(df.apply(np.sum, axis=1))

   A  B
0  4  9
1  4  9
2  4  9 

A    12
B    27
dtype: int64 

0    13
1    13
2    13
dtype: int64


In [65]:
print(df, '\n')
print(df.apply(lambda x: [1, 2], axis=0),'\n')
print(df.apply(lambda x: [1, 2], axis=1))

   A  B
0  4  9
1  4  9
2  4  9 

A    [1, 2]
B    [1, 2]
dtype: object 

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object


In [95]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame = np.abs(frame)

In [98]:
print(frame,'\n')

# the difference b/w max-min for each column (for default), or row (axis=1)
print(frame.apply(lambda x: x.max() - x.min(), axis=0),'\n') # along row axis (each column)
print(frame.apply(lambda x: x.max() - x.min(), axis=1),'\n') # along column axis (each row)

               b         d         e
Utah    0.242654  0.618637  1.092771
Ohio    0.100004  0.750184  1.985751
Texas   0.719892  0.210842  0.376385
Oregon  1.760535  0.448075  0.786428 

b    1.660531
d    0.539343
e    1.609366
dtype: float64 

Utah      0.850116
Ohio      1.885747
Texas     0.509051
Oregon    1.312460
dtype: float64 



In [103]:
print(frame,'\n')

# given an input row or column,
# find (min, max) 
# return the pair with a seris object
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

print(frame.apply(f, axis=0), '\n')         # along row axis (each column)
print(frame.apply(f, axis=1),'\n')          # along column axis (each row)

               b         d         e
Utah    0.242654  0.618637  1.092771
Ohio    0.100004  0.750184  1.985751
Texas   0.719892  0.210842  0.376385
Oregon  1.760535  0.448075  0.786428 

            b         d         e
min  0.100004  0.210842  0.376385
max  1.760535  0.750184  1.985751 

             min       max
Utah    0.242654  1.092771
Ohio    0.100004  1.985751
Texas   0.210842  0.719892
Oregon  0.448075  1.760535 



**DataFrame.applymap()**:
Apply a function to a Dataframe elementwise.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

In [114]:
print(frame,'\n')
print(frame.applymap(lambda x: '%.2f' % x),'\n')

               b         d         e
Utah    0.242654  0.618637  1.092771
Ohio    0.100004  0.750184  1.985751
Texas   0.719892  0.210842  0.376385
Oregon  1.760535  0.448075  0.786428 

           b     d     e
Utah    0.24  0.62  1.09
Ohio    0.10  0.75  1.99
Texas   0.72  0.21  0.38
Oregon  1.76  0.45  0.79 



**Series.map()**: Map values of Series according to input correspondence.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

In [110]:
frame['e'].map(lambda x: '%.2f' % x)

Utah      1.09
Ohio      1.99
Texas     0.38
Oregon    0.79
Name: e, dtype: object

### 3. Sorting

In [71]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [72]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])

print(frame, "\n") 
print(frame.sort_index(), "\n")                    # default sort on rows (row index based)
print(frame.sort_index(axis='columns'))            # specify axis =1 or 'columns'   

       d  a  b  c
three  0  1  2  3
one    4  5  6  7 

       d  a  b  c
one    4  5  6  7
three  0  1  2  3 

       a  b  c  d
three  1  2  3  0
one    5  6  7  4


In [73]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [74]:
#pd.DataFrame.sort_values?

In [75]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [76]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [77]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame
frame.sort_values(by='b')             # sort by column name   

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [78]:
#multiple level sort
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


### 4. Descriptive Statistics

In [79]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [80]:
print (df,'\n') 
print (df.sum(),'\n')                          # sum along the rows     (different with numpy.sum)
print (df.sum(axis='columns'),'\n')            # sum along the columns   treat NaN as 0

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

one    9.25
two   -5.80
dtype: float64 

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64 



In [81]:
print (df,'\n') 
print(df.mean(),'\n')                           # along the rows
print(df.mean(axis='columns'),'\n')             # along the columns 
print(df.mean(axis='columns', skipna=False))    # along the columns and not skipping NaN

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

one    3.083333
two   -2.900000
dtype: float64 

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64 

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64


In [82]:
print (df,'\n') 
print(df.max(),'\n')                           # along the rows
print(df.max(axis='columns'),'\n')             # along the columns 
print(df.max(axis='columns', skipna=False))    # along the columns and not skipping NaN

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

one    7.1
two   -1.3
dtype: float64 

a    1.40
b    7.10
c     NaN
d    0.75
dtype: float64 

a     NaN
b    7.10
c     NaN
d    0.75
dtype: float64


In [83]:
#Compute index labels at which minimum or maximum value obtained, respectively
print (df,'\n') 
print(df.idxmax(),'\n') 
print(df.idxmax(axis='columns'),'\n') 

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

one    b
two    d
dtype: object 

a    one
b    one
c    NaN
d    one
dtype: object 



In [84]:
#Cumulative sum of values
print (df,'\n') 
print(df.cumsum(),'\n') 
print(df.cumsum(axis='columns'),'\n') 

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

    one  two
a  1.40  NaN
b  8.50 -4.5
c   NaN  NaN
d  9.25 -5.8 

    one   two
a  1.40   NaN
b  7.10  2.60
c   NaN   NaN
d  0.75 -0.55 



In [85]:
# Compute set of summary statistics for Series or each DataFrame column
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [86]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object


### 4. Other useful methods

**DataFrame.head(n=5)**: Return the first n rows.

This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

In [87]:
print (df,'\n') 
print (df.head(2),'\n')

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

   one  two
a  1.4  NaN
b  7.1 -4.5 



**DataFrame.tail(n=5)**
Return the last n rows.

This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.

In [88]:
print (df,'\n') 
print (df.tail(2),'\n')

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

    one  two
c   NaN  NaN
d  0.75 -1.3 



**Series.unique()**
Return unique values of Series object.

Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.

In [89]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

**Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)**
Return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

In [90]:
obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64