# <span style="color:blue">Programming for Data Science - DS-GA 1007</span>
## <span style="color:blue">Lecture 12: Pandas - Part III</span>
---

### Contents
- Pivot Tables
- Grouping, Aggregation and Transformation

__References__<br>
- [Pandas: powerful Python data analysis toolkit: Wes McKinney & PyData Devel. Team](https://pandas.pydata.org/pandas-docs/stable/pandas.pdf)
- [http://pandas.pydata.org/pandas-docs/stable/index.html](http://pandas.pydata.org/pandas-docs/stable/index.html)

In [1]:
import pandas as pd
import numpy as np

## GroupBy
Group operations using split-apply-combine pipeline.
- Split a pandas object into groups based on one or more keys from rows (axis=0) or columns (axis=1)
- A function is applied to each group
- The result of the function is combined into a new object

__PS__: Groups are represented by a GroupBy object

$$
\begin{array}{ccccc}
DataFrame & & Split & Apply & Combine\\
\begin{array}{c|c}
C1 & C2 \\ \hline
A & 0 \\ \hline
B & 5 \\ \hline
C & 10 \\ \hline
A & 5 \\ \hline
B & 5 \\ \hline
C & 10 \\ \hline
A & 10 \\ \hline
B & 5 \\ \hline
C & 10 
\end{array} &
\begin{array}{c}
\nearrow \\ \\
\rightarrow \\ \\
\searrow
\end{array} &
\begin{array}{c|c}
A & 0 \\ \hline
A & 5 \\ \hline
A & 10 \\ \\
B & 5 \\ \hline
B & 5 \\ \hline
B & 5 \\ \\
C & 10 \\ \hline
C & 10 \\ \hline
C & 10 
\end{array} &
\begin{array}{c}
\searrow\\ \\
\rightarrow \\ \\
\nearrow
\end{array} & 
\begin{array}{c|c}
A & 15 \\ \hline
B & 15 \\ \hline
C & 30 
\end{array}
\end{array}
$$

The splitting ''keys'' do not have to be of the same type
  - A list or array of values that is the same length as the axis being grouped
  - A value indicating a column name in a DataFrame
  - A dict or Series providing a mapping between values on the axis being grouped and the group names

In [2]:

df = pd.DataFrame({'key1': ['a','a','b','b','a'],
                  'key2': ['one','two','one','two','one'], 
                  'data1': np.random.uniform(low=0,high=1,size=5),
                  'data2': np.random.uniform(low=0,high=1,size=5)})
print(df)

  key1 key2     data1     data2
0    a  one  0.579327  0.840200
1    a  two  0.798034  0.291710
2    b  one  0.931207  0.125801
3    b  two  0.264197  0.175929
4    a  one  0.234048  0.965384


In [3]:
gb1 = df['data1'].groupby(df['key1'])  
print(gb1.mean())

key1
a    0.537136
b    0.597702
Name: data1, dtype: float64


In [4]:
# groupby results in a groupby object
# but the final combination/aggregation is a dataframe or series

print(type(gb1))
print(type(gb1.mean()))

<class 'pandas.core.groupby.generic.SeriesGroupBy'>
<class 'pandas.core.series.Series'>


In [5]:
# using a list as keys (same lenght) as the dataframe
external_list = [0,1,0,1,1]
bg2 = df[['data1','data2']].groupby(external_list)
print(bg2.min())

      data1     data2
0  0.579327  0.125801
1  0.234048  0.175929


__PS__: Multimple ''keys'' result in a hierarchical index

In [7]:
bg3 = df[['data1','data2']].groupby([df['key1'],df['key2']])
print(bg3.max())

              data1     data2
key1 key2                    
a    one   0.579327  0.965384
     two   0.798034  0.291710
b    one   0.931207  0.125801
     two   0.264197  0.175929


### Iterating Over Groups
The `groupby` method generates a squence of 2-tuples containing group _name_ and _data_

In [15]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'key1': ['a','a','b','b','a'],
                  'key2': ['one','two','one','two','one'], 
                  'data1': np.random.uniform(low=0,high=1,size=5),
                  'data2': np.random.uniform(low=0,high=1,size=5)})
print(df)

      data1     data2 key1 key2
0  0.050465  0.472563    a  one
1  0.835403  0.408942    a  two
2  0.712161  0.255194    b  one
3  0.834593  0.361111    b  two
4  0.370471  0.483622    a  one


In [17]:
bg2 = df[['data1','data2']].groupby(df['key1'])

for name, group in bg2:
    print(name)
    print(group)
    
print(5*'-')
bg3 = df[['data1','data2']].groupby([df['key1'],df['key2']])
for name, group in bg3:
    print(name)
    print(group,type(group))

a
      data1     data2
0  0.050465  0.472563
1  0.835403  0.408942
4  0.370471  0.483622
b
      data1     data2
2  0.712161  0.255194
3  0.834593  0.361111
-----
('a', 'one')
      data1     data2
0  0.050465  0.472563
4  0.370471  0.483622 <class 'pandas.core.frame.DataFrame'>
('a', 'two')
      data1     data2
1  0.835403  0.408942 <class 'pandas.core.frame.DataFrame'>
('b', 'one')
      data1     data2
2  0.712161  0.255194 <class 'pandas.core.frame.DataFrame'>
('b', 'two')
      data1     data2
3  0.834593  0.361111 <class 'pandas.core.frame.DataFrame'>


### Grouping with dictionaries
Combines columns using a dictionary mapping

In [8]:
dfp = pd.DataFrame(data=np.random.randint(low=0, high=10, size=(5,5)),
               columns=['a','b','c','d','e'], 
               index=['Joe','Steve','Wes','Jim','Travis'])
print(dfp)

        a  b  c  d  e
Joe     1  6  8  4  0
Steve   4  4  3  6  2
Wes     1  8  9  9  5
Jim     0  0  7  0  2
Travis  0  7  7  5  1


In [9]:
mapping = {'a':'red', 'b':'red', 'c':'blue',
           'd':'blue', 'e':'red', 'f':'orange'}

gbd = dfp.groupby(mapping, axis=1)

for name, group in gbd:
    print(name)
    print(group)

print(gbd.sum())

blue
        c  d
Joe     8  4
Steve   3  6
Wes     9  9
Jim     7  0
Travis  7  5
red
        a  b  e
Joe     1  6  0
Steve   4  4  2
Wes     1  8  5
Jim     0  0  2
Travis  0  7  1
        blue  red
Joe       12    7
Steve      9   10
Wes       18   14
Jim        7    2
Travis    12    8


### Grouping with functions
- A function passed as a group key will be called once per __index__ value
- The return value is used as the group name

In [10]:
dfp = pd.DataFrame(data=np.random.randint(low=0, high=10, size=(5,5)),
               columns=['a','b','c','d','e'], 
               index=['Joe','Steve','Wes','Jim','Travis'])
print(dfp)

gbf = dfp.groupby(lambda x: len(x))

for name,group in gbf:
    print(name)
    print(group)

        a  b  c  d  e
Joe     4  3  6  7  9
Steve   9  4  6  4  0
Wes     4  8  2  0  5
Jim     7  8  2  9  4
Travis  9  0  4  7  1
3
     a  b  c  d  e
Joe  4  3  6  7  9
Wes  4  8  2  0  5
Jim  7  8  2  9  4
5
       a  b  c  d  e
Steve  9  4  6  4  0
6
        a  b  c  d  e
Travis  9  0  4  7  1


### Transform
- The `transform` method applies function to each group
- The result is placed in the appropriate locations
- The function must return a scalar or a transformed array of the same size as the group
- If each group produces a scalar value, it will be broadcasted

In [11]:
import pandas as pd
import numpy as np

dfp = pd.DataFrame(data=np.random.randint(low=0, high=10, size=(5,5)),
               columns=['a','b','c','d','e'], 
               index=['Joe','Steve','Wes','Jim','Travis'])
print(dfp)

        a  b  c  d  e
Joe     2  4  2  3  5
Steve   9  2  1  2  5
Wes     9  0  8  2  1
Jim     5  7  1  0  7
Travis  5  4  5  2  4


In [12]:
key=['one','two','one','two','one']

gbm = dfp.groupby(key)

for n,g in gbm:
    print(n)
    print(g)
    print(5*'**')

print(5*'-')
# max aggregation 
print(gbm.max())

print(5*'-')
# max via transform 
print(gbm.transform(np.max))

one
        a  b  c  d  e
Joe     2  4  2  3  5
Wes     9  0  8  2  1
Travis  5  4  5  2  4
**********
two
       a  b  c  d  e
Steve  9  2  1  2  5
Jim    5  7  1  0  7
**********
-----
     a  b  c  d  e
one  9  4  8  3  5
two  9  7  1  2  7
-----
        a  b  c  d  e
Joe     9  4  8  3  5
Steve   9  7  1  2  7
Wes     9  4  8  3  5
Jim     9  7  1  2  7
Travis  9  4  8  3  5


### Apply
Most general GroupBy method
- Splits the object into pieces
- Invokes the supplied function on each piece
- Concatenates the pieces together again

In [13]:
import pandas as pd
import numpy as np

dfp = pd.DataFrame(data=np.random.randint(low=0, high=10, size=(5,5)),
               columns=['a','b','c','d','e'], 
               index=['Joe','Steve','Wes','Jim','Travis'])
print(dfp)

        a  b  c  d  e
Joe     9  6  7  1  4
Steve   7  3  8  9  1
Wes     9  2  9  9  1
Jim     4  8  5  0  6
Travis  9  6  3  4  8


In [14]:
def top2(df):
    return(df.sort_values(by='a')[-2:])

#print(dfp.sort_values(by='a')[-2:])

print(5*'-')
key=['one','two','one','two','one']
dfpa = dfp.groupby(key)

for n,g in dfpa:
    print(n)
    print(g)
    print(5*'-')
    
print(dfpa.apply(top2))

print(5*'-')
print(dfp.groupby(key,group_keys=False).apply(top2)) # avoid hierarchical indexing

-----
one
        a  b  c  d  e
Joe     9  6  7  1  4
Wes     9  2  9  9  1
Travis  9  6  3  4  8
-----
two
       a  b  c  d  e
Steve  7  3  8  9  1
Jim    4  8  5  0  6
-----
            a  b  c  d  e
one Wes     9  2  9  9  1
    Travis  9  6  3  4  8
two Jim     4  8  5  0  6
    Steve   7  3  8  9  1
-----
        a  b  c  d  e
Wes     9  2  9  9  1
Travis  9  6  3  4  8
Jim     4  8  5  0  6
Steve   7  3  8  9  1


### Missing Values
Can use `groupby` to fill missing values with group-specific values

In [15]:
import pandas as pd
import numpy as np

df = pd.DataFrame(data=np.random.randint(low=0, high=10, size=(20,5)),
               columns=['c1','c2','c3','c4','c5'])

r = np.random.randint(low=0, high=20, size=5)
c = np.random.randint(low=0, high=5, size=5)
df.iloc[r,c] = np.nan
print(df)

     c1   c2  c3  c4   c5
0   5.0  9.0   2   9  9.0
1   3.0  9.0   8   6  1.0
2   5.0  1.0   8   9  9.0
3   0.0  2.0   0   3  1.0
4   4.0  2.0   9   3  9.0
5   8.0  4.0   8   3  1.0
6   NaN  NaN   2   0  NaN
7   NaN  NaN   8   5  NaN
8   NaN  NaN   3   1  NaN
9   NaN  NaN   9   8  NaN
10  0.0  8.0   1   3  1.0
11  1.0  2.0   4   7  6.0
12  9.0  3.0   4   2  1.0
13  4.0  4.0   9   5  9.0
14  8.0  0.0   9   5  4.0
15  8.0  9.0   9   1  9.0
16  1.0  7.0   6   8  0.0
17  NaN  NaN   8   8  NaN
18  5.0  9.0   6   3  5.0
19  2.0  0.0   7   4  2.0


In [17]:
# generating keys to group by
keys = np.random.randint(low=0,high=3,size=20).tolist()
dkeys = {0:'g1',1:'g2',2:'g3'}
keys = [dkeys[i] for i in keys]
print(keys)

['g1', 'g2', 'g1', 'g2', 'g3', 'g2', 'g3', 'g2', 'g2', 'g3', 'g1', 'g1', 'g1', 'g3', 'g2', 'g2', 'g2', 'g3', 'g1', 'g1']


In [18]:
# filling nans in each group with the mean of the group
dfg = df.groupby(keys,group_keys=False)
print(dfg.apply(lambda x: x.fillna(x.mean())).sort_index())

          c1        c2  c3  c4        c5
0   5.000000  9.000000   2   9  9.000000
1   3.000000  9.000000   8   6  1.000000
2   5.000000  1.000000   8   9  9.000000
3   0.000000  2.000000   0   3  1.000000
4   4.000000  2.000000   9   3  9.000000
5   8.000000  4.000000   8   3  1.000000
6   4.000000  3.000000   2   0  9.000000
7   4.666667  5.166667   8   5  2.666667
8   4.666667  5.166667   3   1  2.666667
9   4.000000  3.000000   9   8  9.000000
10  0.000000  8.000000   1   3  1.000000
11  1.000000  2.000000   4   7  6.000000
12  9.000000  3.000000   4   2  1.000000
13  4.000000  4.000000   9   5  9.000000
14  8.000000  0.000000   9   5  4.000000
15  8.000000  9.000000   9   1  9.000000
16  1.000000  7.000000   6   8  0.000000
17  4.000000  3.000000   8   8  9.000000
18  5.000000  9.000000   6   3  5.000000
19  2.000000  0.000000   7   4  2.000000


## Pivot
- Used to create a new derived table
- Three arguments: index, columns, and values
- Columns comes from original table

`Pivot` creates a new table whose row and column indices are the unique values from the index and column parameters.
- Cell values are taken from the column given as the values parameter
- Multiple rows with same values might result in a ValueError

In [26]:
df = pd.DataFrame({'Item': ['Item0','Item0','Item1','Item1', 'Item0'],
                      'CType': ['gold','bronze','gold','silver','silver'], 
                      'USD': [i for i in range(2,11,2)],
                      'EU': [i for i in range(1,11,2)]})
print(df)

    Item   CType  USD  EU
0  Item0    gold    2   1
1  Item0  bronze    4   3
2  Item1    gold    6   5
3  Item1  silver    8   7
4  Item0  silver   10   9


In [27]:
df_pivot = df.pivot(index='Item', columns='CType',values='USD')
print(df_pivot)
# error because there are mutiple entries for one thing 
# fix by having the last gold to silver

CType  bronze  gold  silver
Item                       
Item0     4.0   2.0    10.0
Item1     NaN   6.0     8.0


__PS__: If the `values` parameter is not provided then and hierarchical index is used to handle the columns

In [28]:
df_pivot = df.pivot(index='Item', columns='CType')
print(df_pivot)
print(5*'-')
print(df_pivot['EU'])
print(5*'-')
print(df_pivot['USD'])

         USD                 EU            
CType bronze gold silver bronze gold silver
Item                                       
Item0    4.0  2.0   10.0    3.0  1.0    9.0
Item1    NaN  6.0    8.0    NaN  5.0    7.0
-----
CType  bronze  gold  silver
Item                       
Item0     3.0   1.0     9.0
Item1     NaN   5.0     7.0
-----
CType  bronze  gold  silver
Item                       
Item0     4.0   2.0    10.0
Item1     NaN   6.0     8.0


## Pivot Tables
If rows have duplicate data, then `pivot` will raise a ValueError.

In [29]:
df = pd.DataFrame({'Item': ['Item0','Item0','Item1','Item1', 'Item0','Item0'],
                      'CType': ['gold','bronze','gold','silver','silver','silver'], 
                      'USD': [i for i in range(2,13,2)],
                      'EU': [i for i in range(1,13,2)]})
print(df)

    Item   CType  USD  EU
0  Item0    gold    2   1
1  Item0  bronze    4   3
2  Item1    gold    6   5
3  Item1  silver    8   7
4  Item0  silver   10   9
5  Item0  silver   12  11


In [30]:
df_pivot = df.pivot(index='Item', columns='CType',values='USD')
print(df_pivot)

ValueError: Index contains duplicate entries, cannot reshape

Duplicate values can be handled via `pivot_table` command, which aggregates values using a given aggregation function.

In [31]:
df = pd.DataFrame({'Item': ['Item0','Item0','Item1','Item1', 'Item0'],
                      'CType': ['gold','bronze','gold','silver','gold'], 
                      'USD': [i for i in range(2,11,2)],
                      'EU': [i for i in range(1,11,2)]})
print(df)

    Item   CType  USD  EU
0  Item0    gold    2   1
1  Item0  bronze    4   3
2  Item1    gold    6   5
3  Item1  silver    8   7
4  Item0    gold   10   9


In [32]:
df_pivot = df.pivot_table(index='Item', columns='CType',values='USD',aggfunc=sum)
print(df_pivot)

CType  bronze  gold  silver
Item                       
Item0     4.0  12.0     NaN
Item1     NaN   6.0     8.0


In [None]:
a