# Grouping

A common need is bound to arise where you will need to look at an aggregate view of a `DataFrame` by a certain value.  This is where `groupby` comes in.

Grouping on a value returns a new type of object called the [`GroupBy`](https://pandas.pydata.org/pandas-docs/stable/api.html#groupby).


In [1]:
# Setup
import os

import pandas as pd

pd.options.display.max_rows = 10
users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
# Sanity check
(len(users), len(transactions))

(475, 998)

Let's remind ourselves about the types of data we have in the **`transactions`** `DataFrame`.

In [2]:
transactions.dtypes

sender        object
receiver      object
amount       float64
sent_date     object
dtype: object

Grouping by a specific column is pretty straight forward. We want to group by the receiver, so we use the [`DataFrame.groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method.

In [3]:
grouped_by_receiver = transactions.groupby('receiver')

# Let's see what type of object we got back
type(grouped_by_receiver)

pandas.core.groupby.groupby.DataFrameGroupBy

We received a [`DataFrameGroupBy`](https://pandas.pydata.org/pandas-docs/stable/api.html#groupby) object. There are quite a few methods here.

Let's take a look first at [`GroupBy.size`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.size.html). This will return a `Series` of how many members are in each of the groups. In our case this is the number of transactions that each user received.

In [4]:
#take a look in the groups
grouped_by_receiver.groups

{'aaron': Int64Index([571, 626, 763, 803, 867, 959], dtype='int64'),
 'acook': Int64Index([875], dtype='int64'),
 'adam.saunders': Int64Index([198, 518], dtype='int64'),
 'adrian': Int64Index([123, 228, 403], dtype='int64'),
 'adrian.blair': Int64Index([150, 239, 326, 346, 372, 751, 862], dtype='int64'),
 'alan9443': Int64Index([26], dtype='int64'),
 'alexander7808': Int64Index([56, 241, 292, 384, 873], dtype='int64'),
 'alvarado': Int64Index([685, 802], dtype='int64'),
 'alvarez': Int64Index([832, 846, 853, 857, 878], dtype='int64'),
 'amanda': Int64Index([868], dtype='int64'),
 'amiller': Int64Index([503], dtype='int64'),
 'andersen': Int64Index([474, 523], dtype='int64'),
 'andrade': Int64Index([851], dtype='int64'),
 'andrew.alvarez': Int64Index([414], dtype='int64'),
 'andrew6216': Int64Index([22, 121, 200], dtype='int64'),
 'andrew6347': Int64Index([487, 700, 718], dtype='int64'),
 'angela7209': Int64Index([402, 527], dtype='int64'),
 'anthony1788': Int64Index([253, 337, 413, 889

### Iterating through Groups
 With the groupby object in hand, we can iterate through the objects

In [5]:
for index, group in grouped_by_receiver:
    print(index)
    print(group)
    break #remove the break to see all the groups. 

aaron
              sender receiver  amount   sent_date
571          bradley    aaron   98.52  2018-08-31
626      anthony1788    aaron    8.10  2018-09-06
763  daniel.williams    aaron   35.20  2018-09-16
803    samuel.conner    aaron   96.88  2018-09-17
867         gail2190    aaron   47.91  2018-09-21
959    alexander7808    aaron   79.54  2018-09-25


### Select a Group

Using the get_group() method, we can select a single group.

In [6]:
grouped_by_receiver.get_group('alvarado')

Unnamed: 0,sender,receiver,amount,sent_date
685,adrian.blair,alvarado,14.73,2018-09-11
802,dean2365,alvarado,45.58,2018-09-17


In [7]:
# Returns a Series of total number of rows for each group
grouped_by_receiver.size()

receiver
aaron            6
acook            1
adam.saunders    2
adrian           3
adrian.blair     7
                ..
wilson           2
wking            2
wright3590       4
young            2
zachary.neal     4
Length: 410, dtype: int64

Similarly, we can use the `DataFrameGroupBy.count` method to see counts of how many non missing data points we have across each column in our group across the columns of our `DataFrame`.

In [8]:
grouped_by_receiver.count()

Unnamed: 0_level_0,sender,amount,sent_date
receiver,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
aaron,6,6,6
acook,1,1,1
adam.saunders,2,2,2
adrian,3,3,3
adrian.blair,7,7,7
...,...,...,...
wilson,2,2,2
wking,2,2,2
wright3590,4,4,4
young,2,2,2


The `GroupBy` object provides aggregate functions that makes getting calculations quick and seamless. For instance, if we use the `GroupBy.sum` method we can see each numeric column summed up for each grouping. In our case there is only one numeric column **`amount`**.

In [9]:
grouped_by_receiver.sum()

Unnamed: 0_level_0,amount
receiver,Unnamed: 1_level_1
aaron,366.15
acook,94.65
adam.saunders,101.15
adrian,124.36
adrian.blair,462.88
...,...
wilson,44.39
wking,74.07
wright3590,195.45
young,83.57


Now, all at the same time

In [10]:
grouped_by_receiver.agg(['count', 'min', 'max', 'mean'])

Unnamed: 0_level_0,amount,amount,amount,amount
Unnamed: 0_level_1,count,min,max,mean
receiver,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
aaron,6,8.10,98.52,61.025000
acook,1,94.65,94.65,94.650000
adam.saunders,2,9.31,91.84,50.575000
adrian,3,0.73,96.35,41.453333
adrian.blair,7,30.76,89.55,66.125714
...,...,...,...,...
wilson,2,5.01,39.38,22.195000
wking,2,14.50,59.57,37.035000
wright3590,4,0.99,98.86,48.862500
young,2,33.91,49.66,41.785000


## Custom aggregations

In [11]:
def my_mean(df):
    return df.mean()
grouped_by_receiver.agg(my_mean)

Unnamed: 0_level_0,amount
receiver,Unnamed: 1_level_1
aaron,61.025000
acook,94.650000
adam.saunders,50.575000
adrian,41.453333
adrian.blair,66.125714
...,...
wilson,22.195000
wking,37.035000
wright3590,48.862500
young,41.785000


### Creating a new column on our **`users`** `DataFrame`.

In [12]:
# Create a new column in users called transaction count, and set the values to the size of the matching group
users['transaction_count'] = grouped_by_receiver.size()

In [13]:
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance,transaction_count
aaron,Aaron,Davis,aaron6348@gmail.com,True,2018-08-31,6,18.14,6.0
acook,Anthony,Cook,cook@gmail.com,True,2018-05-12,2,55.45,1.0
adam.saunders,Adam,Saunders,adam@gmail.com,False,2018-05-29,3,72.12,2.0
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,2018-04-28,3,30.01,3.0
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,2018-06-16,7,25.85,7.0


In [14]:
# Not every user has made a transaction, let's see what kind of missing data we are dealing with
len(users[users.transaction_count.isna()])

65

Since we don't have a transaction record for everyone, not every user will be in our grouping.  So when we created the new column, we ended up adding some `np.nan`.  Let's fix that.

In [15]:
# Set all missing data to 0, since in reality, there have been 0 received transactions for this user
users.transaction_count.fillna(0, inplace=True)
users

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance,transaction_count
aaron,Aaron,Davis,aaron6348@gmail.com,True,2018-08-31,6,18.14,6.0
acook,Anthony,Cook,cook@gmail.com,True,2018-05-12,2,55.45,1.0
adam.saunders,Adam,Saunders,adam@gmail.com,False,2018-05-29,3,72.12,2.0
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,2018-04-28,3,30.01,3.0
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,2018-06-16,7,25.85,7.0
...,...,...,...,...,...,...,...,...
wilson,Robert,Wilson,robert@yahoo.com,False,2018-05-16,5,59.75,2.0
wking,Wanda,King,wanda.king@holt.com,True,2018-06-01,2,67.08,2.0
wright3590,Jacqueline,Wright,jacqueline.wright@gonzalez.com,True,2018-02-08,6,18.48,4.0
young,Jessica,Young,jessica4028@yahoo.com,True,2018-07-17,4,75.39,2.0


Check it out! There's our column, but it's a floating point number, we don't need that.  Let's convert it!

In [16]:
# Convert from the default type of float64 to int64 (no precision needed)
users.transaction_count = users.transaction_count.astype('int64')

Let's sort by the number of transaction

In [17]:
# Sort our values by the new field descending (so the largest comes first), and then by first name ascending
users.sort_values(
    ['transaction_count', 'first_name'],
    ascending=[False, True],
    inplace=True
)
users.loc[:, ['first_name', 'last_name', 'email', 'transaction_count']].head(10)

Unnamed: 0,first_name,last_name,email,transaction_count
scott3928,Scott,,scott@yahoo.com,9
sfinley,Samuel,Finley,samuel@gmail.com,8
adrian.blair,Adrian,Blair,adrian9335@gmail.com,7
hdeleon,Hannah,Deleon,hannah@yahoo.com,7
miranda6426,Miranda,Rogers,miranda.rogers@gmail.com,7
aaron,Aaron,Davis,aaron6348@gmail.com,6
corey,Corey,Fuller,fuller8100@yahoo.com,6
heather,Heather,Ray,hray@yahoo.com,6
jennifer.hebert,Jennifer,Hebert,jennifer.hebert@yahoo.com,6
edwards,Michael,Edwards,edwards5456@gmail.com,6


In [18]:
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance,transaction_count
scott3928,Scott,,scott@yahoo.com,True,2018-02-26,5,72.02,9
sfinley,Samuel,Finley,samuel@gmail.com,False,2018-03-09,2,83.62,8
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,2018-06-16,7,25.85,7
hdeleon,Hannah,Deleon,hannah@yahoo.com,True,2018-09-12,6,48.93,7
miranda6426,Miranda,Rogers,miranda.rogers@gmail.com,True,2018-08-06,0,20.51,7
