# Groupby and Merge in Pandas
This functionality is essential to your MTA turnstile analysis!

In [None]:
import pandas as pd

Create a Dataframe, with values as a list-of-lists and columns as a list

In [None]:
df = pd.DataFrame([[123,'xt23',20],[123,'q45',2],[123,'a89',25],[77,'q45',3],[77,'a89',30],[92,'xt23',24],[92,'m33',60],[92,'a89',28]], columns=['userid','product','price'])
df

If we want the maximum price anyone paid, we just do this:

In [None]:
df['price'].max()

If we want the max price per user, we'll do a groupby. When we do that, it does the aggregation on each column seperately. So the value we get on the price column might not be for the product that we get on the product column 

In [None]:
df.groupby('userid').max()

In [None]:
df.groupby('userid')[['price']].max()

Just like max, we can do sum, etc. Pandas will smartly leave out columns for which that aggregation doesn't have meaning.

In [None]:
df.groupby('userid').sum()

We can sort columns this way:

In [None]:
df.sort_values(by=['userid','product'])

We can sort and filter columns this way:

In [None]:
df.sort_values(by=['userid','product'])[['userid','price']]

Diff is another routine. It does a diff with value in the previous row

In [None]:
df.sort_values(by=['userid','product'])[['userid','price']].groupby(['userid']).diff()

If we want the maximum price each user paid and the product associated with that price, we will sort, group and filter. Groupby will maintain the sort order within each group.
*(For SQL users: in SQL, you groupby and the sort, but in Pandas, it's easier to do it the other way around)*

In [None]:
(df
 .sort_values(by=['userid','price'], ascending=[True, False])
 .groupby('userid')
 .head(1))

In [None]:
df.sort_values(by=['userid','price'],ascending=False).groupby('userid').head(1)

Adding a new column is easy:

In [None]:
df['website']=['Amazon','Amazon','NewEgg','NewEgg','NewEgg','Amazon','Amazon','Amazon']
df

In [None]:
df.groupby(['userid','website']).sum()

Below, we are going to do the same groupby as above. But if we set the as_index flag to "False" we get a flat table instead of the nested indexes

In [None]:
df3=df.groupby(['userid','website'],as_index=False).sum()
df3

Let's create a second table:

In [None]:
df2 = pd.DataFrame([[123,'USA'],[77,'Canada'],[92,'USA']], columns=['userid','country'])
df2

We can combine the two tables using a merge function. What it does is, it will do a pairwise comparision of every row in table1 with every row in table2 and if the "on" condition matches, it will create a single row with columns from both those matched rows.

Merge of two tables with 5 rows each can give as little as 0 rows and as much as 25 rows.

    [1,2,3,4,5] merged with [6,7,8,9,10] will give 0 rows
    [1,2,3,4,5] merged with [1,2,3,4,5] will give 5 rows
    [1,1,1,1,1] merged with [1,1,1,1,1] will give 25 rows

In [None]:
pd.merge(df,df2,on='userid')

We can merge and then groupby to get what we want (Money spend on each website per country)

In [None]:
pd.merge(df,df2,on='userid').groupby(['country','website']).sum()

We can also work with previously merged tables. Below we use df3 instead of df (scroll up to see what df3 is). The result is the same as the previous box.

In [None]:
pd.merge(df3,df2,on='userid').groupby(['country','website']).sum()

Let's add another column: purchase date

In [None]:
df['date']=['2015-01-12','2015-01-08','2015-01-06','2015-01-03','2015-01-05','2015-01-04','2015-01-07','2015-01-02']
df

Here is a tricky task. For each row, I want the average purchase price for that user prior to that purchase. One option is to do some loops. But another solution is to just do a merge on itself and filter.

But first, let's review what a merge (or 'join' if you come from SQL) does. Say you merge two dataframes with 3 rows each, how many rows would you end up with? The answer is anywhere between 0 and 9.

Consider the following examples, where table x has users and the movies they like. And table y has users and the wines they line. And let's do a merge to come up with possible movie and wine pairings for any user. In case A, we get 0 rows, in case B, we get 3 rows and case C we get 9 rows.

In [None]:
dfx = pd.DataFrame([[1,'Godfather'],[2,'Amelie'],[3,'Chicago']],columns=['userid','movies'])
dfy = pd.DataFrame([[4,'red'],[5,'white'],[6,'pink']],columns=['userid','wines'])
dfm1=pd.merge(dfx,dfy,on='userid')
dfm1

In [None]:
dfx = pd.DataFrame([[1,'Godfather'],[2,'Amelie'],[3,'Chicago']],columns=['userid','movies'])
dfy = pd.DataFrame([[1,'red'],[2,'white'],[3,'pink']],columns=['userid','wines'])
dfm1=pd.merge(dfx,dfy,on='userid')
dfm1

In [None]:
dfx = pd.DataFrame([[1,'Godfather'],[1,'Amelie'],[1,'Chicago']],columns=['userid','movies'])
dfy = pd.DataFrame([[1,'red'],[1,'white'],[1,'pink']],columns=['userid','wines'])
dfm1=pd.merge(dfx,dfy,on='userid')
dfm1

Now let's return to the original question: For each row, I want the average purchase price for that user prior to that purchase. Let's do a merge on itself and filter.

If we join a table on itself, for each row, you'll get every other purchase the user did.

In [None]:
df4=pd.merge(df[['userid','date']],df[['userid','price','date']],on='userid')
df4

Then we can filter out the purchases that not prior to the current purchase (notice the date sorting on users 123 and 92 are flipped; doesn't impact the work).

In [None]:
df4=df4[df4['date_x']>df4['date_y']]
df4

Then we can group by to get the average price that we wanted

In [None]:
df5 = df4.groupby(['userid','date_x']).mean()
df5.rename(columns={'price': 'avg_price_prior'}, inplace=True)
df5

Finally, we merge with the original dataframe

In [None]:
df6 = df.merge(df5, left_on=['userid', 'date'], right_index=True, how='left')
df6