# Grouping and Summarising Dataframes

Grouping and aggregation are some of the most frequently used operations in data analysis, especially while doing exploratory data analysis (EDA), where comparing summary statistics across groups of data is common.

For e.g., in the retail sales data we are working with, you may want to compare the average sales of various regions, or compare the total profit of two customer segments. 

Grouping analysis can be thought of as having three parts:
1. **Splitting** the data into groups (e.g. groups of customer segments, product categories, etc.)
2. **Applying** a function to each group (e.g. mean or total sales of each customer segment)
3. **Combining** the results into a data structure showing the summary statistics

Let's work through some examples.

In [1]:
# Loading libraries and files
import numpy as np
import pandas as pd

market_df = pd.read_csv("./global_sales_data/market_fact.csv")
customer_df = pd.read_csv("./global_sales_data/cust_dimen.csv")
product_df = pd.read_csv("./global_sales_data/prod_dimen.csv")
shipping_df = pd.read_csv("./global_sales_data/shipping_dimen.csv")
orders_df = pd.read_csv("./global_sales_data/orders_dimen.csv")

Say you want to understand how well or poorly the business is doing in various customer segments, regions, product categories etc. Specifically, you want to identify areas of business where you are incurrring heavy losses, and want to take action accordingly.

To do that, we will answer questions such as:
* Which customer segments are the least profitable?
* Which product categories and sub-categories are the least profitable?
* Customers in which geographic region cause the most losses?
* Etc.

First, we will merge all the dataframes, so we have all the data in one ```master_df```.

In [2]:
# Merging the dataframes one by one
df_1 = pd.merge(market_df, customer_df, how='inner', on='Cust_id')
df_2 = pd.merge(df_1, product_df, how='inner', on='Prod_id')
df_3 = pd.merge(df_2, shipping_df, how='inner', on='Ship_id')
master_df = pd.merge(df_3, orders_df, how='inner', on='Ord_id')

master_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,...,Region,Customer_Segment,Product_Category,Product_Sub_Category,Order_ID_x,Ship_Mode,Ship_Date,Order_ID_y,Order_Date,Order_Priority
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56,...,WEST,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",36262,REGULAR AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED
1,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59,...,WEST,CORPORATE,TECHNOLOGY,TELEPHONES AND COMMUNICATION,36262,EXPRESS AIR,27-07-2010,36262,27-07-2010,NOT SPECIFIED
2,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37,...,WEST,CORPORATE,OFFICE SUPPLIES,PAPER,36262,EXPRESS AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED
3,Ord_2978,Prod_16,SHP_4112,Cust_1088,305.05,0.04,27,23.12,3.37,0.57,...,ONTARIO,HOME OFFICE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",37863,REGULAR AIR,26-02-2011,37863,24-02-2011,HIGH
4,Ord_5484,Prod_16,SHP_7663,Cust_1820,322.82,0.05,35,-17.58,3.98,0.56,...,WEST,CONSUMER,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",53026,REGULAR AIR,03-03-2012,53026,26-02-2012,LOW


In [3]:
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8399 entries, 0 to 8398
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Ord_id                8399 non-null   object 
 1   Prod_id               8399 non-null   object 
 2   Ship_id               8399 non-null   object 
 3   Cust_id               8399 non-null   object 
 4   Sales                 8399 non-null   float64
 5   Discount              8399 non-null   float64
 6   Order_Quantity        8399 non-null   int64  
 7   Profit                8399 non-null   float64
 8   Shipping_Cost         8399 non-null   float64
 9   Product_Base_Margin   8336 non-null   float64
 10  Customer_Name         8399 non-null   object 
 11  Province              8399 non-null   object 
 12  Region                8399 non-null   object 
 13  Customer_Segment      8399 non-null   object 
 14  Product_Category      8399 non-null   object 
 15  Product_Sub_Category 

In [6]:
master_df['Customer_Segment'].value_counts()

CORPORATE         3076
HOME OFFICE       2032
CONSUMER          1649
SMALL BUSINESS    1642
Name: Customer_Segment, dtype: int64

#### Step 1. Grouping using ```df.groupby()```

Typically, you group the data using a categorical variable, such as customer segments, product categories, etc. This creates as many subsets of the data as there are levels in the categorical variable. 

For example, in this case, we will group the data along ```Customer_Segment```.

In [7]:
# Which customer segments are the least profitable? 

# Step 1. Grouping: First, we will group the dataframe by customer segments
df_by_segment = master_df.groupby('Customer_Segment')
df_by_segment

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002CF846A45E0>

Note that ```df.groupby``` returns a DataFrameGroupBy object.

In [8]:
market_df.sort_values(by='Ship_id', ascending=True).head(3)

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
5858,Ord_1,Prod_1,SHP_1,Cust_1,261.54,0.04,6,-213.25,35.0,0.8
6025,Ord_8,Prod_7,SHP_10,Cust_8,124.56,0.04,32,-14.33,2.0,0.53
6026,Ord_8,Prod_6,SHP_10,Cust_8,196.85,0.01,45,-166.85,6.18,0.4


In [9]:
market_df.describe()

Unnamed: 0,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
count,8399.0,8399.0,8399.0,8399.0,8399.0,8336.0
mean,1775.878179,0.049671,25.571735,181.184424,12.838557,0.512513
std,3585.050525,0.031823,14.481071,1196.653371,17.264052,0.135589
min,2.24,0.0,1.0,-14140.7,0.49,0.35
25%,143.195,0.02,13.0,-83.315,3.3,0.38
50%,449.42,0.05,26.0,-1.5,6.07,0.52
75%,1709.32,0.08,38.0,162.75,13.99,0.59
max,89061.05,0.25,50.0,27220.69,164.73,0.85


In [20]:
ship_ids = market_df['Ship_id'].value_counts()
ship_ids

SHP_564     4
SHP_5602    4
SHP_1378    4
SHP_3855    3
SHP_181     3
           ..
SHP_6893    1
SHP_6796    1
SHP_6821    1
SHP_6847    1
SHP_7628    1
Name: Ship_id, Length: 7701, dtype: int64

In [16]:
ship_ids[ship_ids > 3]

SHP_564     4
SHP_5602    4
SHP_1378    4
Name: Ship_id, dtype: int64

In [21]:
market_df.groupby('Ship_id')['Sales'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Ship_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
SHP_1,1.0,261.540,,261.54,261.5400,261.540,261.5400,261.54
SHP_10,2.0,160.705,51.116749,124.56,142.6325,160.705,178.7775,196.85
SHP_100,1.0,4679.100,,4679.10,4679.1000,4679.100,4679.1000,4679.10
SHP_1000,1.0,823.630,,823.63,823.6300,823.630,823.6300,823.63
SHP_1001,1.0,286.730,,286.73,286.7300,286.730,286.7300,286.73
...,...,...,...,...,...,...,...,...
SHP_995,1.0,12979.100,,12979.10,12979.1000,12979.100,12979.1000,12979.10
SHP_996,1.0,2052.820,,2052.82,2052.8200,2052.820,2052.8200,2052.82
SHP_997,1.0,1905.790,,1905.79,1905.7900,1905.790,1905.7900,1905.79
SHP_998,1.0,68.880,,68.88,68.8800,68.880,68.8800,68.88


#### Step 2. Applying a Function

After grouping, you apply a function to a **numeric variable**, such as ```mean(Sales)```, ```sum(Profit)```, etc. 

In [26]:
# Step 2. Applying a function
# We can choose aggregate functions such as sum, mean, median, etc.
df_by_segment['Profit'].sum().sort_values(ascending=False)

Customer_Segment
CORPORATE         599746.00
HOME OFFICE       318354.03
SMALL BUSINESS    315708.01
CONSUMER          287959.94
Name: Profit, dtype: float64

Notice that we have indexed the ```Profit``` column in the DataFrameGroupBy object exactly as we index a normal column in a dataframe. Alternatively, you could also use ```df_by_segment.Profit```. 

In [27]:
# Alternatively
df_by_segment.Profit.sum()

Customer_Segment
CONSUMER          287959.94
CORPORATE         599746.00
HOME OFFICE       318354.03
SMALL BUSINESS    315708.01
Name: Profit, dtype: float64

So this tells us that profits are the least in the CONSUMER segment, and highest in the CORPORATE segment.

In [28]:
# For better readability, you may want to sort the summarised series:
df_by_segment.Profit.sum().sort_values(ascending = False)

Customer_Segment
CORPORATE         599746.00
HOME OFFICE       318354.03
SMALL BUSINESS    315708.01
CONSUMER          287959.94
Name: Profit, dtype: float64

In [29]:
customer_df.head(3)

Unnamed: 0,Customer_Name,Province,Region,Customer_Segment,Cust_id
0,MUHAMMED MACINTYRE,NUNAVUT,NUNAVUT,SMALL BUSINESS,Cust_1
1,BARRY FRENCH,NUNAVUT,NUNAVUT,CONSUMER,Cust_2
2,CLAY ROZENDAL,NUNAVUT,NUNAVUT,CORPORATE,Cust_3


In [33]:
df_by_segment['Profit'].sum().sort_values(ascending=False)

Customer_Segment
CORPORATE         599746.00
HOME OFFICE       318354.03
SMALL BUSINESS    315708.01
CONSUMER          287959.94
Name: Profit, dtype: float64

#### Step 3. Combining the results into a Data Structure

You can optionally show the results as a dataframe.

In [32]:
# Converting to a df
pd.DataFrame(df_by_segment['Profit'].sum().sort_values(ascending=False))

Unnamed: 0_level_0,Profit
Customer_Segment,Unnamed: 1_level_1
CORPORATE,599746.0
HOME OFFICE,318354.03
SMALL BUSINESS,315708.01
CONSUMER,287959.94


In [35]:
master_df['Product_Category'].value_counts()

OFFICE SUPPLIES    4610
TECHNOLOGY         2065
FURNITURE          1724
Name: Product_Category, dtype: int64

In [34]:
# Let's go through some more examples
# E.g.: Which product categories are the least profitable?
df_grp_by_pc= master_df.groupby('Product_Category')
df_grp_by_pc.Profit.sum().sort_values(ascending=False)

Product_Category
TECHNOLOGY         886313.52
OFFICE SUPPLIES    518021.43
FURNITURE          117433.03
Name: Profit, dtype: float64

In [46]:
df_grp_by_pc.count().sort_values(by='Ord_id', ascending=False)

Unnamed: 0_level_0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,...,Province,Region,Customer_Segment,Product_Sub_Category,Order_ID_x,Ship_Mode,Ship_Date,Order_ID_y,Order_Date,Order_Priority
Product_Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
OFFICE SUPPLIES,4610,4610,4610,4610,4610,4610,4610,4610,4610,4589,...,4610,4610,4610,4610,4610,4610,4610,4610,4610,4610
TECHNOLOGY,2065,2065,2065,2065,2065,2065,2065,2065,2065,2065,...,2065,2065,2065,2065,2065,2065,2065,2065,2065,2065
FURNITURE,1724,1724,1724,1724,1724,1724,1724,1724,1724,1682,...,1724,1724,1724,1724,1724,1724,1724,1724,1724,1724


In [50]:
886313.52/2065

429.2075157384988

In [47]:
pd.DataFrame(df_grp_by_pc.Profit.sum().sort_values(ascending=False))

Unnamed: 0_level_0,Profit
Product_Category,Unnamed: 1_level_1
TECHNOLOGY,886313.52
OFFICE SUPPLIES,518021.43
FURNITURE,117433.03


In [48]:
# Let's go through some more examples
# E.g.: Which product categories are the least profitable?

# 1. Group by product category
by_product_cat = master_df.groupby('Product_Category')

In [53]:
# 2. This time, let's compare average profits
# Apply mean() on Profit
by_product_cat['Profit'].mean()

Product_Category
FURNITURE           68.116607
OFFICE SUPPLIES    112.369074
TECHNOLOGY         429.207516
Name: Profit, dtype: float64

In [49]:
pd.DataFrame(by_product_cat['Profit'].mean())

Unnamed: 0_level_0,Profit
Product_Category,Unnamed: 1_level_1
FURNITURE,68.116607
OFFICE SUPPLIES,112.369074
TECHNOLOGY,429.207516


In [64]:
by_product_cat['Profit'].describe().sort_values(by='count')

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Product_Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
FURNITURE,1724.0,68.116607,1112.923257,-11053.6,-281.355,-14.25,187.16,8614.79
TECHNOLOGY,2065.0,429.207516,1863.208375,-14140.7,-88.94,66.22,561.13,27220.69
OFFICE SUPPLIES,4610.0,112.369074,744.617939,-2175.09,-57.0225,-3.845,56.9475,11535.28


FURNITURE is the least profitable, TECHNOLOGY the most. Let's see which product sub-cetgories within FURNITURE are less profitable.

In [65]:
df_grp_by_pc_and_subgrp_by_psc= master_df.groupby(['Product_Category', 'Product_Sub_Category']).Profit.mean()
df_grp_by_pc_and_subgrp_by_psc

Product_Category  Product_Sub_Category          
FURNITURE         BOOKCASES                         -177.683228
                  CHAIRS & CHAIRMATS                 387.693601
                  OFFICE FURNISHINGS                 127.446612
                  TABLES                            -274.411357
OFFICE SUPPLIES   APPLIANCES                         223.866498
                  BINDERS AND BINDER ACCESSORIES     335.970918
                  ENVELOPES                          195.864228
                  LABELS                              47.490174
                  PAPER                               36.949551
                  PENS & ART SUPPLIES                 11.950679
                  RUBBER BANDS                        -0.573575
                  SCISSORS, RULERS AND TRIMMERS      -54.161458
                  STORAGE & ORGANIZATION              12.205403
TECHNOLOGY        COMPUTER PERIPHERALS               124.389815
                  COPIERS AND FAX                   192

In [79]:
master_df.groupby(['Product_Category', 'Product_Sub_Category'])['Profit'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
Product_Category,Product_Sub_Category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
FURNITURE,BOOKCASES,189.0,-177.683228,1707.455501,-11053.6,-662.8,-305.98,-78.36,7513.88
FURNITURE,CHAIRS & CHAIRMATS,386.0,387.693601,1482.276988,-3404.24,-300.5225,-64.58,761.3175,8614.79
FURNITURE,OFFICE FURNISHINGS,788.0,127.446612,463.997735,-1570.32,-34.075,24.955,171.2975,3408.46
FURNITURE,TABLES,361.0,-274.411357,1148.310769,-6474.65,-694.33,-352.96,31.21,5626.42
OFFICE SUPPLIES,APPLIANCES,434.0,223.866498,817.377547,-2172.14,-82.1,6.375,448.635,5183.04
OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,915.0,335.970918,1349.974,-961.5,-64.64,-9.72,72.665,11535.28
OFFICE SUPPLIES,ENVELOPES,246.0,195.864228,479.703533,-201.6,-6.6925,39.315,204.12,3187.37
OFFICE SUPPLIES,LABELS,288.0,47.490174,136.013924,-223.5,10.5625,35.77,58.74,1704.0
OFFICE SUPPLIES,PAPER,1225.0,36.949551,217.200169,-331.63,-69.29,-14.35,55.82,1480.15
OFFICE SUPPLIES,PENS & ART SUPPLIES,633.0,11.950679,77.341605,-216.66,-15.03,0.82,24.6,502.42


In [86]:
# E.g.: Which product categories and sub-categories are the least profitable?
# 1. Group by category and sub-category

by_product_cat_subcat = master_df.groupby(['Product_Category', 'Product_Sub_Category'])
# To apply multiple functions simultaneously, you can use the describe() function on the grouped df object
by_product_cat_subcat['Profit'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
Product_Category,Product_Sub_Category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
FURNITURE,BOOKCASES,189.0,-177.683228,1707.455501,-11053.6,-662.8,-305.98,-78.36,7513.88
FURNITURE,CHAIRS & CHAIRMATS,386.0,387.693601,1482.276988,-3404.24,-300.5225,-64.58,761.3175,8614.79
FURNITURE,OFFICE FURNISHINGS,788.0,127.446612,463.997735,-1570.32,-34.075,24.955,171.2975,3408.46
FURNITURE,TABLES,361.0,-274.411357,1148.310769,-6474.65,-694.33,-352.96,31.21,5626.42
OFFICE SUPPLIES,APPLIANCES,434.0,223.866498,817.377547,-2172.14,-82.1,6.375,448.635,5183.04
OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,915.0,335.970918,1349.974,-961.5,-64.64,-9.72,72.665,11535.28
OFFICE SUPPLIES,ENVELOPES,246.0,195.864228,479.703533,-201.6,-6.6925,39.315,204.12,3187.37
OFFICE SUPPLIES,LABELS,288.0,47.490174,136.013924,-223.5,10.5625,35.77,58.74,1704.0
OFFICE SUPPLIES,PAPER,1225.0,36.949551,217.200169,-331.63,-69.29,-14.35,55.82,1480.15
OFFICE SUPPLIES,PENS & ART SUPPLIES,633.0,11.950679,77.341605,-216.66,-15.03,0.82,24.6,502.42


In [85]:
# Some other summary functions to apply on groups
pd.DataFrame(by_product_cat_subcat['Profit'].count())

Unnamed: 0_level_0,Unnamed: 1_level_0,Profit
Product_Category,Product_Sub_Category,Unnamed: 2_level_1
FURNITURE,BOOKCASES,189
FURNITURE,CHAIRS & CHAIRMATS,386
FURNITURE,OFFICE FURNISHINGS,788
FURNITURE,TABLES,361
OFFICE SUPPLIES,APPLIANCES,434
OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,915
OFFICE SUPPLIES,ENVELOPES,246
OFFICE SUPPLIES,LABELS,288
OFFICE SUPPLIES,PAPER,1225
OFFICE SUPPLIES,PENS & ART SUPPLIES,633


Thus, within FURNITURE, TABLES are the least profitable, followed by BOOKCASES.

In [80]:
# Recall the df.describe() method?
# To apply multiple functions simultaneously, you can use the describe() function on the grouped df object
by_product_cat['Profit'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Product_Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
FURNITURE,1724.0,68.116607,1112.923257,-11053.6,-281.355,-14.25,187.16,8614.79
OFFICE SUPPLIES,4610.0,112.369074,744.617939,-2175.09,-57.0225,-3.845,56.9475,11535.28
TECHNOLOGY,2065.0,429.207516,1863.208375,-14140.7,-88.94,66.22,561.13,27220.69


In [69]:
# Some other summary functions to apply on groups
by_product_cat['Profit'].count()

Product_Category
FURNITURE          1724
OFFICE SUPPLIES    4610
TECHNOLOGY         2065
Name: Profit, dtype: int64

In [70]:
by_product_cat['Profit'].min()

Product_Category
FURNITURE         -11053.60
OFFICE SUPPLIES    -2175.09
TECHNOLOGY        -14140.70
Name: Profit, dtype: float64

In [87]:
# E.g. Customers in which geographic region are the least profitable?
master_df.groupby('Region').Profit.mean().sort_values()

Region
NUNAVUT                   35.963418
YUKON                    136.253155
WEST                     149.175595
QUEBEC                   179.803649
PRARIE                   188.253294
ONTARIO                  189.960865
ATLANTIC                 221.259870
NORTHWEST TERRITORIES    255.464670
Name: Profit, dtype: float64

In [90]:
master_df.groupby('Region')['Sales'].count().sort_values()

Region
NUNAVUT                    79
NORTHWEST TERRITORIES     394
YUKON                     542
QUEBEC                    781
ATLANTIC                 1080
PRARIE                   1706
ONTARIO                  1826
WEST                     1991
Name: Sales, dtype: int64

In [83]:
# Note that the resulting object is a Series, thus you can perform vectorised computations on them

# E.g. Calculate the Sales across each region as a percentage of total Sales
# You can divide the entire series by a number (total sales) easily 
(master_df.groupby('Region').Sales.sum() / sum(master_df['Sales'])).sort_values(ascending=False)*100

Region
WEST                     24.119372
ONTARIO                  20.536970
PRARIE                   19.022396
ATLANTIC                 13.504305
QUEBEC                   10.124936
YUKON                     6.542595
NORTHWEST TERRITORIES     5.369193
NUNAVUT                   0.780233
Name: Sales, dtype: float64

In [74]:
master_df.loc[:, 'Product_Base_Margin': 'Region' ].head(3)

Unnamed: 0,Product_Base_Margin,Customer_Name,Province,Region
0,0.56,AARON BERGMAN,ALBERTA,WEST
1,0.59,AARON BERGMAN,ALBERTA,WEST
2,0.37,AARON BERGMAN,ALBERTA,WEST


In [79]:
master_df.groupby(['Region', 'Province']).Sales.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
Region,Province,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ATLANTIC,NEW BRUNSWICK,323.0,2118.30193,6095.496542,4.97,145.495,536.38,1940.02,89061.05
ATLANTIC,NEWFOUNDLAND,82.0,1255.171555,2115.000529,8.87,126.9425,509.725,1364.125625,12098.87
ATLANTIC,NOVA SCOTIA,464.0,1762.347764,3509.566627,8.49,146.1825,471.715,1859.5975,28180.08
ATLANTIC,PRINCE EDWARD ISLAND,211.0,1940.204976,3409.353673,14.23,177.275,572.4325,1979.475,23255.61
NORTHWEST TERRITORIES,NORTHWEST TERRITORIES,394.0,2032.607435,3851.533817,4.99,165.47,494.535,1870.95,26133.39
NUNAVUT,NUNAVUT,79.0,1473.120044,2735.648992,14.76,141.49,370.48,1238.38575,14223.82
ONTARIO,ONTARIO,1826.0,1677.553384,3079.057136,3.63,138.2925,451.265,1671.105,24051.49
PRARIE,MANITOBA,793.0,1731.209057,3409.114311,7.15,145.42,438.47,1642.642,29345.27
PRARIE,SASKACHEWAN,913.0,1604.004183,3315.760705,5.63,153.28,458.8,1479.14,41343.21
QUEBEC,QUEBEC,781.0,1933.668476,4082.113932,3.42,137.97,436.78,1750.0,45923.76


The regions ONTARIO, WEST and PRARIE comprise of about 64% of the sales.

Until now, we've been working with the data without making changes or additions to it. In the next section, we will create new columns, alter existing columns and apply some more grouping and summarising.



In [84]:
df = pd.read_csv('https://query.data.world/s/vBDCsoHCytUSLKkLvq851k2b8JOCkF')

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.00
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.00
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.00
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.00
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
512,4,3,aug,sun,81.6,56.7,665.6,1.9,27.8,32,2.7,0.0,6.44
513,2,4,aug,sun,81.6,56.7,665.6,1.9,21.9,71,5.8,0.0,54.29
514,7,4,aug,sun,81.6,56.7,665.6,1.9,21.2,70,6.7,0.0,11.16
515,1,4,aug,sat,94.4,146.0,614.7,11.3,25.6,42,4.0,0.0,0.00


In [87]:
df.groupby(['month','day'])[['rain', 'wind']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,rain,wind
month,day,Unnamed: 2_level_1,Unnamed: 3_level_1
apr,fri,0.0,3.100000
apr,mon,0.0,3.100000
apr,sat,0.0,4.500000
apr,sun,0.0,5.666667
apr,thu,0.0,5.800000
...,...,...,...
sep,sat,0.0,3.460000
sep,sun,0.0,3.955556
sep,thu,0.0,3.357143
sep,tue,0.0,3.431579
