## Module 6: GroupBy

- Creates a category/chunk/segment from a DataFrame 
- Split data into unique groups from the variable/column of choice.


In [1]:
import pandas as pd
fortune= pd.read_csv("data/fortune1000.csv", index_col="Rank")
fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


- by using **groupby()** method, Pandas will create a new object called *groupby.DataFrameGroupBy* object.
- **groupby object** only works when we apply a method on it.

In [2]:
sectors = fortune.groupby("Sector")
sectors

<pandas.core.groupby.DataFrameGroupBy object at 0x000001DD97C9BDD8>

In [3]:
sectors.groups.keys()


dict_keys(['Aerospace & Defense', 'Apparel', 'Business Services', 'Chemicals', 'Energy', 'Engineering & Construction', 'Financials', 'Food and Drug Stores', 'Food, Beverages & Tobacco', 'Health Care', 'Hotels, Resturants & Leisure', 'Household Products', 'Industrials', 'Materials', 'Media', 'Motor Vehicles & Parts', 'Retailing', 'Technology', 'Telecommunications', 'Transportation', 'Wholesalers'])

### Operations and Attributes with groupby Object

-  **size()** : Number of rows in each group.
-  **first()** : Compute first of values within each group.
- **last()** : Compute last of group values.

In [4]:
sectors.size()

Sector
Aerospace & Defense              20
Apparel                          15
Business Services                51
Chemicals                        30
Energy                          122
Engineering & Construction       26
Financials                      139
Food and Drug Stores             15
Food, Beverages & Tobacco        43
Health Care                      75
Hotels, Resturants & Leisure     25
Household Products               28
Industrials                      46
Materials                        43
Media                            25
Motor Vehicles & Parts           24
Retailing                        80
Technology                      102
Telecommunications               15
Transportation                   36
Wholesalers                      40
dtype: int64

The first value in the Aerospace & Defense Sector is the Company, **Boeing**. If we see in the list of Sector column, the first company is **Boeing** as well.

In [5]:
sectors.first().head(2)

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,Boeing,Aerospace and Defense,"Chicago, IL",96114,5176,161400
Apparel,Nike,Apparel,"Beaverton, OR",30601,3273,62600


In [6]:
fortune[fortune["Sector"] == "Aerospace & Defense"].head(2)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
24,Boeing,Aerospace & Defense,Aerospace and Defense,"Chicago, IL",96114,5176,161400
45,United Technologies,Aerospace & Defense,Aerospace and Defense,"Farmington, CT",61047,7608,197200


The last Company in Aerospace & Defense Sector is **Delta Tucker Holdings**. If we go through the original DataFrame, we can see that the last company is also **Delta Tucker Holdings**.

In [7]:
sectors.last().head(2)

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,Delta Tucker Holdings,Aerospace and Defense,"McLean, VA",1923,-133,12000
Apparel,Guess,Apparel,"Los Angeles, CA",2204,82,13500


In [8]:
fortune[fortune["Sector"] == "Aerospace & Defense"].tail(1)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
987,Delta Tucker Holdings,Aerospace & Defense,Aerospace and Defense,"McLean, VA",1923,-133,12000


**Attributes: groups**

Returns a dictionary which contains the index of values belonging to each group.

In [9]:
sectors.groups

{'Aerospace & Defense': Int64Index([ 24,  45,  60,  88, 118, 120, 209, 245, 282, 378, 389, 490, 560,
             605, 785, 788, 836, 903, 958, 987],
            dtype='int64', name='Rank'),
 'Apparel': Int64Index([91, 231, 340, 354, 448, 547, 575, 597, 683, 695, 726, 794, 877,
             882, 917],
            dtype='int64', name='Rank'),
 'Business Services': Int64Index([144, 186, 199, 204, 221, 248, 249, 294, 307, 312, 355, 392, 404,
             440, 467, 468, 481, 485, 492, 503, 545, 626, 635, 652, 677, 694,
             714, 729, 734, 735, 737, 744, 767, 776, 777, 783, 791, 792, 796,
             801, 803, 816, 819, 820, 869, 870, 886, 939, 951, 952, 993],
            dtype='int64', name='Rank'),
 'Chemicals': Int64Index([ 56, 101, 182, 189, 206, 253, 262, 277, 288, 296, 316, 538, 549,
             555, 566, 580, 613, 624, 654, 668, 717, 720, 724, 758, 761, 829,
             865, 898, 934, 949],
            dtype='int64', name='Rank'),
 'Energy': Int64Index([  2,  14,  30,  32,

<div style="page-break-after: always;"></div>
### get_group() method

Construct DataFrame of a specific group by specifying its name

For example, , we  want to have a list of companies in the Retailing Sector.

In [10]:
sectors.get_group("Retailing").sort_values(by="Revenue", ascending=False).head(10)

Unnamed: 0_level_0,Company,Employees,Industry,Location,Profits,Revenue
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,2300000,General Merchandisers,"Bentonville, AR",14694,482130
15,Costco,161000,Specialty Retailers: Other,"Issaquah, WA",2377,116199
28,Home Depot,385000,Specialty Retailers: Other,"Atlanta, GA",7009,88519
38,Target,341000,General Merchandisers,"Minneapolis, MN",3363,73785
47,Lowe’s,225000,Specialty Retailers: Other,"Mooresville, NC",2546,59074
71,Best Buy,125000,Specialty Retailers: Other,"Richfield, MN",897,39745
89,TJX,216000,Specialty Retailers: Apparel,"Framingham, MA",2278,30945
103,Macy’s,157500,General Merchandisers,"Cincinnati, OH",1072,27079
111,Sears Holdings,178000,General Merchandisers,"Hoffman Estates, IL",-1129,25146
132,Staples,58963,Specialty Retailers: Other,"Framingham, MA",379,21059


We can also use **masking technique**. The approach is different but the output is the same.

In [11]:
fortune[fortune["Sector"] == "Retailing"]

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
15,Costco,Retailing,Specialty Retailers: Other,"Issaquah, WA",116199,2377,161000
28,Home Depot,Retailing,Specialty Retailers: Other,"Atlanta, GA",88519,7009,385000
38,Target,Retailing,General Merchandisers,"Minneapolis, MN",73785,3363,341000
47,Lowe’s,Retailing,Specialty Retailers: Other,"Mooresville, NC",59074,2546,225000
71,Best Buy,Retailing,Specialty Retailers: Other,"Richfield, MN",39745,897,125000
89,TJX,Retailing,Specialty Retailers: Apparel,"Framingham, MA",30945,2278,216000
103,Macy’s,Retailing,General Merchandisers,"Cincinnati, OH",27079,1072,157500
111,Sears Holdings,Retailing,General Merchandisers,"Hoffman Estates, IL",25146,-1129,178000
132,Staples,Retailing,Specialty Retailers: Other,"Framingham, MA",21059,379,58963


### sum() method

returns a summation on each category for each other Numeric Columns.

In [12]:
sectors.sum()

Unnamed: 0_level_0,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,357940,28742,968057
Apparel,95968,8236,346397
Business Services,272195,28227,1361050
Chemicals,243897,22628,463651
Energy,1517809,-73447,1188927
Engineering & Construction,153983,5304,406708
Financials,2217159,260209,3359948
Food and Drug Stores,483769,16759,1395398
"Food, Beverages & Tobacco",555967,51417,1211632
Health Care,1614707,106114,2678289


### count() method

Count data for every unique group. For example, according to the DataFrame below, we can know that there are 20 Employees who are working in Aerospace & Defense sector.

In [13]:
sectors.count().head(10)

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,20,20,20,20,20,20
Apparel,15,15,15,15,15,15
Business Services,51,51,51,51,51,51
Chemicals,30,30,30,30,30,30
Energy,122,122,122,122,122,122
Engineering & Construction,26,26,26,26,26,26
Financials,139,139,139,139,139,139
Food and Drug Stores,15,15,15,15,15,15
"Food, Beverages & Tobacco",43,43,43,43,43,43
Health Care,75,75,75,75,75,75


### mean() method

returns average value for each Numeric columns.

In [14]:
sectors.mean()

Unnamed: 0_level_0,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,17897.0,1437.1,48402.85
Apparel,6397.866667,549.066667,23093.133333
Business Services,5337.156863,553.470588,26687.254902
Chemicals,8129.9,754.266667,15455.033333
Energy,12441.057377,-602.02459,9745.303279
Engineering & Construction,5922.423077,204.0,15642.615385
Financials,15950.784173,1872.007194,24172.28777
Food and Drug Stores,32251.266667,1117.266667,93026.533333
"Food, Beverages & Tobacco",12929.465116,1195.744186,28177.488372
Health Care,21529.426667,1414.853333,35710.52


Extract the average of  specific columns by specifying its name inside square brackets at the end.


In [15]:
sectors.mean()["Revenue"]

Sector
Aerospace & Defense             17897.000000
Apparel                          6397.866667
Business Services                5337.156863
Chemicals                        8129.900000
Energy                          12441.057377
Engineering & Construction       5922.423077
Financials                      15950.784173
Food and Drug Stores            32251.266667
Food, Beverages & Tobacco       12929.465116
Health Care                     21529.426667
Hotels, Resturants & Leisure     6781.840000
Household Products               8383.464286
Industrials                     10816.978261
Materials                        6026.627907
Media                            8830.560000
Motor Vehicles & Parts          20105.833333
Retailing                       18313.450000
Technology                      13505.882353
Telecommunications              30788.933333
Transportation                  11347.444444
Wholesalers                     11120.000000
Name: Revenue, dtype: float64

### max() method

This method will return the highest values in the specified column for each group. 

For example, we can use it when we want to identify which Sectors give the most profit.

In [16]:
sectors["Profits"].max()

Sector
Aerospace & Defense              7608
Apparel                          3273
Business Services                6328
Chemicals                        7685
Energy                          16150
Engineering & Construction        803
Financials                      24442
Food and Drug Stores             5237
Food, Beverages & Tobacco        7351
Health Care                     18108
Hotels, Resturants & Leisure     5920
Household Products               7036
Industrials                      4833
Materials                         991
Media                            8382
Motor Vehicles & Parts           9687
Retailing                       14694
Technology                      53394
Telecommunications              17879
Transportation                   7610
Wholesalers                      1472
Name: Profits, dtype: int64

### min() method

This method is the opposite of max() method. It will return the lowest values in the specified column for each group.

Here is an example where we use the min() method to get the least amount of revenue a company received for each sector.

In [17]:
sectors["Revenue"].min()

Sector
Aerospace & Defense             1923
Apparel                         2204
Business Services               1910
Chemicals                       2084
Energy                          1898
Engineering & Construction      1909
Financials                      1902
Food and Drug Stores            2151
Food, Beverages & Tobacco       2066
Health Care                     1987
Hotels, Resturants & Leisure    1896
Household Products              1914
Industrials                     1895
Materials                       1924
Media                           1921
Motor Vehicles & Parts          1986
Retailing                       1999
Technology                      1920
Telecommunications              2726
Transportation                  1995
Wholesalers                     1917
Name: Revenue, dtype: int64

### Grouping by multiple groups

In the previous example, we only group by one column/category. Now, let's do it with multiple columns.

In [18]:
multiple = fortune.groupby(["Sector", "Industry"])
multiple

<pandas.core.groupby.DataFrameGroupBy object at 0x000001DD992666A0>

Multiple groups give more detailed information. The return value is more spread out.

In the example, we can see that there are 7 Industries in the Business Services sector. Meanwhile, there is only one industry in sector Aerospace & Defense.

In [19]:
multiple.sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profits,Employees
Sector,Industry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aerospace & Defense,Aerospace and Defense,357940,28742,968057
Apparel,Apparel,95968,8236,346397
Business Services,"Advertising, marketing",22748,1549,124100
Business Services,Diversified Outsourcing Services,64829,4305,708330
Business Services,Education,7485,69,46755
Business Services,Financial Data Services,100778,17456,264926
Business Services,Miscellaneous,11185,2130,37720
Business Services,Temporary Help,34716,1000,60020
Business Services,Waste Management,30454,1718,119199
Chemicals,Chemicals,243897,22628,463651


### agg() method

Aggregates using one or more operations over the specified axis.

- **dictionary** : it will do the specified operations on the columns.
- **list** : it will do all operations in the list.
-  **combine both list and dictionary** :  enable us to do specific operations on specific columns.

In [20]:
sectors.agg({
    "Revenue" : "sum",
    "Profits" : "sum",
    "Employees" : "mean"
})

Unnamed: 0_level_0,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,357940,28742,48402.85
Apparel,95968,8236,23093.133333
Business Services,272195,28227,26687.254902
Chemicals,243897,22628,15455.033333
Energy,1517809,-73447,9745.303279
Engineering & Construction,153983,5304,15642.615385
Financials,2217159,260209,24172.28777
Food and Drug Stores,483769,16759,93026.533333
"Food, Beverages & Tobacco",555967,51417,28177.488372
Health Care,1614707,106114,35710.52


In [21]:
sectors.agg(["sum", "mean","size" ])

Unnamed: 0_level_0,Revenue,Revenue,Revenue,Profits,Profits,Profits,Employees,Employees,Employees
Unnamed: 0_level_1,sum,mean,size,sum,mean,size,sum,mean,size
Sector,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Aerospace & Defense,357940,17897.0,20,28742,1437.1,20,968057,48402.85,20
Apparel,95968,6397.866667,15,8236,549.066667,15,346397,23093.133333,15
Business Services,272195,5337.156863,51,28227,553.470588,51,1361050,26687.254902,51
Chemicals,243897,8129.9,30,22628,754.266667,30,463651,15455.033333,30
Energy,1517809,12441.057377,122,-73447,-602.02459,122,1188927,9745.303279,122
Engineering & Construction,153983,5922.423077,26,5304,204.0,26,406708,15642.615385,26
Financials,2217159,15950.784173,139,260209,1872.007194,139,3359948,24172.28777,139
Food and Drug Stores,483769,32251.266667,15,16759,1117.266667,15,1395398,93026.533333,15
"Food, Beverages & Tobacco",555967,12929.465116,43,51417,1195.744186,43,1211632,28177.488372,43
Health Care,1614707,21529.426667,75,106114,1414.853333,75,2678289,35710.52,75


In [22]:
sectors.agg({
    "Revenue" : ["sum","size"],
    "Profits" : "sum",
    "Employees" : "mean"
})

Unnamed: 0_level_0,Revenue,Revenue,Profits,Employees
Unnamed: 0_level_1,sum,size,sum,mean
Sector,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Aerospace & Defense,357940,20,28742,48402.85
Apparel,95968,15,8236,23093.133333
Business Services,272195,51,28227,26687.254902
Chemicals,243897,30,22628,15455.033333
Energy,1517809,122,-73447,9745.303279
Engineering & Construction,153983,26,5304,15642.615385
Financials,2217159,139,260209,24172.28777
Food and Drug Stores,483769,15,16759,93026.533333
"Food, Beverages & Tobacco",555967,43,51417,28177.488372
Health Care,1614707,75,106114,35710.52


### Iterating through Groups

Create an empty DataFrame with columns based on the original dataset.

In [23]:
df = pd.DataFrame(columns=fortune.columns)
df

Unnamed: 0,Company,Sector,Industry,Location,Revenue,Profits,Employees


Using for loop, we can get all the information related to the highest revenue. (Its location, which company and industry it belongs to).

In [24]:
for sector, data  in sectors:
    highest = data.nlargest(1, "Revenue")
    df = df.append(highest)
df.head(10)

Unnamed: 0,Company,Sector,Industry,Location,Revenue,Profits,Employees
24,Boeing,Aerospace & Defense,Aerospace and Defense,"Chicago, IL",96114,5176,161400
91,Nike,Apparel,Apparel,"Beaverton, OR",30601,3273,62600
144,ManpowerGroup,Business Services,Temporary Help,"Milwaukee, WI",19330,419,27000
56,Dow Chemical,Chemicals,Chemicals,"Midland, MI",48778,7685,49495
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
155,Fluor,Engineering & Construction,"Engineering, Construction","Irving, TX",18114,413,38758
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
7,CVS Health,Food and Drug Stores,Food and Drug Stores,"Woonsocket, RI",153290,5237,199000
41,Archer Daniels Midland,"Food, Beverages & Tobacco",Food Production,"Chicago, IL",67702,1849,32300
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [25]:
fortune.sort_values("Revenue", ascending=False).head(10)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400
6,UnitedHealth Group,Health Care,Health Care: Insurance and Managed Care,"Minnetonka, MN",157107,5813,200000
7,CVS Health,Food and Drug Stores,Food and Drug Stores,"Woonsocket, RI",153290,5237,199000
8,General Motors,Motor Vehicles & Parts,Motor Vehicles and Parts,"Detroit, MI",152356,9687,215000
9,Ford Motor,Motor Vehicles & Parts,Motor Vehicles and Parts,"Dearborn, MI",149558,7373,199000
10,AT&T,Telecommunications,Telecommunications,"Dallas, TX",146801,13345,281450
