# Assignment 29: Pandas Transform, Grouping, and Aggregation #

### Goals for this Assignment ###

By the time you have completed this assignment, you should be able to:

- Use the `transform` method on `Series` objects to apply a given function to every element
- Use the `groupby` method on `DataFrame` objects to split data up into different `DataFrame` objects, based on values in a column
- Use the `groupby` method on `DataFrame` objects to split data up into different `DataFrame` objects, based on some function's output
- Use the `groupby` and `aggregate` methods together to apply a function to multiple `DataFrame` objects simultaneously

## Step 1: Use the `transform` Method to Apply a Function to Each Element of a `Series` Object ##

### Background: `Series`' `transform` Method ###

We've previously seen that vector operations allow us to concisely and efficiently apply a given operation to each element of a `Series` object.
For example, the cell below increments each element of `Series` `input_series`, putting the results in `Series` `output_series`:

In [4]:
import pandas as pd

input_series = pd.Series([2, 8, 3, 9], index=["a", "b", "c", "d"])
output_series = input_series + 1
print(output_series)

a     3
b     9
c     4
d    10
dtype: int64


The one problem with vector operations is that we are limited to those operations defined by Pandas.
For example, say we have a `Series` of string objects, as shown in the next cell:

In [5]:
strings = pd.Series(["alpha", "beta", "gamma", "delta", "epsilon"],
                    index=["first", "second", "third", "fourth", "fifth"])
print(strings)

first       alpha
second       beta
third       gamma
fourth      delta
fifth     epsilon
dtype: object


If you run the above cell, you'll see that the `dtype` of the `Series` is `object`, corresponding to general Python objects.
This severely limits the vector operations that can be performed with this `Series` object.
For example, if you wanted a `Series` object holding the lengths of each of these strings, there is no corresponding vector operation for it.

One way around this is to iterate through the `Series` object using either a loop or a list comprehension, as shown in the next cell:

In [6]:
print([len(s) for s in strings])

[5, 4, 5, 5, 7]


That is, you can iterate through a `Series` object as if it were a Python list.
However, the end result here is a regular Python list, _not_ a `Series` object.
This means that if you wanted to, say, compute the average string length of these strings with Pandas, you'd need to bundle this list back up into a `Series` object, as with the following:

In [7]:
string_lengths = pd.Series([len(s) for s in strings], index=strings.index)
print(string_lengths)
print(string_lengths.mean())

first     5
second    4
third     5
fourth    5
fifth     7
dtype: int64
5.2


There is, however, a much cleaner way to do this: use the `transform` method on `Series` objects.
The `transform` method takes a function, where that function takes a single argument and returns some new value.
`transform` will apply this function to each element of the `Series`, and return a new `Series` with the same indices as the original `Series` object.

We can use `transform` to simplify this problem of computing string lengths, as shown in the cell below:

In [8]:
string_lengths_with_transform = strings.transform(lambda s: len(s))
print(string_lengths_with_transform)
print(string_lengths_with_transform.mean())

first     5
second    4
third     5
fourth    5
fifth     7
dtype: int64
5.2


As shown above, `transform` takes a `lambda`, which itself takes some parameter `s`.
This `lambda` is then called with every string inside `strings`, one at a time, returning the length of each string (thanks to the call to `len`).
`transform` bundles up the results of these calls into a new `Series` object, which is bound to `string_lengths_with_transform`.

Because `transform` takes a function, and functions can do practically anything, `transform` is very general.
For example, we can use `transform` to increment each element in a `Series`, in the same way that we did in the first Python cell.
This is shown below:

In [9]:
# repeated for convenience
input_series = pd.Series([2, 8, 3, 9], index=["a", "b", "c", "d"])
new_output_series = input_series.transform(lambda e: e + 1)
print(new_output_series)

a     3
b     9
c     4
d    10
dtype: int64


As shown above, `new_output_series` ends up holding the same results as the original `output_series` from before.
With this in mind, strictly speaking, from a software expressibility standpoint, we don't need vector operations at all, just `transform`; that is, `transform` is a general operation, and could be used to implement all the vector-based operations we have seen before.
That all said, `transform` should be viewed more of as a "last resort" operation, to be used only if there isn't a vector operation available (or some collection of vector operations) that does what you need.
In the prior cell, `input_series + 1` would not only do the same thing, but it would do it with far less code, and much more quickly.
While Pandas' vector operations are limited to whatever Pandas has implemented for us, those operations _will_ be faster than anything you can do with `transform`, ultimately because those operations will make much better use of specialized hardware-level operations.

### Try this Yourself ###

The cell below defines class `Foo`, where each instance of `Foo` holds an integer `some_integer`.
A series of `Foo` objects are then bundled together into a Pandas `Series` object.
Using `transform`, create a new Pandas `Series` object holding the corresponding values of `some_integer` for each `Foo` object, and bind this to variable `foo_results`.
From there, print out `foo_results`.

In [10]:
class Foo:
    def __init__(self, some_integer):
        self.some_integer = some_integer

foos = pd.Series([Foo(2), Foo(7), Foo(1), Foo(9), Foo(0)],
                 index=["a", "b", "c", "d", "e"])
# call transform on foos here, and bind the return value
# to foo_results
foo_results = foos.transform(lambda e: e.some_integer)

print(foo_results)
# above statement should print:
# a    2
# b    7
# c    1
# d    9
# e    0
# dtype: int64

a    2
b    7
c    1
d    9
e    0
dtype: int64


## Step 2: Use the `groupby` Method to Split Data into Different `DataFrame` Objects Based on Values ##

### Background: `groupby` Method ###

Sometimes you want to lump data into different groups, where each member of a given group shares some property.
These properties between groups are distinct, and non-overlapping.
For example, say you have a `DataFrame` representing furniture, as shown below:

In [11]:
furniture = pd.DataFrame({"kind": [ "sofa", "chair", "table", "chair", "table", "table", "chair"],
                          "color": ["blue", "green", "beige",  "blue", "black", "brown", "brown"],
                          "age":   [     8,      46,       2,       8,       4,       1,       3]})
print(furniture)

    kind  color  age
0   sofa   blue    8
1  chair  green   46
2  table  beige    2
3  chair   blue    8
4  table  black    4
5  table  brown    1
6  chair  brown    3


If we want to determine the average age of all furniture in `furniture`, this wouldn't be so difficult, as shown in the next cell:

In [12]:
print(furniture["age"].mean())

10.285714285714286


Determining the average age for furniture of a particular kind is also not to difficult.
For example, if we want the average age of all chairs, we can use a mask to extract out the chairs, as with:

In [13]:
print(furniture[furniture["kind"] == "chair"]) # to show what the next line gets the mean of
print(furniture[furniture["kind"] == "chair"]["age"].mean())

    kind  color  age
1  chair  green   46
3  chair   blue    8
6  chair  brown    3
19.0


However, what if we want to determine the average age of all pieces of furniture, by type of furniture?
That is, for each type of furniture, there should be an average age for that specific kind of furniture in this `DataFrame`.

One approach to this would be to separately extract out the different kinds of furniture, and compute the means from there.
For example:

In [14]:
print(furniture[furniture["kind"] == "sofa"]["age"].mean())
print(furniture[furniture["kind"] == "chair"]["age"].mean())
print(furniture[furniture["kind"] == "table"]["age"].mean())

8.0
19.0
2.3333333333333335


However, we can start to see that this code is getting repetitive.
We could remove some of this repetition by extracting this out into a function, as with:

In [15]:
def mean_for_kind(wanted_kind):
    return furniture[furniture["kind"] == wanted_kind]["age"].mean()

print(mean_for_kind("sofa"))
print(mean_for_kind("chair"))
print(mean_for_kind("table"))

8.0
19.0
2.3333333333333335


However, arguably the only reason why this extraction is even feasible here is because we don't have a lot of different kinds of furniture in `furniture`.
If we had, say, 1,000 different kinds of furniture, then it would be impractical to separately call `mean_for_kind` for each one.
This also separately requires us to determine all the kinds of furniture we have.

Fortunately, Pandas has a way of resolving this issue.
The resolution is broken into two parts:

1. We can divide a `DataFrame` into separate `DataFrame` objects, based on the specific values contained in the objects.  This is done with `DataFrame`'s `groupby` method.  For our purposes, this means grouping by the kind of furniture we have.
2. Once we have divided data into separate `DataFrames`, we can then _aggregate_ over the parts to get our values per part.  For our purposes, this means finding the average of the ages for each separate kind of furniture we have.

The next two steps will focus specifically on the `groupby` method, and then we will end with `aggregate`.

In the cell below, we use `groupby` to separate the furniture according to the kind it is.

In [16]:
for kind, table_for_kind in furniture.groupby("kind"):
    print(f"---DataFrame for {kind}---")
    print(table_for_kind)
    print()

---DataFrame for chair---
    kind  color  age
1  chair  green   46
3  chair   blue    8
6  chair  brown    3

---DataFrame for sofa---
   kind color  age
0  sofa  blue    8

---DataFrame for table---
    kind  color  age
2  table  beige    2
4  table  black    4
5  table  brown    1



If you iterate through the result of `groupby` with `for...in` (as done in the cell above), you will get a series of 2-tuples back, where the first element of the tuple is the specific value (or _key_) all rows in the table share, and the second element of the tuple is a `DataFrame` holding the whole entries.
As shown above, we ended up with three `DataFrame`s in total resulting from `groupby`, one for each kind of furniture; the kinds are determined by reading the values in the `"kind"` column.

We can similarly group by the colors of the furniture, as with:

In [17]:
for color, table_for_color in furniture.groupby("color"):
    print(f"---DataFrame for {color}---")
    print(table_for_color)
    print()

---DataFrame for beige---
    kind  color  age
2  table  beige    2

---DataFrame for black---
    kind  color  age
4  table  black    4

---DataFrame for blue---
    kind color  age
0   sofa  blue    8
3  chair  blue    8

---DataFrame for brown---
    kind  color  age
5  table  brown    1
6  chair  brown    3

---DataFrame for green---
    kind  color  age
1  chair  green   46



...or by age:

In [18]:
for age, table_for_age in furniture.groupby("age"):
    print(f"---DataFrame for {age}---")
    print(table_for_age)
    print()

---DataFrame for 1---
    kind  color  age
5  table  brown    1

---DataFrame for 2---
    kind  color  age
2  table  beige    2

---DataFrame for 3---
    kind  color  age
6  chair  brown    3

---DataFrame for 4---
    kind  color  age
4  table  black    4

---DataFrame for 8---
    kind color  age
0   sofa  blue    8
3  chair  blue    8

---DataFrame for 46---
    kind  color  age
1  chair  green   46



### Try this Yourself ###

In the next cell, `DataFrame` `hardware_inventory` (from the prior assignment) has been defined, which contains inventory information for a hardware store.
Using `groupby`, separate this `DataFrame` into separate `DataFrames`, one separate `DataFrame` per location where the tools are.
Similar to the examples above, you then must iterate over the results of `groupby` to print out the results.

In [20]:
hardware_inventory = pd.DataFrame({"product": ["hammer", "wrench", "screws (20)", "jigsaw"],
                                   "price": [20, 30, 2.5, 40],
                                   "quantity": [19, 15, 150, 12],
                                   "location": ["tools", "tools", "hardware", "tools"]})

# Put your code below which calls groupby on hardware_inventory, and prints out
# all the results with the help of a for...in loop

for location_name, table_for_location in hardware_inventory.groupby("location"):
    print(f"---DataFrame for {location_name}---")
    print(table_for_location)
    print()
    
# The expected output of your code is as follows:
# ---DataFrame for hardware---
#        product  price  quantity  location
# 2  screws (20)    2.5       150  hardware

# ---DataFrame for tools---
#   product  price  quantity location
# 0  hammer   20.0        19    tools
# 1  wrench   30.0        15    tools
# 3  jigsaw   40.0        12    tools

---DataFrame for hardware---
       product  price  quantity  location
2  screws (20)    2.5       150  hardware

---DataFrame for tools---
  product  price  quantity location
0  hammer   20.0        19    tools
1  wrench   30.0        15    tools
3  jigsaw   40.0        12    tools



## Step 3: Use `groupby` with a Custom Function to Split Data ##

### Background: `groupby` with a Function ###

Sometimes you want to group data via the result of an operation, instead of by specific value.
To motivate this, let's consider the patient information again from the prior assignment, where each patient has a particular name, age, and most recent blood pressure reading:

In [21]:
patients = pd.DataFrame({"name": ["abigail", "adam", "alice", "andrew", "barbara", "bob", "bill"],
                         "age": [45, 57, 25, 62, 30, 22, 43],
                         "blood_pressure": [135, 122, 118, 121, 132, 135, 126]})
print(patients)

      name  age  blood_pressure
0  abigail   45             135
1     adam   57             122
2    alice   25             118
3   andrew   62             121
4  barbara   30             132
5      bob   22             135
6     bill   43             126


Using `groupby`, we can divide these patients up by age, like so:

In [22]:
for age, table_for_age in patients.groupby("age"):
    print(f"---DataFrame for {age}---")
    print(table_for_age)
    print()

---DataFrame for 22---
  name  age  blood_pressure
5  bob   22             135

---DataFrame for 25---
    name  age  blood_pressure
2  alice   25             118

---DataFrame for 30---
      name  age  blood_pressure
4  barbara   30             132

---DataFrame for 43---
   name  age  blood_pressure
6  bill   43             126

---DataFrame for 45---
      name  age  blood_pressure
0  abigail   45             135

---DataFrame for 57---
   name  age  blood_pressure
1  adam   57             122

---DataFrame for 62---
     name  age  blood_pressure
3  andrew   62             121



As shown, because all the ages in this data set are unique, each patient will end up in their own `DataFrame`.
Depending on the specific analysis one wishes to perform, this is likely far too fine-grained.
While it can make sense to divide patients by age, it's more likely that we'd be interested in patients within a given age range, instead of patients sharing an exact age.
To do this, we can instead pass a function to `groupby`, either a `lambda` or a named function.
The function will take the index of the current row, and is expected to return whatever the corresponding key will be for the row.
The key itself will be used as the value to divide rows into separate `DataFrame`s.
An example of this is shown in the cell below, where patients are divided into one of four `DataFrame`s, based on their age:

In [23]:
def bucket_age(patient_index):
    age = patients["age"][patient_index]
    if age < 30:
        return "< 30"
    elif age < 50:
        return "30 - 49"
    elif age < 70:
        return "50 - 69"
    else:
        return ">= 70"

for age, table_for_age in patients.groupby(bucket_age):
    print(f"---DataFrame for {age}---")
    print(table_for_age)
    print()

---DataFrame for 30 - 49---
      name  age  blood_pressure
0  abigail   45             135
4  barbara   30             132
6     bill   43             126

---DataFrame for 50 - 69---
     name  age  blood_pressure
1    adam   57             122
3  andrew   62             121

---DataFrame for < 30---
    name  age  blood_pressure
2  alice   25             118
5    bob   22             135



As shown, `groupby` now takes the `bucket_age` function itself as a parameter.
`groupby` will call `bucket_age` for each row index of `patients`.
Inside the `bucket_age` function, we first determine the specific age of the current patient by accessing `patients` at the `"age"` column and at the specific `patient_index`.
From there, based on that `age`, we will return one of four strings, corresponding to the specific value of `age`.
These returned strings are used to divide data up into separate `DataFrame` objects, leading us to have one `DataFrame` corresponding to the string `"30 - 49"`, another for `"50 - 69"`, and another for `"< 30"`.
We do not see a `DataFrame` object corresponding to the `">= 70"` string, as none of the patients in this dataset were older than 70; as a result, `bucket_age` never returned `">= 70"`, and so this wasn't one of our output `DataFrame`s.

### Try this Yourself ###

In the next cell, use `groupby` to divide patients into two `DataFrames`: those with a blood pressure reading under 130, and those with a blood pressure reading greater than or equal to 130.
Print out the results with `for...in`, but this time, only print out the tables themselves.
You can define any additional functions you want.

In [25]:
# Write your code here.




for is_high_bp, table in patients.groupby(patients['blood_pressure'] >= 130): 
    print(table)


# The output of your code should be:
#      name  age  blood_pressure
# 1    adam   57             122
# 2   alice   25             118
# 3  andrew   62             121
# 6    bill   43             126
#       name  age  blood_pressure
# 0  abigail   45             135
# 4  barbara   30             132
# 5      bob   22             135

     name  age  blood_pressure
1    adam   57             122
2   alice   25             118
3  andrew   62             121
6    bill   43             126
      name  age  blood_pressure
0  abigail   45             135
4  barbara   30             132
5      bob   22             135


## Step 4: Use `aggregate` to Reduce a Column To a Single Value Over Multiple `DataFrame` Objects Simultaneously ##

### Background: Aggregation in Pandas ###

_Aggregation_ is Pandas' term for reducing the values in a `Series` down to a single scalar value.
You've actually been using aggregation for awhile now; for example, `min`, `max`, and `mean` are all aggregations, because these take the values in a `Series` and condense them down into a single value.
These functions can also be called by passing strings of the same name to the `aggregate` method, shown below:

In [26]:
series = pd.Series([2, 8, 3, 9], index=["a", "b", "c", "d"])
print(series.min()) # prints 2
print(series.aggregate('min')) # prints 2

2
2


The `aggregate` method also can be abbreviated as `agg`, shown below:

In [27]:
print(series.agg('min'))

2


You can provide your own function to `aggregate` which takes the whole `Series` object itself.
This is shown below:

In [28]:
print(series.aggregate(lambda s: s.max() - s.min()))

7


The above cell effectively does:

In [29]:
print(series.max() - series.min())

7


On the surface, `aggregate` seems fairly useless; this is basically just doing the same operations you've been doing so far but with extra steps.
However, `aggregate` _also_ works over other kinds of Pandas objects, including both `DataFrame`s and the result of `groupby`.
We'll see this momentarily.

You may have noticed that, contrary to almost all prior examples from prior assignments, up until this point we haven't tried printing out the result from `groupby` directly.
Let's do that now, reusing the furniture data from before:

In [30]:
# copied so you don't need to scroll back to the original definition
furniture = pd.DataFrame({"kind": [ "sofa", "chair", "table", "chair", "table", "table", "chair"],
                          "color": ["blue", "green", "beige",  "blue", "black", "brown", "brown"],
                          "age":   [     8,      46,       2,       8,       4,       1,       3]})
print(furniture.groupby("kind"))

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001D80B343750>


The `print` above shows that the return value from `groupby` is actually a `pandas.core.groupby.generic.DataFrameGroupBy` object, or just `DataFrameGroupBy` for short.
It just so happens that you can use a `for...in` loop to iterate over a `DataFrameGroupBy` object and get back tuples; exactly how this works internally is beyond our scope with this course.
The important part is that `DataFrameGroupBy` is its own type of object with its own methods, _including_ the `aggregate` method.
In fact, `DataFrameGroupBy` supports many of the same operations as `DataFrame` itself does, and performing these operations effectively performs the operations over _all_ `DataFrame`s contained within the `DataFrameGroupBy`.

For example, let's say you want to get the average age of each piece of furniture by kind.
This is exactly what step 2's background wanted to ultimately accomplish, and we finally have all the pieces necessary to do this.
This is shown in the next cell:

In [31]:
print(furniture.groupby("kind")["age"].mean())

kind
chair    19.000000
sofa      8.000000
table     2.333333
Name: age, dtype: float64


A step-by-step breakdown of the prior cell follows:

1. Using `furniture.groupby("kind")`, we group the original data into different `DataFrames`, where each resulting `DataFrame` holds furniture of the same kind.  This division by kind is based on the values in the `"kind"` column.  The specific type of the object returned is `DataFrameGroupBy`.
2. From there, `furniture.groupby("kind")["age"]` is used to access the `"age"` column of **all** the `DataFrame` objects contained within the `DataFrameGroupBy` object, at once.
3. From there, `furniture.groupby("kind")["age"].mean()` calculates the average of all those aforementioned `"age"` columns, resulting in a `DataFrame` of results.  Indices in the resulting `DataFrame` are based on the kinds of furniture, and the values hold the average age for each specific kind of furniture.

We could also use `aggregate` to accomplish this, as shown below:

In [32]:
print(furniture.groupby("kind")["age"].aggregate('mean'))

kind
chair    19.000000
sofa      8.000000
table     2.333333
Name: age, dtype: float64


The real value of `aggregate` is in that it gives us a handle on each individual `Series` instance which is involved.
For example, if we wanted to determine the difference between the max and min ages for each kind of furniture, we could use `groupby` with `aggregate` and `lambda` in the following way:

In [33]:
print(furniture.groupby("kind")["age"].aggregate(lambda s: s.max() - s.min()))

kind
chair    43
sofa      0
table     3
Name: age, dtype: int64


Importantly, the `lambda` in the prior cell is called once for each individual `Series` object corresponding to the `"age"` column for each `DataFrame`.
This means that, with `aggregate` over `DataFrameGroupBy` objects, there is an implicit loop happening.
This makes `aggregate` over `DataFrameGroupBy` objects much more useful than over `Series` objects.

### Try this Yourself ###

The next cell redefines the `furniture` information from before.
For each color of furniture, determine the maximum age it has, and print out the resulting `DataFrame`.
You can call any of `max`, `agg`, and/or `aggregation` to help accomplish this; you don't have to call `agg` or `aggregation` specifically, since `max` performs an aggregation.

In [35]:
furniture = pd.DataFrame({"kind": [ "sofa", "chair", "table", "chair", "table", "table", "chair"],
                          "color": ["blue", "green", "beige",  "blue", "black", "brown", "brown"],
                          "age":   [     8,      46,       2,       8,       4,       1,       3]})

# Define your code here, and be sure to print the result.

furniture = furniture.groupby("color")["age"].max()
print(furniture)


# Your code should print out:
# color
# beige     2
# black     4
# blue      8
# brown     3
# green    46
# Name: age, dtype: int64

color
beige     2
black     4
blue      8
brown     3
green    46
Name: age, dtype: int64


## Step 5: Submit via Canvas ##

Be sure to **save your work**, then log into [Canvas](https://canvas.csun.edu/).  Go to the COMP 502 course, and click "Assignments" on the left pane.  From there, click "Assignment 29".  From there, you can upload your `29_pandas_transform_grouping_aggregation.ipynb` file.

You can turn in the assignment multiple times, but only the last version you submitted will be graded.

### Special Thanks to Dr. Glenn Bruns ###

Special thanks to [Dr. Glenn Bruns](https://csumb.edu/scd/glenn-bruns/) at California State University, Monterey Bay, for providing me with closely-related materials which were used in the creation of this assignment.