# Intro to Pandas
by Ryan Orsinger

## Module 4: Aggregating (continued)
- Using `.groupby` and aggregate methods
- Understanding the `.groupby` object
- Introducing the `.agg` method
- Specifying column output
- Grouping by multiple columns

In [None]:
# Import pandas
import pandas as pd

# Read in our data
df = pd.read_csv("tips.csv")
df.head()

In [None]:
# We've already worked with some aggregate functions
df.total_bill.median()

In [None]:
# Aggregate functions run on entire columns or dataframes
df.mean(numeric_only=True)

In [None]:
df.tip.min(), df.tip.max()

In [None]:
# .describe is also an aggregate function, since it is host to multiple aggregate functions
df.tip.describe()

But what do we do when we need aggregate results for each value in a categorical column?

In [None]:
# It's possible to manually create dataframes for each category
# But this can become tedious with many categories
# and with multiple columns
# Especially if we want to run the same methods on each dataframe
# etc...
thurs = df[df.day == "Thur"]
fri = df[df.day == "Fri"]
sat = df[df.day == "Sat"]
sun = df[df.day == "Sun"]

# We don't have labels with this method, unfortunately
thurs.total_bill.mean(), fri.total_bill.mean(), sat.total_bill.mean(), sun.total_bill.mean()

In [None]:
# We calculate from the groupby object with aggregate methods (.mean, .median, etc...)
# Calculate the average total bill for each day
# The "for each" means that we're grouping by the day column
df.groupby("day").total_bill.mean()

In [None]:
# The groupby object is a compound entity, built for accessing with aggregate functions
df.groupby("day")

In [None]:
# The groupby object does not print out results, 
# Underneath the hood, it is an object containing multiple tuples of dataframes for each possible categorical value
# Recommend avoiding decomposing groupby objects (this cell is to share context)
# That's what aggregate functions are for!
a, b, c, d = df.groupby("day")
a

In [None]:
# We calculate from the groupby object with aggregate methods (.mean, .median, etc...)
# Calculate the average total bill for each day
# The "for each" means that we're grouping by the day column
df.groupby("day").total_bill.mean()

In [None]:
# Consider the following
# We get the average for each day, on all numeric columns
# Notice that each groupby result redefines what each row means
df.groupby("day").mean()

In [None]:
# We can also group by more than 1 column. This creates a multiple
# Without specifying the columns, we'll see all the numeric columns in the output
df.groupby(["day", "time"]).mean()

In [None]:
# We can also group by more than 1 column. This creates a multiple
# We can provide a list of numeric columns inside the square brackets that specify columns (making double brackets)
df.groupby(["day", "time"])[["total_bill", "tip"]].mean()

In [None]:
# If we need to turn the groupby output into their own column names, we can use .reset_index
df.groupby(["day", "time"])[["total_bill", "tip"]].mean().reset_index()

In [None]:
df.groupby("day")[["total_bill", "tip"]].mean()

In [None]:
# .describe is an aggregate function, too
df.groupby("time").total_bill.describe()

In [None]:
# Using the .agg method to specify multiple
df.groupby("day").total_bill.agg(["mean", "std"])

In [None]:
# Using the .agg method to specify multiple
# We can cal .agg on multiple numeric columns, too
df.groupby("day")[["total_bill", "tip"]].agg(["mean", "std"])

In [None]:
# Since the output is a dataframe, we can transpose it, if doing so makes for easier reading
df.groupby("day")[["total_bill", "tip"]].agg(["mean", "std"]).T

## The forms of .groupby

| specific example    |  general form    |
| ---- | ---- |
|`df.groupby("day").mean()` | `df.groupby("categorical_column").aggregate_function()`     |
| `df.groupby("day").total_bill.mean()`     | `df.groupby("categorical_column").numeric_column.aggregate_function()`     |
| `df.groupby("day")["tip"].median()`     | `df.groupby("categoryA")["numeric_columnA"].aggregate_function()`     |
| `df.groupby("day")[["total_bill", "tip"]].min()`     | `df.groupby("categoryA")[["numeric_columnA", "numeric_columnB"]].aggregate_function()`     |
| `df.groupby(["day", "time"]).mean()`     | `df.groupby(["categoryA", "categoryB").aggregate_function()` |
| `df.groupby("day").agg(["min", "median", "max"])`    | `df.groupby("category").agg(["min", "median", "max"])`     |
| `df.groupby("day")[["total_bill", "tip"]].agg(["min", "median", "max"])`    | `df.groupby("category")[["numericA", "numericB"]].agg(["min", "median", "max"])`     |

## Additional Resource
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html
- Further reading on the multi-index https://pandas.pydata.org/docs/user_guide/advanced.html from grouping by multiple columns

## Exercises
- Use the "mpg.csv" dataset to create a dataframe named `mpg`
- Group by manufacturer and obtain the highest `hwy` mileage for each manufacturer
- Group by the manufacturer and obtain the average `hwy` and `cty` mileage
- Group by the number of cylinders and get the average displacement for each cylinder
- Group by the vehicle class, then calculate the average and standard deviation of `hwy` mileage
- Which vehicle class has the largest standard deviation of hwy mileage?

In [None]:
# Use the "mpg.csv" dataset to create a dataframe named `mpg`


In [None]:
# Group by manufacturer and obtain the highest hwy mileage for each manufacturer


In [None]:
# Group by the manufacturer and obtain the average hwy and cty mileage


In [None]:
# Group by the number of cylinders and get the average displacement for each cylinder


In [None]:
# Group by the vehicle class, then calculate the average and standard deviation of hwy mileage


In [None]:
# Which vehicle class has the largest standard deviation of hwy mileage?
