# Intro to Pandas
by Ryan Orsinger

## Module 4: Aggregating
- Using `.crosstab` to count a frequency count for each category pairing
- Using `.pivot_table` to calculate aggregates of numeric values for each category pairing (same as a spreadsheet pivot table)

In [None]:
# Import pandas
import pandas as pd

# Read in some data
df = pd.read_csv("tips.csv")
df.head()

### What is `.crosstab?`
- Crosstab computes a simple cross tabulation of two (or more) factors
- Computes a frequency table of factors
- Example: counting up how many tables ate lunch or dinner for each day?
- Example: Counting the number of smoking tables broken out by gender?

In [None]:
# Say we needed to get all the different days
df.day.unique()

In [None]:
# And all the different times
df.time.unique()

In [None]:
# To count Thursday Lunch, we need this compound indexing operation
df[(df.day == "Thur") & (df.time == "Lunch")].shape[0]

In [None]:
# To count Thursday Dinner, we need this compound indexing operation
# Repeat this for each day/time combination...
df[(df.day == "Thur") & (df.time == "Dinner")].shape[0]

In [None]:
# For another approach,
# we could run .time.value_counts() on each individual day
# But this would be get tedious, too
# Especially if the possible values are larger than 4 x 2
df[df.day == "Thur"].time.value_counts()

In [None]:
# Crosstab to the rescue!
# Frequency count of all days by all times
pd.crosstab(index=df.day, columns=df.time)

In [None]:
# Margins=True shows the row/column totals
pd.crosstab(index=df.day, columns=df.time, margins=True)

In [None]:
# We can also pass lists of series into either index or columns
pd.crosstab(index=df.day, columns=[df.time, df.smoker])

## Using pivot_tables to aggregate more than counts
- Use `.pivot_table` to set up intersections, then specify the column to measure, in aggregate, and your aggregate function
- The `.pivot_table` method defaults to using the average, 
- We can specify multiple categories in the index and columns, but the results can become visually busy
- Example: for each day/time pairing, calculate the average `total_bill`
- Example: for each day/time pairing, get the average `total_bill` and `tip`
- Example: for each day/time pairing, calculate the min, median, max `tip`

In [None]:
# Without specifying a "values" column, 
# pivot_table returns the average of each category pair
pd.pivot_table(df, index="day", columns="time")

In [None]:
pd.pivot_table(df, index="day", columns="time", values="total_bill")

In [None]:
# Use the "values" argument to specify which columns to calculate
pd.pivot_table(df, index="day", columns="time", values=["total_bill", "tip"])

In [None]:
# Use the aggfunc argument to overwrite the default mean function
pd.pivot_table(df, values="tip", aggfunc="median", index="day", columns="time")

In [None]:
# The aggfunc argument can take a list of aggregate functions
pd.pivot_table(df, values="tip", aggfunc=["min", "median", "max"], index="day", columns="time")

## Additional Resources
- https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
- https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

## Exercises
- Use crosstab on the `tips` dataframe to count the number of differently sized tables for each time of day. *Hint* remember that `.size` is a built-in attribute on pandas objects.
- Use `pd.read_csv` and the `mpg.csv` file to create a dataframe named `mpg`.
- Use `.crosstab` to count the number of vehicles for each combination of class and drivetrain. *Hint* remember that `class` is a reserved word in Python.
- Use `.crosstab` to count the number of vehicles for each combination of manufacturer and drivetrain.
- Use `.pivot_table` and `mpg` to calculate the average highway mileage for each combination of vehicle class and drivetrain. 
- Use `.pivot_table` and `mpg` to calculate the median city mileage for each combination of manufacturer and drivetrain.

In [None]:
# Use crosstab on the tips dataframe to count the number of differently sized tables for each time of day. 
# Hint remember that .size is a built-in attribute on pandas objects.


In [None]:
# Use pd.read_csv and the mpg.csv file to create a dataframe named mpg.


In [None]:
# Use .crosstab to count the number of vehicles for each combination of class and drivetrain. 
# Hint remember that "class" is a reserved word in Python.


In [None]:
# Use .crosstab to count the number of vehicles for each combination of manufacturer and drivetrain.


In [None]:
# Use .pivot_table and mpg to calculate the average highway mileage for each combination of vehicle class and drivetrain.


In [None]:
# Use .pivot_table and mpg to calculate the median city mileage for each combination of manufacturer and drivetrain.
