# Data: Deriving Quantities

*Purpose*: Often our data will not tell us *directly* what we want to know; in
these cases we need to *derive* new quantities from our data. In this exercise,
we'll work with `tf_mutate()` to create new columns by operating on existing
variables, and use `tf_group_by()` with `tf_summarize()` to compute aggregate
statistics (summaries!) of our data.

Aside: The data-summary verbs in grama are heavily inspired by the [dplyr](https://dplyr.tidyverse.org/) package in the R programming langauge. 

## Setup

In [22]:
import grama as gr
import pandas as pd
DF = gr.Intention()
%matplotlib inline

We'll be using the `diamonds` as seen in `e-vis00-basics` earlier. 

In [23]:
from grama.data import df_diamonds
df_diamonds.head(10)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


## Summarizing Data Frames

The `tf_summarize()` function in [grama](https://py-grama.readthedocs.io/en/latest/source/grama.dfply.html?highlight=tf_summarize#module-grama.dfply.summarize) is used to compute new variables and summary statistics. With `tf_summarize()` you can pass a data frame with arguments that assign the summary statistic's value and column name. Each argument should generate a single summary statistic for the output dataframe. 

| Type | Functions |
| ---- | --------- |
| Location | `gr.mean(x), gr.median(x), gr.mean_lo, gr.mean_up, gr.min(x), gr.max(x)` |
| Spread | `gr.sd(x), gr.var(x), gr.IQR(x)` |
| Position | `gr.first(x), gr.nth(x, n), gr.last(x)` |
| Counts | `gr.n_distinct(x), gr.n()` |
| Logical | `gr.sum(x != 0), gr.mean(y == 0)` |

### __q1__ Use `tf_summarize()` with a logical function

Using `summarize()` and a logical summary function, determine the number of rows with `Ideal` type of `cut`. Save this value to a column called `n_ideal`.

In [94]:
# TASK: Determine the number of obsevations with an 'Ideal' diamond 'cut.' Assign this value as 'n_ideal'

df_q1 = (
    df_diamonds
    >> gr.tf_summarize(
# solution-start
        n_ideal=gr.sum(DF.cut == "Ideal")
# solution-end
    )
)

assert \
    df_q1["n_ideal"][0]/23==937,\
    "Sum does not match expected value"

df_q1

Unnamed: 0,n_ideal
0,21551


## Grouping multiple variables

The `tf_group_by()` takes a data frame and then the names of one or more columns in the data frame. It returns a copy of the data frame that has been “grouped” into sets of rows that share identical combinations of values in the specified columns. When passing a grouped data frame into `tf_summarize()` it will calculate your summary statistics ina group-wise manner. 

**DRAFT NOTE:** Copied directly from the RStudio article

### __q2__ How does `tf_group_by()` change the output to `tf_summarize()`?

The `tf_group_by()` modifies how other data-management verbs function. Uncomment the `tf_group_by()` below, and describe how the result changes.

In [126]:
#TASK: Uncomment the tf_group_by() below, and describe how the result changes

df_q2 = (
    df_diamonds
#    >> gr.tf_group_by(DF["color"],DF["clarity"])
    >> gr.tf_summarize(diamonds_mean=gr.mean(DF["price"]))
)

df_q2

Unnamed: 0,diamonds_mean
0,3932.799722


*Observations*
<!-- task-begin -->
- Write your observations here!
<!-- task-end -->
<!-- solution-begin -->
- The commented version computes a summary over the entire dataframe
- The uncommented version computes summaries over groups of `color` and `clarity`
<!-- solution-end -->


## Vectorized Functions

| Type | Functions |
| ---- | --------- |
| Arithmetic ops. | `+, -, *, /, ^` |
| Modular arith. | `//, %` |
| Logical comp. | `<, <=, >, >=, !=, ==` |
| Logarithms | `gr.log(x)` |
| Offsets | `gr.lead(x), gr.lag(x)` |
| Cumulants | `gr.cumsum(x), gr.cumprod(x), gr.cummin(x), gr.cummax(x), gr.cummean(x)` |
| Ranking | `gr.min_rank(x), gr.row_number(x), gr.dense_rank(x), gr.percent_rank(x)` |


### __q3__ Comment on why the difference is so large.

The `depth` variable is supposedly computed via `depth_computed = 100 * 2 * DF["z"] /
(DF["x"] + DF["y"])`. Compute `diff = DF["depth"] - DF["depth_computed"]`: This is a measure of
discrepancy between the given and computed depth. Additionally, compute the
*coefficient of variation* `cov = sd / mean` for both `depth` and `diff`:
This is a dimensionless measure of variation in a variable. Assign the resulting
tibble to `df_q3`, and assign the appropriate values to `cov_depth` and
`cov_diff`. Comment on the relative values of `cov_depth` and `cov_diff`; why is
`cov_diff` so large?

*Note*: Confusingly, the documentation for `diamonds` leaves out the factor of `100` in the computation of `depth`. Additionally, by default the `pandas` library will exclude values equal to `NaN` in functions like `gr.mean()` or `gr.sd()`

In [128]:
df_q3 = (
    df_diamonds 
    >> gr.tf_mutate(depth_computed=100 * 2 * DF["z"] / (DF["x"] + DF["y"]))
    >> gr.tf_mutate(diff = DF["depth"]- DF["depth_computed"])
    >> gr.tf_mutate(
        depth_mean=gr.mean(DF["depth"]),
        depth_sd=gr.sd(DF["depth"]),
        diff_mean=gr.mean(DF["diff"]),
        diff_sd=gr.sd(DF["diff"]),
        diff_median = gr.median(DF["diff"])
    )
    >> gr.tf_mutate(
        cov_depth=DF["depth_sd"] / DF["depth_mean"],
        cov_diff=DF["diff_sd"] / DF["diff_mean"],
        c_diff=gr.IQR(DF["diff"]/DF["diff_median"])
    )
    >> gr.tf_select(DF["depth_mean"],DF["depth_sd"],DF["cov_depth"],DF["diff_mean"],DF["diff_sd"],DF["cov_diff"],DF["c_diff"])
)


assert \
    df_q3["cov_depth"][0] - 0.02320057 < 1e-3 and\
    df_q3["cov_diff"][0]- 497.5585 < 1e-3,\
    "Double check your calculations for depth and diff"

df_q3.head(1) 

#DRAFT NOTE: This is unelegant way to get the correct answer. Unclear how to pass forward variables in a summarize call or reorganize order with a select call

Unnamed: 0,depth_mean,depth_sd,cov_depth,diff_mean,diff_sd,cov_diff,c_diff
0,61.749405,1.432621,0.023201,0.005284,2.629223,497.558461,7495391000000.0


**Observation**

<!-- task-begin -->
- Comment on the relative values of `cov_depth` and `cov_diff`.
- Why is `cov_diff` so large?
<!-- task-end -->
<!-- solution-begin -->
- `cov_depth` is tiny; there's not much variation in the depth, compared to its scale.
- `cov_diff` is enormous! This is because the mean difference between `depth` and `depth_computed` is small, which causes the `cov` to blow up.
<!-- solution-end -->

### __q4__ Compute and observe

Compute the `price_mean = mean(price)`, `price_sd = sd(price)`, and `price_cov =
price_sd / price_mean` for each `cut` of diamond. What observations can you make
about the various cuts? Do those observations match your expectations?

In [124]:
# TASK: Compute the `price_mean = gr.mean(DF["price"])`, `price_sd = gr.sd(DF["price"])`, and `price_cov = price_sd / price_mean` for each `cut` of diamond. 

df_q4=(
    df_diamonds
# solution-begin    
    >> gr.tf_group_by(DF["cut"])
    >> gr.tf_summarize(
        price_mean=gr.mean(DF["price"]),
        price_sd=gr.sd(DF["price"]),
    )
    >> gr.tf_mutate(price_cov=DF["price_sd"]/DF["price_mean"])
# solution-end
)

# DRAFT NOTE: See below code chunk for assertion issue
df_q4

Unnamed: 0,cut,price_mean,price_sd,price_cov
0,Fair,4358.757764,3560.386612,0.816835
1,Good,3928.864452,3681.589584,0.937062
2,Ideal,3457.54197,3808.401172,1.101476
3,Premium,4584.257704,4349.204961,0.948726
4,Very Good,3981.759891,3935.862161,0.988473


In [120]:
# Reading mutate_helpers.py I was expecting to be able to pass either a integer 3 after the DF.price_cov or pass the argument 'decimals = 3' per the np.round() documentation.
# Neither worked so I wasn't sure how to go about the assert for validating the provided answer. 

df_q4_check = (
    df_q4
        >> gr.tf_select(DF["cut"], DF["price_cov"]) 
        >> gr.tf_mutate(price_cov = gr.round(DF["price_cov"]))
        #>> gr.tf_mutate(price_cov = gr.round(DF["price_cov"], 3)) <- Expected  round call
)
df_q4_check

Unnamed: 0,cut,price_cov
0,Fair,1.0
1,Good,1.0
2,Ideal,1.0
3,Premium,1.0
4,Very Good,1.0


*Observations*
<!-- task-begin -->
- Write your observations here!
<!-- task-end -->
<!-- solution-begin -->
- I would expect the `Ideal` diamonds to have the highest price, but that's not the case!
- The `COV` for each cut is very large, on the order of 80 to 110 percent! To me, this implies that the other variables `clarity, carat, color` account for a large portion of the diamond price.
<!-- solution-end -->