# c01-titanic

*Purpose*: Most datasets have at least a few variables. Part of our task in analyzing a dataset is to understand trends as they vary across these different variables. Unless we're careful and thorough, we can easily miss these patterns. In this challenge you'll analyze a dataset with a small number of categorical variables and try to find differences among the groups.

*Reading*: (Optional) [Wikipedia article](https://en.wikipedia.org/wiki/RMS_Titanic) on the RMS Titanic.


## Setup

In [None]:
import grama as gr
import pandas as pd
DF = gr.Intention()
%matplotlib inline


## Background

*Background*: The RMS Titanic sank on its maiden voyage in 1912; about 67% of its passengers died. We will study a dataset summarizing how many passengers of different groups either survived or perished.


In [None]:
## NOTE: No need to edit; this loads the data
df_titanic = pd.read_csv("./data/titanic.csv")
df_titanic


# Initial look


### __q1__ Get the basic facts

Inspect `df_titanic`. Answer the questions under *observations* below.


In [None]:
## TASK: Inspect df_titanic
(
    df_titanic
    >> gr.tf_head()
)
# task-end

*Observations*

- What variables are in the dataset?
  - (Your response here)


### __q2__ Do some background reading

Skim the [Wikipedia article](https://en.wikipedia.org/wiki/RMS_Titanic) on the RMS Titanic, and look for a total count of souls aboard. Compare against the total computed below. Are there any differences? Are those differences large or small? What might account for those differences? 

Answer the questions under *observations* below.


In [None]:
## NOTE: No need to edit; we'll learn how to do this calculation later
(
    df_titanic
    >> gr.tf_summarize(total=gr.sum(DF.n))
)

**Observations**:

- Are there any differences between what you read and the `total` persons computed from our dataset?
  - (Your response here)
- If yes, what might account for those differences?
  - (Your response here)


### __q3__ Visualize survivor count

Create a plot showing the count of persons who *did* survive, along with aesthetics for `Class` and `Sex`. Document your observations below.


In [None]:
## TASK: Visualize the count of survivors, along with `Class` and `Sex`
(
    df_titanic

)

**Observations**:

- Write your observations here
  - (Your response here)
  - (Your response here)
  - (Your response here)


# Deeper look

Raw counts give us a sense of totals, but they are not as useful for understanding differences between groups. This is because the differences we see in counts could be due to either the relative size of the group OR differences in outcomes for those groups. To make comparisons between groups, we should also consider *proportions*.[1]

The following code computes proportions within each `Class, Sex, Age` group.


In [None]:
## NOTE: No need to edit
df_prop = (
    df_titanic
    >> gr.tf_group_by(DF.Class, DF.Sex, DF.Age)
    >> gr.tf_mutate(
        Total=gr.sum(DF.n),
        Prop=DF.n/gr.sum(DF.n),
    )
    >> gr.tf_ungroup()
)
df_prop

### __q4__ Visualize proportions

Replicate your visual from q3, but display `Prop` in place of `n`. Document your observations, and note any new/different observations you make in comparison with q3. Is there anything *fishy* in your plot?

Answer the questions under *observations* below.


In [None]:
## TASK: Visualize the count of survivors, along with `Class` and `Sex`
(
    df_prop

)

*Observations*

- Write your observations here.
  - (Your response here)
  - (Your response here)
  - (Your response here)
- Is there anything *fishy* going on in your plot?
  - (Your response here)


### __q5__ Visualize with more variables

Visualize the proportion `Prop` data showing the group-proportion of occupants who *did* survive, along with aesthetics for `Class`, `Sex`, *and* `Age`. Make sure to show the proportions for *all* survivor groups. Document your observations below.

*Hint*: You may need to make multiple plots using different filters. Or you can look ahead to `e-vis05-multiples` to learn how to use `gr.facet_grid()`.


In [None]:
## TASK:
(
    df_prop

)

*Observations*

Document your observations below.

- (Your response here)
- (Your response here)
- (Your response here)
- If you saw something *fishy* in q4 above, use your new plot to explain the fishy-ness.
  - (Your response here)



# Notes

[1] This is basically the same idea as [Dimensional Analysis](https://en.wikipedia.org/wiki/Dimensional_analysis); computing proportions is akin to non-dimensionalizing a quantity.

[2] Further details on this dataset---and its known errors!---are written up in Dawson, "The 'Unusual Episode' Data Revisited" (1995) *Journal of Statistics Education*, [link](https://www.tandfonline.com/doi/full/10.1080/10691898.1995.11910499).
