# c01-titanic

*Purpose*: Most datasets have at least a few variables. Part of our task in analyzing a dataset is to understand trends as they vary across these different variables. Unless we're careful and thorough, we can easily miss these patterns. In this challenge you'll analyze a dataset with a small number of categorical variables and try to find differences among the groups.

*Reading*: (Optional) [Wikipedia article](https://en.wikipedia.org/wiki/RMS_Titanic) on the RMS Titanic.


## Setup

In [None]:
import grama as gr
import pandas as pd
DF = gr.Intention()
%matplotlib inline


## Background

*Background*: The RMS Titanic sank on its maiden voyage in 1912; about 67% of its passengers died. We will study a dataset summarizing how many passengers of different groups either survived or perished.


In [None]:
## NOTE: No need to edit; this loads the data
df_titanic = pd.read_csv("./data/titanic.csv")
df_titanic


# Initial look


### __q1__ Get the basic facts

Inspect `df_titanic`. Answer the questions under *observations* below.


In [None]:
## TASK: Inspect df_titanic
# task-begin
(
    df_titanic
    >> gr.tf_head()
)
# task-end

*Observations*

<!-- task-begin -->
- What variables are in the dataset?
  - (Your response here)
<!-- task-end -->
<!-- solution-begin -->
- What variables are in the dataset?
  - `Class`, `Sex`, `Age`, `Survived`, and `n`
<!-- solution-end -->


### __q2__ Do some background reading

Skim the [Wikipedia article](https://en.wikipedia.org/wiki/RMS_Titanic) on the RMS Titanic, and look for a total count of souls aboard. Compare against the total computed below. Are there any differences? Are those differences large or small? What might account for those differences? 

Answer the questions under *observations* below.


In [None]:
## NOTE: No need to edit; we'll learn how to do this calculation later
(
    df_titanic
    >> gr.tf_summarize(total=gr.sum(DF.n))
)

**Observations**:

<!-- task-begin -->
- Are there any differences between what you read and the `total` persons computed from our dataset?
  - (Your response here)
- If yes, what might account for those differences?
  - (Your response here)
<!-- task-end -->
<!-- solution-begin -->
- Are there any differences between what you read and the `total` persons computed from our dataset?
  - Our computed total is `2201`, whereas Wikipedia gives an estimate of `2224` persons (as of 2020-06-25). Our total is missing about `20` people.
- If yes, what might account for those differences?
  - It is possible the original data source was missing some individual entries; in this case their count would appear in neither *survived* nor *perished*.
  - Our dataset is from the British Board of Trade Inquiry Report (1990); this report is known to be in error. For instance, there is at least one known 1st class Child passenger who died: ALLISON, Miss Helen Loraine (see [*Encyclopedia Titanica*](https://www.encyclopedia-titanica.org/children-that-died-on-titanic/) for more details) [2].
<!-- solution-end -->


### __q3__ Visualize survivor count

Create a plot showing the count of persons who *did* survive, along with aesthetics for `Class` and `Sex`. Document your observations below.


In [None]:
## TASK: Visualize the count of survivors, along with `Class` and `Sex`
(
    df_titanic
# solution-begin
    >> gr.tf_filter(DF.Survived == "Yes")
    
    >> gr.ggplot(gr.aes("Class", "n", fill="Sex"))
    + gr.geom_col(position="dodge")
# solution-end
)

**Observations**:

<!-- task-begin -->
- Write your observations here
  - (Your response here)
  - (Your response here)
  - (Your response here)
<!-- task-end -->
<!-- solution-begin -->
- Write your observations here
  - Many more women than men survived among 1st and 2nd class passengers
  - About the same number of 3rd class men and women survived
  - Many more male crew members survived than female crew. However, this might be misleading; perhaps there were very few female crewmembers to start.
<!-- solution-end -->


# Deeper look

Raw counts give us a sense of totals, but they are not as useful for understanding differences between groups. This is because the differences we see in counts could be due to either the relative size of the group OR differences in outcomes for those groups. To make comparisons between groups, we should also consider *proportions*.[1]

The following code computes proportions within each `Class, Sex, Age` group.


In [None]:
## NOTE: No need to edit
df_prop = (
    df_titanic
    >> gr.tf_group_by(DF.Class, DF.Sex, DF.Age)
    >> gr.tf_mutate(
        Total=gr.sum(DF.n),
        Prop=DF.n/gr.sum(DF.n),
    )
    >> gr.tf_ungroup()
)
df_prop

### __q4__ Visualize proportions

Replicate your visual from q3, but display `Prop` in place of `n`. Document your observations, and note any new/different observations you make in comparison with q3. Is there anything *fishy* in your plot?

Answer the questions under *observations* below.


In [None]:
## TASK: Visualize the count of survivors, along with `Class` and `Sex`
(
    df_prop
# solution-begin
    >> gr.tf_filter(DF.Survived == "Yes", DF.Age == "Adult")
    
    >> gr.ggplot(gr.aes("Class", "Prop", fill="Sex"))
    + gr.geom_col(position="dodge")
    + gr.labs(title="Adult Titanic Survivors")
# solution-end
)

*Observations*

<!-- task-begin -->
- Write your observations here.
  - (Your response here)
  - (Your response here)
  - (Your response here)
- Is there anything *fishy* going on in your plot?
  - (Your response here)
<!-- task-end -->
<!-- solution-begin -->
- Write your observations here.
  - Based on this new figure, we can see a very different trend among the Crew, where the vast majority of female adults survived.
  - Adult women tended to survive at greater rates than adult men.
    - In particular, higher class seemed to benefit adult women.
  - Oddly, 2nd Class Men had the lowest survival rate among Adults.
- Is there anything *fishy* going on in your plot?
  - I avoided the "trap" by filtering to `age=="Adult"`.
  
This question is deliberately tricky: A common mistake with this question is a graph like the following.

```python
(
    df_prop
    >> gr.tf_filter(DF.Survived == "Yes")

    >> gr.ggplot(gr.aes("Class", "Prop", fill="Sex"))
    + gr.geom_col(position="dodge")
)
```

Note that this *falsely* implies that everyone in the 1st and 2nd classes survived! Adding an aesthetic for `Age` helps reveal what went wrong:

```python
(
    df_prop
    >> gr.tf_filter(DF.Survived == "Yes")

    >> gr.ggplot(gr.aes("Class", "Prop", fill="Sex", color="Age"))
    + gr.geom_col(position="dodge")
)
```

Ggplot essentially hid some of the bars, which contributed to a false impression of the data.
<!-- solution-end -->

### __q5__ Visualize with more variables

Visualize the proportion `Prop` data showing the group-proportion of occupants who *did* survive, along with aesthetics for `Class`, `Sex`, *and* `Age`. Make sure to show the proportions for *all* survivor groups. Document your observations below.

*Hint*: You may need to make multiple plots using different filters. Or you can look ahead to `e-vis05-multiples` to learn how to use `gr.facet_grid()`.


In [None]:
## TASK:
(
    df_prop
# solution-begin
    >> gr.tf_filter(DF.Survived == "Yes")
    
    >> gr.ggplot(gr.aes("Class", "Prop", fill="Sex"))
    + gr.geom_col(position="dodge")
    + gr.facet_grid("Age~.")
# solution-end
)

*Observations*

Document your observations below.

<!-- task-begin -->
- (Your response here)
- (Your response here)
- (Your response here)
- If you saw something *fishy* in q4 above, use your new plot to explain the fishy-ness.
  - (Your response here)
<!-- task-end -->
<!-- solution-begin -->
- 100% of the 1st and 2nd class children survived.
- A much smaller fraction of 3rd class children survived, with a slight bias towards girls.
- Among adults, women survived in much greater proportions than men.
  - Collectively, these observations seem to support the old maritime practice of ["women and children first"](https://en.wikipedia.org/wiki/Women_and_children_first).
- Surprisingly, the smallest proportion of surviving men were those in 2nd class.
- There were no child crewmembers.

- If you saw something *fishy* in q4 above, use your new plot to explain the fishy-ness.
  - With this new plot, we can understand the *fishy-ness* above: The 1st and 2nd class children survived at 100%. But in the example plot above, we see that the adult and child bars overlapped. This created the false impression that those in 1st and 2nd class all survived at 100% (regardless if adult or child).
<!-- solution-end -->



# Notes

[1] This is basically the same idea as [Dimensional Analysis](https://en.wikipedia.org/wiki/Dimensional_analysis); computing proportions is akin to non-dimensionalizing a quantity.

[2] Further details on this dataset---and its known errors!---are written up in Dawson, "The 'Unusual Episode' Data Revisited" (1995) *Journal of Statistics Education*, [link](https://www.tandfonline.com/doi/full/10.1080/10691898.1995.11910499).
