# Introduction to Exploratory Data Analysis

*Purpose*: *Exploratory Data Analysis* (EDA) is a __crucial__ skill for a practicing data scientist. Unfortunately, much like human-centered design EDA is hard to teach. This is because EDA is **not** a strict procedure, so much as it is a **mindset**. Also, much like human-centered design, EDA is an *iterative, nonlinear process*. There are two key principles to keep in mind when doing EDA:

1. Curiosity: Generate lots of ideas and hypotheses about your data.
2. Skepticism: Remain unconvinced of those ideas, unless you can find credible
  patterns to support them.

Since EDA is both *crucial* and *difficult*, we will practice doing EDA *a lot* in this course!


## Reading

*Reading*: [Exploratory Data Analysis](https://rstudio.cloud/learn/primers/3.1)

*Topics*: (All topics)

*Reading Time*: ~45 minutes


## Setup


In [None]:
import grama as gr
DF = gr.Intention()

We'll study the diamonds dataset for this exercise.


In [None]:
from grama.data import df_diamonds
df_diamonds = (
    df_diamonds
    >> gr.tf_mutate(
        # Order the cut to aid in plotting
        cut=gr.as_factor(
            DF.cut,
            categories=[
                "Fair",
                "Good",
                "Very Good",
                "Premium",
                "Ideal"
            ]
        )
    )
)


# Basic EDA Tools

There are a few simple tools we can use to investigate a dataset. We should use these tools even before making visuals of the data.


### __q1__ Take the head

Use the appropriate function to get the first 5 observations in `df_diamonds`. Answer the questions under *observations* below.


In [None]:
# TASK: Get the first 10 observations
(
    df_diamonds

)

*Observations*

- What variables does this dataset have?
  - (Your response here)
- What variables does this dataset have?
  - `carat`, `cut`, `color`, `clarity`, `depth`, `table`, `price`, `x`, `y`, `z`
<!-- task-end -->

### __q2__ Use descriptive statistics

The `gr.tf_describe()` function gives useful descriptive statistics on a dataset. Use these values to answer the questions under *observations* below.


In [None]:
# NOTE: No need to edit; run and inspect
(
    df_diamonds
    >> gr.tf_describe()
)

*Observations*

- How many observations are in the dataset?
  - (Your response here)
- What is a typical value for the `price` of a diamond, according to this dataset?
  - (Your response here)
- What is the largest diamond in the dataset? (According to `carat`.) What is the smallest?
  - (Your response here)
- You identified all the variables in the dataset in __q1__ above. Do the results from `gr.tf_describe()` provide information on **all** of these variables?
  - (Your response here)


## Distinct Values (levels)

Variables that do not take numerical values are sometimes called *categorical variables*; there are other tools that are useful for investigating categorical variables.

The verb `gr.tf_distinct()` is like `gr.tf_filter()`, but it filters for rows that are *distinct* according to the given variables. For instance, if we wanted to know what distinct values of `x` exist in `df_data`, we would call:

```python
(
    df_data
    >> gr.tf_distinct(DF.variable)
)
```

> *Aside*: A categorical variable is sometimes called a *factor*. The unique values of a categorical variable are called *levels*.

We can use `gr.tf_distinct()` to figure out what values show up for a categorical variable.


### __q3__ Find the distinct `cut` values

Use `gr.tf_distinct()` to find the unique values of `cut` in `df_diamonds`.


In [None]:
# TASK: Find the distinct `cut` values in the dataset
(
    df_diamonds

)

## Counts

Another approach to assessing a categorical is to simply count the number of rows that correspond to each distinct value. We can do this with the `gr.tf_count()` verb. For instance, if we wanted to know how may rows there are for each value of `x` in `df_data`, we would call:

```python
(
    df_data
    >> gr.tf_count(DF.x)
)
```


### __q4__ Find the count of cut values

Use `gr.tf_count()` to find the number of rows for each distinct `cut` value in `df_diamonds`.


In [None]:
# TASK: Find the distinct `cut` values in the dataset
(
    df_diamonds

)


# Guided EDA

I'm going to walk you through a train of thought I had when studying the diamonds dataset.

There are four standard "C's" of [judging](https://en.wikipedia.org/wiki/Diamond_(gemstone)) a diamond. These are `carat, cut, color` and `clarity`, all of which are in the `diamonds` dataset.

*Note*: This remainder of this exercise will consist of interpreting pre-made graphs. You can run the whole notebook to generate all the figures at once. Just make sure to do all the exercises and write your observations!


## Hypothesis 1

Here's a hypothesis:

>  `Ideal` is the "best" value of `cut` for a diamond. Since an `Ideal` cut seems more labor-intensive, I hypothesize that `Ideal` cut diamonds are less numerous than other cuts.

### __q5__ Assess hypothesis 1

Run the chunk below, and study the plot. Was hypothesis 1 correct? Why
or why not?


In [None]:
# NOTE: No need to edit; run and inspect
(
    df_diamonds
    >> gr.ggplot(gr.aes("cut"))
    + gr.geom_bar()
)


*Observations*

- Is hypothesis 1 true or not?
  - (Your response here)


## Hypothesis 2

Another hypothesis: 

> The `Ideal` cut diamonds should be the most pricey.


### __q6__ Assess hypothesis 2

Study the following graph; does it support, contradict, or not relate to hypothesis 2?


In [None]:
# NOTE: No need to edit; run and inspect
(
    df_diamonds
    >> gr.ggplot(gr.aes("cut", "price"))
    + gr.geom_point()
)

*Observations*

- Does this plot support, contradict, or not relate to hypothesis 2?
  - (Your response here)


The following is a set of *boxplots*; the middle bar denotes the median, the boxes denote the *quartiles* (upper and lower "quarters" of the data), and the lines and dots denote large values and outliers.


### __q7__ Assess hypothesis 2, take 2

Study the following graph; does it support or contradict hypothesis 2?


In [None]:
# NOTE: No need to edit; run and inspect
(
    df_diamonds
    
    >> gr.ggplot(gr.aes("cut", "price"))
    + gr.geom_boxplot()
)


*Observations*

- Does this plot support or contradict hypothesis 2?
  - (Your response here)


Upon making the graph in __q3__, I was *very* surprised, so I did some reading on diamond cuts. It turns out that some gemcutters [sacrifice cut for carat](https://en.wikipedia.org/wiki/Diamond_(gemstone)#Cut). Could this effect explain the surprising pattern above?


### __q8__ Unravel hypothesis 2

Study the following graph; does it support a "sacrifice cut for carat" hypothesis? How might this relate to price?

*Hint*: The article linked above will help you answer these questions!


In [None]:
# NOTE: No need to edit; run and inspect
(
    df_diamonds
    >> gr.ggplot(gr.aes("cut", "carat"))
    + gr.geom_boxplot()
)

*Observations*

- (Your response here)
