01_introduction_to_eda.qmd

---
title: "01_exploratory_data_analysis"
author: "Sergio Uribe"
format: html
editor: visual
---

# Start here

Install the pacman *pacman* package, as a user-friendly package manager for R. Packages in R are collections of functions, data, and code that extend the capabilities of base R. They are developed by the R community and can be installed and loaded into an R session to provide additional functionality for data analysis, visualization, modeling, and more. Packages in R are essential for performing complex data analyses and are a major reason why R is a popular language for statistical computing and data science.

```{r}
install.packages("pacman")
```

The *pacman* package checks whether the package is already installed and installs it if it is not. It also checks whether the package is already loaded and loads it if it is not. This command saves time and reduces errors that may occur when manually installing and loading packages,

# Install the necessary packages

```{r}
pacman::p_load(tidyverse, #  collection of R packages that provides tools for data manipulation, exploration, and visualization.
               gtsummary, # generates publication-ready summary tables of data frames and regression models.
               janitor, # for data cleaning
               here, # provides a flexible way to specify file paths by dynamically generating them based on the current working directory, regardless of the operating system.
               ggrepel, # for the labels
               palmerpenguins, # the penguins dataset
               gapminder, # the world bank dataset
               SmartEDA) # used to automatically perform exploratory data analysis on a dataset, including generating summary statistics, correlation matrices, and visualizations
```

# Load the dataset

```{r}
# Set the working directory to the root of the project
here::here()
```

## Folder structure for data analysis

A good folder structure for data analysis should be organized, easy to navigate, and scalable for future projects. One popular structure is the project-oriented structure, which includes separate directories for data, code, and output.

The **data directory** should contain all the raw and processed data files, as well as any metadata or documentation about the data. It is important to keep the data directory separate from the code directory to avoid accidentally modifying the original data.

The **code directory** should contain all the scripts, functions, and notebooks used for data analysis. It is helpful to organize the code into subdirectories based on their purpose or type, such as data cleaning, exploratory data analysis, and modeling.

The **output directory** should contain all the final products of the analysis, such as reports, visualizations, and tables. It is useful to organize the output directory by project or date, with subdirectories for different versions or drafts.

In addition to these directories, it may be helpful to include a documentation directory for project documentation, a reference directory for external resources, and a results directory for storing intermediate results of analysis.

# Tufte principles

## Information:ink ratio

```{r}
theme_set(theme_minimal())
```

## Show the data in high-resolution

## Use clear and detailed labelling

```{r}
ggplot(mtcars) +
  geom_point(aes(wt, mpg), color = 'red') +
  geom_text(aes(wt, mpg, label = rownames(mtcars))) 
```

```{r}
ggplot(mtcars) +
  geom_point(aes(wt, mpg), color = 'red') +
  geom_text_repel(aes(x = wt, 
                      y = mpg,
                      # color = as.factor(cyl), 
                      # shape = hp, 
                      label = rownames(mtcars))) +
  theme_classic() + 
  labs(title = "Car Weight vs. Fuel Efficiency",
       subtitle = "Data from the [insert source here]", 
       x = "Weight (wt)", 
       y = "Miles per Gallon (mpg)", 
       caption = "Data points represent car models in the mtcars dataset. Labels show the model names.")
```

# WORKSHOP

### What is EDA?

EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

1.  Generate questions about your data

2.  Search for answers by visualising, transforming, and/or modeling your data

3.  Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.

### The EDA mindset

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA, you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on lines of inquiry that reveal insights worth writing up and communicating to others.

### Questions

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." --- John Tukey

### Two useful questions

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

1.  What type of **variation** occurs **within** my variables?

2.  What type of **covariation** occurs **between** my variables?

The rest of this tutorial will look at these two questions. To make the discussion easier, let's define some terms...

### Definitions

-   A **variable** is a quantity, quality, or property that you can measure.

-   A **value** is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

-   An **observation** or **case** is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I'll sometimes refer to an observation as a case or data point.

-   **Tabular data** is a table of values, each associated with a variable and an observation. Tabular data is **tidy** if each value is placed in its own cell, each variable in its own column, and each observation in its own row.

    So far, all of the data that you've seen has been tidy. In real-life, most data isn't tidy, so we'll come back to these ideas again in [Data Wrangling](https://tutorials.shinyapps.io/01-Exploratory-Data-Analysis/_w_0a9d7d06/).

### What is variation?

**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice---and precisely enough---you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different objects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).

Every variable has its own pattern of variation, which can reveal useful information. The best way to understand that pattern is to visualise the distribution of the variable's values. How you visualise the distribution of a variable will depend on whether the variable is **categorical** or **continuous**.

## Covariation

### What is covariation?

If variation describes the behavior *within* a variable, covariation describes the behavior *between* variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on whether your variables are categorical or continuous.

![](https://tutorials.shinyapps.io/01-Exploratory-Data-Analysis/_w_0a9d7d06/www/images/plots-table.png)

# Exercises EDA

## head (the chunk, run the code)

```{r}
head(penguins)
```

## glimpse

```{r}
glimpse(penguins)
```

## Filter (the pipe)

```{r}
penguins %>% 
  select(species) %>% 
  summary()
```

## summary

```{r}
summary(penguins)
```

## summary one column (select)

```{r}
summary(penguins$species)
```

```{r}
penguins %>% 
  select(species) %>% 
  summary()
```

## tables

```{r}
penguins %>% 
  tabyl(species, island)
```

```{r}
penguins %>% 
  tabyl(species, island) %>% 
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns()
```

## SmartEDA

```{r}
penguins %>% 
  SmartEDA::ExpData()
```

```{r}
penguins %>% 
  SmartEDA::ExpReport(op_file="report.html")
```

## Reporting summaries aka Table 1

```{r}
penguins %>% 
  gtsummary::tbl_summary(by = species)
```

## Reporting regression results

```{r}
model <- lm(body_mass_g ~  species + island, data = penguins)
```

```{r}
model %>% 
  gtsummary::tbl_regression()
```

```{r}

model_log <- glm(sex ~  bill_length_mm + body_mass_g + island, data = penguins, family = "binomial")
```

```{r}
model_log %>% 
  gtsummary::tbl_regression(exponentiate = TRUE)
```

# Exercises Plotting

## Exercise 1: Plotting One Variable

**Research Question: What is the distribution of body mass among penguins in the dataset?**

-   Create a histogram of the **`body_mass_g`** variable.

```{r}
# Create a histogram of body_mass_g
penguins %>% 
  ggplot(aes(x = body_mass_g)) +
  geom_histogram()
```

Your turn for flipper_length

```{r}
penguins %>% 
  ggplot(aes(x = flipper_length_mm)) +
  geom_histogram()
```

```{r}
penguins %>% 
  ggplot(aes(x = flipper_length_mm)) +
  geom_histogram() + 
  facet_grid(species ~ . )
```

## Exercise 2: Plotting Two Variables (Categorical vs. Continuous)

**Research Question: How does the bill length vary among different species of penguins?**

-   Create a box plot comparing the **`species`** variable (categorical) with the **`bill_length_mm`** variable (continuous).

```{r}
penguins %>% 
  ggplot(aes(x = species, 
             y = bill_length_mm)) +
  geom_boxplot()
```

```{r}
penguins %>% 
  ggplot(aes(x = species, 
             y = bill_length_mm)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.2, width = .2)
```

Your turn for the weight of the penguins between the islands

```{r}
penguins %>% 
  ggplot(aes(x = island, 
             y = body_mass_g)) +
  geom_violin() + 
  geom_boxplot(width = 0.1) +
  geom_jitter(alpha = 0.1, width = 0.1)
```

## Exercise 3: Plotting Two Variables (Two Continuous)

**Research Question: Is there a relationship between bill length and bill depth in penguins?**

-   Create a scatter plot with **`bill_length_mm`** on the x-axis and **`bill_depth_mm`** on the y-axis.

```{r}
penguins %>% 
  ggplot(aes(x = bill_length_mm, 
             y = bill_depth_mm)) +
  geom_point()
```

Explore the association

```{r}
penguins %>% 
  ggplot(aes(x = bill_length_mm, 
             y = bill_depth_mm)) +
  geom_point() + 
  geom_smooth(method = "lm")
```

## Exercise 4: Plotting Three Variables

**Research Question: How does the relationship between bill length and bill depth differ among different penguin species?**

-   Create a scatter plot with **`bill_length_mm`** on the x-axis, **`bill_depth_mm`** on the y-axis, and color the points based on the **`species`** variable.

```{r}
penguins %>% 
  ggplot(aes(x = bill_length_mm, 
             y = bill_depth_mm, 
             color = species)) +
  geom_point()
```

```{r}
penguins %>%
  ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm,
             color = species)) +
  geom_point() +
  geom_smooth(method = "lm")
```

Simps

## Exercise 5: Creating a New Variable with **`mutate`** and Plotting

-   Use **`mutate`** to create a new variable called **`body_mass_kg`** by dividing **`body_mass_g`** by 1000. Then, create a scatter plot with **`bill_length_mm`** on the x-axis and **`body_mass_kg`** on the y-axis.

```{r}
penguins %>% 
  mutate(body_mass_kg = body_mass_g / 1000) %>%
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_kg)) +
  geom_point()
```

## Exercise 6: Reshaping Data and Plotting

**How does the relationship between body mass and bill depth vary across different penguin species, considering the influence of island location?**

-   In this exercise, we reshape the data to a long format using **`pivot_longer()`** to create a dataset with variables for body mass, bill depth, species, and island. Then, we create a faceted scatter plot using **`ggplot2`** to explore the relationship between body mass and bill depth across different penguin species, considering the influence of island location.

```{r}
gapminder %>%
  # Filter the data for a specific year range
  filter(year >= 1950 & year <= 2000) %>%  
  # Group the data by continent and year
  group_by(continent, year) %>%  
  # Calculate the log10-transformed life expectancy
  mutate(log_life_expectancy = log10(lifeExp)) %>%  
  summarize(avg_log_life_expectancy = mean(log_life_expectancy, na.rm = TRUE)) %>%  # Summarize the average log-transformed life expectancy per continent and year
  ggplot(aes(x = year, y = avg_log_life_expectancy, color = continent)) +  # Create a line plot to visualize the average log-transformed life expectancy over time
  geom_line() +
  labs(x = "Year", y = "Average Log Life Expectancy", color = "Continent") +
  theme_minimal()
```

```{r}
gapminder %>%
  filter(country %in% c("China", "Sweden", "Germany", "Honduras")) %>%  # Filter the data for China and Latvia
  group_by(country, year) %>%  # Group the data by country and year
  ggplot(aes(x = year, 
             y = gdpPercap, 
             color = country)) +  # Create a line plot to compare log-transformed population over time
  geom_line() +
  labs(x = "Year", y = "gdpPercap", color = "Country") +
  theme_minimal()
```

```{r}
gapminder %>%
  filter(country %in% c("China", "Sweden", "Germany", "Honduras")) %>%  # Filter the data for China and Latvia
  group_by(country, year) %>%  # Group the data by country and year
  ggplot(aes(x = year, 
             y = gdpPercap, 
             color = country)) +  # Create a line plot to compare log-transformed population over time
  geom_line() +
  labs(x = "Year", y = "Log gdpPercap", color = "Country") +
  theme_minimal() +
  scale_y_log10()
  

```