-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_introduction_to_eda.qmd
407 lines (292 loc) · 14.5 KB
/
01_introduction_to_eda.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
---
title: "01_exploratory_data_analysis"
author: "Sergio Uribe"
format: html
editor: visual
---
# Start here
Install the pacman *pacman* package, as a user-friendly package manager for R. Packages in R are collections of functions, data, and code that extend the capabilities of base R. They are developed by the R community and can be installed and loaded into an R session to provide additional functionality for data analysis, visualization, modeling, and more. Packages in R are essential for performing complex data analyses and are a major reason why R is a popular language for statistical computing and data science.
```{r}
install.packages("pacman")
```
The *pacman* package checks whether the package is already installed and installs it if it is not. It also checks whether the package is already loaded and loads it if it is not. This command saves time and reduces errors that may occur when manually installing and loading packages,
# Install the necessary packages
```{r}
pacman::p_load(tidyverse, # collection of R packages that provides tools for data manipulation, exploration, and visualization.
gtsummary, # generates publication-ready summary tables of data frames and regression models.
janitor, # for data cleaning
here, # provides a flexible way to specify file paths by dynamically generating them based on the current working directory, regardless of the operating system.
ggrepel, # for the labels
palmerpenguins, # the penguins dataset
gapminder, # the world bank dataset
SmartEDA) # used to automatically perform exploratory data analysis on a dataset, including generating summary statistics, correlation matrices, and visualizations
```
# Load the dataset
```{r}
# Set the working directory to the root of the project
here::here()
```
## Folder structure for data analysis
A good folder structure for data analysis should be organized, easy to navigate, and scalable for future projects. One popular structure is the project-oriented structure, which includes separate directories for data, code, and output.
The **data directory** should contain all the raw and processed data files, as well as any metadata or documentation about the data. It is important to keep the data directory separate from the code directory to avoid accidentally modifying the original data.
The **code directory** should contain all the scripts, functions, and notebooks used for data analysis. It is helpful to organize the code into subdirectories based on their purpose or type, such as data cleaning, exploratory data analysis, and modeling.
The **output directory** should contain all the final products of the analysis, such as reports, visualizations, and tables. It is useful to organize the output directory by project or date, with subdirectories for different versions or drafts.
In addition to these directories, it may be helpful to include a documentation directory for project documentation, a reference directory for external resources, and a results directory for storing intermediate results of analysis.
# Tufte principles
## Information:ink ratio
```{r}
theme_set(theme_minimal())
```
## Show the data in high-resolution
## Use clear and detailed labelling
```{r}
ggplot(mtcars) +
geom_point(aes(wt, mpg), color = 'red') +
geom_text(aes(wt, mpg, label = rownames(mtcars)))
```
```{r}
ggplot(mtcars) +
geom_point(aes(wt, mpg), color = 'red') +
geom_text_repel(aes(x = wt,
y = mpg,
# color = as.factor(cyl),
# shape = hp,
label = rownames(mtcars))) +
theme_classic() +
labs(title = "Car Weight vs. Fuel Efficiency",
subtitle = "Data from the [insert source here]",
x = "Weight (wt)",
y = "Miles per Gallon (mpg)",
caption = "Data points represent car models in the mtcars dataset. Labels show the model names.")
```
# WORKSHOP
### What is EDA?
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
1. Generate questions about your data
2. Search for answers by visualising, transforming, and/or modeling your data
3. Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
### The EDA mindset
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA, you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on lines of inquiry that reveal insights worth writing up and communicating to others.
### Questions
Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." --- John Tukey
### Two useful questions
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
1. What type of **variation** occurs **within** my variables?
2. What type of **covariation** occurs **between** my variables?
The rest of this tutorial will look at these two questions. To make the discussion easier, let's define some terms...
### Definitions
- A **variable** is a quantity, quality, or property that you can measure.
- A **value** is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
- An **observation** or **case** is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I'll sometimes refer to an observation as a case or data point.
- **Tabular data** is a table of values, each associated with a variable and an observation. Tabular data is **tidy** if each value is placed in its own cell, each variable in its own column, and each observation in its own row.
So far, all of the data that you've seen has been tidy. In real-life, most data isn't tidy, so we'll come back to these ideas again in [Data Wrangling](https://tutorials.shinyapps.io/01-Exploratory-Data-Analysis/_w_0a9d7d06/).
### What is variation?
**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice---and precisely enough---you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different objects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal useful information. The best way to understand that pattern is to visualise the distribution of the variable's values. How you visualise the distribution of a variable will depend on whether the variable is **categorical** or **continuous**.
## Covariation
### What is covariation?
If variation describes the behavior *within* a variable, covariation describes the behavior *between* variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on whether your variables are categorical or continuous.
![](https://tutorials.shinyapps.io/01-Exploratory-Data-Analysis/_w_0a9d7d06/www/images/plots-table.png)
# Exercises EDA
## head (the chunk, run the code)
```{r}
head(penguins)
```
## glimpse
```{r}
glimpse(penguins)
```
## Filter (the pipe)
```{r}
penguins %>%
select(species) %>%
summary()
```
## summary
```{r}
summary(penguins)
```
## summary one column (select)
```{r}
summary(penguins$species)
```
```{r}
penguins %>%
select(species) %>%
summary()
```
## tables
```{r}
penguins %>%
tabyl(species, island)
```
```{r}
penguins %>%
tabyl(species, island) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns()
```
## SmartEDA
```{r}
penguins %>%
SmartEDA::ExpData()
```
```{r}
penguins %>%
SmartEDA::ExpReport(op_file="report.html")
```
## Reporting summaries aka Table 1
```{r}
penguins %>%
gtsummary::tbl_summary(by = species)
```
## Reporting regression results
```{r}
model <- lm(body_mass_g ~ species + island, data = penguins)
```
```{r}
model %>%
gtsummary::tbl_regression()
```
```{r}
model_log <- glm(sex ~ bill_length_mm + body_mass_g + island, data = penguins, family = "binomial")
```
```{r}
model_log %>%
gtsummary::tbl_regression(exponentiate = TRUE)
```
# Exercises Plotting
## Exercise 1: Plotting One Variable
**Research Question: What is the distribution of body mass among penguins in the dataset?**
- Create a histogram of the **`body_mass_g`** variable.
```{r}
# Create a histogram of body_mass_g
penguins %>%
ggplot(aes(x = body_mass_g)) +
geom_histogram()
```
Your turn for flipper_length
```{r}
penguins %>%
ggplot(aes(x = flipper_length_mm)) +
geom_histogram()
```
```{r}
penguins %>%
ggplot(aes(x = flipper_length_mm)) +
geom_histogram() +
facet_grid(species ~ . )
```
## Exercise 2: Plotting Two Variables (Categorical vs. Continuous)
**Research Question: How does the bill length vary among different species of penguins?**
- Create a box plot comparing the **`species`** variable (categorical) with the **`bill_length_mm`** variable (continuous).
```{r}
penguins %>%
ggplot(aes(x = species,
y = bill_length_mm)) +
geom_boxplot()
```
```{r}
penguins %>%
ggplot(aes(x = species,
y = bill_length_mm)) +
geom_boxplot() +
geom_jitter(alpha = 0.2, width = .2)
```
Your turn for the weight of the penguins between the islands
```{r}
penguins %>%
ggplot(aes(x = island,
y = body_mass_g)) +
geom_violin() +
geom_boxplot(width = 0.1) +
geom_jitter(alpha = 0.1, width = 0.1)
```
## Exercise 3: Plotting Two Variables (Two Continuous)
**Research Question: Is there a relationship between bill length and bill depth in penguins?**
- Create a scatter plot with **`bill_length_mm`** on the x-axis and **`bill_depth_mm`** on the y-axis.
```{r}
penguins %>%
ggplot(aes(x = bill_length_mm,
y = bill_depth_mm)) +
geom_point()
```
Explore the association
```{r}
penguins %>%
ggplot(aes(x = bill_length_mm,
y = bill_depth_mm)) +
geom_point() +
geom_smooth(method = "lm")
```
## Exercise 4: Plotting Three Variables
**Research Question: How does the relationship between bill length and bill depth differ among different penguin species?**
- Create a scatter plot with **`bill_length_mm`** on the x-axis, **`bill_depth_mm`** on the y-axis, and color the points based on the **`species`** variable.
```{r}
penguins %>%
ggplot(aes(x = bill_length_mm,
y = bill_depth_mm,
color = species)) +
geom_point()
```
```{r}
penguins %>%
ggplot(aes(x = bill_length_mm,
y = bill_depth_mm,
color = species)) +
geom_point() +
geom_smooth(method = "lm")
```
Simps
## Exercise 5: Creating a New Variable with **`mutate`** and Plotting
- Use **`mutate`** to create a new variable called **`body_mass_kg`** by dividing **`body_mass_g`** by 1000. Then, create a scatter plot with **`bill_length_mm`** on the x-axis and **`body_mass_kg`** on the y-axis.
```{r}
penguins %>%
mutate(body_mass_kg = body_mass_g / 1000) %>%
ggplot(aes(x = bill_length_mm,
y = body_mass_kg)) +
geom_point()
```
## Exercise 6: Reshaping Data and Plotting
**How does the relationship between body mass and bill depth vary across different penguin species, considering the influence of island location?**
- In this exercise, we reshape the data to a long format using **`pivot_longer()`** to create a dataset with variables for body mass, bill depth, species, and island. Then, we create a faceted scatter plot using **`ggplot2`** to explore the relationship between body mass and bill depth across different penguin species, considering the influence of island location.
```{r}
gapminder %>%
# Filter the data for a specific year range
filter(year >= 1950 & year <= 2000) %>%
# Group the data by continent and year
group_by(continent, year) %>%
# Calculate the log10-transformed life expectancy
mutate(log_life_expectancy = log10(lifeExp)) %>%
summarize(avg_log_life_expectancy = mean(log_life_expectancy, na.rm = TRUE)) %>% # Summarize the average log-transformed life expectancy per continent and year
ggplot(aes(x = year, y = avg_log_life_expectancy, color = continent)) + # Create a line plot to visualize the average log-transformed life expectancy over time
geom_line() +
labs(x = "Year", y = "Average Log Life Expectancy", color = "Continent") +
theme_minimal()
```
```{r}
gapminder %>%
filter(country %in% c("China", "Sweden", "Germany", "Honduras")) %>% # Filter the data for China and Latvia
group_by(country, year) %>% # Group the data by country and year
ggplot(aes(x = year,
y = gdpPercap,
color = country)) + # Create a line plot to compare log-transformed population over time
geom_line() +
labs(x = "Year", y = "gdpPercap", color = "Country") +
theme_minimal()
```
```{r}
gapminder %>%
filter(country %in% c("China", "Sweden", "Germany", "Honduras")) %>% # Filter the data for China and Latvia
group_by(country, year) %>% # Group the data by country and year
ggplot(aes(x = year,
y = gdpPercap,
color = country)) + # Create a line plot to compare log-transformed population over time
geom_line() +
labs(x = "Year", y = "Log gdpPercap", color = "Country") +
theme_minimal() +
scale_y_log10()
```