-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy path6_curate.qmd
More file actions
907 lines (636 loc) · 47.5 KB
/
6_curate.qmd
File metadata and controls
907 lines (636 loc) · 47.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
---
execute:
echo: true
---
::: {.content-visible when-format="pdf"}
```{=latex}
\setDOI{10.4324/9781003393764.6}
\thispagestyle{chapterfirstpage}
```
:::
# Curate {#sec-curate-chapter}
<!-- Data Curation in Text Analysis: Strategies for Structuring and Documenting Datasets -->
```{r}
#| label: setup-options
#| child: "../_common.qmd"
#| cache: false
```
::: {.callout}
**{{< fa regular list-alt >}} Outcomes**
- Describe the importance of data curation in text analysis
- Recognize the different types of data formats
- Associate the types data formats with the appropriate R programming techniques to curate the data
:::
```{r}
#| label: curate-data-packages
#| echo: false
```
In this chapter, we will now look at the next step in a text analysis project: data curation. That is, the process of converting the original data we acquire to a tidy dataset. Acquired data can come in a wide variety of formats. These formats tend to signal the richness of the metadata that is included in the file content. We will consider three general types of content formats: (1) unstructured data, (2) structured data, and (3) semi-structured data. Regardless of the file type and the structure of the data, it will be necessary to consider how to curate a dataset such that the structure reflects the basic unit of analysis that we wish to investigate. The resulting dataset will form the base from which we will work to further transform the dataset such that it aligns with the unit(s) of observation required for the analysis method that we will implement. Once the dataset is curated, we will create a data dictionary that describes the dataset and the variables that are included in the dataset for transparency and reproducibility.
::: {.callout}
**{{< fa terminal >}} Lessons**
**What**: Pattern Matching, Tidy Datasets\
**How**: In an R console, load {swirl}, run `swirl()`, and follow prompts to select the lesson.\
**Why**: To familiarize yourself with the basics of using the pattern matching syntax Regular Expressions and the {dplyr} package to manipulate data into Tidy datasets.
:::
## Unstructured {#sec-unstructured}
The bulk of textual data is of the unstructured variety. Unstructured data is data that has not been organized to make the information contained within machine-readable.\index{unstructured data} Remember that text in itself is not information.\index{information} Only when given explicit context in the form of metadata does text become informative. Metadata can be linguistic or non-linguistic in nature.\index{metadata} So for unstructured data there is little to no metadata directly associated with the data.
### Reading data
Some of the common file formats which contain unstructured data include TXT\index{plain text}, PDF\index{portable document format (PDF)}, and DOCX\index{Word document (DOCX)}. Although these formats are unstructured, they are not the same. Reading these files into R requires different techniques and tools.
There are many ways to read TXT files into R and many packages that can be used to do so. For example, using {readr}\index{R packages!readr}, we can choose to read the entire file into a single vector of character strings with `read_file()` or read the file by lines with `read_lines()` in which each line is a character string in a vector.\cindex{read_file()}\cindex{read_lines()}
Less commonly used in prepared data resources, PDF and DOCX files are more complex than TXT files as they contain formatting and embedded document metadata. However, these attributes are primarily for visual presentation and not for machine-readability. Needless to say, we need an alternate strategy to extract the text content from these files and potentially some of the metadata. For example, using {readtext} [@R-readtext]\index{R packages!readtext}, we can read the text content from PDF and DOCX files into a single vector of character strings with `readtext()`.\cindex{readtext()}
Whether in TXT, PDF, or DOCX format, the resulting data structure will require further processing to convert the data into a tidy dataset.
### Orientation
As an example of curating an unstructured source of corpus data, let's take a look at the [Europarl Parallel Corpus](https://www.statmt.org/europarl/) [@Koehn2005].\index{Europarl Parallel Corpus} This corpus contains parallel texts (source and translated documents) from the European Parliamentary proceedings between 1996 and 2011 for some 21 European languages.\index{parallel corpus}
Let's assume we selected this corpus because we are interested in researching Spanish to English translations. After consulting the corpus website, downloading the archive file, and inspecting the unarchived structure, we have the file structure seen in @def-curate-europarl-file-structure.\index{translation studies}
::: {#def-curate-europarl-file-structure}
Project directory structure for the Europarl Parallel Corpus
```{.bash code-line-numbers="false"}
project/
├── process/
│ ├── 1-acquire-data.qmd
│ ├── 2-curate-data.qmd
│ └── ...
├── data/
│ ├── analysis/
│ ├── derived/
│ └── original/
│ │── europarl_do.csv
│ └── europarl/
│ ├── europarl-v7.es-en.en
│ └── europarl-v7.es-en.es
├── reports/
├── DESCRIPTION
├── Makefile
└── README
```
\index{research scaffold}
:::
The *europarl_do.csv* file contains the data origin information documented as part of the acquisition process.\index{data origin} The contents are seen in @tbl-curate-europarl-data-origin.
```{r}
#| label: tbl-curate-europarl-data-origin
#| tbl-cap: "Data origin: Europarl Corpus"
#| tbl-colwidths: [25, 75]
#| echo: false
# Read in the data origin file
read_csv("data/curate-europarl_do.csv") |>
tt(width = 1)
```
```{r}
#| label: curate-acquire-europarl
#| eval: false
#| echo: false
get_compressed_data(
url = "https://www.statmt.org/europarl/v7/es-en.tgz",
target_dir = "../data/original/europarl",
confirmed = TRUE
)
```
Now let's get familiar with the corpus directory structure and the files. In @def-curate-europarl-file-structure, we see that there are two corpus files, *europarl-v7.es-en.es* and *europarl-v7.es-en.en*, that contain the source and target language texts, respectively. The file names indicate that the files contain Spanish-English parallel texts. The *.es* and *.en* extensions indicate the language of the text.
Looking at the beginning of the *.es* and *.en* files, in @def-curate-europarl-es and @def-curate-europarl-en, we see that the files contain a series of lines in either the source or target language.
::: {#def-curate-europarl-es}
*europarl-v7.es-en.es* file
```{.xml code-line-numbers="false"}
Reanudación del período de sesiones
Declaro reanudado el período de sesiones del Parlamento Europeo, interrumpido el viernes 17 de diciembre pasado, y reitero a Sus Señorías mi deseo de que hayan tenido unas buenas vacaciones.
Como todos han podido comprobar, el gran "efecto del año 2000" no se ha producido. En cambio, los ciudadanos de varios de nuestros países han sido víctimas de catástrofes naturales verdaderamente terribles.
Sus Señorías han solicitado un debate sobre el tema para los próximos días, en el curso de este período de sesiones.
A la espera de que se produzca, de acuerdo con muchos colegas que me lo han pedido, pido que hagamos un minuto de silencio en memoria de todas las víctimas de las tormentas, en los distintos países de la Unión Europea afectados.
```
:::
We can clearly appreciate that the data is unstructured. That is, there is no explicit metadata associated with the data. The data is just a series of character strings separated by lines. The only information that we can surmise from structure of the data is that the texts are line-aligned and that the data in each file corresponds to source and target languages.
::: {#def-curate-europarl-en}
*europarl-v7.es-en.en* file
```{.xml code-line-numbers="false"}
Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
You have requested a debate on this subject in the course of the next few days, during this part-session.
In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.
```
:::
Now, before embarking on a data curation process\index{curate data}, it is recommendable to define the structure of the data that we want to create. I call this the "**idealized structure**" of the data. For a curated dataset, we want to reflect the contents of the original data, yet in a tidy format, to maintain the integrity of and connection with the data.
Given what we know about the data, we can define the idealized structure of the data as seen in @tbl-curate-europarl-structure-example.
::: {#tbl-curate-europarl-structure-example tbl-colwidths="[10, 17, 19, 54]"}
| variable | name | type | description |
|----------|------|---------------|-------------|
| type | Document type | character | Contains the type of document, either 'Source' or 'Target' |
| line | Line | character | Contains the text of each line in the document |
Idealized structure for the curated Europarl Corpus datasets
:::
Our task now is to develop code that will read the original data and render the idealized structure as a curated dataset for each corpus file. We will then write the datasets to the *data/derived/* directory. The code we develop will be added to the *2-curate-data.qmd* file. And finally, the datasets will be documented with a data dictionary file.
### Tidy the data
To create the idealized dataset structure in @tbl-curate-europarl-structure-example, let's start by reading the files by lines into R. As the files are aligned by lines, we will use the `read_lines()` function to read the files into character vectors.
::: {#exm-curate-europarl-readr}
```r
# Load package
library(readr)
# Read Europarl files .es and .en
europarl_es_chr <-
read_lines("../data/original/europarl-v7.es-en.es")
europarl_en_chr <-
read_lines("../data/original/europarl-v7.es-en.en")
```
\cindex{library()}\cindex{read_lines()}
```{r}
#| label: curate-europarl-readr
#| echo: false
# Load package
library(readr)
# Read Europarl files .es and .en
europarl_es_chr <-
read_lines("data/europarl/original/europarl-v7.es-en.es")
europarl_en_chr <-
read_lines("data/europarl/original/europarl-v7.es-en.en")
```
:::
Using the `read_lines()` function, we read each line of the files into a character vector. Since the Europarl corpus is a parallel corpus, the lines in the source and target files are aligned. This means that the first line in the source file corresponds to the first line in the target file, the second line in the source file corresponds to the second line in the target file, and so on. This alignment is important for the analysis of parallel corpora, as it allows us to compare the source and target texts line by line.
Let's inspect our character vectors to ensure that they are of the length and appear to be structured as we expect. We can use the `length()` function to get the number of lines in each file and the `head()` function to preview the first few lines of each file.
::: {#exm-curate-europarl-inspect-chr}
```{r}
#| label: curate-europarl-inspect-chr
# Inspect Spanish character vector
length(europarl_es_chr)
head(europarl_es_chr, 5)
# Inspect English character vector
length(europarl_en_chr)
head(europarl_en_chr, 5)
```
\cindex{length()}\cindex{head()}
:::
The output of @exm-curate-europarl-inspect-chr shows that the number of lines in each file is the same. This is good. If the number of lines in each file was different, we would need to figure out why and fix it. We also see that the content of the files is aligned as expected.
Let's now create a dataset for each of the character vectors. We will use the `tibble()` function from {tibble} to create a data frame object with the character vectors as the `line` column and add a `type` column with the value 'Source' for the Spanish file and 'Target' for the English file. We will assign the output two new objects `europarl_source_df` and `europarl_target_df`, respectively, as seen in @exm-curate-europarl-df.
::: {#exm-curate-europarl-df}
```{r}
#| label: curate-europarl-df
# Create source data frame
europarl_source_df <-
tibble(
type = "Source",
lines = europarl_es_chr
)
# Create target data frame
europarl_target_df <-
tibble(
type = "Target",
lines = europarl_en_chr
)
```
\index{R packages!tibble}\cindex{tibble()}
:::
Inspecting these data frames with `glimpse()` in @exm-curate-europarl-glimpse, we can see if the data frames have the structure we expect.
::: {#exm-curate-europarl-glimpse}
```{r}
#| label: curate-europarl-glimpse
#| results: hold
# Preview source
glimpse(europarl_source_df)
# Preview target
glimpse(europarl_target_df)
```
\cindex{glimpse()}
:::
We now have our `type` and `lines` columns and the associated observations for our idealized dataset, in @tbl-curate-europarl-structure-example. We can now write these datasets to the *data/derived/* directory using `write_csv()` and create corresponding data dictionary files.
## Structured
Structured data already reflects the physical and semantic structure of a tidy dataset.\index{structured data}\index{tidy format} This means that the data is already in a tabular format and the relationships between columns and rows are already well-defined. Therefore, the heavy lifting of curating the data is already done. There are two remaining questions, however, that need to be taken into account. One, logistical question, is what file format the dataset is in and how to read it into R. And the second, more research-based, is whether the data may benefit from some additional curation and documentation to make it more amenable to analysis and more understandable to others.
### Reading datasets
Let's consider some common formats for structured data, *i.e.* datasets\index{dataset}, and how to read them into R. First, we will consider R-native formats\index{R}, such as package datasets\index{package datasets} and RDS files.\index{R data serialization (RDS)} Then will consider non-native formats, such as relational databases\index{relational databases} and datasets produced by other software. Finally, we will consider software agnostic formats, such as CSV.\index{comma-separated values (CSV)}
R and some R packages provide structured datasets that are available for use directly within R.\index{package datasets} For example, {languageR} [@R-languageR]\index{R packages!languageR} provides the `dative` dataset, which is a dataset containing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection. {janeaustenr} [@R-janeaustenr]\index{R packages!janeaustenr} provides the `austen_books` dataset, which is a dataset of Jane Austen's novels. **Package datasets**\index{package datasets} are loaded into an R session using either the `data()` function, if the package is loaded, or the `::` operator\index{::}, if the package is not loaded, `data(dative)` or `languageR::dative`, respectively.
::: {.callout}
**{{< fa medal >}} Dive deeper**
To explore the available datasets in a package, you can use the `data(package = "package_name")` function. For example, `data(package = "languageR")` will list the datasets available in {languageR}. You can also explore all the datasets available in the loaded packages with the `data()` function using no arguments. For example, `data()`.\cindex{data()}
:::
R also provides a native file format for storing R objects, the RDS file.\index{R data serialization (RDS)} Any R object, including data frames, can be written from an R session to disk by using the `write_rds()` function from `readr`. The *.rds* files will be written to disk in a binary format that is not human-readable, which is not ideal for transparent data sharing. However, the files and the R objects can be read back into an R session using the `read_rds()` function with all the attributes intact, such as vector types, factor levels, *etc.*\cindex{write_rds()}\cindex{read_rds()}
R provides a suite of tools for importing data from non-native structured sources such as databases and datasets from software such as SPSS, SAS, and Stata. For instance, if you are working with data stored in a **relational database**\index{relational databases} such as MySQL, PostgreSQL, or SQLite, you can use {DBI} [@R-DBI] to connect to the database and {dbplyr} [@R-dbplyr] to query the database using the SQL language. Files from SPSS (*.sav*), SAS (*.sas7bdat*), and Stata (*.dta*) can be read into R using {haven} [@R-haven].\index{R packages!DBI}\index{R packages!dbplyr}\index{R packages!haven}
Software agnostic file formats include delimited files, such as CSV, TSV, *etc.*\index{comma-separated values (CSV)} These file formats lack the robust structural attributes of the other formats, but balance this shortcoming by storing structured data in more accessible, human-readable format. Delimited files are plain text files which use a delimiter, such as a comma (`,`), tab (`\t`), or pipe (`|`), to separate the columns and rows. For example, a CSV file is a delimited file where the columns and rows are separated by commas, as seen in @exm-curate-csv-example.
::: {#exm-curate-csv-example}
```{.xml}
column_1,column_2,column_3
row 1 value 1,row 1 value 2,row 1 value 3
row 2 value 1,row 2 value 2,row 2 value 3
```
:::
Given the accessibility of delimited files, they are a common format for sharing structured data in reproducible research. It is not surprising, then, that this is the format which we have chosen for the derived datasets in this book.
### Orientation
With an understanding of the various structured formats, we can now turn to considerations about how the original dataset is structured and how that structure is to be used for a given research project. As an example, we will work with the CABNC datasets acquired in @sec-acquire-chapter. The structure of the original dataset is shown in @def-curate-cabnc-structure.
\pagebreak
::: {#def-curate-cabnc-structure}
Directory structure for the CABNC datasets
```{.bash code-line-numbers="false"}
data/
├── analysis/
├── derived/
└── original/
├── cabnc_do.csv
└── cabnc/
├── participants.csv
├── token_types.csv
├── tokens.csv
├── transcripts.csv
└── utterances.csv
```
:::
In addition to other important information, the data origin file *cabnc_do.csv* shown in @tbl-curate-cabnc-do informs us the datasets are related by a common variable.
```{r}
#| label: tbl-curate-cabnc-do
#| tbl-cap: "Data origin: CABNC datasets"
#| tbl-colwidths: [25, 75]
#| echo: false
read_csv("data/curate-cabnc_do.csv") |>
tt(width = 1)
```
The CABNC datasets are structured in a relational format, which means that the data is stored in multiple tables that are related to each other. The tables are related by a common column or set of columns, which are called **keys**\index{dataset keys}. A key is used to join the tables together to create a single dataset. There are two keys in the CABNC datasets, `filename` and `who`. Each variable corresponds to recording- and/ or participant-oriented datasets.
Now, let's envision a scenario in which we are preparing our data for a study that aims to investigate the relationship between speaker demographics and utterances. In their original format, the CABNC datasets separate information about utterances and speakers in separate datasets, `cabnc_utterances` and `cabnc_participants`, respectively. Ideally, we would like to curate these datasets such that the information about the utterances and the speakers are ready to be joined as part of the dataset transformation process, while still retaining the relevant original structure. This usually involves removing redundant and/ or uninformative variables and/ or adjusting variable names and writing these datasets and their documentation files to disk.
### Tidy the dataset
With these goals in mind, let's start the process of curation by reading the relevant datasets into an R session. Since we are working with CSV files we will use the `read_csv()` function, as seen in @exm-curate-cabnc-read.
::: {#exm-curate-cabnc-read}
```{r}
#| label: curate-cabnc-read
# Read the relevant datasets
cabnc_utterances <-
read_csv("data/cabnc/original/utterances.csv")
cabnc_participants <-
read_csv("data/cabnc/original/participants.csv")
```
\cindex{read_csv()}
:::
The next step is to inspect the structure of the datasets. We can use the `glimpse()` function for this task.
::: {#exm-curate-cabnc-glimpse}
```r
#Preview the structure of the datasets
glimpse(cabnc_utterances)
glimpse(cabnc_participants)
```
```{r}
#| label: curate-cabnc-glimpse
#| results: hold
#| echo: false
# Preview the structure of the datasets
glimpse(cabnc_utterances)
cat("\n")
glimpse(cabnc_participants)
```
\cindex{glimpse()}
:::
From visual inspection of the output of @exm-curate-cabnc-glimpse we can see that there are common variables in both datasets. In particular, we see the `filename` and `who` variables mentioned in the data origin file *cabnc_do.csv*.
The next step is to consider the variables that will be useful for future analysis. Since we are creating a curated dataset, the goal will be to retain as much information as possible from the original datasets. There are cases, however, in which there may be variables that are not informative and, thus, will not prove useful for any analysis. These removable variables tend to be of one of two types: variables which show no variation across observations and variables where the information is redundant.
As an example case, let's look at the `cabnc_participants` data frame. We can use the `skim()` function from {skimr} to get a summary of the variables in the dataset. We can add the `yank()` function to look at variable types one at a time. We will start with the character variables, as seen in @exm-curate-cabnc-skim-character.\index{R packages!skimr}
::: {#exm-curate-cabnc-skim-character}
```r
# Summarize character variables
cabnc_participants |>
skim() |>
yank("character")
```
\cindex{library()}\cindex{skim()}\cindex{yank()}
```{r, comment=""}
#| label: skim-test
#| echo: false
library(skimr)
cabnc_participants |>
skim() |>
yank("character") |>
capture.output() |>
cat(sep = "\n")
```
:::
We see from the output in @exm-curate-cabnc-skim-character, that the variables `role` and `language` have a single unique value. This means that these variables do not show any variation across observations.\index{entropy} We will remove these variables from the dataset.
Continuing on, let's look for redundant variables. We see that the variables `filename` and `path` have the same number of unique values. And if we combine this with the visual summary in @exm-curate-cabnc-glimpse, we can see that the `path` variable is redundant. We will remove this variable from the dataset.
Another potentially redundant set of variables are `who` and `name` ---both of which are speaker identifiers. The `who` variable is a unique identifier, but there may be some redundancy with the `name` variable, that is, there may be two speakers with the same name. We can check this by looking at the number of unique values in the `who` and `name` variables from the `skim()` output in @exm-curate-cabnc-skim-character. `who` has 568 unique values and `name` has 269 unique values. This suggests that there are multiple speakers with the same name.
Another way to explore this is to look at the number of unique values in the `who` variable for each unique value in the `name` variable. We can do this using the `group_by()` and `summarize()` functions from {dplyr}. For each value of `name`, we will count the number of unique values in `who` with `n_distinct()` and then sort the results in descending order.\index{R packages!dplyr}
::: {#exm-curate-cabnc-who-name}
```{r}
#| label: curate-cabnc-who-name
cabnc_participants |>
group_by(name) |>
summarize(n = n_distinct(who)) |>
arrange(desc(n)) |>
slice_head(n = 5)
```
\cindex{group_by()}\cindex{summarize()}\cindex{n_distinct()}\cindex{arrange()}
:::
It is good that we performed the check in @exm-curate-cabnc-who-name beforehand. In addition to speakers with the same name, such as 'Chris' and 'David', we also have multiple speakers with generic codes, such as 'None' and 'Unknown_speaker'. It is clear that `name` is redundant.
With this in mind, we can then safely remove the following variables from the dataset: `role`, `language`, `name`, and `path`. To drop variables from a data frame we can use the `select()` function in combination with the `-` operator.\index{-} The `-` operator tells the `select()` function to drop the variable that follows it.
::: {#exm-curate-cabnc-drop-vars}
```{r}
#| label: curate-cabnc-drop-vars
# Drop variables
cabnc_participants <-
cabnc_participants |>
select(-role, -language, -name, -path)
# Preview the dataset
glimpse(cabnc_participants)
```
\cindex{select()}\cindex{glimpse()}
:::
Now we have a frame with 9 more informative variables which describe the participants. We would then repeat this process for the `cabnc_utterances` dataset to remove redundant and uninformative variables.
Another, optional step, is to rename and/ or organize the order the variables to make the dataset more understandable. Let's organize the columns to read left to right from most general to most specific. Again, we turn to the `select()` function, this time including the variables in the order we want them to appear in the dataset. We will take this opportunity to rename some of the variable names so that they are more informative.
::: {#exm-curate-cabnc-rename-vars}
```{r}
#| label: curate-cabnc-rename-vars
# Rename variables
cabnc_participants <-
cabnc_participants |>
select(
doc_id = filename,
part_id = who,
part_age = monthage,
part_sex = sex,
num_words = numwords,
num_utts = numutts,
avg_utt_len = avgutt,
median_utt_len = medianutt
)
# Preview the dataset
glimpse(cabnc_participants)
```
\cindex{select()}\cindex{glimpse()}
:::
The variable order is organized after running @exm-curate-cabnc-rename-vars. Now let's sort the rows by `doc_id` and `part_id` so that the dataset is sensibly organized. The `arrange()` function takes a data frame and a list of variables to sort by, in the order they are listed.
::: {#exm-curate-cabnc-sort-rows}
```{r}
#| label: curate-cabnc-sort-rows
# Sort rows
cabnc_participants <-
cabnc_participants |>
arrange(doc_id, part_id)
# Preview the dataset
cabnc_participants |>
slice_head(n = 5)
```
\cindex{arrange()}\cindex{slice_head()}
:::
Applying the sorting in @exm-curate-cabnc-sort-rows, we can see that the utterances are now our desired order, a dataset that reads left to right from document to participant-oriented attributes and top to bottom by document and participant.
## Semi-structured
Between unstructured and structured data falls semi-structured data.\index{semi-structured data} And as the name suggests, it is a hybrid data format. This means that there will be important structured metadata included with unstructured elements.\index{metadata} The file formats and approaches to encoding the structured aspects of the data vary widely from resource to resource and therefore often require more detailed attention to the structure of the data and often include more sophisticated programming strategies to curate the data to produce a tidy dataset.
### Reading data
The file formats associated with semi-structured data include a wide range. These include file formats conducive to more structured-leaning data, such as XML\index{Extensible Markup Language (XML)}, HTML\index{Hypertext Markup Language (HTML)}, and JSON\index{JavaScript Object Notation (JSON)}, and file formats with more unstructured-leaning data, such as annotated TXT files.\index{plain text} Annotated TXT files may in fact appear with the *.txt* extension, but may also appear with other, sometimes resource-specific, extensions, such as *.utt* for the Switchboard Dialog Act Corpus or *.cha* for the Child Language Data Exchange System (CHILDES)\index{Child Language Data Exchange System (CHILDES)} annotation files, for example.
The more structured file formats use standard conventions and therefore can be read into an R session with format-specific functions. Say, for example, we are working with data in a JSON file format. We can read the data into an R session with the `read_json()` function from {jsonlite} [@R-jsonlite]. For XML and HTML files, {rvest} [@R-rvest] provides the `read_xml()` and `read_html()` functions.\index{R packages!jsonlite}\index{R packages!rvest}
Semi-structured data in TXT files can be read either as a file or by lines. The choice of which approach to take depends on the structure of the data. If the data structure is line-based, then `read_lines()` often makes more sense than `read_file()`. However, in some cases, the data may be structured in a way that requires the entire file to be read into an R session and then subsequently parsed.
### Orientation {#sec-curate-semi-structured-orientation}
To provide an example of the curation process using semi-structured data, we will work with the Europarl corpus of native, non-native and translated texts (ENNTT) corpus [@Nisioi2016].\index{Europarl corpus of native, non-native and translated texts (ENNTT)} The ENNTT corpus contains native and translated English drawn from European Parliament proceedings. Let's look at the directory structure for the ENNTT corpus in @def-curate-enntt-structure.
::: {#def-curate-enntt-structure}
Data directory structure for the ENNTT corpus
```{.bash code-line-numbers="false"}
data/
├── analysis/
├── derived/
└── original/
├── enntt_do.csv
└── enntt/
├── natives.dat
├── natives.tok
├── nonnatives.dat
├── nonnatives.tok
├── translations.dat
└── translations.tok
```
:::
We now inspect the data origin file for the ENNTT corpus, *enntt_do.csv*, in @tbl-curate-enntt-do.\index{data origin}
```{r}
#| label: tbl-curate-enntt-do
#| tbl-cap: "Data origin: ENNTT Corpus"
#| tbl-colwidths: [25, 75]
#| echo: false
read_csv("data/curate-enntt_do.csv") |>
tt(width = 1)
```
According to the data origin file, there are two important file types, *.dat* and *.tok*. The *.dat* files contain annotations and the *.tok* files contain the actual text. Let's inspect the first couple of lines in the *.dat* file for the native speakers, *nonnatives.dat*, in @def-curate-enntt-nonnatives-dat.
::: {#def-curate-enntt-nonnatives-dat}
Example *.dat* file for the non-native speakers
```{.xml code-line-numbers="false"}
<LINE STATE="Poland" MEPID="96779" LANGUAGE="EN" NAME="Danuta Hübner," SEQ_SPEAKER_ID="184" SESSION_ID="ep-05-11-17"/>
<LINE STATE="Poland" MEPID="96779" LANGUAGE="EN" NAME="Danuta Hübner," SEQ_SPEAKER_ID="184" SESSION_ID="ep-05-11-17"/>
```
:::
We see that the *.dat* file contains annotations for various session and speaker attributes. The format of the annotations is XML-like. XML is a form of markup language, such as YAML, JSON, *etc.* **Markup languages** are used to annotate text with additional information about the structure, meaning, and/ or presentation of text. In XML, structure is built up by nesting of nodes. The nodes are named with tags, which are enclosed in angle brackets, `<` and `>`. Nodes are opened with `<TAG>` and closed with `</TAG>`. In @def-curate-xml we see an example of a simple XML file structure.
::: {#def-curate-xml}
Example *.xml* file structure
```{.xml code-line-numbers="false"}
<?xml version="1.0" encoding="UTF-8"?>
<book category="fiction">
<title lang="en">The Catcher in the Rye</title>
<author>J.D. Salinger</author>
<year>1951</year>
</book>
```
:::
In @def-curate-xml there are four nodes, three of which are nested inside of the `<book>` node. The `<book>` node in this example is the root node. XML files require a root node.\index{Extensible Markup Language (XML)} Nodes can also have attributes, such as the `category` attribute in the `<book>` node, but they are not required. Furthermore, XML files also require a declaration, which is the first line in @def-curate-xml. The declaration specifies the version of XML used and the encoding.
So the *.dat* file is not strict XML, but is similar in that it contains nodes and attributes. An XML variant you are likely familiar with, HTML\index{Hypertext Markup Language (HTML)}, has more relaxed rules than XML. HTML is a markup language used to annotate text with information about the organization and presentation of text on the web that does not require a root node or a declaration ---much like our *.dat* file. So suffice it to say that the *.dat* file can safely be treated as HTML.
And the *.tok* file for native speakers, *nonnatives.tok*, in @def-curate-enntt-nonnatives-tok, shows the actual text for each line in the corpus.
\pagebreak
::: {#def-curate-enntt-nonnatives-tok}
Example *.tok* file for the non-native speakers
```{.xml code-line-numbers="false"}
The Commission is following with interest the planned construction of a nuclear power plant in Akkuyu , Turkey and recognises the importance of ensuring that the construction of the new plant follows the highest internationally accepted nuclear safety standards .
According to our information , the decision on the selection of a bidder has not been taken yet .
```
:::
In a study in which we are interested in contrasting the language of natives and non-natives, we will want to combine the *.dat* and *.tok* files for these groups of speakers.
The question is what attributes\index{variables} we want to include in the curated dataset. Given the research focus, we will not need the `LANGUAGE` or `NAME` attributes. We may want to modify the attribute names so they are a bit more descriptive.
An idealized version of the curated dataset based on this criteria is shown in @tbl-curate-enntt-ideal.
::: {#tbl-curate-enntt-ideal tbl-colwidths="[15, 18, 17, 50]"}
| variable | name | type | description |
|----------|------|---------------|-------------|
| session_id | Session ID | character | Unique identifier for each session. |
| speaker_id | Speaker ID | integer | Unique identifier for each speaker. |
| state | State | character | The political state of the speaker. |
| type | Type | character | Indicates whether the text is native or non-native |
| session_seq | Session Sequence | integer | The sequence of the text in the session. |
| text | Text | character | Contains the text of the line, and maintains the structure of the original data. |
Idealized structure for the curated ENNTT Corpus datasets
:::
### Tidy the data
Now that we have a better understanding of the corpus data and our target curated dataset structure, let's work to extract and organize the data from the native and non-native files.
The general approach we will take is, for native and then non-natives, to read in the *.dat* file as an HTML file and then extract the line nodes and their attributes combining them into a data frame. Then we'll read in the *.tok* file as a text file and then combine the two into a single data frame.\index{data frame}
Starting with the natives, we use {rvest} to read in the *.dat* file as an XML file with the `read_html()` function and then extract the line nodes with the `html_elements()` function as in @exm-curate-enntt-read-xml.\index{R packages!rvest}
::: {#exm-curate-enntt-read-xml}
```r
# Load packages
library(rvest)
# Read in *.dat* file as HTML
ns_dat_lines <-
read_html("../data/original/enntt/natives.dat") |>
html_elements("line")
# Inspect
class(ns_dat_lines)
typeof(ns_dat_lines)
length(ns_dat_lines)
```
\cindex{library()}\cindex{read_html()}\cindex{html_elements()}
\cindex{class()}\cindex{typeof()}\cindex{length()}
```{r}
#| label: curate-enntt-read-xml
#| echo: false
#| results: hold
# Load packages
library(rvest)
# Read in *.dat* file as HTML
ns_dat_lines <-
read_html("data/enntt/original/natives.dat") |>
html_elements("line")
# Inspect the object
class(ns_dat_lines)
typeof(ns_dat_lines)
length(ns_dat_lines)
```
:::
We can see that the `ns_dat_lines` object is a special type of list, `xml_nodeset` which contains `r format(length(ns_dat_lines), big.mark = ",")` line nodes. Let's now jump out of sequence and read in the *.tok* file as a text file, in @exm-curate-enntt-read-lines, again by lines using `read_lines()`, and compare the two to make sure that our approach will work.
::: {#exm-curate-enntt-read-lines}
```r
# Read in *.tok* file by lines
ns_tok_lines <-
read_lines("../data/enntt/original/natives.tok")
# Inspect
class(ns_tok_lines)
typeof(ns_tok_lines)
length(ns_tok_lines)
```
\cindex{read_lines()}
```{r}
#| label: curate-enntt-read-lines
#| echo: false
#| results: hold
# Read in *.tok* file by lines
ns_tok_lines <-
read_lines("data/enntt/original/natives.tok")
# Inspect object
class(ns_tok_lines)
typeof(ns_tok_lines)
length(ns_tok_lines)
```
:::
We do, in fact, have the same number of lines in the *.dat* and *.tok* files. So we can proceed with extracting the attributes\index{variables} from the line nodes and combining them with the text from the *.tok* file.
Let's start by listing the attributes of the first line node in the `ns_dat_lines` object. To do this we will draw on the `pluck()` function from {purrr} [@R-purrr] to extract the first line node. Then, we use the `html_attrs()` function to get the attribute names and the values, as in @exm-curate-enntt-list-attributes.\index{R packages!purrr}
\pagebreak
::: {#exm-curate-enntt-list-attributes}
```{r}
#| label: curate-enntt-list-attributes
# Load package
library(purrr)
# List attributes line node 1
ns_dat_lines |>
pluck(1) |>
html_attrs()
```
\cindex{library()}\cindex{pluck()}\cindex{html_attrs()}
:::
No surprise here, these are the same attributes we saw in the *.dat* file preview in @def-curate-enntt-nonnatives-dat. At this point, it's good to make a plan on how to associate the attribute names with the column names in our curated dataset.
- `session_id` = `session_id`
- `speaker_id` = `MEPID`
- `state` = `state`
- `session_seq` = `seq_speaker_id`
We can do this one attribute at a time using the `html_attr()` function and then combine them into a data frame with the `tibble()` function as in @exm-curate-enntt-extract-attributes.
::: {#exm-curate-enntt-extract-attributes}
```{r}
#| label: curate-enntt-extract-attributes
# Extract attributes from first line node
session_id <- ns_dat_lines |> pluck(1) |> html_attr("session_id")
speaker_id <- ns_dat_lines |> pluck(1) |> html_attr("mepid")
state <- ns_dat_lines |> pluck(1) |> html_attr("state")
session_seq <- ns_dat_lines |> pluck(1) |> html_attr("seq_speaker_id")
# Combine into data frame
tibble(session_id, speaker_id, state, session_seq)
```
\cindex{pluck()}\cindex{html_attr()}\cindex{tibble()}
:::
The results from @exm-curate-enntt-extract-attributes show that the attributes have been extracted and mapped to our idealized column names, but this would be tedious to do for each line node. A function to extract attributes and values from a line and add them to a data frame would help simplify this process.\index{custom functions} The function in @exm-curate-enntt-extract-attributes-function does just that.
::: {#exm-curate-enntt-extract-attributes-function}
```{r}
#| label: curate-enntt-extract-attributes-function
# Function to extract attributes from line node
extract_dat_attrs <- function(line_node) {
session_id <- line_node |> html_attr("session_id")
speaker_id <- line_node |> html_attr("mepid")
state <- line_node |> html_attr("state")
session_seq <- line_node |> html_attr("seq_speaker_id")
tibble(session_id, speaker_id, state, session_seq)
}
```
\cindex{html_attr()}\cindex{tibble()}\cindex{function()}
:::
It's a good idea to test out the function to verify that it works as expected. We can do this by passing the various indices to the `ns_dat_lines` object to the function as in @exm-curate-enntt-test-extract-attributes-function.
::: {#exm-curate-enntt-test-extract-attributes-function}
```{r}
#| label: curate-enntt-test-extract-attributes-function
#| results: hold
# Test function
ns_dat_lines |> pluck(1) |> extract_dat_attrs()
ns_dat_lines |> pluck(20) |> extract_dat_attrs()
ns_dat_lines |> pluck(100) |> extract_dat_attrs()
```
:::
It looks like the `extract_dat_attrs()` function is ready for prime-time. Let's now apply it to all of the line nodes in the `ns_dat_lines` object using the `map_dfr()` function from {purrr} as in @exm-curate-enntt-extract-attributes-all.\index{R packages!purrr}
::: {#exm-curate-enntt-extract-attributes-all}
```{r}
#| label: curate-enntt-extract-attributes-all
#| cache: true
# Extract attributes from all line nodes
ns_dat_attrs <-
ns_dat_lines |>
map_dfr(extract_dat_attrs)
# Inspect
glimpse(ns_dat_attrs)
```
\cindex{map_dfr()}\cindex{glimpse()}
:::
::: {.callout}
**{{< fa medal >}} Dive deeper**
The `map*()` functions from {purrr} are a family of functions that apply a function to each element of a vector, list, or data frame. The `map_dfr()` function is a variant of the `map()` function that returns a data frame that is the result of row-binding the results, hence `*_dfr`.\cindex{map_dfr()}
:::
We can see that the `ns_dat_attrs` object is a data frame with `r format(nrow(ns_dat_attrs), big.mark = ",")` rows and `r ncol(ns_dat_attrs)` columns, just has we expected. We can now combine the `ns_dat_attrs` data frame with the `ns_tok_lines` vector to create a single data frame with the attributes and the text. This is done with the `mutate()` function assigning the `ns_tok_lines` vector to a new column named `text` as in @exm-curate-enntt-combine-attributes-text.\index{R packages!dplyr}
::: {#exm-curate-enntt-combine-attributes-text}
```{r}
#| label: curate-enntt-combine-attributes-text
# Combine attributes and text
ns_dat <-
ns_dat_attrs |>
mutate(text = ns_tok_lines)
# Inspect
glimpse(ns_dat)
```
\cindex{mutate()}\cindex{glimpse()}
:::
This is the data for the native speakers. We can now repeat this process for the non-native speakers, or we can create a function to do it for us.
\pagebreak
::: {.callout .halfsize}
**{{< fa regular lightbulb >}} Consider this**
Using the previous code as a guide, consider what steps you would need to take to create a function to combine the *.dat* and *.tok* files for the non-native speakers (and/ or the translations). What arguments would the function take? What would the function return? What would the processing steps be? In what order would the steps be executed?
:::
After applying the curation steps to both the native and non-native datasets, we will have two data frames, `enntt_ns_df` and `enntt_nns_df`, respectively that meet the idealized structure for the curated ENNTT Corpus datasets, as shown in @tbl-curate-enntt-ideal. The `enntt_ns_df` and `enntt_nns_df` data frames are ready to be written to disk and documented.
## Documentation
After applying the curation steps to our data, we will now want to write the dataset to disk and to do our best to document the process and the resulting dataset.\index{data documentation}
Since data frames are a tabular, we will have various options for the file type to write. Many of these formats are software-specific, such as `*.xlsx` for Microsoft Excel, `*.sav` for SPSS, `*.dta` for Stata, and `*.rds` for R. We will use the `*.csv` format since it is a common format that can be read by many software packages. We will use the `write_csv()` function from {readr} to write the dataset to disk.
Now the question is where to save our CSV file. Since our dataset is derived by our work, we will added it to the *derived/* directory. If you are working with multiple data sources within the same project, it is a good idea to create a sub-directory for each dataset. This will help keep the project organized and make it easier to find and access the datasets.
The final step, as always, is to provide documentation. For datasets the documentation is a data dictionary\index{data dictionary}, as discussed in @sec-data-data-dictionaries. As with data origin files, you can use spreadsheet software to create and edit the data dictionary.
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
The `create_data_dictionary()` function from {qtkit} provides a rudimentary data dictionary template by default. However, the `model` argument let's you take advantage of OpenAI's text generation models to generate a more detailed data dictionary for you to edit. See the function documentation for more information.
\index{R packages!qtkit}\cindex{create_data_dictionary()}
:::
In {qtkit} we have a function, `create_data_dictionary()` that will generate the scaffolding for a data dictionary. The function takes two arguments, `data` and `file_path`. It reads the dataset columns and provides a template for the data dictionary.
An example of a data dictionary, a data dictionary for the `enntt_ns_df` dataset is shown in @tbl-curate-unstructured-data-dictionary-example.
```{r}
#| label: tbl-curate-unstructured-data-dictionary-example
#| tbl-cap: "Data dictionary: `enntt_ns_df` dataset"
#| tbl-colwidths: [15, 15, 15, 55]
#| echo: false
read_csv("data/curate-enntt_curated_dd.csv") |>
tt(width = 1)
```
## Activities {.unnumbered}
The following activities build on your skills and knowledge to use R to read, inspect, and write data and datasets in R. In these activities you will have an opportunity to learn and apply your skills and knowledge to the task of curating datasets. This is a vital component of text analysis research that uses unstructured and semi-structured data.
::: {.callout}
**{{< fa regular file-code >}} Recipe**
**What**: Organizing and documenting datasets\
**How**: Read Recipe 6, complete comprehension check, and prepare for Lab 6.\
**Why**: To rehearse methods for deriving tidying datasets to use as the base for further project-specific purposes. We will explore how regular expressions are helpful in developing strategies for matching, extracting, and/ or replacing patterns in character sequences and how to organize datasets in rows and columns. We will also explore how to document datasets in a data dictionary.
:::
\pagebreak
::: {.callout}
**{{< fa flask >}} Lab**
**What**: Taming data\
**How**: Fork, clone, and complete the steps in Lab 6.\
**Why**: To gain experience working with coding strategies to manipulate data using Tidyverse functions and regular expressions, to practice reading/ writing data from/ to disk, and to implement organizational strategies for organizing and documenting a dataset in reproducible fashion.
:::
## Summary {.unnumbered}
In this chapter we looked at the process of structuring data into a dataset. This included a discussion on three main types of data ---unstructured, structured, and semi-structured. The level of structure of the original data(set) will vary from resource to resource and by the same token so will the file format used to support the level of metadata included. The results from data curation results in a dataset that is saved separate from the original data in order to maintain modularity between what the data(set) look like before we intervene and afterwards. Since there can be multiple analysis approaches applied to the original data in a research project, this curated dataset serves as the point of departure for each of the subsequent datasets derived from the transformational steps. In addition to the code we use to derive the curated dataset's structure, we also include a data dictionary which documents the variables and measures in the curated dataset.