Skip to content
Bradley Dilger edited this page Jul 3, 2024 · 123 revisions

Welcome to CIABATTA!

CIABATTA stands for "Corpus In A Box: Automated Tools, Tutorials, & Advising." It is a distillation of the accumulated knowledge of the Crow team, corpus researchers and software developers who maintain the Corpus & Repository of Writing.

CIABATTA provides templates and code for corpus building -- examples, design patterns, best practices, and step-by-step processes -- that provide a starting point for developing new corpora. The guides and guidelines included here can be used as-is, or can be extended to fit the particular needs of a given corpus.

We recommend not using Safari with the Wiki, tools, or GitHub more generally.


Contents

1. Best practices for corpus building

2. CIABATTA overview

3. Ethical issues in corpus building

4. Checking consents and collecting data

5. Organizing your data

6. Converting, encoding, and standardizing your data

6a. Automatic processing with our Corpus Text Processor

6b. Manually converting your data

7. Organizing, preparing and processing metadata

7a. Gathering and preparing metadata

7b. Running the metadata processing script

8. Adding headers and changing filenames

8a. Why add headers and filenames?

8b. Adding headers and changing filenames script

9. Deidentifying your data

9a. Automatic deidentification

9b. Manual deidentification

How to report an issue