# Wrangling Humanities Data with AI and R: A Very Brief Introduction 
In this workshop, we will explore how generative AI can help you create R code for working with humanistic data.

## Working With Unstructured Data 
This portion of the session focuses on text that was generated by optical character recognition (OCR) of The Gettysburgian in academic year 1999-2000 (August through May). 

The data we are working with are a series of plain text files, one for each issue, with file names that correspond to the date of the issue (such as `1999-08-26.txt`). This is a type of **unstructured data**. However, there is also a file, `corpus_index.csv`, that has a list of the files and their issue dates, which as **structured data** may help us out.

## What We Will Do:
1. Load the issues of the Gettysburgian. 
2. Verify the number of issues and see a portion of one issue. 
3. Use a generative AI tool to write R code that analyzes some aspect of the word usage

You will generate most of the R code yourself by working with an AI tool of your choice (ChatGPT, Claude, Gemini, etc.).

This workshop uses a [GitHub repository](https://github.com/rmiessle/jfi2026) to store the data and [binder](https://mybinder.org), a tool that lets you create a notebook where you can run code.


## How to Run Code in This binder Notebook

- Click into any `code` cell.
- Press **Shift + Enter** to run the code.

If a cell returns an error:
- Check the error message.
- Ask an AI tool to help fix it, usually by copying and pasting it into the AI tool itself.

Sometimes you will see warnings in red - you can generally ignore these, although sometimes they have useful information you can use to refine your R script.



## Verify the Issue Dataset

The Gettysburgian issues are `.txt` files in the `gettysburgian` folder in this notebook. This next step is just a quick check to make sure all the files we expect to see are there.

This code includes the `tidyverse` R library, which provides some helper code which lets us read the files.

Run the next cell to load it by clicking in the cell and pressing **Shift + Enter**.



In [None]:
# loads the tidyverse R library
library(tidyverse)
# create a variable 'issues' for the CSV corpus index
issues <- read_csv("corpus_index.csv", show_col_types = FALSE)
# Quick sanity checks: confirm files exist and preview one issue

# 1) Do the files listed in the index exist?
issues <- issues |> dplyr::mutate(exists = file.exists(file_path))
issues |> dplyr::count(exists) |> print()

# 2) Preview the first existing issue (first ~30 lines)
if (!any(issues$exists)) {
  cat("\nNo text files were found at the paths listed in the index.\n")
  cat("Here are a few example paths from the index:\n")
  print(head(issues$file_path, 5))
} else {
  first_file <- issues$file_path[which(issues$exists)[1]]
  cat("\nPreviewing:", first_file, "\n\n")
  lines <- readLines(first_file, warn = FALSE, encoding = "UTF-8")
  n <- min(30, length(lines))
  cat(paste(lines[1:n], collapse = "\n"))
  cat("\n")
}


## Using AI to Write R Code

You will now generate your own R code with the help of a generative AI tool.

### Instructions:
1. Ask an AI tool to write the R code for the visualization or analysis.  
2. Copy and paste **only the code** in the cell below, underneath our libraries.  
3. Run it by clicking in the cell and pressing **Shift + Enter**.  
4. Sometimes you may get unexpected results. You can use the AI to troubleshoot and refine the code, often by giving it examples of the unexpected results. If errors appear, copy and paste the error messages and ask the AI to help debug.
### Tips for Successful Prompt Writing 
Successful prompt writing requires that you provide the AI tool some information about your coding environment and data. In this case, we can say:

**I am using a notebook created at mybinder.org to run R scripts from a GitHub repository provided by a facilitator. This repository includes a directory, `gettysburgian`, which includes the OCR-ed text of student newspaper issues. Each issue is a plain text file with a filename in `YYYY-MM-DD.txt` format, which corresponds to the date of the issue. There is also a CSV file in the root directory of the repository, `corpus_index.csv`, that has two columns, `issue_date` (YYYY-MM-DD) and `file_path` (e.g., gettysburgian/1999-08-26). Each row in the `corpus_index.csv` spreadsheet is a different issue (example: `1999-08-26,gettysburgian/1999-08-26.txt`), for a total of 27 issues. The mybinder.org notebook has the following R libraries preloaded:** 

**`tidyverse, readr, dplyr, stringr, lubridate, tidytext, ggplot2, syuzhet, wordcloud, RColorBrewer`**

**I want to create some R scripts that analyze and visualize the dataset. When creating the code, do not re-load libraries. Assume I know nothing about R, so please include comments that help me understand what the different parts of the script are doing.** 

You can even do this before you even ask the AI tool to start writing R scripts to get it primed for what comes next.



### Task: Analyze the Corpus Using AI-Generated Code
First, prime everything by giving your AI tool your environment and data as above. Then, you can ask it to write an R script.

#### Example prompts you might use:
- Write R code that tells me which issues the world 'Isherwood' appears, and how many times it appears in that issue
- Write R code that extracts the 25 most frequent non-stopword words, and also ignores the words gettysburg, college, and gettysburgian (regardless of case), from all the issues and creates a word cloud
- Write R code that gives me a table of the most common non-stopword word in each issue, also ignoring the words gettysburg, college, and gettysburgian (regardless of case)
- Write R code that applies a Syuzhet sentiment analysis score for each issue and reports the results in a table
- Write R code that counts the total number of characters (excluding spaces and blank lines) in each issue and creates a line chart of the totals

After the code is generated by the AI tool, copy **only the code** (most tools will have a copy button in the code block it provides) and paste it **below the libraries** in the code block below.



In [None]:
# DO NOT DELETE OR PASTE OVER THESE LINES
# load our R libraries
library(tidyverse)
library(readr)
library(dplyr)
library(stringr)
library(lubridate)
library(tidytext)
library(ggplot2)
library(syuzhet)
library(wordcloud)
library(RColorBrewer)
# Paste your AI-generated code below this line and run it.
# --------------------------------------------------------




If you want to try a different prompt, re-generate code and paste it into the cell below and press **Shift + Enter**. You will not need to re-load the libraries.


In [None]:
# Paste your AI-generated code below this line and run it.
# --------------------------------------------------------


