# Wrangling Humanities Data with AI and R â€“ A Very Brief Introduction 

In this workshop we will explore how generative AI can help you create R code for working with humanistic data.
This portion of the session focuses on text that was generated by optical character recognition (OCR) of The Gettysburgian in academic year 1997-1998 (August through April). 
The data is a series of plain text files, one for each issue, with file names that correspond to the date of the issue. This is a type of **unstructured data**. However, there is also a file, `corpus_index.csv`, that has a list of the files and their issue dates, which as **structured data** may help us out.

### What we will do:
1. Get a list of files that we are using
2. Use generative AI to write R code that:
   - visualizes some aspect of the data,
   - analyzes some aspect of the word usage,

You will generate most of the R code yourself by working with an AI tool of your choice (ChatGPT, Claude, Gemini, etc.).


## How to Run Code in This mybinder.org Notebook

- Click any `code` cell.
- Press **Shift + Enter** to run the code.

If a cell returns an error:
- Check the error message.
- Ask an AI tool to help fix it, usually by copying and pasting it into the AI tool itself.

Sometimes you will see warnings in red - you can generally ignore these, although sometimes they have useful information you can use to refine your R script.



# Loading the Gettysburgian Dataset

The Gettysburgian files are in the data folder in this notebook. When we load it, we will see the first few rows. We should also see the column names, **issue_date** (in YYYY-MM-dd format) and **file_name**. The column names are important when working with AI to generate R scripts.

This code includes the tidyverse R library, which provides some helper code which lets us read the CSV file./n/nRun the next cell to load it by clicking in the cell and pressing **Shift + Enter**.

Note: While this step is not always necessary if you are already familar with the dataset, we can pull it up just to verify it is being loaded.
/n

In [None]:
# loads the tidyverse R library
library(tidyverse)
# create a variable `issues` for the CSV corpus index
issues <- read_csv("gettysburgian/corpus_index.csv", show_col_types = FALSE)
# Peek at the dataset
head(issues)


# Using AI to Write R Code

You will now generate your own R code with the help of a generative AI tool.

### Successful prompt writing ### 
Successful prompt writing requires that you provide the AI tool some information about your coding environment and data. In this case, we can say:

**I am using a mybinder.org notebook to run R scripts from a GitHub repository provided by a facilitator. This repository includes a directory, gettysburgian, which includes the OCR-ed text of student newspaper issues. Each issue is a plain text file named in YYYY-MM-DD.txt format, which corresponds to the date of the issue. There is also a CSV file, corpus_index.csv, that has two columns, issue_date and file_name. Each row in the corpus_index spreadsheet is a different issue (example: 1997-08-21,gettysburgian/1997-08-21.txt). The mybinder.org notebook has the following R libraries preloaded:** 
**`tidyverse, readr, dplyr, stringr, lubridate, tidytext, ggplot2, syuzhet, wordcloud, RColorBrewer`. Assume I know nothing about R, so please include comments that help me understand what the different parts of the script are doing.** 

You can even do this before you even ask the AI tool to start writing R scripts to get it primed for what comes next.



# Task 1: Analyze the Corpus Using AI-Generated Code

First, prime everything by giving your AI tool your environment and data as above. Then, you can ask it to write an R script.
### Example prompts you might use:
- **Write R code that extracts the 25 most frequent non-stopword words, and also ignores the words gettysburg, college, and gettysburgian, from all the issues and creates a word cloud**
- **Write R code that gives me a table of the most common non-stopword word in each issue, also ignoring the words gettysburg, college, and gettysburgian**
- **Write R code that applies a Syuzhet sentiment analysis score for each issue and reports the results in a table**
- **Write R code that counts the total number of characters (excluding spaces and blank lines) in each issue and creates a line chart of the totals**

### Instructions:
1. Ask an AI tool to write the R code for the visualization.  
2. Copy and paste **only the code** in the cell below, underneath our libraries.  
3. Run it by pressing **Shift + Enter**.  
4. If errors appear, copy and paste the error messages and ask the AI to help debug.


In [None]:
# DO NOT DELETE OR PASTE OVER THESE LINES
# load our R libraries
library(tidyverse)
library(readr)
library(dplyr)
library(stringr)
library(lubridate)
library(tidytext)
library(ggplot2)
library(syuzhet)
library(wordcloud)
library(RColorBrewer)
# Paste your AI-generated code below this line and run it.
# --------------------------------------------------------


