This repository contains the code and data supporting the research paper Tracing Linguistic Footprints of ChatGPT Across Tasks, Domains and Personas in English and German. The project explores how the output of large language models like ChatGPT differs from human-generated text and analyzes the impact of task-specific prompting on linguistic features in both English and German texts.
The code is generalizable to include any number of corpora, domains, and tasks by simply updating config/config.yaml
-
cd text_generation -
Set up your OpenAI API key:
export OPENAI_API_KEY="your_openai_api_key_here" -
Update prompts, tasks, and personas in
prompts.json(also, seeprompts_ashuman_asmachine.json) -
Create an input JSON file for each corpus. See
make_json.pyfor an example. The JSON file format should be as follows:{ "human_file1": { "title": "very interesting and engaging topic", "prompt": "part of the text to use for the prompt", "text": "the rest of the text" }, "human_file2": { } } -
bash call_generate_personas.sh- model:
gpt-4specify you rOpenAI model - infolder:
../data_collection/100_files_json/human texts for prompting and analysis - outfolder:
../generated_datasee the resulting directory tree representation below - config:
../config/config.yaml - calls
generate_personas.py
generated_data/ ├── corpus1/ │ ├── task1/ │ │ ├── human/ │ │ │ ├── 0.txt │ │ │ ├── 1.txt │ │ │ └── ... │ │ └── system1/ │ │ ├── 0.txt │ │ ├── 1.txt │ │ └── ... │ └── task2/ │ ├── human/ │ │ ├── 0.txt │ │ ├── 1.txt │ │ └── ... │ └── system1/ │ ├── 0.txt │ ├── 1.txt │ └── ... └── corpus2/ ├── task1/ │ ├── human/ - model:
cd feature_extraction/scripts/
Specify your input data and output directories in bash run_experiments.sh, which executes two bash scripts:
-
bash run_extract_BiasMT_features.shextracts metrics for Sophistication, Lexical and Morphological richness using the BiasMT tool. -
bash run_extract_other_features.sh:- extracts features using the TextDescriptives library
- extracts connectives
- uses custom formula for German Flesch Reading Ease
- reorganizes results and transforms dataframes for further analysis
features_list.py contains several dictionnaries with feature names:
- features_list is a list of TextDescriptives features
- features_custom is a list of custom-added feature names
- features_to_visualize_dict is a dictionnary with feature names used by textDescriptives and throughout the project as keys and modified feature names as values
- features_raw_counts is a list of features that are measured in raw counts
cd analysis/scripts/
bash run_analysis.sh
- Parameters:
- alpha:
0.01,0.05 - method:
bon(bonferroni),bh(benjamini-hochberg)
- alpha:
If you use this repository or build on this work, please cite:
@inproceedings{Shaitarova2024,
title={Tracing Linguistic Footprints of ChatGPT Across Tasks, Domains and Personas in English and German},
author={Shaitarova, Anastassia and Bauer, Nikolaj and Vamvas, Jannis and Volk, Martin},
booktitle={Proceedings of the 9th edition of the Swiss Text Analytics Conference},
pages={102--112},
year={2024},
address={Chur, Switzerland},
publisher={Association for Computational Linguistics}
}