Skip to content

9a. Automatic deidentification

Shelley Staples edited this page Nov 30, 2021 · 5 revisions

Contents

Before you run this script

Before running the deidentification script, make sure that your texts are in .txt format, UTF8 encoded, and cleaned of non-ASCII characters. You can use our Corpus Text Processor to perform these steps.

The script takes in folders with subfolders and creates a new folder called “deidentified”, which will replicate the original folder structure.

Getting started

The purpose of automatic deidentification is to remove proper names that occur before the body of the text itself because that is where most names are present. To find and remove these names and some other identifying information (e.g. emails, student ids), we use regular expressions that match strings that look like proper names, emails, student ids, etc. as the only content on the line.

Here's an example of a file before the deidentification script was run. And here's the same file after the script was run. You can see that the name Malik Parrish A before the main body of the text was removed.

Downloading the script and the files

We have included a folder with some sample files for you to run the script on. Both the folder and the script are inside ciabatta > automatic_deidentification. The folder is called files_with_headers and the Python script is called ciabatta_deid.py.

There are two ways (a and b below) you can download them.

a) From the git website: Navigate to the ciabatta directory, then in the upper right corner click on the "Code" button and select “Download zip”. This will download the zip file on your computer. Then unzip the file (Windows users: ensure you unzip the file), and you will have the script with the folder on your computer.

b) From the terminal: Navigate to the ciabatta directory, then in the upper right corner click on the "Code" button and copy the link. Now navigate to your terminal on a Mac (in Windows, use Command Prompt or Powershell) and run this line:

git clone https://github.com/writecrow/ciabatta.git

This will download the git directory with the script and the files on your computer.

Running the script on Mac

Now, in your terminal, navigate to where your downloaded ciabatta folder is. For example, if you unzipped the files to your Desktop, navigate to your Desktop.

cd Desktop

Navigate inside the automatic_deidentification folder where the script and files are inside ciabatta:

cd ciabatta/automatic_deidentification

Before you run the script, check how many files there are in the folder files_with_headers to make sure that it is the same number of files after you run the script. To count the number of files, run the following command:

ls files_with_headers/**/**/**/**/*.txt | wc -l

NOTE: There should be 40 files in the files_with_headers folder.

Now you can run the script with the following command: python ciabatta_deid.py --directory=files_with_headers

This is what it looks like in your terminal:

where ciabatta_deid.py is the name of the script, and files_with_headers is the name of the directory with the text files you want to deidentify. After you run the script, you should now have a new folder inside ciabatta called “deidentified”. Make sure all of your files have been processed correctly with this command:

ls deidentified/**/**/**/**/**/*.txt | wc -l

Video presentation for Mac

A video version of this content is available on the Crow YouTube channel.

Video: Running the deidentification script on a Mac

Running the script on PC (PowerShell ISE)

Now, in your terminal, navigate to where your downloaded ciabatta folder is. For example, if you unzipped the files to your Desktop, navigate to your Desktop.

cd Desktop

Navigate inside the automatic_deidentification folder where the script and files are inside ciabatta:

cd ciabatta\automatic_deidentification

Before you run the script, check how many files there are in the folder files_with_headers to make sure that it is the same number of files after you run the script. To count the number of files, run the following command:

ls files_with_headers\**\**\**\**\*.txt | Measure-Object -Line

NOTE: There should be 40 files in the files_with_headers folder.

Now you can run the script with the following command: python ciabatta_deid.py --directory=files_with_headers

This is what it looks like:

where ciabatta_deid.py is the name of the script, and files_with_headers is the name of the directory with the text files you want to deidentify. After you run the script, you should now have a new folder inside ciabatta called “deidentified”. Make sure all of your files have been processed correctly with this command:

ls deidentified\**\**\**\**\**\*.txt | Measure-Object -Line

Video presentation for PC

A video version of this content is available on the Crow YouTube channel.

Video: Running the deidentification script on a PC

Navigating CIABATTA

Previous: 9. Deidentifying your data

Next: 9b. Manual deidentification