# In-Class Practicum:

March 2, 2022



## Processing text files

This Jupyter notebook includes a series of examples for how you might use regular expressions (the re library) along with the collections and Path libraries to help with batch processing text files.

### Reading in a file

In [2]:
import pandas as pd
import re
from collections import Counter

In [3]:
filepath_of_text = "../_datasets/texts/literature/Jane-Austen-Pride-and-Prejudice.txt"

In [4]:
full_text = open(filepath_of_text, mode='r', encoding="utf-8").read()

In [5]:
# Print the first 100 characters of our string
print(full_text[:500])

﻿The Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Edited
by R. W. (Robert William) Chapman


This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org





Title: Pride and Prejudice


Author: Jane Austen

Editor: R. W. (Robert William) Chapman

Release Date: May 9, 2013  [eBook #42671]

Lan


### Write a text file

Instead of useing `open('sample-file.txt', mode='r', encoding='utf-8').read()` using the `mode='r'` read mode, we can write to a file using the `mode='w'`.

Let's create a new blank file called "a-new-file.txt":

In [6]:
open('a-new-file.txt', mode='w', encoding='utf-8')

<_io.TextIOWrapper name='a-new-file.txt' mode='w' encoding='utf-8'>

To add something to this file, we use the `.write()` method:

In [7]:
open('a-new-file.txt', mode='w', encoding='utf-8').write('I just wrote this text to a text file!')

38

### Open and read all files in a directory
What if we want to work with more than one text at once, like analyzing word frequencies or patterns in a collection of texts?

Using `for` loops (which we learned), and a library called `Pathlib`, we can iterate over all the files in a directory:

#### Import Pathlib 

In [8]:
from pathlib import Path

In [9]:
# Define the path to a directory that we
directory_path = 'sample-directory'

#### Loop through any file in the directory with the star * character, (a wildcard which matches anything)

In [10]:
for filepath in Path(directory_path).glob('*'):
    print(filepath)

sample-directory/sample-text1.txt
sample-directory/sample-text3.txt
sample-directory/sample-text2.txt


#### Loop through just text files in the directory with *.txt, which matches only files that end with “.txt”

In [11]:
for filepath in Path(directory_path).glob('*.txt'):
    print(filepath)

sample-directory/sample-text1.txt
sample-directory/sample-text3.txt
sample-directory/sample-text2.txt


#### To read these text files, simply add in the open() function and .read() method

In [12]:
for filepath in Path(directory_path).glob('*.txt'):
    print(open(filepath, encoding='utf-8').read())

This is a sample text!
This is yet another sample text!
This is another sample text!


## Splitting up a text file


### Split up a text file by paragraphs
To split a file by paragraphs, we need to first figure how how paragraphs are represented in our text. Let's peak inside the file:  

In [13]:
full_text[:500]

'\ufeffThe Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Edited\nby R. W. (Robert William) Chapman\n\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.org\n\n\n\n\n\nTitle: Pride and Prejudice\n\n\nAuthor: Jane Austen\n\nEditor: R. W. (Robert William) Chapman\n\nRelease Date: May 9, 2013  [eBook #42671]\n\nLan'

Notice that new lines are marked by two newline characters––"\n\n". This newline character is a way of encoding a linebreak as a character, and can use this to split our text up everytime we encouter a paragraph break, like so: 

In [14]:
full_text_split_by_paragraph = full_text.split("\n\n")

### Print out the contents of paragraph 36
(Remember, Python numbering starts at 0, not 1, but each split creates a text before it)

In [15]:
# Print out the contents of paragraph 36
print(full_text_split_by_paragraph[36])


It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.


### Split up a text file by a string, eg "CHAPTER"

In [16]:
# Split up our document by CHAPTER
full_text_split_by_chapter = full_text.split("CHAPTER")

In [17]:
# Print out the contents of "CHAPTER" 3
# Chapter 0 here is the start of the document
print(full_text_split_by_chapter[3])

 III.


Not all that Mrs. Bennet, however, with the assistance of her five
daughters, could ask on the subject was sufficient to draw from her
husband any satisfactory description of Mr. Bingley. They attacked him
in various ways; with barefaced questions, ingenious suppositions, and
distant surmises; but he eluded the skill of them all; and they were at
last obliged to accept the second-hand intelligence of their neighbour
Lady Lucas. Her report was highly favourable. Sir William had been
delighted with him. He was quite young, wonderfully handsome, extremely
agreeable, and to crown the whole, he meant to be at the next assembly
with a large party. Nothing could be more delightful! To be fond of
dancing was a certain step towards falling in love; and very lively
hopes of Mr. Bingley's heart were entertained.

"If I can but see one of my daughters happily settled at Netherfield,"
said Mrs. Bennet to her husband, "and all the others equally well
married, I shall have nothing to wish for

### Output a list to a single text file

Output our split text file (now a list of paragraphs) as a single text file.

Example: output just the first paragraph of Austen

In [None]:
output_file = open('booknlp-contexts.txt', mode='w', encoding='utf-8')

for paragraph in full_text_split_by_paragraph:
     output_file.write(paragraph)
     output_file.write('\n')
output_file.close()

### Output a list (here, our split text file) as a series of new files
To output our split files as a series of new files with the same beginning, followed by the number of the section:

In [None]:
# To output our split files as a series of new files with the same beginning, followed by the number of the section
begining_of_output_filenames = 'Austen-Pride-and-Prejudice-'
[open(begining_of_output_filenames+str(i)+'.txt', 'w').write(full_text_split_by_paragraph[i-1]) for i in range(1, len(full_text_split_by_paragraph)+1)]

### Output a list (here, our split text file) as a series of new files in a new directory

In [None]:
# To output our split paragraphs as a series of new files with the same beginning, followed by the number of the section

#Import pathlib 
from pathlib import Path

# Define and name the new output directory using pathlib
path = Path('Austen-Pride-and-Prejudice-paragraphs/')
path.mkdir(exist_ok=True)

# Set the prefix for our output files, followed by the number of the section
begining_of_output_filenames = 'Austen-Pride-and-Prejudice-paragraph'

# Iterate over each of the chunks of context for BookNLP NER
for i in range(1, len(full_text_split_by_paragraph)+1):
    open(str(path) + "/" + begining_of_output_filenames+str(i)+'.txt','w').write(full_text_split_by_paragraph[i-1])