### chunking by paragraph

A bit on working with paragraphs / segmenting. Or should that come later?

### chunking by percentage

It often makes sense to partition your text up for more legible analysis. After all, we frequently want to get a more nuanced sense of how particular modes of analysis might change over the course of a text. To do that, the first necessary action is to divide the text into smaller portions that can be individually analyzed. One of the most common ways of doing this is to partition the text into even units. Below we divide the text of Jacob's Room into 100 even pieces.

In [1]:
import math
filename = 'corpus/1922_jacobs_room.txt'
with open(filename, 'r') as fin:
    text = fin.read()

text_length = len(text)
text_chunks = []
number_of_chunks = 100
for i in range(number_of_chunks):
    chunk_size = text_length/number_of_chunks
    chunk_start = math.floor(chunk_size * i)
    chunk_end = math.floor(chunk_size * (i +1))
    text_chunks.append(text[chunk_start:chunk_end])


print('number of chunks: ' + str(len(text_chunks)))
print('length of chunk 1: ' + str(len(text_chunks[0])))
print('length of chunk 2: ' + str(len(text_chunks[1])))
print('length of chunk 3: ' + str(len(text_chunks[2])))

FileNotFoundError: [Errno 2] No such file or directory: 'corpus/1922_jacobs_room.txt'

Dividing up the text in this way provides us with a series of small texts, each of which can be subjected to analysis. We can then string the analysis of these smaller pieces to make arguments about trends in the overal piece. Below we take the same bit of code, wrap it into a function, and then use it to track changes in the use of the word Jacob over the course of the novel.

In [None]:
import math
import nltk
import matplotlib.pyplot as plt

def get_chunks(text, num_chunks):
    text_length = len(text)
    text_chunks = []
    number_of_chunks = num_chunks
    for i in range(number_of_chunks):
        chunk_size = text_length/number_of_chunks
        chunk_start = math.floor(chunk_size * i)
        chunk_end = math.floor(chunk_size * (i +1))
        text_chunks.append(text[chunk_start:chunk_end])
    return text_chunks

filename = 'corpus/1922_jacobs_room.txt'
with open(filename, 'r') as fin:
    raw_text = fin.read()

chunked_text = get_chunks(raw_text, 100)
tokenized_text = [nltk.word_tokenize(chunk) for chunk in chunked_text]
jacob_counts = [nltk.FreqDist(tokenized_chunk)['Jacob'] for tokenized_chunk in tokenized_text]
print(jacob_counts)

Remember - we have divided the text up into 100 (roughly) equal units. Using the FreqDist() module in the NLTK package we get a quick count of the word 'Jacob' in the text. We can then take that information and plot it.

In [None]:
plt.style.use('seaborn-whitegrid')
plt.plot(jacob_counts)
plt.show()

A little bit difficult to see any discernable trend lines, right? Fortunately we already have code that can help us parse things a little differently. Rather than slotting things into 100 equal parts, lets shift to ten equal chunks for the whole novel.

In [None]:
import math
import nltk
import matplotlib.pyplot as plt

def get_chunks(text, num_chunks):
    text_length = len(text)
    text_chunks = []
    number_of_chunks = num_chunks
    for i in range(number_of_chunks):
        chunk_size = text_length/number_of_chunks
        chunk_start = math.floor(chunk_size * i)
        chunk_end = math.floor(chunk_size * (i +1))
        text_chunks.append(text[chunk_start:chunk_end])
    return text_chunks

filename = 'corpus/1922_jacobs_room.txt'
with open(filename, 'r') as fin:
    raw_text = fin.read()

chunked_text = get_chunks(raw_text, 10)
tokenized_text = [nltk.word_tokenize(chunk) for chunk in chunked_text]
jacob_counts = [nltk.FreqDist(tokenized_chunk)['Jacob'] for tokenized_chunk in tokenized_text]
print(jacob_counts)
plt.style.use('seaborn-whitegrid')
plt.plot(jacob_counts)
plt.show()

The lesson here is that visualizations are constructed and subject to interpretation. The first graph using 100 chunks showed a text with a noisy distribution of results and no clear meaning. The second graph accounted for some of this noise by using a smaller number of chunks, the result being that we can clearly see an increase in the use of Jacob's name over the course of the novel. 