# Segment and Disaggregate Texts

## Introduction

Use this code to clean, section, and disaggregate texts and corpora. 

**Why Perform Text Sectioning?** 

Dividing texts into sections (for example, chapters or chunks of N length) is valuable as a precursor to topic modeling and other forms of computational analysis which perform more accurately when applied to groups of segmented documents from longer texts. 

**Why Disaggregate Texts?** 

The process of disaggregating the words in texts (in this case, by alphabetizing them) also creates data sets that can be shared freely where original texts cannot be due to copyright restrictions. 

*Input/Output Specifications:* 

This code requires plain txt files as input, either those from this repository's sample_data folder or those from a local machine. It returns csv files with disaggregated text grouped by chapter or chunk of n length.

## Upload and Add Text Files To Pandas DataFrame
In this section, text files are added into a Pandas DataFrame. Pandas is a fast and relatively easy way to work with large datasets. Though data frames are typically associated with numbers, Pandas also offers many functionalities for [working with textual data. ](https://www.tutorialspoint.com/python_pandas/python_pandas_working_with_text_data.htm) 

In [None]:
#Import os and glob
import glob
import os

#Import pandas
import pandas as pd

#Import nltk for tokenization 
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
nltk.download('punkt')


In [None]:
#Get current working directory 
path = os.getcwd()
print(path)

#Change working directory
path = os.chdir("/PATHNAME")

In [None]:
#Append all txt files to a pandas dataframe
filenames = []
data = []
files = [f for f in os.listdir(path) if os.path.isfile(f)]
for f in files:
    #if f.endswith('.txt'):
        with open(f, 'rb') as myfile:
            filenames.append(myfile.name)
            data.append(myfile.read())
d = {'Title':filenames, 'Text': data}
books = pd.DataFrame(d)
books

## Perform Minimal Cleaning and Set Parameters for Sectioning 
Several basic cleaning processes are implemented: removing unwanted characters from titles and removing encoding  and newline characters from texts. Parameters are then set for part(s) of text to be included in sectioning. In the SciFi Corpus project, "START OF BOOK" and "END OF BOOK" tags were added to delineate the body of each text. Code in this section removes any text outside the starting and ending parameters--e.g., title page, copyright page, other paratext. 

In [None]:
books_cleaned = books.copy()

In [None]:
#Remove .txt from titles
books_cleaned['Title'] = books_cleaned['Title'].str.replace(r'.txt', ' ', regex=True) 
books_cleaned.head()

In [None]:
#Remove encoding characters from Text column (b'\xef\xbb\xbf)
books_cleaned['Text'] = books_cleaned['Text'].apply(lambda x: x.decode('utf-8', errors="ignore"))

#Remove newline characters
books_cleaned['Text'] = books_cleaned['Text'].str.replace(r'\s+|\\r', ' ', regex=True) 
books_cleaned['Text'] = books_cleaned['Text'].str.replace(r'\s+|\\n', ' ', regex=True) 
books_cleaned

In [None]:
#Check that text is cleaned and sectioned
books_cleaned.iloc[0]['Text']

## Section Texts By Chunks of N Length
When working with texts WITHOUT discernable chapter headings--or, even if chapter headings are present but too infrequent to split texts into meaningful segments--texts can instead be sectioned by chunks of "N" length, where N is a variable that can be custom-set below. After checking the word counts for each text to determine what size chunks would be appropriate, this code iterates through the texts and splits them each time it counts "N" number of words. From here, the text from each chunk is appended to a new dataframe and denoted by book and chunk number.

In [None]:
#Get number of words in each book (helps to determine chunk length)
words = books_cleaned["Text"].apply(lambda x: len(str(x).split(' ')))

#Append chapter counts to dataframe
books_cleaned["Word Count"] = words
books_cleaned

In [None]:
#Tokenize Text
books_cleaned['Text'] = books_cleaned['Text'].astype(str)
books_cleaned['Tokens'] = books_cleaned.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)
books_cleaned

In [None]:
#Define chunking function
def split(list_a, chunk_size):
  for i in range(0, len(list_a), chunk_size):
    yield list_a[i:i + chunk_size]

#Set desired size of chunks
chunk_size = 500

#Create new list for chunked sentences
chunked_sentences = []

#Perform chunking function on each row of tokens
s = books_cleaned['Tokens']
for content in s:
  chunks = list(split(content, chunk_size))
  #Check that text is being chunked correctly
  print(chunks[0])
  #Add to new list
  chunked_sentences.append(chunks)


In [None]:
#Create dictionary to associate chunks with titles
keys = books_cleaned['Title']
values = chunked_sentences

res = {keys[i]: values[i] for i in range(len(keys))}

In [None]:
#Add chunks to new dataframe
chunked_df = pd.DataFrame.from_dict(res, orient='index')
chunked_df.head()

In [None]:
#Reset dataframe index and rename columns
chunked_df = chunked_df.stack().reset_index()
chunked_df.columns = ["Title","Chunk","Text"]
chunked_df

In [None]:
#Tidying the DF
#Combine book and chunk labels into one column
chunked_df['Book + Chunk'] = chunked_df['Title'].astype(str) + ' Chunk ' + chunked_df['Chunk'].astype(str)

#Remove individual book and chunk columns
chunked_df.drop(columns=['Title', 'Chunk'])

#Detokenize text
TreebankWordDetokenizer().detokenize
chunked_df['Text'] = chunked_df.apply(lambda row: TreebankWordDetokenizer().detokenize(row['Text']), axis=1)
chunked_df['Text'] 

#Reindex so book + chunk is first column 
column_names = "Book + Chunk", "Text"
chunked_df = chunked_df.reindex(columns=column_names)

#Print cleaned df
chunked_df

## Download CSV and Txt Output of Aggregated and Disaggregated Texts 

At this point, you have three dataframes containing segmented texts that are ready for further analysis. All three (along with the dataframe containing the full texts) can be downloaded as csv files. Depending on the nature of your texts and future analysis, it may be necessary to first disaggregate the data before download. Some analyses like topic modeling work well with "bag of words" data, and copyrighted texts cannot be shared in their original forms. Disaggregation, or the breakdown of data into smaller (disordered) parts, is accomplished through the alphabetization of the words in each chapter/chunk.Below, texts are disaggregated and the resulting dataframes can then be downloaded as csvs. 


In [None]:
#Change working directory to where output will be stored
path = os.chdir("PATHNAME")

In [None]:
#Download CSVs of aggregated texts

#Download CSV with aggregated full texts 
books_agg = books_cleaned[['Title', 'Text']]
books_agg.to_csv('full_texts_agg_output.csv', encoding = 'utf-8-sig')

#Download CSV with aggregated chunks
chunked_df.to_csv('chunks_agg_output.csv', encoding = 'utf-8-sig') 

In [None]:
## Disaggregate data in each dataframe

#Alphabetize words in each full text
books_bow = books_agg.copy()
books_bow['Text'] = books_bow['Text'].apply(lambda x: ' '.join(sorted(x.split())))

#Alphabetize words in each chunk 
chunked_df['Text'] = chunked_df['Text'].apply(lambda x: ' '.join(sorted(x.split())))

In [None]:
#Download CSVs of disaggregated texts

#Download CSV with disaggregated full texts 
books_bow.to_csv('full_texts_bow_output.csv', encoding = 'utf-8-sig')

#Download disaggregated chunks to csv
chunked_df.to_csv('chunks_bow_output.csv', encoding = 'utf-8-sig') 

In [None]:
# Specify the directory where you want to save the disaggregated TXT files
output_directory = 'output_folder'

# Iterate through DataFrame rows and save each text as a TXT file
for index, row in books_bow.iterrows():
    text = row['Text']
    file_name = f'file_{index + 1}.txt'
    file_path = os.path.join(output_directory, file_name)
    
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(text)

    print(f"File '{file_name}' saved.")

print("All files saved.")


In [None]:
# Specify the directory where you want to save the disaggregated chunked TXT files
output_directory = 'output_folder'

# Iterate through DataFrame rows and save each text as a TXT file
for index, row in chunked_df.iterrows():
    text = row['Text']
    file_name = f'file_{index + 1}.txt'
    file_path = os.path.join(output_directory, file_name)
    
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(text)

    print(f"File '{file_name}' saved.")

print("All files saved.")