# File splitter notebook
This notebook takes a large text file that has predictable sub-units (in this case, short stories separated by five blank lines), and splits it into those sub-units, naming each file by a cleaned-up version of the first line after the separator (which in this case, is the title).

## Install and import libraries
First, we install the `anyascii` library which takes non-ASCII characters and converts them to the closest ASCII. Then, we import that along with `re` (regular expressions, a fancy find-and-replace syntax) and `os` for navigating file paths.

In [None]:
import sys
!{sys.executable} -m pip install anyascii

In [1]:
import re
import os
from anyascii import anyascii

## Set up files & directories
First, we define the directory where we'll be putting the split files. Then, we change locations into that directory. Finally, we define the full path to the file that we're splitting.

In [2]:
directory = '/Users/qad/Documents/ethan'
os.chdir(directory)
sourcefile = '/Users/qad/Documents/ethan/Cuentos completos - Emilia Pardo Bazan.txt'

## Split the file
The following cell actually splits the file after 5 blank lines (`\n`), then splits the title of the file (the immediately following line) off of that. It cleans up the title, then uses it to write the content of that section of the source (in this case, the short story).

In [3]:
#Opens the source file defined above
with open(sourcefile, 'r') as source:
    #Reads the source file
    text = source.read()
    #Splits the text based on 5 blank lines, then does the following for each of those chunks:
    for chapter in re.split(r'\n{5}', text):
        #Splits the title (the first line following the 5 blank lines)
        title = chapter.strip().split('\n')[0]
        #Shortens the title to 30 characters
        title = title[:30]
        #Makes the title all lowercase
        title = title.lower()
        #Turns the title into only ASCII characters
        title = anyascii(str(title))
        #Removes less than ideal characters that are still part of the title
        title = str(title).replace('<', '')
        title = str(title).replace('>', '')
        title = str(title).replace(' ', '-')
        #Prints the transformed title
        print(str(title))
        #Defines 'content' as the non-title portion of the text split by blank lines
        content = '\n'.join(chapter.strip().split('\n')[1:])
        #Creates a new file named for the title
        with open(f"{title}.txt", 'w') as _fh:
            #Writes the file
            _fh.write(content)

en-una-ocasion,-leopoldo-alas-
emilia-pardo-bazan
cuentos-completos
epub-r1.0
titulo-original:-cuentos-compl
prologo-de-cuentos
introduccion-de-las-obras-com
las-recopilaciones
la-dama-joven-y-otros-cuentos
nieto-del-cid
el-indulto
fuego-a-bordo
el-rizo-del-nazareno
la-borgonona
i
ii
primer-amor
un-diplomatico
sic-transit...
el-premio-gordo
una-pasion
el-principe-amado
ii
iii
la-gallega
cuentos-escogidos
travesura-pontificia
planta-montes
crimen-libre
temprano-y-con-sol...
cuentos-de-marineda
por-el-arte
morrion-y-boina
las-tapias-del-campo-santo
el-senor-doctoral
en-el-nombre-del-padre...
el-mechon-blanco
?cobardia?
cuentos-nuevos
cuentos-de-navidad
ii
iii
iv
v
vi
vii
viii
las-dos-vengadoras
la-mariposa-de-pedreria
el-ruido
remordimiento
agravante
la-hierba-milagrosa
publicada-esta-carta-con-el-cu
sobremesa
evocacion
confidencia
pina
la-calavera
cuatro-socialistas
el-tesoro
la-paloma-negra
sedano
el-milagro-del-hermanuco
madre
cuento-primitivo
la-cena-de-cristo
apostasia
santiago-el-m