# Loading Textual Data in Python

This Workbook records methods to efficiently load, store and restore textual data in Python

Content of this workbook:

* Writing Text to Files
* Reading Text from a single file
* Reaging Text from multiple files

### Writing multiple files to folder

To write a larger set of text to multiple files automatically we can use loops and the file.write(Text) command

In [9]:
# Writing multiple files to a folder
text = 'This is a test text'
for i in range(10):
    f = open('./data/file_0{}.txt'.format(i), 'w')
    f.write(text)
    f.close()

In [2]:
# Loading Data from a folder
with open('./data/test.txt', 'r') as f:
    text = f.read()

### Loading multiple files from a given folder

In [10]:
import os
for filename in os.listdir('./data'):
    print(filename)

file_00.txt
file_01.txt
file_02.txt
file_03.txt
file_04.txt
file_05.txt
file_06.txt
file_07.txt
file_08.txt
file_09.txt
test_1.txt


In [15]:
# Loading multiple files and storing their results into a list
text = []
for filename in os.listdir('./data'):
    with open('./data/{}'.format(filename), 'r') as f:
        text.append(f.read())

In [31]:
text

['This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a testfile\n']

### Using a Generator

With large Datasets it might be better to read them file by file using a generator that iterates over all the files in the folder

In [27]:
# Loading multiple files and concatenating the results
def text_reader(folder):
    for f in os.listdir(folder):
        with open('{}/{}'.format(folder, f), 'r') as file:
            yield file.readline()

In [28]:
# Using the Generator we can not read it document by document at any point in time we need it
a = text_reader('./data')

In [30]:
a.__next__()

'This is a test text'

### Storing python objects to disk

Once we are done with creating a specific preprocessed text object we want to get back to later, we can pickle it and save it to disc (We serialize the object).

Pickle serializes the object as a binary stream of data, and as such the file we open has to be opened in binary writing mode aka 'wb'. Otherwise we get a confusing error message

In [44]:
import pickle

In [45]:
with open('./data/text_corpus.pkl', 'wb') as f:
    pickle.dump(text, f)

In [46]:
# Lets check for the existence of our pickle file
os.listdir('./data/')

['file_00.txt',
 'file_01.txt',
 'file_02.txt',
 'file_03.txt',
 'file_04.txt',
 'file_05.txt',
 'file_06.txt',
 'file_07.txt',
 'file_08.txt',
 'file_09.txt',
 'test_1.txt',
 'text_corpus.pkl']

In [47]:
# To return this move we can retrieve the object from the pckl dump
b = pickle.load(open('./data/text_corpus.pkl', 'rb'))

In [48]:
b

['This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a test text',
 'This is a testfile\n']