Sample code for preprocessing the text files.

In [1]:
import os
import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize

Preliminary setup. Include nltk tokenizer, for other downstream tasks. BERTopic makes use of sckit.learn tokenizer.

In [2]:
def load_text(file_path):
    df = pd.DataFrame(columns=['title', 'text'])
    file_list = sorted(os.listdir(file_path))
    for f in file_list:
        dir = file_path + '/' + f
        text = open(dir, 'r').read().strip()
        df = df.append({'title':f[:-4], 'text':text}, ignore_index=True)
    return df

Simple chunking function

In [4]:
def chunking_text(chunk_size, text_list): 
    chunk_list = []
    for text in text_list:
        tokenized = word_tokenize(re.sub(r'[^\w\s]', '', text))
        chunked = []
        for i in range(0, len(tokenized), chunk_size):
            chunked.append(' '.join(tokenized[i:i+chunk_size]))
        chunk_list.append(chunked)
    return chunk_list

Load example dataset, here the tales.

In [None]:
tale_danish_df = load_text('fairytale_datasets/tales')

Clean the text

In [6]:
tale_danish_df['text'] = tale_danish_df['text'].replace(regex=r'\s+', value=' ').replace(regex='[»«]', value='')

Chunking

In [11]:
tale_danish_df['text'] = chunking_text(50, tale_danish_df['text'])
tale_danish_df = tale_danish_df.explode('text', ignore_index=True)

Save as csv file for future work. Do this for each subcorpus--travel writing, each of the four travel collections, and (as done here), the tales.

In [13]:
tale_danish_df.to_csv("fairytale_datasets/tale_danish.csv")

Create the various time-based bins

- Group 1: 1835 - 1841 (after publication of Rambles in the Harz Mountains in 1831)
- Group 2: 1842 - 1850 (after publication of A Poet's Bazaar in 1842)
- Group 3: 1851 - 1862 (after publication of In Sweden in 1851)
- Group 4: 1863 - 1873 (after publication of In Spain in 1863 and A Visit to Portugal in 1868)

In [15]:
group1_titles = tale_dates_dan['title'][tale_dates_dan['date'] <= 1841]
group2_titles = tale_dates_dan['title'][(tale_dates_dan['date'] >= 1842) & (tale_dates_dan['date'] <= 1850)]
group3_titles = tale_dates_dan['title'][(tale_dates_dan['date'] >= 1851) & (tale_dates_dan['date'] <= 1862)]
group4_titles = tale_dates_dan['title'][tale_dates_dan['date'] >= 1863]

Remember when creating the travel_tale corpus to manually delete the duplicate tales, e.g. Metalsvinet and En Rose fra Homers Grav.

In [42]:
group1_skygge = travel_tale_danish_df[(travel_tale_danish_df['book'].isin(group1_titles)) | 
                                      (travel_tale_danish_df['book'] == 'Skygge')]
group2_bazar = travel_tale_danish_df[(travel_tale_danish_df['book'].isin(group2_titles)) | 
                                     (travel_tale_danish_df['book'] == 'Bazar')]
group3_sverrig = travel_tale_danish_df[(travel_tale_danish_df['book'].isin(group3_titles)) | 
                                       (travel_tale_danish_df['book'] == 'Sverrig')]
group4_spanien = travel_tale_danish_df[(travel_tale_danish_df['book'].isin(group4_titles)) | 
                                      (travel_tale_danish_df['book'] == 'Spanien')]

Save the binned groups to csv. 

In [47]:
group1_skygge.to_csv("both_datasets/group1_skygge.csv")
group2_bazar.to_csv("both_datasets/group2_bazar.csv")
group3_sverrig.to_csv("both_datasets/group3_sverrig.csv")
group4_spanien.to_csv("both_datasets/group4_spanien.csv")