### First Steps

Natural language processes require specific input files and formats in order to properly run their algorithms and produce meaningful outputs. Thus, it is always important to clean your input data, especially when it is text, before using any NLP methods. In this case, we wish to transform the text in folktales, which we have stored in various S3 buckets. AWS S3 stores all data in buckets and as bytes. This can cause textual data to be slightly modified with escape characters, which need to be removed before continuing further using Comprehend. 

In [47]:
import boto3

s3 = boto3.client('s3')

buckets = ['arabian-folktales', 'chinese-folktales', 'english-folktales','german-folktales', 'indian-folktales', 'russian-folktales']

The following loop cleans every folktales text file and creates new S3 buckets to store the new clean files. Because S3 stores every object in bytes, we must decode each text file so that the bytes are recomposed into a 'utf-8' string which we can manipulate. 

In [45]:
for currBucket in buckets:
    newBucket = currBucket + "-transformed"
    s3.create_bucket(Bucket = newBucket)
    for item in s3.list_objects_v2(Bucket = currBucket).get('Contents'):
        name = item.get('Key')
        obj = s3.get_object(Bucket = currBucket, Key = name)
        text = obj.get('Body').read().decode('utf-8')
        text = text.replace('\n', " ")
        text = text.replace('\r', "")
        s3.put_object(Bucket = newBucket, Body = text, Key = name)