# Download the text files from Amazon using the API

In [1]:
import requests
from pathlib import Path
import ast

In [94]:
r = requests.get("https://s3.us-east-1.amazonaws.com/data.labs.loc.gov/digitized-books/README.md", headers={"Content-Type": "application/json"})

In [88]:
r.headers

{'x-amz-id-2': 'PtiOvZi+FMUwv5gf8Vmctemte6eqp38X5l5SgD/hNQzG6RoV8CkhTsCtIvpkGUD+GIbf8K5TC2E=', 'x-amz-request-id': 'Z87X47Z4M60QGS7J', 'Date': 'Wed, 28 Sep 2022 23:39:01 GMT', 'Last-Modified': 'Wed, 28 Sep 2022 15:00:13 GMT', 'ETag': '"7fa6a0457fbd5555c9373a9ba9ac2c63"', 'x-amz-server-side-encryption': 'AES256', 'x-amz-version-id': 'L84OOMy7zJoOq7xFRZNj7oTckKno32Ic', 'Accept-Ranges': 'bytes', 'Content-Type': 'binary/octet-stream', 'Server': 'AmazonS3', 'Content-Length': '10690'}

In [97]:
Path('README.md').write_text(r.text)

10685

In [86]:
r.status_code

200

In [99]:
r = requests.get("https://s3.us-east-1.amazonaws.com/data.labs.loc.gov/digitized-books/manifest.txt")
Path('manifest.txt').write_text(r.text)

16679095

The `manifest.txt` file is tab separated, so could open in pandas as a csv. Or just read by line. Columns are id, hash, and bucket/path. It's about 15mb.

In [100]:
r = requests.get("https://s3.us-east-1.amazonaws.com/data.labs.loc.gov/digitized-books/metadata.csv")
Path('metadata.csv').write_text(r.text)

291388362

The `metadata.csv` file is 270mb. Presumably the JSON is even bigger. I could also work with individual JSON files.

Let's try downloading all the texts from the `manifest.txt` file.

Ok, first problem. Some of the text files seem to have been saved as bytes encoded strings, so just saving them as text results in files with no line breaks and broken characters etc.

~~I'm trying this solution: print('c cac \n\\n1 \\n\\n\\n\\n^SM \\n\\n\\n\\nCCCCCc \\n'.encode('utf-8').decode('unicode_escape')) from: https://stackoverflow.com/questions/1885181/how-to-un-escape-a-backslash-escaped-string
But there's a warning about non-ASCII characters -- so will this work with non-English language books?~~

Trying ast.literal_eval instead: https://stackoverflow.com/a/1885211 this evaluates the string as bytes, then it can be converted back to a string.

In [4]:
output_path = Path('/media/tim/workingData/loc/')
with Path('manifest.txt').open('r') as manifest:
    count = 0
    for line in manifest:
        if '.txt' in line:
            count += 1
            id, image, path = line.split()
            # print(id)
            output_file = Path(output_path, f'{id}.txt')
            if not output_file.exists():
                path = path.replace('s3://', '')
                r = requests.get(f"https://s3.us-east-1.amazonaws.com/{path}", timeout=60)
                r.raise_for_status()
                text = r.text
                # Handle strings with bytes
                if r.text.startswith("b'"):
                    try:
                        text = ast.literal_eval(r.text).decode()
                    except SyntaxError:
                        text = r.text.encode('utf-8').decode('unicode_escape')
                Path(output_path, f'{id}.txt').write_text(text)
print(count)

83135


So the number of files with a `.txt` extension matches the number of files I've downloaded. It took about 5 days. Multiple interruptions.