In [None]:
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonS3.html
# https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Bucket.download_file
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonS3.html
# https://stackoverflow.com/questions/42090830/use-boto3-to-download-from-public-bucket

### DOWNLOADING DATA IN BULK

#### USING WGET

I was unable to get the Boto3 libarary to work.  It continually returned an access denied issue.  Using the python library awscli, I was able to install the AWS CLI on my server.  However, this did not alievate the problem.

Then I tried an experiement wherein I used the https request to see if it might simply download whatever file was in the s3 bucket and save it to a file called export.zip.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonS3.html

Instead, a file called index.html was successfully downloaded while the export.zip portion failed (I believe this is because wget does not take a filename to download to).

wget https://dpla-provider-export.s3.amazonaws.com/ export.zip  

--2018-06-25 16:31:47--  https://dpla-provider-export.s3.amazonaws.com/  
Resolving dpla-provider-export.s3.amazonaws.com (dpla-provider-export.s3.amazonaws.com)... 52.216.160.3  
Connecting to dpla-provider-export.s3.amazonaws.com (dpla-provider-export.s3.amazonaws.com)|52.216.160.3|:443... connected.  
HTTP request sent, awaiting response... 200 OK  
Length: unspecified [application/xml]  
Saving to: `index.html'  

    [ <=>                                                       ] 226,527 --.-K/s in 0.01s

2018-06-25 16:31:47 (19.7 MB/s) - `index.html' saved [226527]

Having an index.html file was completely unexpected.  Upon expoloring the file, it became clear there was valid XML.  So I resaved the file as dpla_bucket.xml.  Opening this file in a browser did indeed show the XML not only for the entire json file, but for each individual provider json file.

In [66]:
<ListBucketResult>
    <Name>dpla-provider-export</Name>
    <Prefix/>
    <Marker/>
    <MaxKeys>1000</MaxKeys>
    <IsTruncated>true</IsTruncated>
    <Contents>
        <Key>2015/12/all.json.gz</Key>
        <LastModified>2015-12-16T16:39:42.000Z</LastModified>
        <ETag>"7f5835017527b10b69e8156b632f2968-685"</ETag>
        <Size>5026500451</Size>
        <StorageClass>STANDARD</StorageClass>
    </Contents>
    <Contents>
        <Key>2015/12/artstor.json.gz</Key>
        <AndSoOn></AndSoOn>
    </Contents>
    <AndSoOn></AndSoOn>
<ListBucketResult>

I was then able to use the following command to download any file listed in the "Key" attribute.

wget https://dpla-provider-export.s3.amazonaws.com/2015/12/all.json.gz

--2018-06-25 16:36:55--  https://dpla-provider-export.s3.amazonaws.com/2015/12/all.json.gz  
Resolving dpla-provider-export.s3.amazonaws.com (dpla-provider-export.s3.amazonaws.com)... 52.216.160.43  
Connecting to dpla-provider-export.s3.amazonaws.com (dpla-provider export.s3.amazonaws.com)|52.216.160.43|:443... connected.  
HTTP request sent, awaiting response... 200 OK  
Length: 5026500451 (4.7G) [application/gzip]  
Saving to: `all.json.gz'

     100%[===============================================>] 5,026,500,451 48.8M/s in 1m 45s

2018-06-25 16:38:40 (45.8 MB/s) - `all.json.gz' saved [5026500451/5026500451]

#### USING AN AWS LIBRARY (DOES NOT WORK DUE TO DPLA PERMISSIONS)

In [1]:
import boto3
import botocore

print 'all set.'

all set.


In [46]:
# import boto3
# import botocore

# ACCESS_KEY = 'AWS_ACCESS_KEY'
# SECRET_ACCESS_KEY = 'AWS_SECRET_ACCESS_KEY'
# DPLA_BUCKET = 'dpla-provider-export'
# FILE_KEY = '2015/12/georgia.json.gz'

# # s3 = boto3.resource('s3')
# s3 = boto3.resource('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key= SECRET_ACCESS_KEY)
# # bucket = s3.Bucket(DPLA_BUCKET)
# # bucket.download_file(FILE_KEY, 'georgia.tar.gz')

# try:
#     s3.Bucket(DPLA_BUCKET).download_file(FILE_KEY, 'georgia.tar.gz')
# except botocore.exceptions.ClientError as e:
#     if e.response['Error']['Code'] == "404":
#         print("The object does not exist.")
#     else:
#         raise

In [47]:
s3_client = client = boto3.client('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key= SECRET_ACCESS_KEY)
s3_resource = boto3.resource('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key= SECRET_ACCESS_KEY)

In [64]:
bucket = s3.Bucket(DPLA_BUCKET)
bucket.name, bucket.creation_date
# bucket.get_available_subresources()

('dpla-provider-export', None)

In [58]:
obj = s3_resource.Object(DPLA_BUCKET, FILE_KEY)
obj.bucket_name, obj.key

('dpla-provider-export', '2015/12/georgia.json.gz')

### PREPARING A FILE SAMPLE IN BASH

1. Get count of lines and size of file:   
  ```du -sh file.json # 1.6GB
  wc -l file.json # 373085```  <br><br>
2. Decide how much you in memory you have to spare.  In this case, my preference is for ~1/3rd.    
```373085 - 2 (first and last lines are square brackets which we'll need to remove and then add back)  
373083/3 = 124361```  <br><br>

3. In order to get a random sample, we'll need to remove the brackets which are the [first and last lines](https://unix.stackexchange.com/questions/209068/how-do-i-delete-the-first-n-lines-and-last-line-of-a-file-using-shell-commands), else these lines would be a part of the random sample.  
  ```sed '1d;$d' file.json.bk > file_no_brackets.json  
  wc -l file_no_brackets.json # 373083```  <br><br> 
  
4. Check to make sure it worked by [looking at the first N characters](https://stackoverflow.com/questions/14364397/read-first-8-characters-of-text-file-with-bash):  
```head -c 100 file_no_brackets.json
tail -c 100 file_no_brackets.json```<br><br>

5. [Shuffle](https://stackoverflow.com/questions/9245638/select-random-lines-from-a-file-in-bash) 124361 lines and put them in a new file:  
```shuf -n 124361 file_no_brackets.json > file_shuffled.json```<br><br>

6. Check the head and tail again:    
```head -c 100 file_shuffled.json
tail -c 100 file_shuffled.json```<br><br>

7. Unfortunately, the shuffle will probably cause a comma to be at the start and maybe the end so we must remove it.  Since it's right at the beginning I'll just do it by hand.  And while I'm at it, I'll replace it with a bracket [  and add a ] at the end as well.  [To get to the end of the file](https://stackoverflow.com/questions/17012308/move-cursor-to-end-of-file-in-vim), use :G$ then enter insert mode, or :GA to do both at the same time, then :wq. <br><br>

8. To make sure it's valid json, copy and paste at least the first and last (I did two) line into a [json linter](https://jsonlint.com/).  You shouldn't have to add any comma's since each line actually begins with a comma.  If you are on a local computer, you can use pbcopy else it's good 'ol highlighting for you!  
```head -n 1 file_shuffled.json
tail -n 1 file_shuffled.json```<br><br>

9. Now that it's validated, we can do a final check of the line count and file size.  Note, because of step 7, the file will still have the same number of lines.  As opposed to the original file in which the square brackets were each on their own line.  
  ```du -sh file.json # 520MB
  wc -l file.json # 124361```  <br><br>

#### DECIDING HOW LARGE TO MAKE THE FILE

In [None]:
    124361 georgia_shuffled.json 520M
    
    100000 georgia_1e5.json      418M  80.40%  FAIL
     50000 georgia_5e4.json      209M  40.20%  562M
     25000 georgia_25e3.json     105M  20.10%  260M
     12436 georgia_12e3.json      58M  10.00%   64M(?)   
     10000 georgia_1e4.json       42M   8.04%   75M
      5000 georgia_5e3.json       21M   4.02%   41M
      2500 georgia_25e2.json      11M   2.01%   16M  
      1000 georgia_1e3.json      6.1M   0.80%    3M
       500 georgia_5e2.json      3.0M   0.40%    0M
       250 georgia_25e1.json     1.5M   0.20%    0M 

    318611 total

#### TESTING MEMORY CAPACITY

In [None]:
# HELPFUL LINKS:
# 1. http://zetcode.com/python/simplejson/  

import simplejson as json

databases = 'data/databases/'
state = 'georgia/'
file_name = 'georgia_25e1.json'

with open(databases + state + file_name) as json_file:
    georgia = json.load(json_file)

# print json.dumps(georgia[0], indent = 4 * ' ')