# Introduction

Auther: Yingding Wang\
Created: 15.11.2023

this notebook introduces load a pdf file from s3 bucket with boto3 and ByteIO stream with `pypdf.PdfReader`.

* Boto3 client doc: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects
* S3 bucket doesn't have folder, but prefix https://stackoverflow.com/questions/51303609/python-recursive-glob-in-s3/51303951#51303951
* Boto3 latest pypi version https://pypi.org/project/boto3/

### show ENV variables of the pod
```python
import os
print(os.environ)
```
or
```
!env
```

### PdfReader with BytesIO
* https://stackoverflow.com/questions/48373967/issue-with-pypdf2-and-decoding-pdf-file-from-s3



In [1]:
import sys, os, applyllm

print(f"applyllm version: {applyllm.__version__}")

applyllm version: 0.0.4


In [2]:
# Installing collected packages: jmespath, botocore, s3transfer, boto3
# !{sys.executable} -m pip install --user --upgrade boto3==1.29.0

## s3 has no regex ListObjects

s3 list objects with prefix only
https://stackoverflow.com/questions/62379936/searching-for-keys-in-a-s3-bucket-with-prefix-suffix-or-regex


### Read translated Text files

In [3]:
from applyllm.io import (
    S3AccessConf,
    S3BucketHelper,
)

BUCKET_NAME="scivias-medreports"
VERIFY_HOST=True

In [4]:
# pattern="trans3en/KK-SCIVIAS-*.txt"
subfolder="trans2en"
file_prefix="KK-SCIVIAS"
text_report_prefix = f"{subfolder}/{file_prefix}"
text_report_prefix

'trans2en/KK-SCIVIAS'

In [5]:
s3_conf = S3AccessConf(
    access_key_id = os.environ.get('AWS_ACCESS_KEY_ID'),
    secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY'),
    endpoint = os.environ.get('S3_ENDPOINT'),
    bucket_name = BUCKET_NAME,
    verify_host = VERIFY_HOST,
)
s3_text_reports_helper = S3BucketHelper(conf=s3_conf, file_prefix=text_report_prefix)

In [6]:
limit_count = 2
text_report_list = list(s3_text_reports_helper.get_object_keys(limit_count=limit_count))

In [7]:
text_report_list

['trans2en/KK-SCIVIAS-00003^0053360847^2018-09-28^KIIGAS.txt',
 'trans2en/KK-SCIVIAS-00004^0051726752^2015-12-17^KIIS1.txt']

### Read PDF files

In [8]:
import boto3

print(f"boto3 version: {boto3.__version__}")

# bucket_name="scivias-medreports"
session = boto3.session.Session(
    aws_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID'),
    aws_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
)
#s3 = session.resource('s3', endpoint_url = os.environ.get('S3_ENDPOINT'))
s3 = session.resource('s3', endpoint_url = os.environ.get('S3_ENDPOINT'), verify=VERIFY_HOST)
bucket = s3.Bucket(BUCKET_NAME)

boto3 version: 1.34.14


In [9]:
# pattern="KK-SCIVIAS-*.pdf"
file_prefix="KK-SCIVIAS"

prefix = f"{file_prefix}"

batch_max  = 2
# for obj in bucket.objects.filter
bucket_items = []
for obj in bucket.objects.filter(Prefix=prefix):
    bucket_items.append(obj.key)
    # print(obj)

In [10]:
len(bucket_items)

800

In [11]:
# bucket_items[0]

In [12]:
from io import BytesIO
from pypdf import PdfReader

In [13]:
idx = 0
item = bucket_items[idx]

In [14]:
obj = s3.Object(BUCKET_NAME, item)

In [15]:
fs = obj.get()['Body'].read()

In [16]:
type(fs)

bytes

In [17]:
reader = PdfReader(BytesIO(fs))

In [18]:
content_raw_str = "".join([page.extract_text() for page in reader.pages])

In [19]:
len(content_raw_str)

7611