Converting docx (downloaded from s3) to pdf (uploaded to s3) #255

Ayh4m · 2023-01-16T09:29:23Z

Hello guys,

I'm trying to convert a MS Word .docx file (that is being downloaded from S3 bucket) to a pdf one (and upload it to the same location in that S3 bucket).

NOTE: I've tried the layer solution and it did work for me.

I've followed all the mentioned steps:

1- Created my own docker image:

Where my Dockerfile looks like this:

FROM public.ecr.aws/shelf/lambda-libreoffice-base:7.4-node16-x86_64
COPY app.js package.json ${LAMBDA_TASK_ROOT}/
RUN npm install
CMD [ "app.handler" ]

and apps.js looks like this:

const {convertTo, canBeConvertedToPDF} = require('@shelf/aws-lambda-libreoffice');
const fs = require("fs");
const AWS = require("aws-sdk");

const S3_BUCKET = process.env.S3_BUCKET
const WORD_FILE_KEY = process.env.WORD_FILE_KEY
const TMP_DIR = "/tmp"

const s3 = new AWS.S3();


module.exports.handler = async () => {

    console.log(`# Bucket Name: ${S3_BUCKET}`)
    console.log(`# File Name: ${WORD_FILE_KEY}`)
    console.log(`[Initial]: ${getTmpDirContent()}`)

    // Download .docx file from S3 to /tmp
    console.log("# Downloading the .docx file from S3 ...")
    await downloadFileFromS3(WORD_FILE_KEY)
    console.log(`[After Download]: ${getTmpDirContent()}`)

    if (canBeConvertedToPDF(WORD_FILE_KEY)) {
        // Convert .docx to .pdf
        console.log("# Converting .docx file to .pdf file ...")
        const convertedFilePath = convertTo(WORD_FILE_KEY, "pdf");
        console.log(convertedFilePath)
        console.log(`[After Convert]: ${getTmpDirContent()}`)

        // Upload .pdf file to S3
        console.log("# Uploading the .pdf file to S3 ...")
        await uploadFileToS3(convertedFilePath)

        console.log("# Done")
    } else {
        console.log("# Can't be converted tp pdf!")
    }
}

const downloadFileFromS3 = async (fileKey) => {
    try {
        const filePath = `${TMP_DIR}/${fileKey}`
        const params = {
            Bucket: S3_BUCKET,
            Key: fileKey
        }
        const objData = await s3.getObject(params).promise();
        fs.writeFileSync(filePath, objData.Body.toString());
        console.log(`- File ${fileKey} has been downloaded to ${filePath} successfully`);
    } catch (e) {
        throw new Error(`- Download Error: ${e.message}`)
    }
}

const uploadFileToS3 = async (filePath) => {
    try {
        const fileKey = filePath.split(`${TMP_DIR}/`)[1]
        const fileData = fs.readFileSync(filePath)
        const params = {
            Bucket: S3_BUCKET,
            Key: fileKey,
            Body: fileData
        }
        const data = await s3.upload(params).promise();
        console.log(`- Upload successfully at ${data.Location}`);
    } catch (e) {
        throw new Error(`- Upload Error: ${e.message}`)
    }
};

const getTmpDirContent = () => {
    return fs.readdirSync(TMP_DIR)
}

& package.json is the following:

{
  "name": "libreoffice-lambda-container-image",
  "version": "1.0.0",
  "main": "index.js",
  "license": "MIT",
  "dependencies": {
    "@shelf/aws-lambda-libreoffice": "^5.0.1",
    "aws-sdk": "^2.1293.0"
  }
}

2- Pushed it to ECR.

3- Created a Lambda Function:

With the following configurations:

AND I'm getting the following error:

Any suggestions?!

The text was updated successfully, but these errors were encountered:

atrudeau-vitall · 2023-02-06T21:56:01Z

Hey @Ayh4m, I had this same use case and was able to get it working. See the repo here

fedeotaran · 2023-06-28T20:03:07Z

I have the same problem, could anyone solve it?

npahucki · 2024-05-30T14:30:57Z

For anyone dealing with this specific error: this happens when the log message returned from libreoffice doesn't match a regex return logs.match(/\/tmp\/.+->\s(\/tmp\/.+) using/)[1]; in the library/logs module. This assumes that the log message 1) is in English (the 'using') and 2) that the input and output path both start with /tmp/.

I was bumping into this while testing the code on MacOS where /tmp is a symbolic link to /private/tmp/, so the output path in the messages returned from libreoffice had /private/tmp/ in them and did not match the regex, thus returning null form the match and an error trying to access index 1 of null. I worked around it by monkey patching getConvertedFilePath when testing on MacOS.

It did work just fine for me with no modification while running in the deployed Lambda, however, if someone were to have some locale variables set in their Dockerfile and libreoffice outputs messages in a language other than than English, this would be broken in deployed Lambdas too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting docx (downloaded from s3) to pdf (uploaded to s3) #255

Converting docx (downloaded from s3) to pdf (uploaded to s3) #255

Ayh4m commented Jan 16, 2023

atrudeau-vitall commented Feb 6, 2023

fedeotaran commented Jun 28, 2023

npahucki commented May 30, 2024

Converting docx (downloaded from s3) to pdf (uploaded to s3) #255

Converting docx (downloaded from s3) to pdf (uploaded to s3) #255

Comments

Ayh4m commented Jan 16, 2023

1- Created my own docker image:

2- Pushed it to ECR.

3- Created a Lambda Function:

atrudeau-vitall commented Feb 6, 2023

fedeotaran commented Jun 28, 2023

npahucki commented May 30, 2024