## Conversion of pdf files to jpeg images 

Our pipeline involved generating pdfs with Italian authors and titles to simulate a real world cover image. While in a real scenario users might have images ready, this was an essential step of the pipeline for us to get the requisite images.

This notebook uses the Wand library to extract jpegs from the pdf files. Wand is a ctypes-based ImageMagick binding library for Python.

In [1]:
from wand.image import Image
from wand.color import Color
import os

The function below reads 100 sample pdfs. It loops over all the pdf files. For every file, it generates a corresponding jpeg image, and saves that image to the images folder.


In [2]:
def pdfToImg():
    """ Convert a PDF into images.

        All the images will be saved in format:
        {pdf_filename}.jpg
    """
    
    path = './Data/sample_pdfs'

    counter = 0

    for filename in os.listdir(path):
        
        try:
            currloc =  path+'/'+filename
            
            #Set desired resolution to 128px
            with Image(filename=currloc, resolution=128) as img:
                
                #Set the image background to white along with width and height
                with Image(width=img.width, height=img.height, background=Color("white")) as bg:
                    images=img.sequence
                    pages=len(images)
                  
                    #Only read the first page of the pdf
                    if pages == 1:
                        bg.composite(img,0,0)
                        
                        #Extract the filename by pruning the ".pdf" extension
                        filename = filename[:-4]
                        
                        #Save image to the images folder
                        output_file = './Data/sample_jpegs/' + filename +'.jpg'
                        
                        # the bg.save() command saves the jpeg file to the Data/sample_jpegs
                        # directory. Uncomment it to save the files.
                        # bg.save(filename=output_file)

        except:
            pass


As of now, we have already populated the Data/sample_jpegs directory with the images. This notebook is a documentation of how we completed this step and can be used to recreate the process offline.

If using the notebook or code outside the project file structure, please change the input/output paths accordingly.

In [3]:
pdfToImg()