# Creating a PDF Report in Python
- Use a library that can generate reports, such as ReportLab.
- ReportLab is a software library written in Python that allows you to create PDFs programmatically.

### What is ReportLab?
ReportLab is a robust open source software library designed to generate Portable Document Format (PDF) using the Python. It generates dynamic PDFs based on user input, database queries, or diverse data sources including charts and graphics.

- Create professional-looking PDF documents with a variety of 
    - layouts including 
        - charts, 
        - tables.
        - design elegant titles, 
        - embed images, or 
        - organize your data into structured tables
- With ReportLab, you can create documents such as 
    - invoices, 
    - reports, 
    - types of documents that require a high degree of formatting and layout control.
- Create 
    - PDFs from scratch, 
    - Modify existing PDF documents.

### Why Opt for PDF Files?
- Cross-Platform Compatibility: 
    - PDFs seamlessly function across various 
        - devices, 
        - operating systems, and 
        - applications, 
        - positioning them as the preferred choice for document sharing.
- Integrity of Design: 
    - PDFs consistently retain the original formatting and design of your content — be it 
        - text, 
        - tables, or 
        - visuals — ensuring it appears as intended regardless of where it’s viewed.
- Efficiency in Size: 
    - When compared to formats like Word documents, 
    - PDFs are often more compact, allowing for easier storage and quicker transmission.
    
### Key Features of ReportLab Include
- Dynamic Content Generation: 
    - Ideal for crafting personalized items such as reports, invoices, and certificates.
- Visual Integration: 
    - ReportLab seamlessly embeds images, charts, and other graphics into your PDFs.
- Table Creation: 
    - Create tables with adjustable row heights, column widths, and custom styling.
- Multi-page Documents: 
    - Effortlessly generate multi-page PDFs with automatic page breaks, headers, and footers.

In [1]:
!pip install reportlab



### Importing it

In [2]:
from reportlab.pdfgen import canvas

## Creating a Basic PDF Page with ReportLab  with Pdfgen Canvas:
ReportLab offers two primary approaches to generating a basic PDF page:
 
- A lower-level method that provides direct drawing methods to create content on a PDF page.
- The canvas interface of `pdfgen` is like a drawing board on which you can place a sequence of paint operations to construct your PDF.
    - Specify the `pagesize` argument, 
        - which is a tuple of two numbers in points (1/72 of an inch), defaults to A4.
- `setTitle()` sets the metadata title of the PDF
- `setFont()` sets the font for any text subsequently drawn on the canvas
- `drawString()` draws a string on the canvas specified at the x and y coordinates
- `showPage()` method causes the canvas to stop painting on the current page, and any further operations will paint on a subsequent page.
- `save()` method is called to generate the PDF document after the construction of the document is complete.

In [29]:
# import the canvas object from ReportLab
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter, A4
from reportlab.lib.units import inch


# Create a new PDF with ReportLab
# Create a “Canvas” object represents the PDF document. 
c = canvas.Canvas("mypdf.pdf", pagesize = A4)

#set title and Font
c.setTitle("AlexForbes Reports")
c.setFont("Helvetica", 36)

# Add content to the PDF.
# Add some text to the PDF
#Draw the text that starts half an inch from the left edge and ten inches from the bottom edge of the page.
c.drawString(.5*inch, 10*inch,"AlexForbes Internal Audit Finding Report")



c.drawString(.5*inch, 9*inch,"A simple PDF file using Canvas interface of the pdfgen")
#c.drawString(20,10, "Hello, World!")

# Use the “drawImage” method to add an image to the PDF
c.drawImage("Alexforbes_bottom_banner.jpg", x=20, y=10, width=500, height=100)
c.drawImage("Alexforbes_full.jpg", x=20, y=750, width=300, height=75)
# at the position (100, 500), with a width of 400 and a height of 400.

# This page is complete 
c.showPage()

# Save the PDF
c.save()

### Adding an Image to a PDF Document
Images can be added using
- Pdfgen Canvas
    - Using the Canvas Method
    - The drawImage function embeds an image into the PDF. 
        - Its first parameter is the image's file path, while the subsequent two specify the x and y positions, determining where the image's bottom-left corner will be anchored on the canvas.

In [None]:
from reportlab.pdfgen import canvas
c=canvas.Canvas('./downloads/image1.pdf', pagesize=A4)
c.drawImage('./downloads/beautifulsky.png', 0, 0)
c.showPage() #showPage() indicates finished adding content to the current page
c.save() #saves all the content added to the canvas

## Creating a Basic PDF Page with ReportLab with Platypus

- Platypus -Page Layout and Typography Using Scripts: 
    - A higher-level framework within ReportLab to create PDFs using a more abstract approach.
- Platypus has several layers:
    - **DocTemplates** 
        - the outermost container for the document
            -  `getSampleStyleSheet()` function retrieves a dictionary of predefined paragraph styles, allowing for easy styling of text elements such as 
                - titles, 
                - standard text, 
                - bulleted lists
    - **PageTemplates** 
        - specifications for layouts of pages of various kinds
    - **Frames** 
        - specifications of regions in pages that can contain flowing text or graphics.
    - **Flowables** 
        - are text or graphic elements that should flow into the document, like images, paragraphs, and tables.
    - **pdfgen.Canvas** 
        - is the lowest level, ultimately receiving the painting of the document from other layers

In [34]:
from reportlab.lib.styles import getSampleStyleSheet

sample_style_sheet = getSampleStyleSheet()

# if you want to see all the sample styles, this prints them
sample_style_sheet.list()

BodyText None
    name = BodyText
    parent = <ParagraphStyle 'Normal'>
    alignment = 0
    allowOrphans = 0
    allowWidows = 1
    backColor = None
    borderColor = None
    borderPadding = 0
    borderRadius = None
    borderWidth = 0
    bulletAnchor = start
    bulletFontName = Helvetica
    bulletFontSize = 10
    bulletIndent = 0
    embeddedHyphenation = 0
    endDots = None
    firstLineIndent = 0
    fontName = Helvetica
    fontSize = 10
    hyphenationLang = 
    justifyBreaks = 0
    justifyLastLine = 0
    leading = 12
    leftIndent = 0
    linkUnderline = 0
    rightIndent = 0
    spaceAfter = 0
    spaceBefore = 6
    spaceShrinkage = 0.05
    splitLongWords = 1
    strikeColor = None
    strikeGap = 1
    strikeOffset = 0.25*F
    strikeWidth = 
    textColor = Color(0,0,0,1)
    textTransform = None
    underlineColor = None
    underlineGap = 1
    underlineOffset = -0.125*F
    underlineWidth = 
    uriWasteReduce = 0
    wordWrap = None

Bullet bu
    name = B

In [37]:
# SimpleDocTemplate takes a buffer as a required argument. 
# A buffer here is acting as a place to output the the contents of the file as it’s being written
from reportlab.platypus import SimpleDocTemplate
from reportlab.lib.pagesizes import A4
from reportlab.platypus import SimpleDocTemplate, Paragraph, ListFlowable, Spacer
from reportlab.lib.styles import getSampleStyleSheet

# ____________________________________________________________________________________

# Sizing the document and margins
# Page size of the pdf. 
# This is parameter you can pass in as a tuple to the pagesize argument.
# Caveat: Have to multiply by a measurement (e.g. mm or inch) constant that they provide. 

from reportlab.lib.units import mm, inch
pagesize = (140 * mm, 216 * mm)  # width, height
my_doc = SimpleDocTemplate(pdf_buffer,pagesize=pagesize)

# For non-default margins, you can also pass those in with topMargin, leftMargin, rightMargin, and bottomMargin.
my_doc = SimpleDocTemplate(
    pdf_buffer,
    pagesize=pagesize,
    topMargin=1*inch,
    leftMargin=1*inch,
    rightMargin=1*inch,
    bottomMargin=1*inch
)

# ____________________________________________________________________________________

# Point the contents to a filename
# This will write to a file called myfilename.pdf
# Create a PDF file
pdf_filename = 'Platypus_sample.pdf'
doc = SimpleDocTemplate(pdf_filename, pagesize=letter)

# To hold onto it in python, you could also use a BytesIO buffer.
from io import BytesIO

pdf_buffer = BytesIO()
my_doc = SimpleDocTemplate(pdf_buffer)

# ___________________________________________________________________________________

# Adding content to the pdf
# Add some basic text using classes that already exist in platypus, 
# - Paragraph
#     - takes two arguments
#         - text and 
#         - style: 
#             - not a whole lot of info on styles. suggest digging around with ipdb.
#             - Use function that builds sample styles for us, getSampleStyleSheet which you can then modify.
# - PageBreak

from reportlab.lib.styles import getSampleStyleSheet

sample_style_sheet = getSampleStyleSheet()

# if you want to see all the sample styles, this prints them
sample_style_sheet.list()

# Example Outputs
sample_style_sheet['BodyText'] 
sample_style_sheet['Heading1']

# ______________________________________________________________________________________

# Creating our paragraph and adding it to flowables
from reportlab.platypus import Paragraph

paragraph_1 = Paragraph("A title", sample_style_sheet['Heading1'])
paragraph_2 = Paragraph("Some normal body text", sample_style_sheet['BodyText'])

# write a tilte
text=" PDF generated using Platypus"
title = Paragraph(text, title_style)


# add normal text
text="All about Paragraph Class <br/>"
paragraph = Paragraph(text, paragraph_style)

# Series of text lines

text_series = [
    "reportlab.platypus.Paragraph class is one of the most useful of the Platypus Flowables",
    "The text argument contains the text of the paragraph;",
    "The bulletText argument provides the text of a default bullet for the paragraph",
    "styles are arranged in a dictionary style object called a stylesheet ",
    "getSampleStyleSheet() allows for the styles to be accessed as stylesheet['BodyText']."
]

# ______________________________________________________________________________________


# flowables as a list of elements/content that we want to add to the pdf. It’s a required argument for build.
flowables = []
flowables.append(paragraph_1)
flowables.append(paragraph_2)
# iterate through content and add paragraphs, headers
# Initialize the list of elements to be added to the PDF
elements = []
elements.append(paragraph)
elements.append(title)

# Add a spacer (adjust the height as needed, e.g., 0.5 inch)
spacer = Spacer(.1*letter[1], 0.02*letter[1])  # 0.2*letter[1] represents 0.2 inch
elements.append(spacer)

# Create a ListFlowable with the list items and apply the 'Bullet' style
for item in text_series:
    list_flowable = ListFlowable([Paragraph(item, list_style)], bulletType="bullet")
    elements.append(list_flowable)

# _______________________________________________________________________________________

# Fonts and text styles
# Call getAvailableFonts() on canvas object
my_doc.canv.getAvailableFonts()

# To include your own, you have to register and embed them

# Create a paragraph style
# modify some of those sample styles that we had got with getSampleStyleSheet
# use those when we create our Paragraph
title_style= styles['Title']
paragraph_style = styles['Normal']
list_style = styles['Bullet']

custom_body_style = sample_style_sheet['BodyText']
custom_body_style.fontName = 'ZapfDingbats'
custom_body_style.fontSize = 25

paragraph_3 = Paragraph("Dingbat paragraph", custom_body_style)
flowables.append(paragraph_3)

# To show all the style’s attributes use listAttrs to get a sense of what’s modifiable.
custom_body_style.listAttrs()

# To add bold, italics, underlines, and linebreaks, 
# Package accepts HTML-like markup tags as RML. You could wrap some text.
Paragraph("A <b>bold</b> word.<br /> An <i>italic</i> word.", custom_body_style)

# ________________________________________________________________________________________

# Adding Page Numbers
# Add page numbers at the bottom of the pages. 
# Done via the build method with the optional arguments: onFirstPage and onLaterPages. 
# These take a callback with the pdf’s canvas and doc passed in

def add_page_number(canvas, doc):
    canvas.saveState()
    canvas.setFont('Times-Roman', 10)
    page_number_text = "%d" % (doc.page)
    canvas.drawCentredString(0.75 * inch,0.75 * inch,page_number_text)
    canvas.restoreState()

# For numbers on all the pages
my_doc.build(flowables, onFirstPage=add_page_number, onLaterPages=add_page_number)


# _________________________________________________________________________________________

# Build the PDF document with the elements
doc.build(elements)

# To create the file, or add to the pdf_buffer, we could use the build method on SimpleDocTemplate.
# using a file as the buffer, you should be able to open it and see the title and body outputted
my_doc.build(flowables)

# use the buffer that we created, pdf_buffer.

def view_that_returns_pdf(request):
    # all the other stuff
    pdf_value = pdf_buffer.getvalue()
    pdf_buffer.close()
    response = HttpResponse(content_type='application/pdf')
    response['Content-Disposition'] = 'attachment; filename="some_file.pdf"'

    response.write(pdf_value)
    return response

# _________________________________________________________________________________________

# A platypus object that is useful was PageBreak, 
# which starts a new page, something that might be useful for a new chapter
from reportlab.platypus import PageBreak

flowables.append(PageBreak())

NameError: name 'pdf_buffer' is not defined

In [38]:
doc.canv.getAvailableFonts()

['Courier',
 'Courier-Bold',
 'Courier-BoldOblique',
 'Courier-Oblique',
 'Helvetica',
 'Helvetica-Bold',
 'Helvetica-BoldOblique',
 'Helvetica-Oblique',
 'Symbol',
 'Times-Bold',
 'Times-BoldItalic',
 'Times-Italic',
 'Times-Roman',
 'ZapfDingbats']

- The `elements` list is initialized as an empty container intended to hold various content components for the eventual PDF document.
    - A Title is constructed using the `title_style`.
    - A standard paragraph is formed using the `paragraph_style, with line breaks integrated via the `<br/>` tag.
    - A `Spacer` introduces a specified gap or spacing within the PDF.
    - A Bulleted List is assembled using the `ListFlowable` function. Each item is wrapped in a `Paragraph` and added to the ListFlowable with the specified `list_style` and bullet type.


### Adding a Table into a PDF Document
The `Table` and `LongTable` classes provide foundational frameworks for 
- structured text grids and 
- originate from the `Flowable` class. 

These tables can accommodate Python strings as well as a list of Flowables within their cells.

`TableStyledefines` style properties to facilitate the aesthetic customization of tables. It governs:
- Header Rows: 
    - The leading rows typically showcasing column titles.
- Data Rows: 
    - Rows containing the core data or information.
- Visual Attributes: 
    - This includes specifications like colors, text alignment, font choices, padding, and the option for word wrapping.
    
The variable table_data contains a list representing a row in the table. 

The first row designates column headers followed by the data.

table = Table(table_data, colWidths=[150, 120, 120])

The code above instantiates a table object referencing the table_data. 
- The colWidths parameter specifies the width of each column, measured in points.

table_style outlines a suite of styling directives for the table:
- The header row background is a muted grey.
- Textual content within the header adopts a white hue.
- All table content is center-aligned.
- The header typography is emboldened and rendered in Helvetica.

`PageBreak()` serves as a Flowable, facilitating the transition to a new page within the document.

In [39]:
from reportlab.platypus import  Table, TableStyle, PageBreak


# Create a PDF file
pdf_filename = 'Plutypus_Table and image.pdf'
doc = SimpleDocTemplate(pdf_filename, pagesize=A4)

# Initialize the list of elements to be added to the PDF
elements = []

styles = getSampleStyleSheet()
# Create a paragraph style
title_style= styles['Title']
para_style= styles['Normal']


# write a title
text="Table generated using Platypus"
title = Paragraph(text, title_style)
elements.append(title)

# Create text content with line breaks
page_para="Page 1 <br/> <br/>"
page_para+=" Table with Countries, their Capital and the Continent where the country is located in a table"
content =Paragraph(page_para, para_style)
elements.append(content)

# Add a page break
elements.append(PageBreak())

# Creating a Table
# Define data for the table with long text
table_data = [
    ['Country', 'Capital', 'Continent'],
    ['United States of America', 'Washington D.C.','North America'],
    ['India', 'New Delhi', 'Asia'],
    ['China', 'Beijing', 'Asia'],
    ['Canada', 'Ottawa', 'North America'],
    ['United Kingdom', 'London', 'Europe'],
    ['France', 'Paris', 'Europe'],
    ['Germany', 'Berlin', 'Europe'],
    ['Australia', 'Canberra', 'Australia'],
    ['Japan', 'Tokyo', 'Asia'],
    ['South Africa', 'Pretoria', 'Africa'],
    ['Brazil', 'Brasília', 'South America'],
    ['Argentina', 'Buenos Aires', 'South America'],
]

# Create the table object
table = Table(table_data, colWidths=[150, 120, 120])

# Define table styles
table_style = [
    ('BACKGROUND', (0, 0), (-1, 0), (0.8, 0.8, 0.8)),
    ('TEXTCOLOR', (0, 0), (-1, 0), (1, 1, 1)),
    ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
    ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
    ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
    ('BACKGROUND', (0, 1), (-1, -1), (0.9, 0.9, 0.9)),
    ('WORDWRAP', (0, 1), (-1, -1), True),  # Enable word wrap for all cells except the header row
]

# Apply styles to the table
table.setStyle(TableStyle(table_style))



# Add the table to the document
elements.append( table)
doc.build(elements)

In [46]:
reportlab.platypus.TableStyle.getCommands()

NameError: name 'reportlab' is not defined

### Adding an Image to a PDF Document
Images can be added using:
- Platypus

Images can also be integrated using the Image class from the reportlab.platypus module. To utilize this:

- Instantiate an Image object, providing the desired image's file path.
- Define the image’s dimensions with drawHeight and drawWidth properties. 
    - The inch function translates these dimensions into point-based values suitable for the PDF.

In [30]:
# Adding an Image to a PDF Document
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Image
from reportlab.lib.units import inch
output_file = 'Plutypus_Table and image.pdf'
doc = SimpleDocTemplate(output_file, pagesize=letter)
element_pic=[]
# Path to your image
image_path = "af_short.png"

# Create an Image object
img = Image(image_path)
# Optionally, you can resize the image
img.drawHeight = 6 * inch
img.drawWidth = 4.5 * inch


# Collect the items to be added to the PDF in a list
elements = [img]
doc.build(element_pic)

### End Result look like this:

In [None]:
from io import BytesIO
from reportlab.platypus import SimpleDocTemplate, Paragraph, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import mm, inch
PAGESIZE = (140 * mm, 216 * mm)
BASE_MARGIN = 5 * mm
class PdfCreator:
    def add_page_number(self, canvas, doc):
        canvas.saveState()
        canvas.setFont('Times-Roman', 10)
        page_number_text = "%d" % (doc.page)
        canvas.drawCentredString(
            0.75 * inch,
            0.75 * inch,
            page_number_text
        )
        canvas.restoreState()
    def get_body_style(self):
        sample_style_sheet = getSampleStyleSheet()
        body_style = sample_style_sheet['BodyText']
        body_style.fontSize = 18
        return body_style
    def build_pdf(self):
        pdf_buffer = BytesIO()
        my_doc = SimpleDocTemplate(
            pdf_buffer,
            pagesize=PAGESIZE,
            topMargin=BASE_MARGIN,
            leftMargin=BASE_MARGIN,
            rightMargin=BASE_MARGIN,
            bottomMargin=BASE_MARGIN
        )
        body_style = self.get_body_style()
        flowables = [
            Paragraph("First paragraph", body_style),
            Paragraph("Second paragraph", body_style)
        ]
        my_doc.build(
            flowables,
            onFirstPage=self.add_page_number,
            onLaterPages=self.add_page_number,
        )
        pdf_value = pdf_buffer.getvalue()
        pdf_buffer.close()
        return pdf_value

# ReportsLab Functions

## Basic PDF Generation

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

def create_pdf(output_path, text):
    c = canvas.Canvas(output_path, pagesize=letter)
    width, height = letter
    c.drawString(100, height - 100, text)
    c.save()

## Adding Multiple Pages

Easily generate multi-page PDFs with varying content.

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

def create_multipage_pdf(output_path, texts):
    c = canvas.Canvas(output_path, pagesize=letter)
    width, height = letter
    for text in texts:
        c.drawString(100, height - 100, text)
        c.showPage()  # Move to the next page
    c.save()

## Incorporating Graphics

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

def create_pdf_with_graphics(output_path):
    c = canvas.Canvas(output_path, pagesize=letter)
    c.line(100, 200, 400, 500)  # Draw a line
    c.circle(300, 300, 50)     # Draw a circle
    c.save()

## Styling Text

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.lib.colors import blue

def styled_text_pdf(output_path, text):
    c = canvas.Canvas(output_path, pagesize=letter)
    text_object = c.beginText(100, 700)
    text_object.setTextOrigin(10, 730)
    text_object.setFont("Times-Roman", 12)
    text_object.setFillColor(blue)
    text_object.textLines(text)
    c.drawText(text_object)
    c.save()

## Adding Images to PDFs

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

def add_image_to_pdf(output_path, image_path):
    c = canvas.Canvas(output_path, pagesize=letter)
    c.drawInlineImage(image_path, 100, 600, width=200, height=200)  # Adjust width and height as needed
    c.save()

## Tables in PDFs

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table

def create_pdf_with_table(output_path, data):
    doc = SimpleDocTemplate(output_path, pagesize=letter)
    table = Table(data)
    story = [table]
    doc.build(story)

## Adding Barcodes

Integrate barcodes for a variety of applications in your PDF

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.graphics.barcode import code128

def barcode_pdf(output_path, barcode_value):
    c = canvas.Canvas(output_path, pagesize=letter)
    barcode = code128.Code128(barcode_value)
    barcode.drawOn(c, 100, 650)
    c.save()

# Working with PDFs 
- Python’s flexibility and interactivity lie in the fact that we can use any form of data. From
    - JSON, 
    - excel sheets, 
    - text files, 
    - APIs, or even 
    - PDFs
    
- PDF or Portable Document Format
- have different elements like text, images, tables, or forms in the file

## Text Extraction using the `PyPDF2` library
- use the PyPDF2 library.

In [None]:
!pip install PyPDF2

In [None]:
# importing required modules 
import PyPDF2 
from PyPDF2 import PdfReader

pdf_path='jeff-bezos-amazon-shareholder-letters-1997_2020.pdf'
pdf = PdfReader(str(pdf_path))

# run the “pdf” variable, it will return a PyPDF2 object. 

In [None]:
pdf

In [None]:
# check the number of pages in the document.
len(pdf.pages)

In [None]:
# look at its metadata
pdf.metadata

# returns a dictionary i.e metadata for the PDF file. 
# It gives information about the creator, creation date, or title of the document.

# current types of data that can be extracted: Author, Creator, Producer, Subject, Title, Number of pages

from PyPDF2 import PdfFileReader


pdf_file=open("sample.pdf", 'rb'):
pdf = PyPDF2.PdfFileReader(pdf_file)
information = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
info = f"""
    Information about {pdf_path}: 
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
"""

print(info)

#### Another Method

In [None]:
# importing required modules 
import PyPDF2 
  
# creating a pdf file object 
# read the pdf file using the python open() reading method.
# we are not reading in normal mode we are reading it in the Byte mode using rb
pdfFileObj = open('file.pdf', 'rb') 
  
# creating a pdf reader object 
# pass out the variable that had the file in the byte form to PdfFileReader() 
# the function which will read the pdf content.
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
  
# printing number of pages in pdf file
# To verify that we successfully read the pdf file or not we used numpages the method of Pypdf2 
# which will count the pages of our pdf and return an integer number.
print(pdfReader.numPages) 

# closing the pdf file object 
pdfFileObj.close() 

### Let’s print the text from the first page of the document.

In [None]:
first_page = pdf.pages[0]

In [None]:
first_page.extract_text()

### To extract text from each page one by one or run it in a loop. 

In [None]:
for page in pdf.pages:
    print(page.extract_text(),end='\n' + '\n _______________Next Page')

### Extracting Content from Pdf method 2

In [None]:
# importing modules 
import PyPDF2 
  
pdfFileObj = open('file.pdf', 'rb') 
  
# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# creating a page object 
# specific page by using getPage() a function and store it in a variable name pageObj
pageObj = pdfReader.getPage(0) 

# hen call extractText() a method on it

# to extract the content of reading pdf files using extractText() the function
# extracting text from page 
print(pageObj.extractText()) 
  
pdfFileObj.close()

### Extract Content across all pages

We would need page count so our loop knows where to stop

Use `numPages` to get pdf pages to count and then we will use while loop by setting i<pagecount and in the loop body we will line 10 and 13.

Changing the getPage(0) to getPage(i). Well on every iteration value i will increase by 1 number so we can iterate the whole pdf.

In [None]:
# importing modules 
import PyPDF2 
  
pdfFileObj = open('file.pdf', 'rb') 
  
# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

pagecount=pdfReader.numPages 

i=0

while i < pagecount:
    pageObj = pdfReader.getPage(i) 
    print(pageObj.extractText())
    i+=1
    pdfFileObj.close() 

## Extracting data using `pypdf` module

In [None]:
!pip install pypdf

### Read text

In [None]:
# Import libraries
from pypdf import PdfReader

# Import PDF files
reader = PdfReader("msa.pdf")

# Get number of pages
number_of_pages = len(reader.pages)

# Get pages. In this case, get the first page.
page = reader.pages[0]

# Extract text
text = page.extract_text()
print(text)

## Splitting documents page by page using `PyPDF2`

split the pdf and store each page as a pdf.

In [None]:
# importing libraries
# imported the reading and writing methods of pydf2
from PyPDF2 import PdfFileReader, PdfFileWriter


# function passing the pdf file name 
def split(path):
    pdf = PdfFileReader(path)
    for page in range(pdf.getNumPages()): # iterating the Pdf file with on basis of page count by using getNumPages().
        # GetNumPages() do same the job as the numPages but in loop area getNumPages() method give more benefit with python range function.
        # set the PdfWriter and add the current iterating page in writer and write that pdf file as we did in the writing section
        pdf_writer = PdfFileWriter()
        pdf_writer.addPage(pdf.getPage(page))

        output = f'page{page}.pdf'
        with open(output, 'wb') as output_pdf:
            pdf_writer.write(output_pdf)


filename = 'sample.pdf'
split(path)

## Split PDF using `pypdf`

In [None]:
# Import libraries
from pypdf import PdfReader, PdfWriter

# Import PDF
reader = PdfReader("msa.pdf ")
sourcepage = reader.pages[0]

# Create object for writing
writer = PdfWriter()

# Extract the 0th element (1st page PDF)
pdf = reader.pages[0]

# Add to object for writing
writer.add_page(pdf)

# Write to file
with open("msa_p1.pdf ", "wb") as fp:
    writer.write(fp)

## Merging documents page by page using `PyPDF2`

little change we are using two loops this time the first loop will iterate the pdf documents by reading them one by one as we passed a list form variable holding all the names of pdf files.

Split the procedure into 3 steps

- Reading pdf one by one
- Writing each page of pdf
- Output the pdf file

In [None]:
from PyPDF2 import PdfFileReader, PdfFileWriter

# 2 pdf files that I passed
def merge_pdfs(filenames):
    pdf_writer = PdfFileWriter()
    # after reading the pdf, 
    # we are writing each page of pdf with Pdf writer and after iterating all pdf documents on line # 13 
    # we are writing pdf file
    for path in filenames:
        pdf_reader = PdfFileReader(path)
        for page in range(pdf_reader.getNumPages()):
            # Add each page to the writer object
            pdf_writer.addPage(pdf_reader.getPage(page))

    # Write out the merged PDF
    with open("merge.pdf", 'wb') as out:
        pdf_writer.write(out)


filenames = ['doc1.pdf', 'doct2.pdf']
merge_pdfs(filenames)

## Combining PDFs using `pypdf`

In [None]:
# Split PDF
from pypdf import PdfWriter

merger = PdfWriter()

for pdf in ["msa.pdf", "foreword.pdf"]:
    merger.append(pdf)

merger.write("merged-pdf.pdf")
merger.close()

## Rotating PDF

Function to rotate the pdf document at any angle. 

By using the rotateClockwise method in which we need to pass an integer value. 
- If we pass 90 numbers in the method that will rotate each page in the pdf document to a 90-degree angle.

In [None]:
# importing the required modules 
from PyPDF2 import PdfFileReader, PdfFileWriter
  
def PDFrotate(Filename, output_name, rotation): 
  
    pdfFile = open(Filename, 'rb') 
    pdfReader = PdfFileReader(pdfFile) 
    pdfWriter = PdfFileWriter() 
      
    # rotating each page 
    for page in range(pdfReader.numPages): 
  
        # creating rotated page object 
        pageObj = pdfReader.getPage(page) 
        pageObj.rotateClockwise(rotation) 
  
        # adding rotated page object to pdf writer 
        pdfWriter.addPage(pageObj) 
  
    # new pdf file object 
    output = open(output_name, 'wb') 
      
    # writing rotated pages to new file 
    pdfWriter.write(output) 
  
    # closing the original pdf file object 
    pdfFile.close() 
      
    # closing the new pdf file object 
    output.close() 
      
PDFrotate("sample.pdf","rotate.pdf",270) 

## Image loading

To read an image in a PDF

In [None]:
# import modules 
import os
from pypdf import PdfReader
from PIL import Image
# Import PDF 
#Please change the directory path 
reader = PdfReader(r"C:\xxxxxxx\PyPDF\WPIEA2019065.pdf")
# Get pages. In this case, get the first page! [0Image113.png]. 
page = reader.pages[0]
count = 0
# Rotate the for statement by the number of images in the page 
for image_file_object in page.images:
    # Create a new png file
    #Please change the directory path
    with open(os.path.join(r" C:\xxxxxxx\PyPDF", str(count) + image_file_object.name + ".png"), "wb") as fp:
        # Write out the image
        fp.write(image_file_object.data)
        count += 1
        # Open the saved image file and save it as PNG
        img = Image.open(fp.name)

# Working with text Data
- leverage NLTK, a leading platform in Python for building Python programs to work with human language data.

In [None]:
!pip install nltk

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

##  Extracting Text from PDFs

In [None]:
# First, extract text from the PDF

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(reader.pages)):
            text += reader.pages[page_num].extract_text()
    return text

In [None]:
# Then, process and analyze the extracted text
def analyze_text(text):
    # Tokenization
    tokens = word_tokenize(text)
    
    # Removing punctuation and making everything lowercase
    tokens = [word.lower() for word in tokens if word.isalpha()]
    
    # Removing stopwords (common words that don't add significant meaning)
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    return filtered_tokens

In [None]:
resume_text = extract_text_from_pdf('Umzubongile Siphamandla Mandindi CV.pdf')
keywords = analyze_text(resume_text)
print(keywords)

## Extracting Text from PDFs 2

In [None]:
import fitz

def extract_pdf_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

# OCR

## Extracting Text from Scanned PDFs

### Library: Pytesseract

For scanned PDFs, Pytesseract remains the choice for OCR.

In [None]:
from PIL import Image
import fitz
import pytesseract

def ocr_extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    page = doc[0]
    image = page.get_pixmap().to_image()
    text = pytesseract.image_to_string(Image.frombytes("RGB", [image.width, image.height], image.samples))
    return text

## Determining PDF Searchability in a Scan

Some are scanned, 
- which means they are non-searchable, 
- while others are not scanned and are searchable. 

Determining whether a PDF is scanned or not can have significant implications, and it is crucial to understand the difference.

Scanned PDF is essentially an image file. It is created by scanning a paper document and saving it as a PDF. As a result, the document becomes non-searchable, meaning you cannot select and copy text from it or use search functions to find specific information. 

Scanned PDFs can be challenging to work with, especially when it comes to editing or extracting data.

##### To determine if a PDF is searchable or not:

In [None]:
def get_pdf_searchable_pages(fname):
    searchable_pages = []
    non_searchable_pages = []
    page_num = 0
    with open(fname, 'rb') as infile:

        for page in PDFPage.get_pages(infile):
            page_num += 1
            if 'Font' in page.resources.keys():
                searchable_pages.append(page_num)
            else:
                non_searchable_pages.append(page_num)
    if page_num > 0:
        if len(searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is non-searchable")
        elif len(non_searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is searchable")
        else:
            print(f"searchable_pages : {searchable_pages}")
            print(f"non_searchable_pages : {non_searchable_pages}")
    else:
        print(f"Not a valid document")

## Creating a new text file from extracted text

In [None]:
with open('new.txt',mode="w") as output_file:
    for page in pdf.pages:
        text = page.extractText()
        output_file.write(text)

## Creating a new PDF file from an existing file

In [None]:
from PyPDF2 import PdfFileWriter

pdf_writer = PdfFileWriter()
existing_pdf=open("sample.pdf","rb")
pdf_reader=PdfFileReader(existing_pdf)
for pagenum in range(pdf_reader.numPages):
    obj=pdf_reader.getPage(pagenum)
    pdf_writer.addPage(obj)

## Merging Multiple PDFs

In [None]:
# Python script to merge multiple PDFs into a single PDF
import PyPDF2

def merge_pdfs(input_paths, output_path):
    pdf_merger = PyPDF2.PdfMerger()
    for path in input_paths:
        with open(path, 'rb') as f:
            pdf_merger.append(f)
        with open(output_path, 'wb') as f:
            pdf_merger.write(f)

## Merging Multiple PDFs 2

### Library: Fitz

Join multiple PDFs into a single document with ease.

In [None]:
def merge_pdfs(pdf_list, output_path):
    merged = fitz.open()
    for pdf in pdf_list:
        doc = fitz.open(pdf)
        merged.insert_pdf(doc)
    merged.save(output_path)

## Merge PDFs with python 3

In [None]:
from PyPDF2 import PdfFileMerger
from os import listdir


input_dir = "C:/Users/" #your input directory path


merge_list = []

for x in listdir(input_dir):
    if not x.endswith('.pdf'):
        continue
    merge_list.append(input_dir + x)

merger = PdfFileMerger()

for pdf in merge_list:
    merger.append(pdf)

merger.write("C:/Users/merged_pdf.pdf") #your output directory
merger.close()

## Adding Password Protection

In [None]:
# Python script to add password protection to a PDF
import PyPDF2

def add_password_protection(input_path, output_path, password):
    with open(input_path, 'rb') as f:
        pdf_reader = PyPDF2.PdfFileReader(f)
        pdf_writer = PyPDF2.PdfFileWriter()
        for page_num in range(pdf_reader.numPages):
            page = pdf_reader.getPage(page_num)
            pdf_writer.addPage(page)
            pdf_writer.encrypt(password)
            with open(output_path, 'wb') as output_file:
                pdf_writer.write(output_file)

## Encrypt PDF File

Add some security to your documents because no one wants to share the important information with anyone you can add encryption to your pdf file using the Encryption method with Pypdf2

In [None]:
from PyPDF2 import PdfFileWriter, PdfFileReader

def add_encryption(input_pdf, output_pdf, password):
    pdf_writer = PdfFileWriter()
    pdf_reader = PdfFileReader(input_pdf)
    
    for page in range(pdf_reader.getNumPages()):
        pdf_writer.addPage(pdf_reader.getPage(page))

    pdf_writer.encrypt(user_pwd=password, owner_pwd=None, use_128bit=True)
    with open(output_pdf, 'wb') as f:
        pdf_writer.write(f)

encryption('sample.pdf','encrypted.pdf','Coder101')

## Splitting a PDF Using Bookmarks

### Library: Fitz

Splitting a PDF based on its bookmarks becomes straightforward with fitz.

In [None]:
import fitz

def split_pdf_by_bookmarks(pdf_path):
    doc = fitz.open(pdf_path)
    bookmarks = doc.getToC()
    for bookmark in bookmarks:
        title, _, start_page = bookmark
        page = doc.load_page(start_page)
        new_pdf = fitz.open()
        new_pdf.insert_page(0, image=page.get_pixmap())
        new_pdf.save(f"{title}.pdf")

## Rendering PDFs to Image

### Library: Fitz

Transforming a PDF to an image is also possible with Fitz.

In [None]:
import fitz

def pdf_to_image(pdf_path, image_path):
    doc = fitz.open(pdf_path)
    page = doc[0]
    pixmap = page.get_pixmap()
    pixmap.save(image_path)

## Searching for Text in PDFs

### Library: Fitz

Efficiently search for specific text within your PDF.

In [None]:
import fitz
def search_text(pdf_path, query):
    doc = fitz.open(pdf_path)
    occurrences = []
    for page in doc:
        found = page.search_for(query)
        for occurrence in found:
            rect = occurrence.rect
            occurrences.append((page.number, rect))
    return occurrences

# Working With PDF Annotations in Python

## Annotating PDFs
### Library: Fitz

What are Annotations?

Remarkable features of PDF documents is the Annotations — an umbrella term that covers a variety of interactive objects that can be placed on top of the PDF content.

Example:
- Textual notes, 
- highlights and file attachments

They offer an engaging way to enrich a document’s content, facilitate reader’s interaction, and provide feedback or clarification where necessary.

Adding annotations or comments to a PDF is seamless with Fitz

##### Marking Text
- Highlight, 
- Underline, 
- Squiggly (zigzag), and 
- StrikeOut are used to increase the visibility of selected portions of text and pretty much do what their names suggest.

##### Drawing Graphical Elements
To create simple graphics, use any of the annotations Line, Square, Circle, Polygon, PolyLine, or Ink (freehand drawing simulation).

##### Providing Meta-Information
A Stamp annotation looks like a rubber stamp and lets you visibly mark a page as being “Draft”, “Confidential”, “Approved” or similar.

##### Caret
A Caret annotation indicates that text on this page has been modified. This may be useful to focus attention on selected pages during negotiating contract wordings or similar.

##### Commenting, Adding Information and Redacting

- A `Text` annotation is represented by an icon, which exhibits text when clicked. It is very similar to a sticky note and it is used to comment on some part of the page. A visual example is shown further down.

- A `FreeText` annotation has a similar purpose, but its text is directly visible and not hidden “inside” an icon.

- `FileAttachment` annotations are similar to Text but allow storing complete files “inside” an icon.

- A `Redact` annotation is used to mark some part of the page as a candidate for removal.

In [None]:
def annotate_pdf(pdf_path, output_path, page_num, rect, text):
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    page.add_text_annot(rect, text)
    doc.save(output_path)

In [None]:
import fitz


doc = fitz.open("input.pdf")  # open the input file
page = doc[0]  # load the desired page


point = fitz.Point(50, 50)  # top-left coordinates of the icon
annot = page.add_text_annot(point, "This is a sticky note.")


doc.save("input-annotated.pdf")  # save changes in a new file
doc.close()

In [None]:
# We need a rectangle into which the circle (ellipse) is drawn
rect = fitz.Rect(100, 100, 300, 250)


# Define colors for border (stroke) and interior (fill)
blue = fitz.pdfcolor["blue"]
yellow = fitz.pdfcolor["yellow"]


annot = page.add_circle_annot(rect)


# Colorize the new annotation
annot.set_colors(stroke=blue, fill=yellow)


# Give it a dashed and cloudy border
annot.set_border(dashes=(2, 2), clouds=2)


# Set a transparency of 30%
annot.set_opacity(0.3)


# Overwrite annotation defaults
annot.update()

In [None]:
# ways to access existing annotations of a page, display their properties, and delete or modify them.
# Iterate over the page’s annotations and print basic properties
for annot in page.annots():
    print(f"{annot!r},'{annot.info['content']}'")


# Output:
'Text' annotation on page 0 of input.pdf, 'This is a sticky note.'
'Circle' annotation on page 0 of input.pdf, ''