# Use Python to manage PDF Files

## Reading PDF Files with PdfReader

<p>To create a new instance of the PdfReader class, you’ll need to provide the path to the PDF file that you want to open. You can do that using the pathlib module:</p>

In [28]:
from pathlib import Path

pdf_path = (
     Path.home()
     / "c:users/name/Applied Python Files" --- you put document directory
     / "Intro.pdf" ---- document name
 )

<p>Now create the PdfReader instance by calling the class’s constructor with the path to your PDF file as an argument:
</p>

In [57]:
pdf_reader = PdfReader('Intro.pdf')

<p>Now that you’ve created a PdfReader instance, you can use it to gather information about the PDF file. For example, to get the number of pages contained in the PDF file, you can use the built-in len() function like in the code below:</p>

In [58]:
len(pdf_reader.pages)


2

<p>You can also access some document information using the .metadata attribute:</p>

In [32]:
pdf_reader.metadata

{'/Title': 'Intro script', '/Producer': 'Skia/PDF m93 Google Docs Renderer'}

<p>we can use each title of metadata for access to contend</p>
<p>To get the title, use the .title attribute:</p>

In [33]:
pdf_reader.metadata.title

'Intro script'

<p>The .metadata object contains the PDF’s metadata, which is set when the file is first created. The PdfReader class provides all the necessary methods and attributes that you need to access data in a PDF file.</p>

## Extracting Text From a Page

<p>If you need to extract text from a PDF page, then you need to run the following steps:
<ol>
    <li>Get a PageObject with PdfReader.page[page_index].</li>
    <li>Extract the text as a string with the PageObject instance’s .extract_text() method.</li>
</ol>
</p>

In [34]:
first_page = pdf_reader.pages[0]
type(first_page)

pypdf._page.PageObject

In [35]:
print(first_page.extract_text())

Python Programming language is one of the most popular
programming languages for everyone. Why?
Python can be used for Data Science, Machine Learning,
Server-side , Desktop apps  and many other areas.
Also, It
has a large library so you don’t have to write your
own
code for every single thing.
Because of this, companies want to hire Python
developers, and Python Data Analysts and they are
some
of the most well-paid people in the industry .
The goal of this diploma,
is to provide you with the fundamentals that you
need to
know ,
to code with Python and to use Python for data handling,
analysis and visualization.
And by the end of this course, you’ll be able to create
real
python programs and analyze data using Python libraries
and tools.
I will show you how to program with Python easily
and
efficiently ,


In [36]:
for page in pdf_reader.pages:
     print(page.extract_text())

Python Programming language is one of the most popular
programming languages for everyone. Why?
Python can be used for Data Science, Machine Learning,
Server-side , Desktop apps  and many other areas.
Also, It
has a large library so you don’t have to write your
own
code for every single thing.
Because of this, companies want to hire Python
developers, and Python Data Analysts and they are
some
of the most well-paid people in the industry .
The goal of this diploma,
is to provide you with the fundamentals that you
need to
know ,
to code with Python and to use Python for data handling,
analysis and visualization.
And by the end of this course, you’ll be able to create
real
python programs and analyze data using Python libraries
and tools.
I will show you how to program with Python easily
and
efficiently ,
And work with data libraries such as NumPy , Pandas,
Matplotlib, seaborn and more, practically and in a
short
time.
My name is noro and If you encounter any problems
in this
course, I’m

## Putting It All Together

<p>Open a new editor window in IDLE, create a new .py file called save_to_txt.py, and type in the following code:</p>

<p>Our code:</p>

In [41]:
from pathlib import Path

from pypdf import PdfReader

pdf_path = (
     Path.home()
     / "MACHINE LEARNING\python for machine learnong a step by step guide\Applied Python Files"
     / "Intro.pdf"
 )

pdf_reader = PdfReader('Intro.pdf')
txt_file = Path.home() / "Pride_and_Prejudice.txt"
content = [
    f"{pdf_reader.metadata.title}",
    f"Number of pages: {len(pdf_reader.pages)}"
]

for page in pdf_reader.pages:
    content.append(page.extract_text())

txt_file.write_text("\n".join(content))

1111

<p>Here’s a breakdown of how this code works like by line:</p>
<ul>
    <li>Line 3 imports Path from pathlib, while line 5 imports PdfReader.</li>
    <li>Lines 7 to 12 define a Path object containing the path to your target PDF file.</li>
<li>Line 14 assigns a new PdfReader instance to the pdf_reader variable.</li>
    <li>Line 15 creates a Path object that points to the output .txt file.</li>
<li>Lines 16 to 19 create a list where you’ll store the content that you’ll save to the .txt file. Initially, this list only contains the PDF title and the number of pages.</li>
<li>Lines 21 and 22 define a for loop that iterates over the PDF pages, extracts their content as strings, and appends these strings to content.</li>
<li>Line 24 concatenates all the strings in content using the .join() method and a newline (\n) character as a separator. Finally, it writes the concatenated text into txt_file by taking advantage of .write_text() from Path.
</ul>

<p>When you save and run the program, it’ll create a new file in your home directory called Pride_and_Prejudice.txt containing the full text of the Pride_and_Prejudice.pdf document. Open it up and check it out!</p>

## Writing to PDF Files With PdfWriter


In [76]:
from pypdf import PdfWriter
pdf_writer = PdfWriter()

In [11]:
page = pdf_writer.add_blank_page(width=8.27 * 72, height=11.7 * 72)

<p>The width and height arguments are required. They determine the dimensions of the page in user space units. One of these units is equal to 1/72 of an inch, so the above code adds an A4 blank page to pdf_writer.</p><br>

<p>The .add_blank_page() method returns a new PageObject instance representing the page that you added to PdfWriter:</p>

In [12]:
type(page)

pypdf._page.PageObject

<p>In this example, you’ve assigned the PageObject instance returned by .add_blank_page() to the page variable, but in practice, you don’t usually need to do this. That is, you usually call .add_blank_page() without assigning the return value to anything:</p>

In [13]:
pdf_writer.add_blank_page(width=8.27 * 72, height=11.7 * 72)

{'/Type': '/Page',
 '/Resources': {},
 '/MediaBox': [0.0, 0.0, 595.44, 842.4],
 '/Parent': {'/Type': '/Pages',
  '/Count': 2,
  '/Kids': [IndirectObject(4, 0, 2334728711176),
   IndirectObject(5, 0, 2334728711176)]}}

<p>o write the contents of pdf_writer to a PDF file, pass a file object in binary write mode to pdf_writer.write():</p>

In [14]:
pdf_writer.write("blank.pdf")

(True, <_io.FileIO [closed]>)

<p>This creates a new file in your current working directory called blank.pdf. If you open the file with a PDF reader, such as Adobe Acrobat, then you’ll see a document with a single blank page with an A4 dimension.</p>

<p>In the example above, you followed three steps to create a new PDF file using pypdf:</p>
    <ol>
        <li>Create a PdfWriter instance.</li>
        <li>Add one or more pages to the PdfWriter instance, using either .add_blank_page() or .add_page().</li>
        <li>Write to a file using PdfWriter.write().</li>
    </ol>

<p>You’ll see this pattern over and over as you learn various ways to add pages to a PdfWriter instance.</p>

In [77]:
#step 1
pdf_writer = PdfWriter()

In [85]:
PdfWriter = pdf_writer.add_blank_page(width=8.27 * 72, height=11.7 * 72)

In [86]:
pdf_reader = PdfReader('blank.pdf')

In [87]:
pdf_writer.add_blank_page(width=8.27 * 72, height=11.7 * 72)

{'/Type': '/Page',
 '/Resources': {},
 '/MediaBox': [0.0, 0.0, 595.44, 842.4],
 '/Parent': {'/Type': '/Pages',
  '/Count': 4,
  '/Kids': [IndirectObject(4, 0, 2334730879304),
   IndirectObject(5, 0, 2334730879304),
   IndirectObject(6, 0, 2334730879304),
   IndirectObject(7, 0, 2334730879304)]}}

In [89]:
len(pdf_writer.pages)

4

## Two interesting forms to do 

In [90]:
# one
"""
from PyPDF2 import PdfReader, PdfWriter

dr = r"C:\GC"
ldr = dr + r"\12L.pdf"
writer = PdfWriter()

with open(ldr, "rb") as f:
  reader = PdfReader(f)
  page = reader.pages[0]
  writer.add_page(page)
  
  with open(dr + r"\new.pdf", "wb") as output_stream:
    writer.write(output_stream)
"""
# two
"""
from PyPDF2 import PdfFileWriter, PdfFileReader

input_pdf = open('test.pdf', 'rb') --here you put directory name and document name
writer = PdfFileWriter()

reader = PdfFileReader(input_pdf)
in1 = writer.addPage(reader.getPage(0))
input_pdf.close()

output_pdf = open('new_test.pdf', 'wb')
writer.write(output_pdf)
output_pdf.close()
"""

"\nfrom PyPDF2 import PdfFileWriter, PdfFileReader\n\ninput_pdf = open('test.pdf', 'rb')\nwriter = PdfFileWriter()\n\nreader = PdfFileReader(input_pdf)\nin1 = writer.addPage(reader.getPage(0))\ninput_pdf.close()\n\noutput_pdf = open('new_test.pdf', 'wb')\nwriter.write(output_pdf)\noutput_pdf.close()\n"