# Converting PDF to Text
Some other web sites may offer PDFs instead of displaying the content via HTML pages. 
[PDFMiner](https://github.com/euske/pdfminer) is a library among several others which can be used to extract the content of the Webpages. 
Below is an example script to extract the report from the URL https://www.bis.org/publ/arpdf/ar2023e.pdf".

Following are some important classes:
1. [PDFPage](https://github.com/euske/pdfminer/blob/master/pdfminer/pdfpage.py) describes the properties of a page and points to its contents.
2. [PDFResourceManager](https://github.com/euske/pdfminer/blob/master/pdfminer/pdfinterp.py) helps to manage the reuse of shared resources such as fonts and images to prevent multiple allocations.
3. [PDFPageInterpreter](https://github.com/euske/pdfminer/blob/master/pdfminer/pdfinterp.py) is a cruicial component that processes the content of individual pages and converts them to a format that can be easily extracted and manipulated.
4. [TextConverter](https://github.com/euske/pdfminer/blob/master/pdfminer/converter.py) Obtains the exact location of text as well as other layout information
5. [LAParams](https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py) defines the parameters for the layout.

In [1]:
from urllib import request
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from tqdm import tqdm
import io
from pdfminer.layout import LAParams
import requests


In [2]:
pdf_url = "https://www.bis.org/publ/arpdf/ar2023e.pdf"
response = request.urlopen(pdf_url)
input_file = "report.pdf"
file = open(input_file, 'wb')
file.write(response.read())
file.close()

In [3]:
i_f = open(input_file,'rb')
resMgr = PDFResourceManager()
resData = io.StringIO() #It is an in-memory file-like object that allows to read from and write to a string buffer as if it were a file.
TxtConverter = TextConverter(resMgr,resData, laparams= LAParams())
interpreter = PDFPageInterpreter(resMgr,TxtConverter)
for page in tqdm(PDFPage.get_pages(i_f)):
    interpreter.process_page(page)
txt = resData.getvalue()
output_path = "report.txt"
with open(output_path,'w',encoding='utf-8') as of:
    of.write(txt)

0it [00:00, ?it/s]

142it [00:24,  5.81it/s]
