### PDF

Following is an example with iPad, since we don't have the original tech spec file as PDF, we need to convert it from HTML using PDF first. Use the [wk<html>TOpdf download page](https://wkhtmltopdf.org/downloads.html) to download (or install it with a package manager). Note that there's also the option to install using one single line of cmd according to [packaging](https://github.com/wkhtmltopdf/packaging). This is not necessarily a part of the workflow we had in mind, and thus, the installation process is not added in the main `README.md`

In [None]:
from pdfkit import from_url
import os
from PyPDF2 import PdfReader
from glob import glob
from dotenv import load_dotenv

In [None]:
URLS = [
    'https://support.apple.com/kb/SP883?viewlocale=en_US&locale=en_US', # Pro
    'https://support.apple.com/kb/SP866?viewlocale=en_US&locale=en_US', # Air
    'https://support.apple.com/kb/SP788?viewlocale=en_US&locale=en_US', # Mini
    'https://support.apple.com/kb/SP884?viewlocale=en_US&locale=en_US', # 10th Gen
    'https://support.apple.com/kb/SP849?viewlocale=en_US&locale=en_US', # 9th Gen
]

FILE_NAMES = [
    'ipad_pro', 
    'ipad_air', 
    'ipad_mini', 
    'ipad_10th_gen', 
    'ipad_9th_gen'
]

OUTPUT_DIR = 'output'

# configure settings for pdf output
options = {
    "page-size": "A4",
    "margin-top": "0mm",
    "margin-right": "0mm",
    "margin-bottom": "0mm",
    "margin-left": "0mm",
    "encoding": "UTF-8",
}

### generate pdf from html

In [4]:
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

for url, pdf_file in zip(URLS, FILE_NAMES):
    pdf_file_name = os.path.join(OUTPUT_DIR, pdf_file + '.pdf') # e.g., output/ipad_pro.pdf
    from_url(url, pdf_file_name, options=options)
    print(f'{pdf_file_name} saved')

output/ipad_pro.pdf saved
output/ipad_air.pdf saved
output/ipad_mini.pdf saved
output/ipad_10th_gen.pdf saved
output/ipad_9th_gen.pdf saved


### convert html to text and summarize

In [4]:
def get_pdf_text(pdf_doc):
    text = ""
    pdf_reader = PdfReader(pdf_doc)
    for page in pdf_reader.pages:
        text += page.extract_text()
    return text

In [35]:
# test_doc = r'output/ipad_pro.pdf'
# pdf_text = get_pdf_text(test_doc)
# pdf_text[:200]

'Languages\n \nEnglish\niPad Pro 12.9-inch (6th generation) - Technical\nSpecifications\nYear introduced: 2022\nIdentify your iPad model\nFinish\nSilver\nSpace Gray\nCapacity\n128GB\n256GB\n512GB\n1TB\n2TB\nSize and W'

In [10]:
pdf_docs = glob(f'{OUTPUT_DIR}/*.pdf') # all PDF files in the list
pdf_docs = ['output/ipad_pro.pdf']
for doc in pdf_docs:
    file_name = os.path.basename(doc)
    txt = get_pdf_text(doc)
    print(txt)

    
    # implement
    # rest of process is already taken cared of in Wencheng's code

Languages
 
English
iPad Pro 12.9-inch (6th generation) - Technical
Specifications
Year introduced: 2022
Identify your iPad model
Finish
Silver
Space Gray
Capacity
128GB
256GB
512GB
1TB
2TB
Size and Weight
Wi-Fi models
Width: 8.46 inches (214.9 mm)
Height: 11.04 inches (280.6 mm)
Depth: 0.25 inch (6.4 mm)
Weight: 1.5 pounds (682 grams)
Wi-Fi + Cellular models
Width: 8.46 inches (214.9 mm)
Height: 11.04 inches (280.6 mm)
Depth: 0.25 inch (6.4 mm)
Weight: 1.51 pounds (685 grams)
Buttons and Connectors
1
. 
Front camera
2
. 
Top button
3
. 
Volume buttons
4
. 
Rear cameras
5
. 
Flash
6
. 
LiDAR Scanner
7
. 
Smart Connector
8
. 
Thunderbolt / USB 4 connector
9
. 
SIM tray (Wi-Fi + Cellular)
10
. 
Magnetic connector for Apple Pencil
1
2
In the Box
iPad Pro 12.9-inch (6th generation)
USB-C Charge Cable (1 meter)
20W USB-C Power Adapter
Display
Liquid Retina XDR display
12.9-inch (diagonal) mini-LED backlit Multi
‑
Touch display with IPS technology
2D backlighting system with 2596 full
‑
arra