In [1]:
%pip install openai pdfplumber python-docx pillow
from openai import OpenAI
import pdfplumber, docx, base64
client = OpenAI()


Collecting pdfplumber
  Downloading pdfplumber-0.11.8-py3-none-any.whl.metadata (43 kB)
Collecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Collecting pillow
  Using cached pillow-12.0.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.8 kB)
Collecting pdfminer.six==20251107 (from pdfplumber)
  Downloading pdfminer_six-20251107-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-5.1.0-py3-none-macosx_11_0_arm64.whl.metadata (67 kB)
Collecting charset-normalizer>=2.0.0 (from pdfminer.six==20251107->pdfplumber)
  Using cached charset_normalizer-3.4.4-cp313-cp313-macosx_10_13_universal2.whl.metadata (37 kB)
Collecting cryptography>=36.0.0 (from pdfminer.six==20251107->pdfplumber)
  Downloading cryptography-46.0.3-cp311-abi3-macosx_10_9_universal2.whl.metadata (5.7 kB)
Collecting lxml>=3.1.0 (from python-docx)
  Downloading lxml-6.0.2-cp313-cp313-macosx_10_13_universal2.whl.metadata (3.6 kB)
Collecting

#### Extract Info from pdf
###### Reads text with pdfplumber and sends to GPT-5_nano for summarization

In [2]:
pdf_path = "data/California_Employment_Offer_Letter.pdf"  

def extract_text_from_pdf(path):
    text = ""
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            t = page.extract_text()
            if t:
                text += t + "\n"
    return text.strip()

pdf_text = extract_text_from_pdf(pdf_path)
pdf_text

'EMPLOYMENT OFFER LETTER (CALIFORNIA)\nAcme Corporation\n1234 Market Street\nSan Francisco, CA 94105\nApril 1, 2023\nPrivate & Confidential\nJane Doe\n567 Oak Avenue\nSan Jose, CA 95112\nDear Jane,\nWe are delighted to extend to you an offer of employment with Acme Corporation (“Company”).\nThis letter sets forth the principal terms and conditions of your employment.\n1. Position and Reporting\nYou will join Acme Corporation as a Senior Software Engineer. Your duties will include, but not be\nlimited to: designing, coding, testing, and maintaining enterprise applications; collaborating with\ncross-functional teams; mentoring junior engineers; and ensuring compliance with internal security\nand data privacy standards. You will report directly to John Smith, Chief Technology Officer. This\nis a full-time, exempt position under California wage and hour law. Your primary work location will\nbe San Francisco, CA, but you may be required to travel up to 10% of your time.\n2. Start Date and D

In [3]:
#Ollama Extraction for PDF text
import ollama

prompt = f"Extract and summarize key details from this PDF:\n\n{pdf_text}"

response = ollama.chat(
    model="llama3",
    messages=[
        {"role": "system", "content": "You summarize and extract data from PDF text."},
        {"role": "user", "content": prompt}
    ]
)

print("Extracted Info from PDF:\n")
print(response['message']['content'])

Extracted Info from PDF:

Here are the key details extracted and summarized from the PDF:

**Position and Reporting**

* Job title: Senior Software Engineer
* Duties: designing, coding, testing, maintaining enterprise applications; mentoring junior engineers; ensuring compliance with internal security and data privacy standards
* Supervisor: John Smith, Chief Technology Officer
* Work location: San Francisco, CA (with occasional travel up to 10% of the time)

**Start Date and Duration**

* Expected start date: April 24, 2023 (contingent upon satisfaction of conditions listed in Section 6)
* Offer expiration date: April 15, 2023
* Job duration: At-will employment

**Compensation**

* Base salary: $150,000 per year, paid bi-weekly
* Bonus: Discretionary annual bonus targeted at 10% of base salary
* Equity grant: 10,000 Restricted Stock Units (RSUs), vesting over 4 years with a 1-year cliff (subject to Board approval)

**Benefits and Paid Time Off**

* Eligible for Company benefits as des

In [8]:
#Open AI API call to summarize and extract key details from the PDF text

prompt = f"Extract and summarize key details from this PDF:\n\n{pdf_text}"

response = client.chat.completions.create(
    model="gpt-5-nano",
    messages=[
        {"role": "system", "content": "You summarize and extract data from PDF text."},
        {"role": "user", "content": prompt}
    ]
)

print("Extracted Info from PDF:\n")
print(response.choices[0].message.content)

Extracted Info from PDF:

Here are the key details from the Employment Offer Letter (California) to Jane Doe from Acme Corporation.

Document and parties
- Date of letter: April 1, 2023
- Employer: Acme Corporation, 1234 Market Street, San Francisco, CA 94105
- Recipient: Jane Doe, 567 Oak Avenue, San Jose, CA 95112
- Private & Confidential

Position and reporting
- Title: Senior Software Engineer
- Duties (illustrative): designing, coding, testing, and maintaining enterprise applications; collaborating with cross-functional teams; mentoring junior engineers; ensuring compliance with security and data privacy standards
- Reports to: John Smith, Chief Technology Officer
- Employment type: full-time, exempt
- Primary work location: San Francisco, CA
- Travel: up to 10% of time

Start date and offer expiration
- Start date: April 24, 2023 (contingent on conditions in Section 6)
- Offer expiration: April 15, 2023 (signed acceptance required by then)

Compensation
- Base salary: $150,000 pe

#### Extract Information from a DOCX (Word) File
###### Reads paragraphs and asks GPT-5-nano for key points.

In [4]:
docx_path = "data/Project_List.docx" 

def extract_text_from_docx(path):
    document = docx.Document(path)
    return "\n".join([p.text for p in document.paragraphs])

docx_text = extract_text_from_docx(docx_path)

print (f"Extracted DOCX Text (first 500 chars):\n{docx_text[:500]}\n")
prompt = f"Extract key points from this DOCX content:\n\n{docx_text[:8000]}"

Extracted DOCX Text (first 500 chars):

Project 1

Project: Real Estate Platform - eREP
Company: Sansa Technology LLC, Milpitas, CA, USA
Duration: From March 2017 – Current
Role: Java EE Developer
Description:
eREP provides leading real estate and rental marketplace platform. eREP serves the full lifecycle of owning and living in a home: buying, selling, renting, financing, remodeling and more. 
Responsibilities:
• Analyzed requirements and designed class diagrams, sequence diagrams using UML and
prepared high level technical documen



In [5]:
#Open AI API call to summarize and extract key details from the DOCX text
response = client.chat.completions.create(
    model="gpt-5-nano",
    messages=[
        {"role": "system", "content": "You summarize and extract structured data from Word documents."},
        {"role": "user", "content": prompt}
    ]
)

print("Extracted Info from DOCX:\n")
print(response.choices[0].message.content)

Extracted Info from DOCX:

Here are the key points organized by project.

Project 1 – Real Estate Platform (eREP)
- Company: Sansa Technology LLC, Milpitas, CA, USA
- Duration: March 2017 – Current
- Role: Java EE Developer
- Description: eREP is a real estate and rental marketplace platform covering the full lifecycle of home ownership (buying, selling, renting, financing, remodeling, etc.).
- Responsibilities:
  - Analyzed requirements and designed class/sequence diagrams using UML; prepared high-level technical documents.
  - Implemented Java and J2EE design patterns.
  - Used Spring MVC with annotations and XML for Dependency Injection.
  - Implemented persistence with Hibernate to interact with MySQL.
  - Wrote SQL queries in the DAO layer.
  - Developed REST web services using Spring, Hibernate, JAX-RS, and JAXB.
  - Built UI with Spring view components, JSP, HTML, CSS, JavaScript, and AngularJS.
  - Used Jenkins for CI/CD (build, test, deploy).
- Environment / Tech Stack:
  - Ja

In [6]:
#Ollama Extraction for DOCX text
response = ollama.chat (
    model="llama3",
    messages=[
        {"role": "system", "content": "You summarize and extract structured data from Word documents."},
        {"role": "user", "content": prompt}
    ]
)

print("Extracted Info from DOCX:\n")
print(response["message"]["content"])

Extracted Info from DOCX:

Here are the key points extracted from the DOCX content:

**Project 1: Real Estate Platform - eREP**

* Duration: From March 2017 – Current
* Role: Java EE Developer
* Description: eREP provides a leading real estate and rental marketplace platform.
* Responsibilities:
	+ Analyzed requirements, designed class diagrams and sequence diagrams using UML, and prepared high-level technical documents.
	+ Implemented Java and J2EE Design patterns.
	+ Utilized Spring MVC annotations and XML configuration for Dependency Injection.
	+ Implemented Persistence layer using Hibernate to interact with the MySQL database.
	+ Developed REST web services using spring, Hibernate, JAX-RS, and JAXB.
	+ Developed UI using spring view component, JSP, HTML, CSS, JavaScript, and AngularJS.
* Environment: Java 8, Java EE 7, Spring Framework 4.0, Spring MVC, Hibernate 4.3, REST Web Services, JAXRS, JAXB, JSP, HTML, CSS, JavaScript, Angular JS, SQL, HQL, XML, UML, Log4J, Apache Tomcat 7.