# Inspect each report for their structure. Refer to findings below.

In [10]:
from pathlib import Path
from pypdf import PdfReader

In [11]:
NEW_INSPECTION = Path("./data/New Inspection Report.pdf")
SAMPLE_INSPECTION = Path("./data/Sample Inspection Report.pdf")

In [12]:
def parse_pdf_to_string(pdf_file: Path | str) -> str:
    """get pdf reader and return the whole report as a string."""
    report_str = ""
    reader = PdfReader(pdf_file)
    for i in range(len(reader.pages)):
        report_str += reader.pages[i].extract_text()
    return report_str

In [107]:
report_str = parse_pdf_to_string(NEW_INSPECTION)
report_str

"Deficienc y 1 \nDeficiency : Location of emergency installations. Not as required.  \nRoot Cause: Design or engineering defect The in -use emergency stop switches did not comply with \nrequirements for electrical switch boxes installed on weather decks exposed to marine environment.  \nCorrective action: New weathertight IP67 rating electrical emergency stop switch boxes, along with \nnewly fabricated outer protection steel boxes, have been installed at the port and starboard bunker \nstations, meeting the approval of the attending Class surveyor.  \nPreventive action: Ship Staff advised to fortify their inspection regime and diligently carry out the checks \nas per CLE 08 - Inactive Function tests, and rectify faults if any immediately.  \nDeficienc y 2 \nDeficiency : The loading computer used for Stability Calculation was not approved by the RO.  \nRoot Cause: Work negligence for chief mate installed the loading computer device software to another \ncomputer in addition to the speci

In [108]:
print(report_str)

Deficienc y 1 
Deficiency : Location of emergency installations. Not as required.  
Root Cause: Design or engineering defect The in -use emergency stop switches did not comply with 
requirements for electrical switch boxes installed on weather decks exposed to marine environment.  
Corrective action: New weathertight IP67 rating electrical emergency stop switch boxes, along with 
newly fabricated outer protection steel boxes, have been installed at the port and starboard bunker 
stations, meeting the approval of the attending Class surveyor.  
Preventive action: Ship Staff advised to fortify their inspection regime and diligently carry out the checks 
as per CLE 08 - Inactive Function tests, and rectify faults if any immediately.  
Deficienc y 2 
Deficiency : The loading computer used for Stability Calculation was not approved by the RO.  
Root Cause: Work negligence for chief mate installed the loading computer device software to another 
computer in addition to the specified loading 

In [109]:
report_str = parse_pdf_to_string(SAMPLE_INSPECTION)
report_str

'Deficienc y 1 \nDeficiency : Fire extinguisher for rescue boat rusted seriously.  \nRoot Cause: Human Factors  \nNOT APPLICABLE  \nVessel Factors  \nInappropriate storage Fire extinguisher was not protected from weather.  \nManagement Factors  \nNOT APPLICABLE  \nOther Factors  \nOthers Inclement weather conditions.  \nCorrective action: Fire extinguisher replaced with a new extinguisher. The extinguisher is kept covered \nfor protection against weather.  \nPreventive action: Brieifing of entire ship staff carried out by Superintendent as to checks of rescue boat \nequipment. Lessons learned shared with all the vessels in Fleet.  \nDeficiency 2  \nDeficiency : The company name on the DOC is not the same as on the CSR. The interim SMC and interim \nSecurity certificate have different company names to the DOC.  \nRoot Cause:  Company stated in CSR doc not same as the DOC. Master without fail to cross check trading \ncertificate.  \nCorrective action:  Master w/o fail to cross check all 

In [110]:
print(report_str)

Deficienc y 1 
Deficiency : Fire extinguisher for rescue boat rusted seriously.  
Root Cause: Human Factors  
NOT APPLICABLE  
Vessel Factors  
Inappropriate storage Fire extinguisher was not protected from weather.  
Management Factors  
NOT APPLICABLE  
Other Factors  
Others Inclement weather conditions.  
Corrective action: Fire extinguisher replaced with a new extinguisher. The extinguisher is kept covered 
for protection against weather.  
Preventive action: Brieifing of entire ship staff carried out by Superintendent as to checks of rescue boat 
equipment. Lessons learned shared with all the vessels in Fleet.  
Deficiency 2  
Deficiency : The company name on the DOC is not the same as on the CSR. The interim SMC and interim 
Security certificate have different company names to the DOC.  
Root Cause:  Company stated in CSR doc not same as the DOC. Master without fail to cross check trading 
certificate.  
Corrective action:  Master w/o fail to cross check all trading certificates

### Findings

All deficiencies in this inspection report have the following structure to it.
- Deficiency Number
- Deficiency Description
- Root Cause
- Corrective Action
- Preventive Action

However, the content strcture for each category is different.
- Example, Root Cause in `SAMPLE REPORT` has various representation. Some has breakdown of HUMAN, VSESSEL, MANAGEMENT factors. While others are describing the root cause in general.
- Also some list multiple ordered list root causes/corrective/preventive action, some single rootcause

To explore suitable ways to parse the inputs (structured) into LLM as question for classification.

1. pypdf to extract the text and send the report one deficiency at a time.
    - use traditional NLP methods to clean the report of stopwords, 
2. use llm to parse the pdf into various deficiency for model input.

### Potential Path
- use the sample inspection as context for few shot prompting for a unseen deficiency. Let the LLM "reason logically" and access the risk. This also allow the user to review LLM chain of thoughts.
- use suitable chunking strategy and label for the LLM, so that when reading a new deficiency, vector should be able to find similar deficiency base on context vector.
    - but need to chunk properly so that the entire deficiency is inside the context.
- potentially using smaller models, BERT etc to classify the deficiency without any explanation.

## Testing chunking strategy

### Regex to parse PDF into deficiency

In [9]:
import re

def split_report_to_chunk(report_text: str)-> dict[int, str]:
    """Split the entire report into a dictionary of deficiency per key value pair."""
    deficiencies = re.split(r"[Dd]eficienc.?y.?\d", report_text)
    deficiencies = [d.strip() for d in deficiencies if d.strip()]

    list_d = {}
    for i, deficiency in enumerate(deficiencies, 1):
        if deficiency:
            list_d[i] = f"{deficiency}"
    
    return list_d

In [14]:
report_str = parse_pdf_to_string(NEW_INSPECTION)
report_chunk = split_report_to_chunk(report_str)
report_chunk

{1: 'Deficiency : Location of emergency installations. Not as required.  \nRoot Cause: Design or engineering defect The in -use emergency stop switches did not comply with \nrequirements for electrical switch boxes installed on weather decks exposed to marine environment.  \nCorrective action: New weathertight IP67 rating electrical emergency stop switch boxes, along with \nnewly fabricated outer protection steel boxes, have been installed at the port and starboard bunker \nstations, meeting the approval of the attending Class surveyor.  \nPreventive action: Ship Staff advised to fortify their inspection regime and diligently carry out the checks \nas per CLE 08 - Inactive Function tests, and rectify faults if any immediately.',
 2: "Deficiency : The loading computer used for Stability Calculation was not approved by the RO.  \nRoot Cause: Work negligence for chief mate installed the loading computer device software to another \ncomputer in addition to the specified loading computer.  

In [13]:
report_str = parse_pdf_to_string(SAMPLE_INSPECTION)
report_chunk = split_report_to_chunk(report_str)
report_chunk

{1: 'Deficiency : Fire extinguisher for rescue boat rusted seriously.  \nRoot Cause: Human Factors  \nNOT APPLICABLE  \nVessel Factors  \nInappropriate storage Fire extinguisher was not protected from weather.  \nManagement Factors  \nNOT APPLICABLE  \nOther Factors  \nOthers Inclement weather conditions.  \nCorrective action: Fire extinguisher replaced with a new extinguisher. The extinguisher is kept covered \nfor protection against weather.  \nPreventive action: Brieifing of entire ship staff carried out by Superintendent as to checks of rescue boat \nequipment. Lessons learned shared with all the vessels in Fleet.',
 2: 'Deficiency : The company name on the DOC is not the same as on the CSR. The interim SMC and interim \nSecurity certificate have different company names to the DOC.  \nRoot Cause:  Company stated in CSR doc not same as the DOC. Master without fail to cross check trading \ncertificate.  \nCorrective action:  Master w/o fail to cross check all trading certificates and