### Project Overview
You are building a Proof of Concept (PoC) for a Knowledge Graph-based NLP Chatbot for Material Safety Data Sheets (MSDS). The PoC involves:

Extracting data from an MSDS PDF using the PyPDF2 library.
Parsing the extracted data to generate RDF tuples using NLP techniques with the spaCy library.
Utilizing the RDF schema provided in the msds_rdf.ttl file to structure the RDF tuples.

1. Set Up the Python Environment </br>
<br>python3 -m venv .venv
<br>source .venv/bin/activate   

In [30]:
# install required packages
! pip install PyPDF2 spacy rdflib

16005.53s - pydevd: Sending message related to process being replaced timed-out after 5 seconds




In [31]:
# Create a requirements.txt file with the installed packages
!pip freeze > requirements.txt

16011.22s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


2. Extract Data from the PDF
Use the PyPDF2 library to extract text from the MSDS PDF file.

In [32]:
from PyPDF2 import PdfReader
class Read_PDF:
    def __init__(self, file_path):
        self.file_path = file_path
        self.text = ""

    def extract_text(self):
        reader = PdfReader(self.file_path)
        for page in reader.pages:
            self.text += page.extract_text() + "\n"
        return self.sanitize_text(self.text)

    def sanitize_text(self, text):
        # Replace problematic patterns, such as leading zeros in numbers
        import re
        sanitized_text = re.sub(r'(?<!\d)0+(\d+)', r'\1', text)  # Remove leading zeros
        return sanitized_text

In [33]:
# Create instance of read_pdf
path = "/Users/I310202/Library/CloudStorage/OneDrive-SAPSE/SR@Work/81.Innovations/98.AI_Developments/33.AI_MSDS/Build_MSDS_SAPKGE/Documents/WD-40.pdf"
pdf_reader = Read_PDF(path)
# Extract text from the PDF
pdf_text = pdf_reader.extract_text()

In [34]:
print(pdf_text)

 
Page 1 of 5  
Safety Data Sheet  
California CARB Compliant  
1 - Identification  
 
Product Name: WD -40 Multi -Use Product Aerosol   
 
Product Use: Lubricant, Penetrant, Drives Out 
Moisture and Protects Surfaces from  Corrosion  
 
Restrictions on Use: None identified  
 
SDS Date of Preparation: November 13 , 2024  Manufacturer:  WD-40 Company  
Address:  9715 Businesspark Avenue  
   San Diego, California, USA  
  92131  
Telephone:   
Emergency:      1 -888-324-7596  
Information:   1-888-324-7596  
Chemical Spills: 1 -800-424-9300 (Chemtrec)  
 1-703-527-3887 (International Calls)  
 
2 – Hazards Identification  
HCS 2024 /GHS Classification:  
Aerosol Category 1  
Aspiration Toxicity Category 1  
Specific Target Organ Toxicity Single Exposure Category 3 (nervous system effects)  
 
Note: This product is a consumer product and is labeled in accordance with the US Consumer Product 
Safety Commission regulations which take precedence over OSHA Hazard Communication labeling. The

3. Parse Text and Extract RDF Tuples
Use the spaCy library to process the extracted text and identify entities and relationships.

In [35]:
import spacy
from rdflib import Graph, URIRef, Literal, Namespace
import urllib.parse

# Download spaCy model if not present
import spacy.cli
spacy.cli.download("en_core_web_sm")

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# RDF Namespace
MSDS = Namespace("http://example.org/msds#")

class Extract_RDF_Tuples:
    def __init__(self, text):
        self.text = text
        self.graph = Graph()

    def extract(self):
        doc = nlp(self.text)
        self.graph.bind("msds", MSDS)
        for ent in doc.ents:
            # Encode the text to make it a valid URI
            encoded_text = urllib.parse.quote(ent.text)
            subject = URIRef(MSDS[encoded_text])
            predicate = URIRef(MSDS["hasType"])
            obj = Literal(ent.label_)
            self.graph.add((subject, predicate, obj))
        return self.graph

0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.8 MB[0m [31m?[0m eta [36m-:--:--[0m  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You

In [36]:
# create instance for class Extract_RDF_Tuples
obj_graph = Extract_RDF_Tuples(pdf_text)
rdf_graph = obj_graph.extract()
print(rdf_graph.serialize(format="turtle"))


@prefix msds: <http://example.org/msds#> .

msds:0 msds:hasType "CARDINAL" .

msds:0.8 msds:hasType "CARDINAL" .

msds:0.82 msds:hasType "CARDINAL" .

msds:1 msds:hasType "CARDINAL" .

msds:1%20-800-424-9300 msds:hasType "CARDINAL" .

msds:1%20-888-324-7596 msds:hasType "QUANTITY" .

msds:1-888-324-7596 msds:hasType "QUANTITY" .

msds:100 msds:hasType "CARDINAL" .

msds:11 msds:hasType "CARDINAL" .

msds:12 msds:hasType "CARDINAL" .

msds:120 msds:hasType "CARDINAL" .

msds:1200 msds:hasType "CARDINAL" .

msds:122 msds:hasType "CARDINAL" .

msds:13 msds:hasType "CARDINAL" .

msds:138%EF%82%B0F msds:hasType "CARDINAL" .

msds:14 msds:hasType "CARDINAL" .

msds:183 msds:hasType "CARDINAL" .

msds:187%C2%B0 msds:hasType "CARDINAL" .

msds:1910.134 msds:hasType "CARDINAL" .

msds:2 msds:hasType "CARDINAL" .

msds:2-3%25 msds:hasType "PERCENT" .

msds:2.1 msds:hasType "CARDINAL" .

msds:2.1%20Ltd.%20Qty msds:hasType "ORG" .

msds:2.79-2.96 msds:hasType "CARDINAL" .

msds:2024 msds:hasType "

4.  Use the RDF Schema</br>
Load the msds_rdf.ttl file and integrate it with the generated RDF tuples.


In [37]:
def load_rdf_schema(schema_path):
    schema_graph = Graph()
    schema_graph.parse(schema_path, format="turtle")
    return schema_graph

schema_graph = load_rdf_schema("msds_rdf_v2.ttl")
rdf_graph += schema_graph

5. Validate and Test</br>
Ensure the RDF graph is valid and conforms to the schema.</br>
Serialize the final RDF graph to a file:


In [38]:
rdf_graph.serialize("output_graph.ttl", format="turtle")

<Graph identifier=N6d76c51977974b2588e22274cf0cffaa (<class 'rdflib.graph.Graph'>)>