
# Mini Materials Knowledge Graph — Common Semiconductors

*Personal notes for me (Md. Saidul Islam). Goal: demonstrate semantic modeling skills relevant to BAM.*

**What I'm building:** a tiny pipeline that turns a small semiconductors table into **RDF triples**, then I query it using **SPARQL**. I keep the ontology minimal and readable.



## 0) Quick glossary (my own words)
- **Ontology** → my small schema + vocabulary for this domain (classes + relations).
- **RDF triple** → one fact written as `subject — predicate — object`.
- **Namespace** → URL prefix so my identifiers are unique.
- **SPARQL** → my query tool for RDF graphs (like SQL but for triples).
- **Turtle (.ttl)** → compact text format to store RDF.



## 1) Setup
I’ll read a tiny CSV and map it into RDF using `rdflib`. I keep paths relative so it runs inside the repo.


In [None]:

# If first run, I can install libs here:
# %pip install rdflib pandas

import pandas as pd
from rdflib import Graph, Namespace, URIRef, Literal
from rdflib.namespace import RDF, RDFS, XSD
from pathlib import Path

DATA = Path("../data/semiconductors_small.csv")   # CSV lives in the repo
TTL_OUT = Path("../data/semiconductors_small.ttl")# RDF Turtle output
print("Using data:", DATA.resolve())



## 2) Ontology skeleton (small and pragmatic)
I declare a few classes and properties that I actually need for the CSV.  
Classes: `Material`, `SynthesisMethod`, `CrystalStructure`, `Property`  
Properties:  
- data: `hasBandGap` (float eV), `hasLatticeConstant` (float Å)  
- object: `hasCrystalStructure`, `synthesizedBy`


In [None]:

g = Graph()

# Namespace for my identifiers (can switch to a real domain later)
EX = Namespace("http://example.org/mse#")
g.bind("ex", EX)
g.bind("rdfs", RDFS)
g.bind("xsd", XSD)

# Classes
Material         = EX.Material
SynthesisMethod  = EX.SynthesisMethod
CrystalStructure = EX.CrystalStructure
Property         = EX.Property

for cls in [Material, SynthesisMethod, CrystalStructure, Property]:
    g.add((cls, RDF.type, RDFS.Class))

# Properties
hasBandGap          = EX.hasBandGap
hasLatticeConstant  = EX.hasLatticeConstant
hasCrystalStructure = EX.hasCrystalStructure
synthesizedBy       = EX.synthesizedBy

for prop in [hasBandGap, hasLatticeConstant, hasCrystalStructure, synthesizedBy]:
    g.add((prop, RDF.type, RDF.Property))

# Light domain/range annotations (sanity helpers for later validation)
g.add((hasBandGap, RDFS.domain, Material));         g.add((hasBandGap, RDFS.range, XSD.float))
g.add((hasLatticeConstant, RDFS.domain, Material)); g.add((hasLatticeConstant, RDFS.range, XSD.float))
g.add((hasCrystalStructure, RDFS.domain, Material));g.add((hasCrystalStructure, RDFS.range, CrystalStructure))
g.add((synthesizedBy, RDFS.domain, Material));      g.add((synthesizedBy, RDFS.range, SynthesisMethod))

print("Ontology initialized. Triples so far:", len(g))



## 3) Load CSV and mint entities
I create IRIs from labels (simple normalization) and assert triples for each row.


In [None]:

df = pd.read_csv(DATA)

def mint_entity(label: str, cls: URIRef):
    # minimal normalizer (good enough for a demo)
    safe = (
        label.strip()
             .replace(" ", "_")
             .replace("(", "")
             .replace(")", "")
             .replace("/", "_")
    )
    iri = EX[safe]
    g.add((iri, RDF.type, cls))
    g.add((iri, RDFS.label, Literal(label)))
    return iri

for _, row in df.iterrows():
    mat = mint_entity(row["material"], Material)
    cs  = mint_entity(row["crystal_structure"], CrystalStructure)
    sm  = mint_entity(row["typical_synthesis"], SynthesisMethod)

    # data properties
    if pd.notna(row.get("band_gap_eV", None)):
        g.add((mat, hasBandGap, Literal(float(row["band_gap_eV"]), datatype=XSD.float)))
    if pd.notna(row.get("lattice_const_A", None)):
        g.add((mat, hasLatticeConstant, Literal(float(row["lattice_const_A"]), datatype=XSD.float)))

    # object properties
    g.add((mat, hasCrystalStructure, cs))
    g.add((mat, synthesizedBy, sm))

print("After ingest: triples =", len(g))



## 4) Serialize to Turtle
I write the RDF graph to a `.ttl` file so it’s versionable in Git and easy to inspect.


In [None]:

TTL_OUT.write_bytes(g.serialize(format="turtle"))
print("Wrote:", TTL_OUT.resolve())



## 5) SPARQL queries (quick checks)
I query the in‑memory graph via `rdflib` to verify the ontology + data mapping.


In [None]:

# Q1) Materials with band gap > 1 eV (descending)
q1 = """PREFIX ex: <http://example.org/mse#>
SELECT ?material ?Eg
WHERE {
  ?m a ex:Material ;
     rdfs:label ?material ;
     ex:hasBandGap ?Eg .
  FILTER(?Eg > 1.0)
}
ORDER BY DESC(?Eg)
"""
for row in g.query(q1, initNs={"rdfs": RDFS}):
    print(row)


In [None]:

# Q2) Materials synthesized by MOCVD
q2 = """PREFIX ex: <http://example.org/mse#>
SELECT ?material
WHERE {
  ?m a ex:Material ;
     rdfs:label ?material ;
     ex:synthesizedBy ?meth .
  ?meth rdfs:label "MOCVD" .
}
"""
for row in g.query(q2, initNs={"rdfs": RDFS}):
    print(row)


In [None]:

# Q3) Materials with diamond cubic structure
q3 = """PREFIX ex: <http://example.org/mse#>
SELECT ?material
WHERE {
  ?m a ex:Material ;
     rdfs:label ?material ;
     ex:hasCrystalStructure ?cs .
  ?cs rdfs:label "Diamond cubic" .
}
"""
for row in g.query(q3, initNs={"rdfs": RDFS}):
    print(row)



## 6) Lightweight consistency checks
I keep a few small rules here to catch obvious issues (labels missing, negative band gaps, etc.).


In [None]:

problems = []

# A) All Materials should have labels
for s in g.subjects(RDF.type, EX.Material):
    if (s, RDFS.label, None) not in g:
        problems.append(f"Material without label: {s}")

# B) Band gap must be numeric and non-negative
for s,p,o in g.triples((None, EX.hasBandGap, None)):
    try:
        if float(o) < 0:
            problems.append(f"Negative band gap for {s}")
    except Exception:
        problems.append(f"Non-numeric band gap for {s}: {o}")

print("No obvious problems ✅" if not problems else "Consistency problems:")
for x in problems:
    print("-", x)



## 7) Placeholder for LLM‑assisted extraction
When I replace this stub with a real LLM/NLP call, I’ll feed abstracts/tables and get back candidate triples to add to the graph.


In [None]:

def propose_triples_from_text(text: str):
    # demo placeholder: pretend I parsed that GaN has Eg ~3.4 eV
    return [(EX.GaN, EX.hasBandGap, Literal(3.4, datatype=XSD.float))]

for s,p,o in propose_triples_from_text("GaN has band gap ~3.4 eV"):
    g.add((s,p,o))

print("Triples after stub insert:", len(g))



## 8) Save again after updates
I keep the TTL in sync with the in‑memory graph.


In [None]:

TTL_OUT.write_bytes(g.serialize(format="turtle"))
print("Updated:", TTL_OUT.resolve())
