
# Building a virtual assistant (VA) for producing a supercapacitor 

The VA (hackaton version*) is supposed to solve the following **tasks**:

1. intelligently process scientific documents
 * pick relevant data (text, images, tables, metadata) 
 * filter low-quality information 
2. create a domain-specific knowledge graph 
3. provide useful answers to user's queries.

*(later - add recommender system, which will make suggestions on material and process optimization)

## Step-by-Step Process to Build the VA

### 1. Data Collection 

Task: search through ArXiv documents using keywords "supercapacitor", for now should be able to work with PDFs (later: structured text - HTML, etc.)

Tools: `BeautifulSoup`, `Selenium`, APIs like ArXiv (later: CrossRef)

(later - expand the Database to include a large corpus of relevant scientific documents, research papers, patents, and technical manuals from IEEE, ScienceDirect, PubMed, Google Scholar, patents databases and tech blogs
+ add multilanguage support??)


### 2. Document Processing

Task: extract structured data (text, tables, images, meta) from the documents

Tools: `PyMuPDF` (text, diagrams, figures, metadata), `pdfplumber`(text), `PDFMiner` (text), `Camelot` (tables), `Tabula` (tables).
Preferred method: `PDFMiner` combined with RPN (Region Proposal Network) - for detecting layout objects -  headers, sections, images, figures, tables, references.  
Tutorial: https://medium.com/@baptisteloquette.entr/langchain-arxiv-tutor-data-loading-c62f55af492d


### 3. Information Filtering

Task: filter relevant vs. low-quality information (later: remove redundant/ duplicate information)

Tools: pre-trained LLMs: `SciBERT`, `PubMedBERT` (both fine-tuned on scientific documents) process domain-specific text, 
`TextRank`, `TF-IDF`, `BERT embeddings` to identify the most relevant sections, keywords, or sentences based on the query, (later: deduplication algorithms to remove similar or redundant information)

### 4. Knowledge Representation (Domain-Specific Ontology)

Task: build an ontology/ knowledge graph specific to supercapacitor development to map key concepts (e.g. electrode materials, electrolyte properties, fabrication processes) to the extracted information

Tools: `Neo4j` or `RDF` (Resource Description Framework) to structure the extracted information, 
Named Entity Recognition (NER) for concept linking (to extract key concepts like materials (graphene, carbon nanotubes), methods (electrochemical synthesis), and metrics (energy density, power density)

### 5. Contextual Understanding and Query System

Task: create a query-based system where the user can ask questions/ make queries, e.g. "What are the best electrode materials for high energy density?", or "Summarize recent developments in electrolyte materials for supercapacitors."

Tools: Use a retrieval-based system `Haystack`, `ElasticSearch` combined with BERT-based Question Answering (QA) models (requires Hugging Face's transformers with pipeline("question-answering")), use semantic (contextual) search to understand the intent of the query and retrieve the relevant sections from the documents

### 6. Low-Quality Information Filtering

Task: assess and filter low-quality information based on 
* citation count, journal impact, author/ institution credibility
* technical jargon matching apers or documents lacking domain-specific language might be of lower relevance (use domain-specific vocabularies for filtering)
* publication age (exclude older documents that may contain outdated methods or materials)

### 7. Machine Learning for Data Extraction and Summarization

Tasks: fine-tune models to perform information extraction and summarization

Tools:
`BERT`, `T5`, `GPT` to extract specific information (material properties, performance metrics, and manufacturing techniques), `BART`, `T5`, `PEGASUS` for automatic text summarization of relevant sections

### 8. Image Recognition and Analysis
Use Computer Vision techniques to analyze images (like SEM, TEM, or XRD patterns) relevant to supercapacitors:

Object Detection: Detect key components (e.g., anode, cathode, electrolyte) in schematics or diagrams.
OCR (Optical Character Recognition): Use tools like Tesseract for reading labels, legends, and other annotations in the images.
9. Human-in-the-Loop
Since automating the entire process might still be difficult, human-in-the-loop methods could be used. You could present the user with a summary of the extracted information and ask for feedback or adjustments.
10. Implementing the Virtual Assistant
Interface: Build a simple UI, possibly a chatbot, that can interact with the user. Frameworks like Flask or FastAPI can be used to host the model and process the requests.
Chatbot API: Use Dialogflow or Rasa to build conversational agents that guide the user through the development process.
Backend: Use Python and integrate models and tools (like SciBERT, Neo4j, etc.) for smooth query answering and document analysis.
Tools and Libraries
Text Processing: transformers, Spacy, nltk, PyMuPDF
Tables: Camelot, Tabula
Semantic Search/QA: Haystack, ElasticSearch, sentence-transformers
Image Processing: PIL, OpenCV, Tesseract
Ontologies/Graphs: Neo4j, rdflib
Summarization: BART, T5, PEGASUS
Deployment: Flask, FastAPI, Rasa
Example: Building a Summarization Pipeline for Supercapacitor Data
python
Copy code
from transformers import pipeline

# Load summarization model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Sample extracted text from research papers
extracted_text = """
Supercapacitors have been the focus of intense research in recent years due to their high power density and long cycle life. One of the key challenges in supercapacitor development is enhancing the energy density while maintaining high power density. The use of novel electrode materials such as graphene, carbon nanotubes, and metal oxides has shown promising results in this regard. Electrolytes play a crucial role in the performance of supercapacitors, with ionic liquids and solid-state electrolytes being explored for high-temperature applications...
"""

# Summarize the extracted text
summary = summarizer(extracted_text, max_length=100, min_length=30, do_sample=False)
print(summary[0]['summary_text'])
Final Thoughts:
Your virtual assistant can be designed to intelligently process both structured and unstructured data from scientific literature and guide you through the supercapacitor development process. By combining NLP, machine learning, and domain-specific ontologies, your VA can give highly targeted insights into materials, manufacturing processes, and more.

In [1]:
# dependencies:
import requests
import re
import os
import cv2
import numpy as np
import pandas as pd
import torch
import layoutparser as lp
# ArXiv loader
%pip install -qU langchain-community arxiv pymupdf
from langchain.document_loaders import ArxivLoader, PDFMinerLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pdf2image import convert_from_path


ModuleNotFoundError: No module named 'torch'

In [None]:
# download papers

def load_pdf(url):
    paper_number    =   os.path.basename(url).strip(".pdf")
    res             =   requests.get(url)
    pdf_path        =   f"papers/{paper_number}.pdf"
    with open(pdf_path, 'wb') as f:
        f.write(res.content)
    return paper_number

link            =  "https://arxiv.org/pdf/2306.08302.pdf"
paper_number    =   load_pdf(link)

In [None]:
docs    =   PDFMinerLoader(f"papers/{paper_number}.pdf").load()

# divide into blocks:
text_splitter   =   RecursiveCharacterTextSplitter(
    chunk_size=700, # Specify the character chunk size
    chunk_overlap=0, # "Allowed" Overlap across chunks
    length_function=len # Function used to evaluate the chunk size (in terms of characters)
)

docs    =   text_splitter.split_documents(docs)

#### Layout Parsing with LayoutParser

Includes 3 components:
1. a CNN to extract feature maps. 
2. a RPN (Region Proposal Network) that makes use of the feature maps to propose and refine a certain number of regions of interests. 
3. component gathering the best propositions, then refine them further, to produce a segmentation mask.

In [None]:
# Layout Parsing with LayoutParser

def pdf_to_img(pdf_pth):
    img_pth    =   os.path.join("papers_imgs", os.path.basename(pdf_pth).strip(".pdf") + "_imgs")
    if not os.path.exists(img_pth):
        os.makedirs(img_pth)
    images      =   convert_from_path(pdf_path=pdf_pth)
    for i in range(len(images)):
        images[i].save(os.path.join(img_pth, "page" + str(i) + ".jpg"), "JPEG")
    print("Images Saved !")
    return img_pth

imgs_pth   =   pdf_to_img(f"papers/{paper_number}.pdf")

In [None]:
model_publay    =   lp.Detectron2LayoutModel('lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x/config',
                    extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.6],
                    label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})

In [None]:
# get layout of a page
page_idx    =   6
img_pth    =   os.path.join(pdf_pth, f"page{page_idx}.jpg")
img         =   cv2.imread(img_path)
img         =   img[..., ::-1]

layout  =   model_publay.detect(img)
lp.draw_box(img, layout)

In [None]:
def get_coordinate(data):

  x1 = data.block.x_1
  y1 = data.block.y_1
  x2 = data.block.x_2
  y2 = data.block.y_2

  return torch.tensor([[x1, y1, x2, y2]], dtype=torch.float)

def get_iou(box_1, box_2):

  return bops.box_iou(box_1, box_2)

def get_area(bbox):
  w = bbox[0, 2] - bbox[0, 0] # Width
  h = bbox[0, 3] - bbox[0, 1] # Height
  area  = w * h

  return area

def refine_bboxes(block_1, block_2):

  bb1 = set_coordinate(block_1)
  bb2 = set_coordinate(block_2)

  iou = get_iou(bb1, bb2)

  if iou.tolist()[0][0] != 0.0:

    a1 = get_area(bb1)
    a2 = get_area(bb2)

    block_2.set(type='None', inplace= True) if a1 > a2 else block_1.set(type='None', inplace= True)

Handle bboxes overlaps by computing the IntersectionOverUnion of each detected bboxes. If IoU > 0 then we have an overlap, we thus compute the area of the 2 overlapping bboxes, and only keep the bbox with the greatest area, by setting it’s type to "None" .

Initialize the ocr_agent that will extract the text from the detected boxes, using Tesseract. Note that you can specify multiple languages with the following format : languages=["eng", "fra"].
Pass the image through the model. Only keeping the bboxes detected with labels “Text”, “Title”, or “List”. Ultimately excluding Figures.
Sort the boxes by their positions on the page. Note that paper’s pages can be in 2 columns, we thus sort from top to bottom, then left to right
Apply the refine_blocks function to handle bboxe’s overlap
Infer the text for each bboxes, here we pad each images by 5. Padding improve the ocr_agent‘s accuracy.
Append to a list the texts and the labels.

In [None]:
ocr_agent                   =   lp.TesseractAgent(languages="eng")

def extract_text_pdf_image_PubLay_OCR(img_path):
    texts       =   []
    image       =   cv2.imread(img_path)
    image       =   image[..., ::-1]
    layout      =   model_publay.detect(image)
    text_blocks =   lp.Layout([b for b in layout if b.type in ['Text', 'List', 'Title']])

    # Organize text blocks based on their positions on the page
    h, w            =   image.shape[:2]
    left_interval   =   lp.Interval(0, w/2*1.05, axis='x').put_on_canvas(image)
    left_blocks     =   text_blocks.filter_by(left_interval, center=True)
    left_blocks.sort(key = lambda b:b.coordinates[1], inplace=True)

    right_blocks            =   lp.Layout([b for b in text_blocks if b not in left_blocks])
    right_blocks.sort(key   =   lambda b:b.coordinates[1], inplace=True)

    text_blocks = lp.Layout([b.set(id = idx) for idx, b in enumerate(left_blocks + right_blocks)])

    for layout_i in text_blocks:    # If some of the blocks overlap -> Take the one with the most associated area
        for layout_j in text_blocks:
            if layout_i != layout_j:
                refine_blocks(layout_i, layout_j)

    for block in text_blocks:
        segment_image = (block
                        .pad(left=5, right=5, top=5, bottom=5)
                        .crop_image(image))
            # add padding in each image segment can help
            # improve robustness 
            
        text = ocr_agent.detect(segment_image)
        block.set(text=text, inplace=True)
    for l in text_blocks:
        texts.append([l.text, l.type])
    return texts

In [None]:
def images_2_OCR(imgs_paths):
    docs        =   []
    for img_path_idx in range(len(os.listdir(imgs_paths))):
        img_path        =   os.path.join(imgs_paths, "page{}.jpg".format(img_path_idx))
        page_content    =   extract_text_pdf_image_PubLay_OCR(img_path)
        for content in page_content:
            text    =   content[0]
            cat     =   content[1]
            if "REFERENCES" in text and cat == "Title":  # exclude references
                return docs
            metadata        =   {"page_number" : img_path_idx, "category" : cat, "source" : paper_number}
            docs.append(Document(page_content=text, metadata=metadata))

    return docs

In [None]:
adjacents_papers_urls       =   []
adjacents_papers_numbers    =   []
for doc in docs:
    adjacents_papers_urls.extend([re.sub("abs", "pdf", url) + ".pdf" for url in re.findall(r'(https?://arxiv.org/abs\S+)', doc.page_content)])
    adjacents_papers_numbers.extend([re.findall('\d{4}\.\d{5}', url)[0] for url in re.findall(r'(https?://arxiv.org/abs\S+)', doc.page_content)])

In [None]:
# loop over the papers:
for pdf_number in adjacents_papers_numbers:
    adj_docs    =   ArxivLoader(query=pdf_number)
    adj_docs    =   PDFMinerLoader(f"papers/{pdf_number}.pdf").load()
    adj_docs    =   text_splitter.split_documents(docs)
    vdb_chunks.add_documents(docs)