## **PDF extraction with Unstructured & Pandas**

### **1. Import Libraries**
- Import required libraries for file handling, data processing, and PDF extraction.

### **2. Define Directories**
- Set paths for PDFs, output files, and extracted images.  
- Create directories if they don’t exist.

### **3. Helper Functions**
- `serialize_image()`: Convert image to base64 for storage.  
- `categorize_elements()`: Separate text and tables from extracted elements.  

### **4. Process PDFs**
- Use `partition_pdf()` to extract text, tables, and images.  
- Store extracted text and tables in a structured format.  
- Save images as base64-encoded strings.

### **5. Process All PDFs in Directory**
- Iterate through all PDFs and process each file.

### **6. Save Extracted Data**
- Convert extracted data into a Pandas DataFrame.  
- Save the DataFrame as a Parquet file for efficient storage.

<span style="color:red;">### Issue: Extracts text in a multi-column PDF but after the first paragraph, it jumps to the first paragraph of the second column, instead of extracting the second paragraph from the first column.</span>


In [12]:
import os
import json
import pandas as pd
import base64
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Table, Image, Title, NarrativeText
import unstructured.documents.elements
import pickle

# Define directories
pdf_dir = "./data/"
output_dir = "processed_output/"
image_dir = "./figures/"
os.makedirs(output_dir, exist_ok=True)
os.makedirs(image_dir, exist_ok=True)

#clear the files in the figures durectory
if os.path.isdir(image_dir):
    for filename in os.listdir(image_dir):
        file_path = os.path.join(image_dir, filename)
        if os.path.isfile(file_path):  # Only remove files
            os.remove(file_path)
    print(f"All files in {image_dir} have been deleted.")
else:
    print(f"{image_dir} is not a valid directory.")

# Pandas DataFrame storage
data_entries = []

# Function to serialize image to base64
def serialize_image(image_path):
    """Convert an image to a base64-encoded string."""
    with open(image_path, "rb") as img_file:
        img_bytes = img_file.read()
        return base64.b64encode(img_bytes).decode("utf-8")  # Convert bytes to UTF-8 string

def serialize_object(obj):
    """Serialize an object using pickle and encode it as a base64 string."""
    obj_bytes = pickle.dumps(obj)  # Convert object to bytes
    return base64.b64encode(obj_bytes).decode('utf-8')  # Encode as base64 string


def categorize_elements(raw_pdf_elements):
    """Categorize extracted elements into texts and tables."""
    texts = []
    tables = []
    for element in raw_pdf_elements:
        # Check if element is a Table
        if isinstance(element, unstructured.documents.elements.Table):
            tables.append(element)
        # Check if element is a text block (CompositeElement is a general parent class for text-like elements)
        elif isinstance(element, unstructured.documents.elements.CompositeElement):
            texts.append(element)
    return texts, tables

def process_pdf(pdf_path):
    pdf_name = os.path.basename(pdf_path)
    elements = partition_pdf(
        filename=pdf_path,
        strategy="hi_res",
        hi_res_model_name="yolox",
        infer_table_structure=True,
        extract_images_in_pdf=True,
        chunking_strategy="by_title",
        max_characters=4000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000,
        image_output_dir_path=image_dir  # Storing images in the ./figures directory
    )

    # Categorize the elements into texts and tables
    texts, tables = categorize_elements(elements)

    # Process text elements
    for text in texts:
        entry = {"pdf_name": pdf_name, "element_type": "text", "content": text.text}
        data_entries.append(entry)

    # Process table elements
    for table in tables:
        entry = {"pdf_name": pdf_name, "element_type": "table", "content": serialize_object(table.to_dict())}
        data_entries.append(entry)

    # Process images in the ./figures directory
    for image_filename in os.listdir(image_dir):
        image_path = os.path.join(image_dir, image_filename)
        if image_filename.lower().endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp')):  # Filter for image files
            image_base64 = serialize_image(image_path)  # Convert the image to base64 string
            entry = {"pdf_name": pdf_name, "element_type": "image", "content": image_base64}
            data_entries.append(entry)

# ###--Uncomment for processing PDF again & save df----#####
# # Process all PDFs in the directory
# for pdf_file in os.listdir(pdf_dir):
#     if pdf_file.endswith(".pdf"):
#         pdf_path = os.path.join(pdf_dir, pdf_file)
#         print(f"Processing: {pdf_file}")
#         process_pdf(pdf_path)
#         #clear the files in the figures durectory
#         if os.path.isfile(image_dir):
#                 os.remove(image_dir)
            
# # Convert extracted data to a Pandas DataFrame
# df = pd.DataFrame(data_entries)
# # Save DataFrame to a Parquet file (efficient binary format)
# df.to_parquet(os.path.join(output_dir, "extracted_data_raw.parquet"), index=False)

print("Processing complete. Data saved in Pandas DataFrame!")

All files in ./figures/ have been deleted.
Processing: Sample_Table.pdf
Processing: chap_1_content.pdf
Processing complete. Data saved in Pandas DataFrame!


In [13]:
import pandas as pd
# Load the Parquet file into a DataFrame
df = pd.read_parquet(os.path.join(output_dir, "extracted_data_raw.parquet"))
# Check the loaded DataFrame
df

Unnamed: 0,pdf_name,element_type,content
0,Sample_Table.pdf,table,gASV6wIAAAAAAAB9lCiMBHR5cGWUjAVUYWJsZZSMCmVsZW...
1,chap_1_content.pdf,text,1 www.tntextbooks.in LAWS OF MOTION\n\nLearnin...
2,chap_1_content.pdf,text,in this unit.\n\n1\n\n| | 10th_Science_Unit-1....
3,chap_1_content.pdf,text,1 .2 .1 T ypes of I nertia\n\na) Inertia of re...
4,chap_1_content.pdf,text,forces.\n\n(a) Like parallel forces: Two or mo...
5,chap_1_content.pdf,text,1 .4 .5 Rotating Effect of Force\n\nThe door c...
6,chap_1_content.pdf,text,1 .4 .7 Application of T orque\n\n1. Gears:\n\...
7,chap_1_content.pdf,text,1 .5 NE W T ON’ S S E C OND L A W OF MOT I ON\...
8,chap_1_content.pdf,text,1 .7 NE W T ON’ S T H I R D L A W OF MOT I ON\...
9,chap_1_content.pdf,text,Figure 1.7 Conservation of\n\nlinear momentum\...


# OpenAI Summarization and Image Analysis

This notebook demonstrates how to use the OpenAI API to summarize text and analyze images within a DataFrame.

## Steps:

1. **Setup**: 
   - Load the OpenAI API key using `dotenv`.
   - Initialize the `openai` client.

2. **Text Summarization**: 
   - `get_summary` function sends text to OpenAI's GPT-4 model for summarization.

3. **Image Analysis**: 
   - `analyze_image` function sends base64-encoded images to GPT-4 vision model (`gpt-4o`) for analysis.

4. **Apply Functions**: 
   - Summarization is applied to rows with `element_type == "text"`.
   - Image analysis is applied to rows with `element_type == "image"`.

5. **Save and Display**:
   - Save the processed DataFrame as a Parquet file.
   - Print the DataFrame with summaries.

<span style="color:red;">### Issue: How to summarize table input?</span>


In [23]:
import openai
import pandas as pd
import time
import os
from dotenv import load_dotenv
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# Set up OpenAI client with your API key
client = openai.OpenAI(api_key=openai_api_key)

# Function to get summary from OpenAI API
def get_summary(text):
    try:
        response = client.chat.completions.create(
            model="gpt-4",  # Use "gpt-3.5-turbo" if needed
            messages=[
                {"role": "system", "content": "Summarize the following content concisely."},
                {"role": "user", "content": text}
            ],
            temperature=0.5,
            max_tokens=100
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error: {e}")
        return None
        
# Function to send base64 image to OpenAI API for analysis
def analyze_image(base64_image):
    try:
        response = client.chat.completions.create(
            model="gpt-4o",  # Vision-enabled model
            messages=[
                {"role": "system", "content": "Analyze the image and describe its contents."},
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "What does this image contain?"},
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                    ],
                },
            ],
            temperature=0.5,
            max_tokens=300
        )
        return response.choices[0].message.content.strip()  # Extract text response
    except Exception as e:
        print(f"Error: {e}")
        return None

# ###--Uncomment for openai processing again----#####
# def summarize_row(row):
#     if row["element_type"] == "text":
#         return get_summary(row["content"])
#     elif row["element_type"] == "image":
#         return analyze_image(row["content"])
#     elif row["element_type"] == "table":
#         # Table summarization: adjust the function if needed
#         return analyze_image(row["content"])
#     else:
#         return None

# df["summary"] = df.apply(summarize_row, axis=1)

# # Save DataFrame to a Parquet file (efficient binary format)
# df.to_parquet(os.path.join(output_dir, "extracted_data_summary.parquet"), index=False)


In [22]:
import pandas as pd
# Load the Parquet file into a DataFrame
df = pd.read_parquet(os.path.join(output_dir, "extracted_data_summary.parquet"))
# Check the loaded DataFrame
df

Unnamed: 0,pdf_name,element_type,content,summary
0,Sample_Table.pdf,table,gASV6wIAAAAAAAB9lCiMBHR5cGWUjAVUYWJsZZSMCmVsZW...,
1,chap_1_content.pdf,text,1 www.tntextbooks.in LAWS OF MOTION\n\nLearnin...,This lesson on the laws of motion aims to educ...
2,chap_1_content.pdf,text,in this unit.\n\n1\n\n| | 10th_Science_Unit-1....,"The content discusses the basics of mechanics,..."
3,chap_1_content.pdf,text,1 .2 .1 T ypes of I nertia\n\na) Inertia of re...,The text discusses different types of inertia:...
4,chap_1_content.pdf,text,forces.\n\n(a) Like parallel forces: Two or mo...,The text discusses different types of forces a...
5,chap_1_content.pdf,text,1 .4 .5 Rotating Effect of Force\n\nThe door c...,The rotating effect of force is demonstrated w...
6,chap_1_content.pdf,text,1 .4 .7 Application of T orque\n\n1. Gears:\n\...,The application of torque can be seen in gears...
7,chap_1_content.pdf,text,1 .5 NE W T ON’ S S E C OND L A W OF MOT I ON\...,Newton's Second Law of Motion states that the ...
8,chap_1_content.pdf,text,1 .7 NE W T ON’ S T H I R D L A W OF MOT I ON\...,The content explains the concept of impulse an...
9,chap_1_content.pdf,text,Figure 1.7 Conservation of\n\nlinear momentum\...,This passage describes the concept of conserva...


## Prepare QA from textbook QA page
<span style="color:red;">### Issue: Does not extract the questions in multi-colum pdf correctly. Mixes the questions & answers string assuming a single column layout</span>


In [24]:
# Define directories
pdf_dir = "./data/QA"
output_dir = "processed_output/"
image_dir = "./figures/"
os.makedirs(output_dir, exist_ok=True)
os.makedirs(image_dir, exist_ok=True)

#clear the files in the figures durectory
if os.path.isdir(image_dir):
    for filename in os.listdir(image_dir):
        file_path = os.path.join(image_dir, filename)
        if os.path.isfile(file_path):  # Only remove files
            os.remove(file_path)
    print(f"All files in {image_dir} have been deleted.")
else:
    print(f"{image_dir} is not a valid directory.")

# Pandas DataFrame storage
data_entries = []

###--Uncomment for processing PDF again & save df----#####
# Process all PDFs in the directory
for pdf_file in os.listdir(pdf_dir):
    if pdf_file.endswith(".pdf"):
        pdf_path = os.path.join(pdf_dir, pdf_file)
        print(f"Processing: {pdf_file}")
        process_pdf(pdf_path)
        #clear the files in the figures durectory
        if os.path.isfile(image_dir):
                os.remove(image_dir)
            
# Convert extracted data to a Pandas DataFrame
df_qa = pd.DataFrame(data_entries)
# Save DataFrame to a Parquet file (efficient binary format)
df_qa.to_parquet(os.path.join(output_dir, "extracted_data_qa.parquet"), index=False)

print("Processing complete. Data saved in Pandas DataFrame!")

All files in ./figures/ have been deleted.
Processing: chap_1_QA.pdf
Processing complete. Data saved in Pandas DataFrame!


In [25]:
df_qa

Unnamed: 0,pdf_name,element_type,content
0,chap_1_QA.pdf,text,|\n\nTT\n\n&\n\nwww.tntextbooks.in\n\n@ iG\n\n...
1,chap_1_QA.pdf,text,handle.\n\n5. There is no gravity in the orbit...
2,chap_1_QA.pdf,text,VIII. Answer in detail.\n\nReason: ‘g’ depends...
3,chap_1_QA.pdf,text,ICT CORNER\n\nNewton’s second law\n\nSteps\n\n...
4,chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...
5,chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...
6,chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...
7,chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...
8,chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...
9,chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...


In [26]:
df_qa['content'].iloc[0]

'|\n\nTT\n\n&\n\nwww.tntextbooks.in\n\n@ iG\n\nT E X T B O O K E V A L U A T I O N\n\nI. Choose the correct answer\n\n1) Inertia of a body depends on\n\n9) If the Earth shrinks to 50% of its real radius its mass remaining the same, the weight of a body on the Earth will\n\na) weight of the object\n\nb) acceleration due to gravity of the planet\n\na) decrease by 50% b) increase by 50% c) decrease by 25% d) increase by 300%\n\nc) mass of the object\n\nd) Both a & b\n\n10) To project the rockets which of the follow- ing principle(s) is /(are) required?\n\n2) Impulse is equals to a) rate of change of momentum b) rate of force and time c) change of momentum d) rate of change of mass\n\na) Newton’s third law of motion b) Newton’s law of gravitation c) law of conservation of linear momentum d) both a and c\n\nII. Fill in the blanks\n\n3) Newton’s III law is applicable a) for a body is at rest b) for a body in motion c) both a & b\n\nc) both a & b\n\n1. To produce a displacement ___________\n\

## Prepare QA from textbook QA page
### Split 2-column pdf into two pages 

In [8]:
# from pdf2image import convert_from_path
# from PIL import Image
# from reportlab.pdfgen import canvas
# from reportlab.lib.pagesizes import letter

# # Path to your multi-column PDF
# pdf_path = './data/QA/chap_1_QA.pdf'

# # Convert PDF to images (one per page)
# pages = convert_from_path(pdf_path, 300)  # 300 DPI for better accuracy

# # List to hold new pages (images)
# new_pages = []

# # Process each page
# for page_number, page in enumerate(pages):
#     width, height = page.size
    
#     # Define the left and right halves of the page (vertically split)
#     left_half = page.crop((0, 0, width // 2, height))  # Left half
#     right_half = page.crop((width // 2, 0, width, height))  # Right half
    
#     # Save the left and right halves as separate new images
#     left_image_path = f"left_page_{page_number + 1}.png"
#     right_image_path = f"right_page_{page_number + 1}.png"
    
#     left_half.save(left_image_path)
#     right_half.save(right_image_path)
    
#     # Append image paths to new_pages (for later inclusion in PDF)
#     new_pages.append(left_image_path)
#     new_pages.append(right_image_path)

# # Create a new PDF with the new images (left and right halves)
# output_pdf_path = './data/QA/new_chap_1_QA.pdf'
# c = canvas.Canvas(output_pdf_path, pagesize=letter)

# for image_path in new_pages:
#     c.drawImage(image_path, 0, 0, width=400, height=600)  # Adjust the width/height as needed
#     c.showPage()  # Move to the next page

# c.save()

# print(f"New PDF saved at {output_pdf_path}")

## Prepare QA from textbook QA page
### Extract using split pdf (questions and answers are more streamlined)

In [27]:
# Define directories
pdf_dir = "./data/QA_split"
output_dir = "processed_output/"
image_dir = "./figures/"
os.makedirs(output_dir, exist_ok=True)
os.makedirs(image_dir, exist_ok=True)

#clear the files in the figures durectory
if os.path.isdir(image_dir):
    for filename in os.listdir(image_dir):
        file_path = os.path.join(image_dir, filename)
        if os.path.isfile(file_path):  # Only remove files
            os.remove(file_path)
    print(f"All files in {image_dir} have been deleted.")
else:
    print(f"{image_dir} is not a valid directory.")

# Pandas DataFrame storage
data_entries = []

###--Uncomment for processing PDF again & save df----#####
# Process all PDFs in the directory
for pdf_file in os.listdir(pdf_dir):
    if pdf_file.endswith(".pdf"):
        pdf_path = os.path.join(pdf_dir, pdf_file)
        print(f"Processing: {pdf_file}")
        process_pdf(pdf_path)
        #clear the files in the figures durectory
        if os.path.isfile(image_dir):
                os.remove(image_dir)
            
# Convert extracted data to a Pandas DataFrame
df_qa_split = pd.DataFrame(data_entries)
# Save DataFrame to a Parquet file (efficient binary format)
df_qa_split.to_parquet(os.path.join(output_dir, "extracted_data_qa.parquet"), index=False)

print("Processing complete. Data saved in Pandas DataFrame!")

All files in ./figures/ have been deleted.
Processing: new_chap_1_QA.pdf
Processing complete. Data saved in Pandas DataFrame!


In [28]:
df_qa_split

Unnamed: 0,pdf_name,element_type,content
0,new_chap_1_QA.pdf,text,J o_O\n\n4\n\nwww .tnte>\n\n\& TEXTBOOK EVALUA...
1,new_chap_1_QA.pdf,text,Laws of motion\n\ns2/s/2022 2:48:49 PM\n\nJ =\...
2,new_chap_1_QA.pdf,text,VIII. Answer in detail.\n\n1. What are the typ...
3,new_chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...
4,new_chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...
5,new_chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...
6,new_chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...


In [30]:
df_qa_split['content'].iloc[0]

'J o_O\n\n4\n\nwww .tnte>\n\n\\& TEXTBOOK EVALUATIO\n\nS\n\nI. Choose the correct answer 1) Inertia of a body depends on a) weight of the object b) acceleration due to gravity of the planet c) mass of the object d) Botha & b 2) Impulse is equals to a) rate of change of momentum b) rate of force and time c) change of momentum d) rate of change of mass 3) Newtons III law is applicable a) for a body is at rest b) for a body in motion c) botha &b d) only for bodies with equal masses 4) Plotting a graph for momentum on the Y-axis and time on X-axis. slope of momen- tum-time graph gives a) Impulsive force b) Acceleration c) Force d) Rate of force 5) In which of the following sport the turning of effect of force used a) swimming b) tennis c) cycling d) hockey 6) The unit of ‘g’ ism s*. It can be also expressed as a) cms! b) Nkg"! c) Nm’kg"! d) cm/’s? 7) One kilogram force equals to a) 9.8 dyne b) 9.8 x 10° N c) 98 x 10* dyne d) 980 dyne 8) The mass of a body is measured on planet Earth as M k

## Pass QA string to GPT and get QA in json format
<span style="color:red;"> ### Issue: Further refinement required </span>

In [77]:
# Function to get summary from OpenAI API
def get_qa(text):
    """
    Uses the OpenAI API to extract questions and answers from the given text.
    The response is expected to be in JSON format, with keys like 'question', 
    'answer', and 'question_type'.
    """
    try:
        response = client.chat.completions.create(
            model="gpt-4o",  # You can use "gpt-3.5-turbo" if needed
            messages=[
                {
                    "role": "system", 
                    "content": ("Prepare question and answer in JSON format. "
                                "Include keys for 'question', 'answer', and 'question_type'."
                                "Include a) b) c) d) options for multiple choice questions."
                                "if some key vales incomplete, fll with null so that I can read using python")
                },
                {"role": "user", "content": text}
            ],
            temperature=0.5,
            max_tokens=12000
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error: {e}")
        return None

def summarize_row(row):
    if row["element_type"] == "text":
        return get_qa(row["content"])
    # If you want to handle other element types (e.g., image or table), add additional logic here.
    return None

####--Uncomment for processing & save df----#####
# # Apply the function to extract Q&A in JSON format for text elements
# df_qa_split["summary"] = df_qa_split.apply(summarize_row, axis=1)
# # Save DataFrame to a Parquet file (efficient binary format)
# df_qa_split.to_parquet(os.path.join(output_dir, "extracted_data_qa.parquet"), index=False)

In [85]:
# Load the Parquet file into a DataFrame
df_qa_split = pd.read_parquet(os.path.join(output_dir, "extracted_data_qa.parquet"))
# Check the loaded DataFrame
df_qa_split

Unnamed: 0,pdf_name,element_type,content,summary
0,new_chap_1_QA.pdf,text,J o_O\n\n4\n\nwww .tnte>\n\n\& TEXTBOOK EVALUA...,"```json\n{\n ""question_type"": ""multiple_choic..."
1,new_chap_1_QA.pdf,text,Laws of motion\n\ns2/s/2022 2:48:49 PM\n\nJ =\...,"```json\n{\n ""question"": ""Match the following..."
2,new_chap_1_QA.pdf,text,VIII. Answer in detail.\n\n1. What are the typ...,"```json\n{\n ""question"": ""What are the types ..."
3,new_chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...,
4,new_chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...,
5,new_chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...,
6,new_chap_1_QA.pdf,image,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...,


In [86]:
import json

def extract_questions(json_str):
    """
    Given a JSON string containing a list of questions, 
    parse it into a DataFrame with selected columns.
    """
    try:
        # Clean up the JSON string by removing markdown code fences
        cleaned_str = json_str.strip()
        if cleaned_str.startswith("```json"):
            cleaned_str = cleaned_str.replace("```json", "", 1)
        if cleaned_str.endswith("```"):
            cleaned_str = cleaned_str[:-3]
        
        # Wrap the objects in square brackets if they're not already
        cleaned_str = cleaned_str.strip()
        if not (cleaned_str.startswith("[") and cleaned_str.endswith("]")):
            cleaned_str = f"[{cleaned_str}]"
        
        # Parse the JSON string
        data = json.loads(cleaned_str)
        df_q = pd.DataFrame(data)
        # Select the desired columns
        df_q = df_q[['question_type', 'question', 'answer', 'options']]
        return df_q
    except Exception as e:
        print(f"Error parsing JSON: {e}")
        # Return an empty DataFrame with the correct columns for consistency
        return pd.DataFrame(columns=['question_type', 'question', 'answer', 'options'])

# Apply the function to every row in the 'summary' column of df_qa_split.
# This returns a Series of DataFrames.
question_dfs = df_qa_split['summary'].apply(extract_questions)

# Combine all individual DataFrames into one
combined_questions_df = pd.concat(question_dfs.tolist(), ignore_index=True)
combined_questions_df

Error parsing JSON: 'NoneType' object has no attribute 'strip'
Error parsing JSON: 'NoneType' object has no attribute 'strip'
Error parsing JSON: 'NoneType' object has no attribute 'strip'
Error parsing JSON: 'NoneType' object has no attribute 'strip'


Unnamed: 0,question_type,question,answer,options
0,multiple_choice,1) Inertia of a body depends on,c,"{'a': 'weight of the object', 'b': 'accelerati..."
1,multiple_choice,2) Impulse is equals to,c,"{'a': 'rate of change of momentum', 'b': 'rate..."
2,multiple_choice,3) Newton's III law is applicable,c,"{'a': 'for a body is at rest', 'b': 'for a bod..."
3,multiple_choice,4) Plotting a graph for momentum on the Y-axis...,c,"{'a': 'Impulsive force', 'b': 'Acceleration', ..."
4,multiple_choice,5) In which of the following sport the turning...,c,"{'a': 'swimming', 'b': 'tennis', 'c': 'cycling..."
5,multiple_choice,6) The unit of ‘g’ is m/s². It can be also exp...,b,"{'a': 'cm/s²', 'b': 'N/kg', 'c': 'Nm/kg', 'd':..."
6,multiple_choice,7) One kilogram force equals to,b,"{'a': '9.8 dyne', 'b': '9.8 x 10³ N', 'c': '98..."
7,multiple_choice,8) The mass of a body is measured on planet Ea...,d,"{'a': '4M', 'b': '2M', 'c': 'M/4', 'd': 'M'}"
8,multiple_choice,9) If the Earth shrinks to 50% of its real rad...,d,"{'a': 'decrease by 50%', 'b': 'increase by 50%..."
9,multiple_choice,10) To project the rockets which of the follow...,d,"{'a': 'Newton's third law of motion', 'b': 'Ne..."


### Here, we have extracted PDFs chapter & QA separately. The extracted contents are stored in dfs. Next we load this context (context) & qa (prompt) dfs & further perform Vector Search, RAG, & Inference. 