# Benchmark Data Loader

This notebook provides functionality to access benchmark text files from subdirectories and convert them into a pandas DataFrame for analysis.

## Import Required Libraries

Import pandas, os, and other necessary libraries for data handling and file operations.

In [1]:
import pandas as pd
import os
import numpy as np
from pathlib import Path
import glob

print("Libraries imported successfully!")

Libraries imported successfully!


## Define Benchmark Data Structure

Define the structure and format of benchmark data, including data types and expected fields.

In [2]:
# Define the benchmark data structure
BENCHMARK_COLUMNS = [
    'subdirectory',
    'file_number', 
    'filename',
    'file_path',
    'question_content',
    'file_size',
    'question_length'
]

# Base folder containing benchmark data
BASE_FOLDER = "Benchmark"

print(f"Benchmark structure defined with columns: {BENCHMARK_COLUMNS}")
print(f"Base folder: {BASE_FOLDER}")

Benchmark structure defined with columns: ['subdirectory', 'file_number', 'filename', 'file_path', 'question_content', 'file_size', 'question_length']
Base folder: Benchmark


## Create Benchmark Access Functions

Create functions to read and parse benchmark data from files, handling different file formats and data sources.

In [3]:
def get_benchmark_subdirectories(base_folder):
    """
    Get all subdirectories in the benchmark folder.
    
    Args:
        base_folder (str): Path to the base benchmark folder
        
    Returns:
        list: List of subdirectory names
    """
    if not os.path.exists(base_folder):
        print(f"Warning: Base folder '{base_folder}' does not exist!")
        return []
    
    subdirs = [subdir for subdir in os.listdir(base_folder) 
               if os.path.isdir(os.path.join(base_folder, subdir))]
    
    print(f"Found {len(subdirs)} subdirectories: {subdirs}")
    return subdirs


def get_text_files_in_directory(directory_path):
    """
    Get all .txt files in a directory, sorted by file number.
    
    Args:
        directory_path (str): Path to the directory
        
    Returns:
        list: Sorted list of .txt filenames
    """
    if not os.path.exists(directory_path):
        return []
    
    files = [f for f in os.listdir(directory_path) if f.endswith('.txt')]
    # Sort by file number (assuming format like "1.txt", "2.txt", etc.)
    try:
        files = sorted(files, key=lambda x: int(x.split('.')[0]))
    except ValueError:
        # If sorting by number fails, sort alphabetically
        files = sorted(files)
    
    return files


def read_text_file_content(file_path):
    """
    Read content from a text file safely.
    
    Args:
        file_path (str): Path to the text file
        
    Returns:
        str: File content or empty string if error
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read().strip()
        return content
    except Exception as e:
        print(f"Error reading file {file_path}: {e}")
        return ""


print("Benchmark access functions defined successfully!")

Benchmark access functions defined successfully!


## Load Benchmark Data

Load all benchmark data from subdirectories and prepare it for DataFrame conversion.

In [4]:
def load_all_benchmark_data(base_folder):
    """
    Load all benchmark data from subdirectories.
    
    Args:
        base_folder (str): Path to the base benchmark folder
        
    Returns:
        list: List of dictionaries containing benchmark data
    """
    benchmark_data = []
    
    # Get all subdirectories
    subdirectories = get_benchmark_subdirectories(base_folder)
    
    for subdir in subdirectories:
        subdir_path = os.path.join(base_folder, subdir)
        print(f"\nProcessing subdirectory: {subdir}")
        
        # Get all text files in this subdirectory
        text_files = get_text_files_in_directory(subdir_path)
        print(f"Found {len(text_files)} text files in {subdir}")
        
        for filename in text_files:
            file_path = os.path.join(subdir_path, filename)
            
            # Extract file number from filename
            try:
                file_number = int(filename.split('.')[0])
            except ValueError:
                file_number = None
            
            # Read file content
            content = read_text_file_content(file_path)
            
            # Get file size
            file_size = os.path.getsize(file_path) if os.path.exists(file_path) else 0
            
            # Create data record
            record = {
                'subdirectory': subdir,
                'file_number': file_number,
                'filename': filename,
                'file_path': file_path,
                'question_content': content,
                'file_size': file_size,
                'question_length': len(content)
            }
            
            benchmark_data.append(record)
    
    print(f"\nTotal records loaded: {len(benchmark_data)}")
    return benchmark_data


# Load the benchmark data
print("Loading benchmark data...")
benchmark_records = load_all_benchmark_data(BASE_FOLDER)

Loading benchmark data...
Found 2 subdirectories: ['foundation_P3', 'foundation_P4']

Processing subdirectory: foundation_P3
Found 100 text files in foundation_P3

Processing subdirectory: foundation_P4
Found 100 text files in foundation_P4

Total records loaded: 200


## Convert to Pandas DataFrame

Convert the accessed benchmark data into a pandas DataFrame with proper column names and data types.

In [5]:
# Convert to DataFrame
if benchmark_records:
    df = pd.DataFrame(benchmark_records)
    
    # Ensure proper data types
    df['file_number'] = pd.to_numeric(df['file_number'], errors='coerce')
    df['file_size'] = pd.to_numeric(df['file_size'], errors='coerce')
    df['question_length'] = pd.to_numeric(df['question_length'], errors='coerce')
    
    # Sort by subdirectory and file number
    df = df.sort_values(['subdirectory', 'file_number']).reset_index(drop=True)
    
    print("DataFrame created successfully!")
    print(f"Shape: {df.shape}")
    
else:
    # Create empty DataFrame with proper structure if no data found
    df = pd.DataFrame(columns=BENCHMARK_COLUMNS)
    print("No benchmark data found. Created empty DataFrame.")

print(f"\nDataFrame columns: {list(df.columns)}")

DataFrame created successfully!
Shape: (200, 7)

DataFrame columns: ['subdirectory', 'file_number', 'filename', 'file_path', 'question_content', 'file_size', 'question_length']


## Validate DataFrame Structure

Verify the DataFrame structure, display basic information, and store the final result in variable 'df'.

In [6]:
# Display DataFrame information
print("=== BENCHMARK DATAFRAME SUMMARY ===")
print(f"Total rows: {len(df)}")
print(f"Total columns: {len(df.columns)}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

print("\n=== COLUMN DATA TYPES ===")
print(df.dtypes)

if not df.empty:
    print("\n=== BASIC STATISTICS ===")
    print(f"Unique subdirectories: {df['subdirectory'].nunique()}")
    print(f"Subdirectories: {df['subdirectory'].unique().tolist()}")
    print(f"File number range: {df['file_number'].min()} - {df['file_number'].max()}")
    print(f"Average question length: {df['question_length'].mean():.2f} characters")
    print(f"Average file size: {df['file_size'].mean():.2f} bytes")
    
    print("\n=== FIRST FEW ROWS ===")
    display(df.head())
    
    print("\n=== SAMPLE QUESTION CONTENT ===")
    if len(df) > 0 and df['question_content'].iloc[0]:
        sample_content = df['question_content'].iloc[0]
        print(f"First question preview (first 200 chars):")
        print(f"'{sample_content[:200]}...'") if len(sample_content) > 200 else print(f"'{sample_content}'")

else:
    print("\nDataFrame is empty - no benchmark data was found.")

print("\n=== FINAL RESULT ===")
print(f"✅ Benchmark data successfully loaded into variable 'df'")
print(f"✅ DataFrame shape: {df.shape}")
print(f"✅ Ready for analysis!")

=== BENCHMARK DATAFRAME SUMMARY ===
Total rows: 200
Total columns: 7
Memory usage: 134.82 KB

=== COLUMN DATA TYPES ===
subdirectory        object
file_number          int64
filename            object
file_path           object
question_content    object
file_size            int64
question_length      int64
dtype: object

=== BASIC STATISTICS ===
Unique subdirectories: 2
Subdirectories: ['foundation_P3', 'foundation_P4']
File number range: 1 - 100
Average question length: 246.22 characters
Average file size: 250.30 bytes

=== FIRST FEW ROWS ===


Unnamed: 0,subdirectory,file_number,filename,file_path,question_content,file_size,question_length
0,foundation_P3,1,1.txt,Benchmark\foundation_P3\1.txt,The ratio of income of A and B is 5 : 4 and th...,206,191
1,foundation_P3,2,2.txt,Benchmark\foundation_P3\2.txt,2. The mean proportional between \(12x^2\) and...,125,117
2,foundation_P3,3,3.txt,Benchmark\foundation_P3\3.txt,3. \(\log_2 \log_2 256 + 2 \log_2 \sqrt{2}\) i...,105,101
3,foundation_P3,4,4.txt,Benchmark\foundation_P3\4.txt,4. What is the value of \(\left(\frac{x^b}{x^c...,216,210
4,foundation_P3,5,5.txt,Benchmark\foundation_P3\5.txt,5. A number consists of two digits. The digits...,245,237



=== SAMPLE QUESTION CONTENT ===
First question preview (first 200 chars):
'The ratio of income of A and B is 5 : 4 and their expenditure is 3 : 2. If at the end of the year each saves ₹ 1,600, then the income of A is:

(A) ₹ 3,400
(B) ₹ 3,600
(C) ₹ 4,000
(D) ₹ 4,400'

=== FINAL RESULT ===
✅ Benchmark data successfully loaded into variable 'df'
✅ DataFrame shape: (200, 7)
✅ Ready for analysis!
