# Dataset Loading

Notebook to load dataset data into the database. This notebook demonstrates how to create dataset entries with their associated metadata and measurement relationships.

## Imports

In [7]:
import psycopg2
import pandas as pd
import numpy as np
from tabulate import tabulate
from pathlib import Path
# Import a folder in the parent directory
import sys
sys.path.append('../')
import dbtools.dbtools as qrs
import dbtools.load as load

## Connection

In [8]:
try:
    # Connect to the PostgreSQL database
    conn = qrs.connect()
    print("Connected to the database")

except (Exception, psycopg2.DatabaseError) as error:
    print(error)

Connected to the database


## Dataset Data

The `load_dataset` function creates a new dataset entry in the database with associated metadata and measurement relationships.

Parameters:
- `file_path`: The path to the dataset file
- `rows`: Number of rows in the dataset
- `patch_size`: Patch size of the dataset
- `targets`: List of targets for the dataset
- `reconstruction_shape`: Shape to see the dataset as an image
- `measurement_file_paths`: List of measurement file paths associated with this dataset
- `description`: Optional dataset description
- `additional_metadata`: Optional list of dictionaries with metadata

Each metadata dictionary should contain:
- `key`: The name of the metadata field
- `value`: The value of the metadata
- `type`: The type or unit of the metadata

## Get Existing Measurements

## Dataset with Additional Metadata

In [9]:
# Define another dataset file path (replace with an actual file path in your system)
dataset_file_path = Path(r'\\192.168.10.106\imdea\DataDriven_UT_AlbertoVicente\04_ML_data\Juan Ignacio\JI_4\MonoElement\patch_vs_volfrac_3.csv')

df2 = pd.read_csv(dataset_file_path)

# Count the number of rows
rows = len(df2)

# Select different measurements to associate with this dataset
measurement_file_paths = ['\\\\192.168.10.106\\imdea\\DataDriven_UT_AlbertoVicente\\03_UT_data\\Probetas JI\\probetas\\4\\4.tif']

# Define different dataset parameters
patch_size = "32x32"
targets = ["density", "thickness"]
reconstruction_shape = (512, 512)
description = "Advanced dataset with additional metadata"

# Define additional metadata
additional_metadata = [
    {'key': 'preprocessing', 'value': 'normalized', 'type': 'string'},
    {'key': 'feature_extraction', 'value': 'wavelet transform', 'type': 'string'},
    {'key': 'augmentation', 'value': 'True', 'type': 'boolean'},
    {'key': 'train_test_split', 'value': '0.8', 'type': 'ratio'},
    {'key': 'author', 'value': 'Data Science Team', 'type': 'string'}
]

# Load the dataset with additional metadata
dataset_id = load.load_dataset(
    conn, 
    str(dataset_file_path),
    rows, 
    patch_size, 
    targets, 
    reconstruction_shape, 
    measurement_file_paths,
    description,
    additional_metadata
)

print(f"Dataset with additional metadata loaded with ID: {dataset_id}")

Dataset from '\\192.168.10.106\imdea\DataDriven_UT_AlbertoVicente\04_ML_data\Juan Ignacio\JI_4\MonoElement\patch_vs_volfrac_3.csv' loaded with ID: 67
Dataset with additional metadata loaded with ID: 67
Dataset with additional metadata loaded with ID: 67


## Verify Insertion

In [10]:
# Get dataset data from the database
dataset_data = qrs.get_data_metadata('datasets')

# Display the dataset data
print("Dataset basic information:")
print(tabulate(dataset_data[['id_dataset', 'file_path_dataset', 'description_dataset']].tail(), 
               headers='keys', tablefmt='pretty'))

# Get relationship data
relationship_data = qrs.relation_metadata('datasets', 'measurements', 'dataset_measurements')

# Display relationships
print("\nDataset-Measurement relationships:")
print(tabulate(relationship_data.tail(), headers='keys', tablefmt='pretty'))

Dataset basic information:
+----+------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
|    | id_dataset |                                                  file_path_dataset                                                   |                                 description_dataset                                  |
+----+------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
| 51 |     61     | \\192.168.10.106\imdea\DataDriven_UT_AlbertoVicente\04_ML_data\Juan Ignacio\JI_8\MonoElement\patch_vs_volfrac_9.csv  | Monoelement RF vs volfrac and areafrac. First functional group of datasets from 2024 |
| 52 |     62     | \\192.168.10.106\imdea\DataDriven_UT_AlbertoVicen

In [11]:
# Close the database connection
conn.close()