# Moremi BioKit: SMILES Subpackage Usage Example

This notebook demonstrates how to use the core functionalities of the `moremi_biokit.smiles` subpackage for processing, validating, and ranking small molecules represented by SMILES strings.

## 1. Setup and Imports

First, ensure that the `moremi-biokit` package is installed in your environment. If you are running this notebook from the `moremi_toolkits` monorepo root after setting it up for development, it should be available.

We'll import the necessary components from `moremi_biokit` and `moremi_biokit.smiles`.

In [1]:
import sys
from pathlib import Path

# Add parent directory to Python path
module_path = str(Path().absolute().parent)
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
# from pathlib import Path

# Core components from the smiles subpackage (can be imported from moremi_biokit directly thanks to the main __init__.py)
from moremi_biokit.smiles import (
    SmallMoleculeValidator,
    MoleculeMetrics,
    ProcessingResult,
    SmallMoleculeRankerV4,
    BatchMoleculeProcessor,
    ScoringConfig # For potential custom ranking
)

# Specific submodules if needed (e.g., for direct access to calculators, though usually not required for basic use)
from moremi_biokit import smiles 

print("Successfully imported Moremi BioKit components for SMILES processing.")

  vars(torch.load(path, map_location=lambda storage, loc: storage)["args"]),
  state = torch.load(path, map_location=lambda storage, loc: storage)


Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "readout.1.weight".
Loading pretrained parameter "readout.1.bias".
Loading pretrained parameter "readout.4.weight".
Loading pretrained parameter "readout.4.bias".
Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "readout.1.weight".
Loading pretrained parameter "readout.1.bias".
Loading pretrained parameter "readout.4.weight".
Loading pretrained parameter "readout.4.b

  state = torch.load(path, map_location=lambda storage, loc: storage)


Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "readout.1.weight".
Loading pretrained parameter "readout.1.bias".
Loading pretrained parameter "readout.4.weight".
Loading pretrained parameter "readout.4.bias".
Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "readout.1.weight".
Loading pretrained parameter "readout.1.bias".
Loading pretrained parameter "readout.4.weight".
Loading pretrained parameter "readout.4.b

## 2. Validating a Single Molecule

Let's validate a single SMILES string and inspect the collected metrics.

In [3]:
validator = SmallMoleculeValidator()

aspirin_smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
ethanol_smiles = "CCO"
invalid_smiles = "ThisIsNotASMILES"

print(f"Validating Aspirin: {aspirin_smiles}")
result_aspirin = validator.process_molecule(aspirin_smiles)

if result_aspirin.success and isinstance(result_aspirin.metrics, MoleculeMetrics):
    print(f"  SMILES: {result_aspirin.metrics.smiles}")
    print(f"  Formula: {result_aspirin.metrics.molecular_formula}")
    print(f"  MW: {result_aspirin.metrics.molecular_weight:.2f}")
    print(f"  Physicochemical (TPSA): {result_aspirin.metrics.physicochemical.get('tpsa')}")
    print(f"  Lipophilicity (Consensus LogP): {result_aspirin.metrics.lipophilicity.get('consensus_logp')}")
    # You can explore other metrics in result_aspirin.metrics as needed
else:
    print(f"  Failed to process Aspirin: {result_aspirin.error}")

print(f"\nValidating Ethanol: {ethanol_smiles}")
result_ethanol = validator.process_molecule(ethanol_smiles)
if result_ethanol.success and isinstance(result_ethanol.metrics, MoleculeMetrics):
    print(f"  SMILES: {result_ethanol.metrics.smiles}")
    print(f"  MW: {result_ethanol.metrics.molecular_weight:.2f}")
else:
    print(f"  Failed to process Ethanol: {result_ethanol.error}")

print(f"\nValidating Invalid SMILES: {invalid_smiles}")
result_invalid = validator.process_molecule(invalid_smiles)
if not result_invalid.success:
    print(f"  Failed as expected: {result_invalid.error}")

Validating Aspirin: CC(=O)OC1=CC=CC=C1C(=O)O


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 2673.23it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 397.83it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00, 20.86it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 78.91it/s]

[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:01<00:00,  2.75it/s]
model ensembles: 100%|██████████| 2/2 [00:01<00:00,  1.05it/s]


  SMILES: CC(=O)Oc1ccccc1C(=O)O
  Formula: C9H8O4
  MW: 180.04
  Physicochemical (TPSA): 63.6
  Lipophilicity (Consensus LogP): 1.45060344

Validating Ethanol: CCO


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 4190.11it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 573.62it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00, 15.68it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 81.41it/s]

[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 100.30it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00, 16.11it/s]


  SMILES: CCO
  MW: 46.04

Validating Invalid SMILES: ThisIsNotASMILES
  Failed as expected: Invalid SMILES


[16:10:11] SMILES Parse Error: syntax error while parsing: ThisIsNotASMILES
[16:10:11] SMILES Parse Error: check for mistakes around position 1:
[16:10:11] ThisIsNotASMILES
[16:10:11] ^
[16:10:11] SMILES Parse Error: Failed parsing SMILES 'ThisIsNotASMILES' for input: 'ThisIsNotASMILES'


Get all metrics result

In [4]:
result_aspirin.metrics.to_dict()

{'smiles': 'CC(=O)Oc1ccccc1C(=O)O',
 'molecular_formula': 'C9H8O4',
 'molecular_weight': 180.042258736,
 'metrics': {'physicochemical': {'tpsa': 63.6,
   'logp': 1.45060344,
   'qed': 0.5501217966938848},
  'medicinal_chemistry': {'synthetic_accessibility': 4.51,
   'qed': 0.5501217966938848},
  'lipophilicity': {'ilogp': 1.3729848,
   'xlogp3': 1.4253888000000001,
   'wlogp': 1.3101,
   'mlogp': 1.5434232479999999,
   'silicos_it': 1.6241932920000002,
   'consensus_logp': 1.45060344},
  'druglikeness': {'lipinski_violations': [], 'bioavailability_score': 0.85},
  'absorption': {'caco2': -4.406400482370849,
   'pampa': 0.2673666700720787,
   'mdck': -4.24320169034944,
   'hia': 0.9971104025840759,
   'pgp_substrate': 0.00825292665977031,
   'pgp_inhibitor': 0.55},
  'distribution': {'vdss': 6.688882144785967,
   'ppb': 73.18028435667074,
   'bbb': 0.8109905123710632,
   'fu': 26.81971564332926},
  'metabolism': {'cyp2c9_inhibition': 0.031664181500673294,
   'cyp2d6_inhibition': 0.00694

## 3. Batch Processing Molecules

The `BatchMoleculeProcessor` can be used to process multiple SMILES from a file. For this example, we'll create a dummy SMILES file.

In [5]:
# Create a dummy SMILES file for demonstration
notebook_dir = Path.cwd() # Assumes notebook is in components/notebooks/
example_data_dir = notebook_dir / "example_data"
example_data_dir.mkdir(exist_ok=True)
dummy_smiles_file = example_data_dir / "example_molecules.smi"

smiles_data = [
    "CC(=O)OC1=CC=CC=C1C(=O)O",  # Aspirin
    "CCO",  # Ethanol
    "C1CCCCC1",  # Cyclohexane
    "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",  # Caffeine
    "InvalidSMILESString"
]

with open(dummy_smiles_file, 'w') as f:
    for s in smiles_data:
        f.write(s + '\n')

print(f"Created dummy SMILES file: {dummy_smiles_file}")

# Define an output directory for batch processing results
batch_output_dir = notebook_dir / "batch_analysis_results"
# batch_output_dir.mkdir(exist_ok=True) # BatchMoleculeProcessor creates it

print(f"Batch output will be saved to: {batch_output_dir}")

# Initialize and run the batch processor
# Note: The BatchMoleculeProcessor by default will try to generate reports,
# including PDF reports for each molecule if not configured otherwise.
# For this notebook, we'll let it run with defaults.
# to generate either pdf or csv, you pass `generate_pdf` or `generate_csv` both are of type bool
batch_processor = BatchMoleculeProcessor(
    input_file=str(dummy_smiles_file), 
    output_dir=str(batch_output_dir)
)

try:
    print("\nStarting batch processing... (This may take a moment)")
    batch_processor.process_batch()
    print("\nBatch processing finished.")
    print(f"Check the directory '{batch_output_dir}' for results, including logs, reports, and rankings.")
except Exception as e:
    print(f"An error occurred during batch processing: {e}")
    import traceback
    traceback.print_exc()

Created dummy SMILES file: /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/example_data/example_molecules.smi
Batch output will be saved to: /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results

Starting batch processing... (This may take a moment)
2025-05-07 16:10:13,127 - INFO - 🚀 Starting validation of 5 molecules...

🔍 Processing 5 molecules from example_molecules.smi


Processing molecules:   0%|          | 0/5 [00:00<?, ?mol/s]

2025-05-07 16:10:13,142 - INFO - 
⚡ Processing molecule 1/5
2025-05-07 16:10:13,145 - INFO - ├── 🧬 SMILES: CC(=O)OC1=CC=CC=C1C(=O)O

📊 Validating molecule 1/5
├── 🧬 SMILES: CC(=O)OC1=CC=CC=C1C(=O)O
├── 🔬 Calculating properties...


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 6754.11it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 201.30it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00, 12.48it/s]

[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A
[A

[A[A

[A[A

[A[A

individual models: 100%|██████████| 5/5 [00:00<00:00, 33.78it/s]

[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

individual models: 100%|██████████| 5/5 [00:00<00:00, 52.44it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  7.20it/s]


2025-05-07 16:10:13,907 - INFO - └── ✅ Validation successful
2025-05-07 16:10:13,911 - INFO -     └── 📝 Formula: C9H8O4


Processing molecules:  20%|██        | 1/5 [00:00<00:03,  1.30mol/s]

└── ✅ Validation successful
    └── 📝 Formula: C9H8O4
2025-05-07 16:10:13,918 - INFO - 
⚡ Processing molecule 2/5
2025-05-07 16:10:13,924 - INFO - ├── 🧬 SMILES: CCO

📊 Validating molecule 2/5
├── 🧬 SMILES: CCO
├── 🔬 Calculating properties...


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 7639.90it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 60.90it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  7.20it/s]

[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A
[A

[A[A

individual models: 100%|██████████| 5/5 [00:00<00:00, 26.15it/s]

[A

[A[A

[A[A

[A[A

[A[A
[A

[A[A

[A[A

[A[A

[A[A

[A[A

individual models: 100%|██████████| 5/5 [00:00<00:00, 28.03it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  5.00it/s]


2025-05-07 16:10:14,842 - INFO - └── ✅ Validation successful
2025-05-07 16:10:14,848 - INFO -     └── 📝 Formula: C2H6O


Processing molecules:  40%|████      | 2/5 [00:01<00:02,  1.15mol/s]

└── ✅ Validation successful
    └── 📝 Formula: C2H6O
2025-05-07 16:10:14,858 - INFO - 
⚡ Processing molecule 3/5
2025-05-07 16:10:14,861 - INFO - ├── 🧬 SMILES: C1CCCCC1

📊 Validating molecule 3/5
├── 🧬 SMILES: C1CCCCC1
├── 🔬 Calculating properties...


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 2781.37it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 62.32it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  6.80it/s]

[A

[A[A

[A[A

[A[A

[A[A

[A[A
[A

[A[A

[A[A

[A[A

[A[A

[A[A
[A

[A[A

individual models: 100%|██████████| 5/5 [00:00<00:00, 10.91it/s]

[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A
[A

[A[A

[A[A

[A[A

[A[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 17.14it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  2.51it/s]


2025-05-07 16:10:16,223 - INFO - └── ✅ Validation successful
2025-05-07 16:10:16,227 - INFO -     └── 📝 Formula: C6H12


Processing molecules:  60%|██████    | 3/5 [00:03<00:02,  1.10s/mol]

└── ✅ Validation successful
    └── 📝 Formula: C6H12
2025-05-07 16:10:16,231 - INFO - 
⚡ Processing molecule 4/5
2025-05-07 16:10:16,233 - INFO - ├── 🧬 SMILES: CN1C=NC2=C1C(=O)N(C(=O)N2C)C

📊 Validating molecule 4/5
├── 🧬 SMILES: CN1C=NC2=C1C(=O)N(C(=O)N2C)C
├── 🔬 Calculating properties...


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 1569.14it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 257.48it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  6.64it/s]

[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A
[A

[A[A

[A[A

[A[A

individual models: 100%|██████████| 5/5 [00:00<00:00, 17.66it/s]

[A

[A[A

[A[A
[A

[A[A

[A[A

[A[A

[A[A
[A

[A[A

[A[A

[A[A

[A[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 11.06it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  2.53it/s]


2025-05-07 16:10:17,639 - INFO - └── ✅ Validation successful
2025-05-07 16:10:17,641 - INFO -     └── 📝 Formula: C8H10N4O2


Processing molecules:  80%|████████  | 4/5 [00:04<00:01,  1.23s/mol]

└── ✅ Validation successful
    └── 📝 Formula: C8H10N4O2
2025-05-07 16:10:17,649 - INFO - 
⚡ Processing molecule 5/5
2025-05-07 16:10:17,650 - INFO - ├── 🧬 SMILES: InvalidSMILESString

📊 Validating molecule 5/5
├── 🧬 SMILES: InvalidSMILESString
├── 🔬 Calculating properties...
2025-05-07 16:10:17,654 - ERROR - └── ❌ Validation failed: Invalid SMILES


[16:10:17] SMILES Parse Error: syntax error while parsing: InvalidSMILESString
[16:10:17] SMILES Parse Error: check for mistakes around position 3:
[16:10:17] InvalidSMILESString
[16:10:17] ~~^
[16:10:17] SMILES Parse Error: Failed parsing SMILES 'InvalidSMILESString' for input: 'InvalidSMILESString'
Processing molecules: 100%|██████████| 5/5 [00:04<00:00,  1.11mol/s]

└── ❌ Validation failed: Invalid SMILES

📊 Ranking molecules and generating reports...
2025-05-07 16:10:17,663 - INFO - 🎯 Ranking 4 molecules...



📊 Calculating scores:   0%|          | 0/4 [00:00<?, ?mol/s]


🧮 Calculating scores for C9H8O4
2025-05-07 16:10:17,678 - INFO - 🧮 Calculating scores for C9H8O4

🧮 Calculating scores for C2H6O
2025-05-07 16:10:17,681 - INFO - 🧮 Calculating scores for C2H6O

🧮 Calculating scores for C6H12
2025-05-07 16:10:17,695 - INFO - 🧮 Calculating scores for C6H12

🧮 Calculating scores for C8H10N4O2
2025-05-07 16:10:17,699 - INFO - 🧮 Calculating scores for C8H10N4O2


📊 Calculating scores: 100%|██████████| 4/4 [00:00<00:00, 116.22mol/s]

2025-05-07 16:10:17,729 - INFO - 💾 Saving ranking results...
2025-05-07 16:10:17,746 - INFO - 📝 Generating report for C9H8O4 (Rank 1)



SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 2385.84it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 320.10it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  4.88it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 12.56it/s]
model ensembles:  50%|█████     | 1/2 [00:00<00:00,  2.43it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 13.94it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  2.53it/s]


Generated reports for C9H8O4 in /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results/molecule_reports
2025-05-07 16:10:19,232 - INFO - 📝 Generating report for C6H12 (Rank 2)


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 6626.07it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 139.45it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  3.67it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00,  9.93it/s]
model ensembles:  50%|█████     | 1/2 [00:00<00:00,  1.88it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 12.34it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  2.08it/s]


Generated reports for C6H12 in /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results/molecule_reports
2025-05-07 16:10:20,924 - INFO - 📝 Generating report for C2H6O (Rank 3)


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 2642.91it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 432.94it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  6.77it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 27.56it/s]
model ensembles:  50%|█████     | 1/2 [00:00<00:00,  5.23it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 36.90it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  5.84it/s]


Generated reports for C2H6O in /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results/molecule_reports
2025-05-07 16:10:21,813 - INFO - 📝 Generating report for C8H10N4O2 (Rank 4)


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 4429.04it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 272.13it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  9.44it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 15.47it/s]
model ensembles:  50%|█████     | 1/2 [00:00<00:00,  2.98it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 51.96it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  4.49it/s]


Generated reports for C8H10N4O2 in /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results/molecule_reports
2025-05-07 16:10:24,129 - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2025-05-07 16:10:24,144 - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.


  sns.violinplot(data=data, palette=colors, alpha=0.7)



Ranking results saved to:
CSV: /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results/rankings/rankings_20250507_161017.csv
PDF: /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results/rankings/ranking_report_20250507_161017.pdf
2025-05-07 16:11:22,974 - INFO - ✅ Ranking complete! Results saved to /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results

⚠️ 1 molecules failed processing
   See /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results/failed_molecules.txt for details

✨ Processing complete!
📁 Results saved to: /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results
📋 Log file: /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results/processing.log

Batch processing finished.
Check the directory '/home/mino_solo/moremi_toolkits/components/moremi-bio

### Inspecting Batch Results

After running `process_batch()`, the `batch_output_dir` will contain several items:
- `processing.log`: A log file of the batch run.
- `failed_molecules.txt` (if any): Lists SMILES that failed validation or processing.
- `rankings/`: This directory will contain a CSV file (e.g., `rankings_YYYYMMDD_HHMMSS.csv`) with all processed molecules, their scores, and ranks.
- `molecule_reports/`: This directory will contain individual reports (CSV and potentially PDF if report generation is enabled and successful) for each successfully processed molecule.

In [6]:
# Example: Listing some of the output files/directories (actual content depends on the run)
if batch_output_dir.exists():
    print(f"\nContents of {batch_output_dir}:")
    for item in sorted(list(batch_output_dir.glob('*'))):
        print(f"  {item.name}")
        if item.is_dir() and item.name == 'rankings':
            print(f"    Contents of {item.name}:")
            for sub_item in sorted(list(item.glob('*'))):
                print(f"      {sub_item.name}")
        if item.is_dir() and item.name == 'molecule_reports':
            print(f"    Contents of {item.name} (first few):")
            for i, sub_item in enumerate(sorted(list(item.glob('*')))):
                if i < 5: print(f"      {sub_item.name}")
            if len(list(item.glob('*'))) > 5: print("      ...and more")
else:
    print(f"\nOutput directory {batch_output_dir} not found. Ensure batch processing ran successfully.")


Contents of /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/batch_analysis_results:
  failed_molecules.txt
  molecule_reports
    Contents of molecule_reports (first few):
      r1_C9H8O4_20250507_161019.csv
      r2_C6H12_20250507_161020.csv
      r3_C2H6O_20250507_161021.csv
      r4_C8H10N4O2_20250507_161022.csv
  processing.log
  rankings
    Contents of rankings:
      ranking_report_20250507_161017.pdf
      rankings_20250507_161017.csv


## 4. Direct Ranking (Advanced/Alternative)

While `BatchMoleculeProcessor` handles validation and ranking together, you can also use `SmallMoleculeRankerV4` directly if you already have a list of `MoleculeMetrics` objects.

In [7]:
# Assuming you have a list of MoleculeMetrics objects from a previous step
# For this example, let's re-process a few to get metrics objects
metrics_for_ranking = []
smiles_to_rank = ["CC(=O)OC1=CC=CC=C1C(=O)O", "CCO", "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"]
validator_for_ranker = SmallMoleculeValidator()

for s in smiles_to_rank:
    res = validator_for_ranker.process_molecule(s)
    if res.success and isinstance(res.metrics, MoleculeMetrics):
        metrics_for_ranking.append(res.metrics)

if metrics_for_ranking:
    print(f"Collected {len(metrics_for_ranking)} MoleculeMetrics objects for direct ranking.")
    ranker = SmallMoleculeRankerV4()
    
    # Set an output directory for the ranker if you want reports from it
    direct_ranking_output_dir = notebook_dir / "direct_ranking_results"
    ranker.set_output_directory(str(direct_ranking_output_dir))
    
    # The ranker can be configured not to generate individual PDF reports to speed things up
    # ranker.generate_pdf = False 
    
    ranked_df = ranker.rank_molecules(metrics_for_ranking)
    
    print("\nRanked DataFrame (top 5):")
    print(ranked_df.head()[["SMILES", "Molecular Formula", "Overall Score"]])
    
    print(f"\nDirect ranking outputs (if any) would be in {direct_ranking_output_dir}")
else:
    print("Could not collect metrics for direct ranking example.")

SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 2608.40it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 268.68it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  9.76it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 21.64it/s]
model ensembles:  50%|█████     | 1/2 [00:00<00:00,  4.01it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 41.97it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  5.26it/s]
SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 6204.59it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 538.35it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00, 15.02it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 53.36it/s]
model ensembles:  50%|█████     | 1/2 [00:00<00

Collected 3 MoleculeMetrics objects for direct ranking.
2025-05-07 16:19:07,134 - INFO - 🎯 Ranking 3 molecules...


📊 Calculating scores:   0%|          | 0/3 [00:00<?, ?mol/s]


🧮 Calculating scores for C9H8O4
2025-05-07 16:19:07,138 - INFO - 🧮 Calculating scores for C9H8O4

🧮 Calculating scores for C2H6O
2025-05-07 16:19:07,139 - INFO - 🧮 Calculating scores for C2H6O

🧮 Calculating scores for C8H10N4O2
2025-05-07 16:19:07,143 - INFO - 🧮 Calculating scores for C8H10N4O2


📊 Calculating scores: 100%|██████████| 3/3 [00:00<00:00, 359.31mol/s]

2025-05-07 16:19:07,151 - INFO - 💾 Saving ranking results...





2025-05-07 16:19:07,162 - INFO - 📝 Generating report for C9H8O4 (Rank 1)


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 5924.16it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 590.83it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  8.98it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 62.60it/s]

[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 65.69it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00, 11.41it/s]


Generated reports for C9H8O4 in /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/direct_ranking_results/molecule_reports
2025-05-07 16:19:07,742 - INFO - 📝 Generating report for C2H6O (Rank 2)


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 5090.17it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 220.58it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  6.31it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 33.62it/s]
model ensembles:  50%|█████     | 1/2 [00:00<00:00,  6.35it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 21.94it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  4.93it/s]


Generated reports for C2H6O in /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/direct_ranking_results/molecule_reports
2025-05-07 16:19:08,684 - INFO - 📝 Generating report for C8H10N4O2 (Rank 3)


SMILES to Mol: 100%|██████████| 1/1 [00:00<00:00, 3566.59it/s]
Computing physchem properties: 100%|██████████| 1/1 [00:00<00:00, 216.03it/s]
RDKit fingerprints: 100%|██████████| 1/1 [00:00<00:00,  7.80it/s]
model ensembles:   0%|          | 0/2 [00:00<?, ?it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 27.94it/s]
model ensembles:  50%|█████     | 1/2 [00:00<00:00,  5.30it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
individual models: 100%|██████████| 5/5 [00:00<00:00, 17.70it/s]
model ensembles: 100%|██████████| 2/2 [00:00<00:00,  4.16it/s]


Generated reports for C8H10N4O2 in /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/direct_ranking_results/molecule_reports
2025-05-07 16:19:11,620 - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2025-05-07 16:19:11,643 - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.


  sns.violinplot(data=data, palette=colors, alpha=0.7)



Ranking results saved to:
CSV: /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/direct_ranking_results/rankings/rankings_20250507_161907.csv
PDF: /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/direct_ranking_results/rankings/ranking_report_20250507_161907.pdf
2025-05-07 16:20:08,566 - INFO - ✅ Ranking complete! Results saved to /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/direct_ranking_results

Ranked DataFrame (top 5):
                       SMILES Molecular Formula  Overall Score
0       CC(=O)Oc1ccccc1C(=O)O            C9H8O4         0.8466
1                         CCO             C2H6O         0.7901
2  Cn1c(=O)c2c(ncn2C)n(C)c1=O         C8H10N4O2         0.6802

Direct ranking outputs (if any) would be in /home/mino_solo/moremi_toolkits/components/moremi-biokit/moremi_biokit/direct_ranking_results


In [10]:
# viewing list of molecule metrics
metrics_for_ranking



## 5. Custom Scoring Configuration (Brief Mention)

The `SmallMoleculeRankerV4` uses a `ScoringConfig` class that defines weights for metric categories and specific scoring functions for individual properties. While the default uses equal weighting, you could potentially subclass or modify `ScoringConfig` for custom ranking behavior. This is an advanced topic not fully covered here.

In [None]:
default_config = ScoringConfig()
print("Default category weights:")
for category, weight in default_config.category_weights.items():
    print(f"  {category.value}: {weight}")

print("\nExample property config for TPSA (first few lines):")
print(default_config.property_configs.get('tpsa'))

## Conclusion

This notebook provided a basic overview of using the `moremi_biokit.smiles` subpackage. You can now:
- Validate individual SMILES strings and retrieve detailed metrics.
- Process batches of SMILES strings from files.
- Understand where to find the output reports and rankings.
- (Optionally) Perform direct ranking if you have `MoleculeMetrics` objects.

Refer to the source code and docstrings of the respective classes for more detailed information on configurations and capabilities.