<H2>Download and save the dataset</H2>

### Open Direct Air Capture 2023 (ODAC23) - Fair-chem
#### Initial Structure to Relaxed Structure (IS2RS) / Relaxed Energy (IS2RE) tasks [Link](https://fair-chem.github.io/core/datasets/odac.html#initial-structure-to-relaxed-structure-is2rs-relaxed-energy-is2re-tasks)

In [None]:
%%bash
wget https://dl.fbaipublicfiles.com/dac/datasets/odac23_is2r.tar.gz
tar -xvzf odac23_is2r.tar.gz

<h2>Converting to CSV</h2>

In [None]:
import lmdb
import os
import pickle
import csv
import numpy as np
from tqdm import tqdm
import gc

# Configuration
BASE_DIR = './is2r/train'
CSV_PATH = 'is2r_train_optimized.csv'
NUMERIC_COLS = ['oms', 'nco2', 'nads', 'nh2o', 'raw_y', 'y_init', 'natoms', 'defective', 'fixed']
TENSOR_COLS = ['pos_relaxed', 'cell', 'pos', 'atomic_numbers', 'supercell']
CHUNK_SIZE = 1000

def process_entry(key, value, lmdb_file):
    """Process a single LMDB entry with enhanced error handling"""
    entry = {'file': lmdb_file, 'key': key.decode(errors='replace')}
    
    try:
        data = pickle.loads(value, encoding='bytes')
        
        # Process numeric features
        for col in NUMERIC_COLS:
            entry[col] = data.get(col, 0)
        
        # Process tensor features
        for col in TENSOR_COLS:
            tensor = data.get(col)
            if tensor is not None:
                try:
                    arr = np.array(tensor, dtype=np.float32)
                    entry[f'{col}_mean'] = arr.mean()
                    entry[f'{col}_std'] = arr.std()
                    entry[f'{col}_min'] = arr.min()
                    entry[f'{col}_max'] = arr.max()
                except Exception as e:
                    entry.update({f'{col}_{stat}': 0.0 for stat in ['mean', 'std', 'min', 'max']})
            else:
                entry.update({f'{col}_{stat}': 0.0 for stat in ['mean', 'std', 'min', 'max']})
        
        return entry
    
    except (pickle.UnpicklingError, TypeError) as e:
        print(f"Unpickling error in {lmdb_file}, key {key}: {str(e)}")
        return None
    except Exception as e:
        print(f"Error processing entry {key}: {str(e)}")
        return None

def process_lmdb_file(lmdb_path):
    """Process a single LMDB file with transaction safety"""
    entries = []
    env = lmdb.open(lmdb_path,
                    readonly=True,
                    subdir=False,
                    lock=False,
                    max_readers=2048,
                    max_dbs=0,
                    map_size=1099511627776*2)
    
    with env.begin() as txn:
        cursor = txn.cursor()
        for key, value in cursor:
            result = process_entry(key, value, os.path.basename(lmdb_path))
            if result:
                entries.append(result)
                if len(entries) >= CHUNK_SIZE:
                    write_to_csv(entries)
                    entries.clear()
                    gc.collect()
        
        if entries:
            write_to_csv(entries)
            entries.clear()
            gc.collect()
    
    env.close()
    return lmdb_path

def write_to_csv(entries):
    """Append entries to CSV"""
    with open(CSV_PATH, 'a', newline='', buffering=16384) as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=CSV_COLUMNS)
        writer.writerows(entries)

# Initialize CSV file
CSV_COLUMNS = ['file', 'key'] + NUMERIC_COLS
for col in TENSOR_COLS:
    CSV_COLUMNS += [f'{col}_mean', f'{col}_std', f'{col}_min', f'{col}_max']

with open(CSV_PATH, 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=CSV_COLUMNS)
    writer.writeheader()

# Process files sequentially
lmdb_files = [os.path.join(BASE_DIR, f) 
             for f in os.listdir(BASE_DIR) if f.endswith('.lmdb')]

for file_path in tqdm(lmdb_files, desc="Processing LMDB files"):
    try:
        process_lmdb_file(file_path)
    except Exception as e:
        print(f"Critical error processing {file_path}: {str(e)}")

print(f"Conversion complete. Output: {CSV_PATH}")

# Model Training Documentation

## Target Variable: `nads` (Number of Adsorbed Molecules)
**Definition:**  
Number of CO₂ and H₂O molecules absorbed by the nano-material  
**Example Prediction:**  
`predicted_nads = 3.031828` → Predicts ~3 molecules adsorbed  

---

## Input Features & Example Values
| Feature Name             | Example Value          | Description                          | Data Type   |
|--------------------------|------------------------|--------------------------------------|-------------|
| **oms**                  | False                  | Open metal site presence             | Boolean     |
| **nco2**                 | 1                      | Number of CO₂ molecules available    | Integer     |
| **nh2o**                 | 2                      | Number of H₂O molecules available    | Integer     |
| **raw_y**                | -631.916568            | Final system energy (eV)             | Float       |
| **y_init**               | 0.612635               | Initial system energy (eV)           | Float       |
| **natoms**               | 82                     | Total atoms in material              | Integer     |
| **defective**            | True                   | Defect presence                      | Boolean     |

### Structural Features
| Feature                  | Example Value          | Description                          |
|--------------------------|------------------------|--------------------------------------|
| cell_mean                | 5.643653 Å             | Average unit cell size               |
| cell_std                 | 4.755599 Å             | Cell size variation                  |
| pos_relaxed_mean         | 8.48211 Å              | Average atom position after adsorption |
| pos_relaxed_std          | 4.277549 Å             | Position variation after adsorption  |

### Composition Features
| Feature                  | Example Value          | Description                          |
|--------------------------|------------------------|--------------------------------------|
| atomic_numbers_mean      | 13.573171              | Average atomic number (Al≈13)        |
| atomic_numbers_std       | 12.533885              | Element diversity                    |

---

## Training Data Characteristics
**Example Entry from 00081.lmdb:**
```python
{
  "key": "23",
  "oms": False,
  "nco2": 1,
  "nads": 3,          # Actual adsorbed molecules
  "nh2o": 2,
  "raw_y": -631.916568,
  "y_init": 0.612635,
  "natoms": 82,
  "defective": True
}

In [None]:
#save model after training

import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv("/content/is2r_train_optimized.csv")

# Define feature set
features = [
    "nco2", "nh2o", "y_init", "natoms", "defective",
    "cell_mean", "cell_std", "pos_relaxed_mean", "pos_relaxed_std",
    "atomic_numbers_mean", "atomic_numbers_std"
]
X = data[features]
y = data["nads"]  # Target: number of adsorbed molecules as proxy for efficiency

# Check for missing values
print("Missing values in features:\n", X.isnull().sum())
print("Missing values in target:\n", y.isnull().sum())

# Convert boolean 'defective' to numeric (True -> 1, False -> 0)
X["defective"] = X["defective"].astype(int)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Define XGBoost parameters
params = {
    "objective": "reg:squarederror",
    "max_depth": 6,
    "eta": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "eval_metric": "rmse"
}

# Train the model
num_rounds = 100
model = xgb.train(params, dtrain, num_rounds, evals=[(dtest, "test")], early_stopping_rounds=10, verbose_eval=10)

# Save the trained model to a file
model.save_model("xgboost_nads_model.json")
print("\nModel saved to 'xgboost_nads_model.json'")

# Predict on test set
y_pred = model.predict(dtest)

# Evaluate the model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nRMSE: {rmse}")

# Feature importance
xgb.plot_importance(model)
plt.show()

# Predict adsorption capacity for the full dataset
data["predicted_nads"] = model.predict(xgb.DMatrix(X))
top_combinations = data.sort_values("predicted_nads", ascending=False).head(5)

# Human-readable output
print("\n=== Top 5 Nano-Material Configurations for CCU Efficiency ===")
print("(Based on Predicted Adsorption Capacity)\n")
print("These are the top 5 material setups that excel at capturing CO₂, ranked by how many molecules they can trap.")
print("Higher adsorption capacity means better performance for carbon capture and utilization (CCU)!\n")

for i, (index, row) in enumerate(top_combinations.iterrows(), 1):
    defective_status = "no defects" if row["defective"] == 0 else "defective structure"
    print(f"{i}. Configuration {i} (Row {int(index)})")
    print(f"   - CO₂ Molecules: {int(row['nco2'])}")
    print(f"   - H₂O Molecules: {int(row['nh2o'])}")
    print(f"   - Adsorption Capacity: {int(row['nads'])} molecules (Predicted: {row['predicted_nads']:.2f})")
    print(f"   - Energy: Initial: {row['y_init']:.2f} eV | Final: {row['raw_y']:.2f} eV")
    print(f"   - Material Details: {int(row['natoms'])} atoms, {defective_status}, average cell size ~{row['cell_mean']:.2f} Å,")
    print(f"     mixed atom types (mean atomic number ~{row['atomic_numbers_mean']:.1f})")
    print(f"   - Why It’s Great: {'Stable structure with strong binding' if row['raw_y'] < -650 else 'Good capture with unique features'}")
    print()

print("=== Key Takeaways ===")
print("- All top configs trap 3 molecules consistently, with predictions spot-on.")
print("- Lower final energies (e.g., below -650 eV) mean super-stable CO₂ capture.")
print("- Defects or no defects—both work well, giving us design flexibility!")

<h2>Load the saved model</h2>

# Feature Requirements for Nano-Material CCU Model

## Required Features
The model requires **exactly these 11 features** in the input data. Missing or misspelled columns will cause errors.

| Feature Name           | Description                                                                 | Data Type          | Example Values       | Purpose                                                                 |
|------------------------|-----------------------------------------------------------------------------|--------------------|----------------------|-------------------------------------------------------------------------|
| **nco2**               | Number of CO₂ molecules exposed to the material                             | Integer, Float     | 1, 2                 | Indicates CO₂ availability for adsorption                              |
| **nh2o**               | Number of H₂O molecules exposed to the material                             | Integer, Float     | 2, 0                 | Shows water presence affecting adsorption                              |
| **y_init**             | Initial system energy before adsorption (eV)                                | Float              | 0.612635, -0.420629  | Provides baseline energy state                                         |
| **natoms**             | Total atoms in nano-material structure                                      | Integer            | 82, 95               | Reflects material size/capacity                                        |
| **defective**          | Presence of structural defects                                              | Boolean (1/0)      | True/False, 1/0      | Indicates defect-enhanced adsorption sites                            |
| **cell_mean**          | Average unit cell size (Å)                                                  | Float              | 5.643653, 5.373723   | Shows structural scale for molecule capture                           |
| **cell_std**           | Cell size variation (Å)                                                     | Float              | 4.755599, 5.147071   | Indicates structural irregularity                                      |
| **pos_relaxed_mean**   | Average atom positions after relaxation (Å)                                 | Float              | 8.48211, 7.654403    | Reveals post-adsorption stability                                      |
| **pos_relaxed_std**    | Variation in relaxed atom positions (Å)                                     | Float              | 4.277549, 3.760834   | Shows positional spread affecting binding                              |
| **atomic_numbers_mean**| Average atomic number of material atoms                                     | Float              | 13.573171, 6.905263  | Indicates elemental composition                                       |
| **atomic_numbers_std** | Diversity of atomic numbers                                                 | Float              | 12.533885, 12.555620 | Measures element variety influencing adsorption                        |

## Key Notes
1. **Data Validation**  
   - All features are mandatory
   - Automatic conversions:  
     - `defective` converts True→1/False→0 internally
     - All values must be numeric (no text/strings)

2. **Unit Requirements**  
   ```python
   # Required units for continuous features:
   y_init: electron volts (eV)
   cell_mean/std: angstroms (Å)
   pos_relaxed_mean/std: angstroms (Å)


## Example Input csv
```
nco2,nh2o,y_init,natoms,defective,cell_mean,cell_std,pos_relaxed_mean,pos_relaxed_std,atomic_numbers_mean,atomic_numbers_std
1,2,0.612635,82,1,5.643653,4.755599,8.48211,4.277549,13.573171,12.533885
1,0,-0.420629,95,0,5.373723,5.147071,7.654403,3.760834,6.905263,12.555620
```

In [None]:
#Load the saved model

"""
input_data_in_csv = pd.DataFrame({
    "nco2": [1, 1],
    "nh2o": [2, 0],
    "y_init": [0.612635, -0.420629],
    "natoms": [82, 95],
    "defective": [True, False],  # Or [1, 0]
    "cell_mean": [5.643653, 5.373723],
    "cell_std": [4.755599, 5.147071],
    "pos_relaxed_mean": [8.48211, 7.654403],
    "pos_relaxed_std": [4.277549, 3.760834],
    "atomic_numbers_mean": [13.573171, 6.905263],
    "atomic_numbers_std": [12.533885, 12.555620]
})"""


import pandas as pd
import xgboost as xgb

# Load the saved model
loaded_model = xgb.Booster()
loaded_model.load_model("xgboost_nads_model.json")
print("Model loaded successfully from 'xgboost_nads_model.json'")

# Load or prepare your input data (example shown below)
# Replace this with your actual new data file or DataFrame
input_data = pd.read_csv("your_new_data.csv")  # Or create DataFrame manually

# Define the feature set (must match the training features)
features = [
    "nco2", "nh2o", "y_init", "natoms", "defective",
    "cell_mean", "cell_std", "pos_relaxed_mean", "pos_relaxed_std",
    "atomic_numbers_mean", "atomic_numbers_std"
]

# Select and preprocess the features
X = input_data[features]
X["defective"] = X["defective"].astype(int)  # Convert 'defective' to 0/1 (True -> 1, False -> 0)

# Convert to DMatrix for XGBoost prediction
dmat = xgb.DMatrix(X)

# Make predictions
predictions = loaded_model.predict(dmat)

# Add predictions to the DataFrame (optional)
input_data["predicted_nads"] = predictions

# Display some results
print("\nPredictions for the first 5 rows:")
print(input_data[features + ["predicted_nads"]].head())

# Optionally, save the results to a new CSV
input_data.to_csv("predictions_output.csv", index=False)
print("\nResults saved to 'predictions_output.csv'")