# Melting Point Prediction: The "Nuclear Option" (AutoGluon)

## Strategy
We are stuck at MAE ~33. To get below 25, we need **Stacking**. 
**AutoGluon** is an AutoML library that automatically trains and stacks dozens of models (CatBoost, XGBoost, LightGBM, Neural Nets, etc.). It often wins tabular competitions.

## Prerequisite
**Run this on Kaggle/Colab (Linux). It is very hard to install on Mac.**

In [None]:
# Install AutoGluon (Tabular only) and RDKit
# NOTE: You will see RED ERROR MESSAGES about 'pyarrow' or 'bigframes'. 
# This is NORMAL on Kaggle/Colab. You can safely ignore them.

!pip install -U pip
!pip install -U setuptools wheel
# Force pyarrow < 20 to keep GPU libraries (cudf) happy, even if it breaks 'datasets'
!pip install "pyarrow<20.0.0" autogluon.tabular rdkit

In [None]:
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem, GraphDescriptors, rdFingerprintGenerator
from autogluon.tabular import TabularPredictor
import os

# Check environment and load data
if os.path.exists('/kaggle/input/melting-point/train.csv'):
    data_path = '/kaggle/input/melting-point/'
elif os.path.exists('train.csv'):
    data_path = './'
else:
    # Auto-download
    !pip install -q kaggle
    !kaggle competitions download -c melting-point
    !unzip -o melting-point.zip
    data_path = './'

df_train = pd.read_csv(f"{data_path}train.csv", sep=",")
test_df = pd.read_csv(f"{data_path}test.csv")
submission_df = pd.read_csv(f"{data_path}sample_submission.csv")

## 1. Feature Engineering
We still need good features for AutoGluon to learn from.

In [None]:
def get_mol_features(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if not mol:
        return None
    
    features = {
        'MolWt': Descriptors.MolWt(mol),
        'LogP': Descriptors.MolLogP(mol),
        'NumHDonors': Descriptors.NumHDonors(mol),
        'NumHAcceptors': Descriptors.NumHAcceptors(mol),
        'TPSA': Descriptors.TPSA(mol),
        'NumRotatableBonds': Descriptors.NumRotatableBonds(mol),
        'RingCount': Descriptors.RingCount(mol),
        'HeavyAtomCount': Descriptors.HeavyAtomCount(mol),
        'NumValenceElectrons': Descriptors.NumValenceElectrons(mol),
        'BertzCT': GraphDescriptors.BertzCT(mol),
        'HallKierAlpha': GraphDescriptors.HallKierAlpha(mol),
    }
    
    # Morgan Fingerprints (Radius 2, 1024 bits)
    mfgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=1024)
    fp = mfgen.GetFingerprint(mol)
    fp_bits = list(fp)
    for i, bit in enumerate(fp_bits):
        features[f'fp_{i}'] = bit
        
    return features

print("Generating features...")
train_features = df_train['SMILES'].apply(get_mol_features).apply(pd.Series)
test_features = test_df['SMILES'].apply(get_mol_features).apply(pd.Series)

# Combine with original data (AutoGluon handles raw text/categories too!)
train_data = pd.concat([df_train, train_features], axis=1)
test_data = pd.concat([test_df, test_features], axis=1)

# Drop ID and SMILES (SMILES is already encoded, ID is useless)
train_data = train_data.drop(columns=['id', 'SMILES'])
test_data = test_data.drop(columns=['id', 'SMILES'])

## 2. AutoGluon Training
We use `presets='best_quality'` which enables heavy stacking and bagging.

In [None]:
predictor = TabularPredictor(
    label='Tm',
    eval_metric='mean_absolute_error',
    problem_type='regression'
).fit(
    train_data,
    presets='best_quality',  # THIS IS KEY: Trains high-accuracy stacked ensembles
    time_limit=3600*2,       # Run for up to 2 hours (adjust as needed)
    ag_args_fit={'num_gpus': 1}  # Use GPU if available
)

## 3. Submission

In [None]:
preds = predictor.predict(test_data)
submission_df['Tm'] = preds
submission_df.to_csv('submission_autogluon.csv', index=False)
print("Saved submission_autogluon.csv")

In [None]:
# Check the leaderboard of models
predictor.leaderboard()