# Introduction to AutoGluon

AutoGluon is an open-source library designed to simplify the process of machine learning by automating the model selection and training process. It’s particularly useful for tabular data, and allows you to train high-quality models with minimal effort and code.

**Key features of AutoGluon:**
- **AutoML for Tabular Data**: AutoGluon automatically selects and trains a variety of models (like Random Forests, XGBoost, Neural Networks, etc.) to find the best-performing model for your dataset.
- **Ensemble Methods**: AutoGluon combines different models through ensembling techniques to boost prediction accuracy.
- **Easy-to-Use API**: With only a few lines of code, you can build powerful machine learning models.
- **Hyperparameter Optimization**: AutoGluon automates the process of hyperparameter tuning, helping you find the best parameters for your models.
- **Supports Multiple Task Types**: You can use AutoGluon for classification, regression, and other tasks with minimal configuration.

AutoGluon is an excellent choice for users who want to quickly build predictive models without needing to fine-tune machine learning algorithms manually.


In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from autogluon.tabular import TabularPredictor  # AutoGluon for tabular data prediction
from google.colab import drive  # Google Colab module to access Google Drive

# Mounting Google Drive to access the dataset stored in your Drive
drive.mount('/content/drive')

# Defining the directory path where the dataset is stored in Google Drive
directory = '/content/drive/My Drive/california-house-prices/'

# Importing pandas again (this is redundant because you already imported it earlier, so it can be removed)
import pandas as pd


Mounted at /content/drive


In [None]:
!pip install autogluon


Collecting autogluon
  Downloading autogluon-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.core==1.1.1 (from autogluon.core[all]==1.1.1->autogluon)
  Downloading autogluon.core-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.features==1.1.1 (from autogluon)
  Downloading autogluon.features-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.tabular==1.1.1 (from autogluon.tabular[all]==1.1.1->autogluon)
  Downloading autogluon.tabular-1.1.1-py3-none-any.whl.metadata (13 kB)
Collecting autogluon.multimodal==1.1.1 (from autogluon)
  Downloading autogluon.multimodal-1.1.1-py3-none-any.whl.metadata (12 kB)
Collecting autogluon.timeseries==1.1.1 (from autogluon.timeseries[all]==1.1.1->autogluon)
  Downloading autogluon.timeseries-1.1.1-py3-none-any.whl.metadata (12 kB)
Collecting scipy<1.13,>=1.5.4 (from autogluon.core==1.1.1->autogluon.core[all]==1.1.1->autogluon)
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metad

In [None]:
%%writefile kaggle_california_house.py
import pandas as pd
import numpy as np
import argparse
import os
import random
from autogluon.tabular import TabularPredictor
from autogluon.multimodal import MultiModalPredictor
import torch as th

def get_parser():
    parser = argparse.ArgumentParser(
        description='The Basic Example of AutoGluon for House Price Prediction.')
    parser.add_argument('--mode',
                        choices=['stack5', 'weighted', 'single', 'single_bag5'],
                        default='weighted',
                        help='"stack5" means 5-fold stacking. "weighted" means weighted ensemble.'
                             ' "single" means use a single model.'
                             ' "single_bag5" means 5-fold bagging via the AutoMM model.')
    parser.add_argument('--automm-mode', choices=['ft-transformer', 'mlp'],
                        default='ft-transformer', help='Fusion model in AutoMM.')
    parser.add_argument('--text-backbone', default='google/electra-small-discriminator')
    parser.add_argument('--cat-as-text', default=False)
    parser.add_argument('--data_path', type=str, default='california-house-prices')
    parser.add_argument('--seed', type=int, default=123)
    parser.add_argument('--exp_path', default=None)
    parser.add_argument('--with_tax_values', default=1, type=int)
    return parser

def get_automm_hyperparameters(mode, text_backbone, cat_as_text):
    if mode == "ft-transformer":
        hparams = {"model.names": ["ft_transformer", "hf_text", "fusion_transformer"],
                   "model.hf_text.checkpoint_name": text_backbone,
                   "data.categorical.convert_to_text": cat_as_text}
    elif mode == "mlp":
        hparams = {"model.names": ["categorical_mlp", "numerical_mlp", "hf_text", "fusion_mlp"],
                   "model.hf_text.checkpoint_name": text_backbone,
                   "data.categorical.convert_to_text": cat_as_text}
    else:
        raise NotImplementedError(f"mode={mode} is not supported!")
    return hparams

def preprocess(df, with_tax_values=True, log_scale_lot=True,
               log_scale_listed_price=True, has_label=True):
    new_df = df.copy()
    new_df.drop('Id', axis=1, inplace=True)
    new_df['Elementary School'] = new_df['Elementary School'].apply(lambda ele: str(ele)[:-len(' Elementary School')] if str(ele).endswith('Elementary School') else ele)
    if log_scale_lot:
        new_df['Lot'] = np.log(new_df['Lot'] + 1)
    if log_scale_listed_price:
        log_listed_price = np.log(new_df['Listed Price']).clip(0, None)
        new_df['Listed Price'] = log_listed_price
    if with_tax_values:
        new_df['Tax assessed value'] = np.log(new_df['Tax assessed value'] + 1)
        new_df['Annual tax amount'] = np.log(new_df['Annual tax amount'] + 1)
    else:
        new_df.drop('Tax assessed value', axis=1, inplace=True)
        new_df.drop('Annual tax amount', axis=1, inplace=True)
    if has_label:
        new_df['Sold Price'] = np.log(new_df['Sold Price'])
    return new_df

def set_seed(seed):
    th.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

def train(args):
    set_seed(args.seed)
    train_df = pd.read_csv(os.path.join(args.data_path, 'train.csv'))
    test_df = pd.read_csv(os.path.join(args.data_path, 'test.csv'))
    submission_df = pd.read_csv(os.path.join(args.data_path, 'sample_submission.csv'))

    train_df = preprocess(train_df, with_tax_values=args.with_tax_values, has_label=True)
    test_df = preprocess(test_df, with_tax_values=args.with_tax_values, has_label=False)

    label_column = 'Sold Price'
    eval_metric = 'r2'

    automm_hyperparameters = get_automm_hyperparameters(args.automm_mode, args.text_backbone, args.cat_as_text)

    tabular_hyperparameters = {
        'GBM': [
            {},
            {'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}},
        ],
        'CAT': {},
        'AG_AUTOMM': automm_hyperparameters,
    }

    if args.mode == 'single':
        predictor = MultiModalPredictor(eval_metric=eval_metric, label=label_column, path=args.exp_path)
        predictor.fit(train_df, hyperparameters=automm_hyperparameters, seed=args.seed)
    else:
        predictor = TabularPredictor(eval_metric=eval_metric, label=label_column, path=args.exp_path)
        predictor.fit(train_df, hyperparameters=tabular_hyperparameters)

    predictions = np.exp(predictor.predict(test_df))
    submission_df['Sold Price'] = predictions
    submission_df.to_csv(os.path.join(args.exp_path, 'submission.csv'), index=None)

if __name__ == '__main__':
    parser = get_parser()
    args = parser.parse_args()
    if args.exp_path is None:
        args.exp_path = f'automm_kaggle_house_{args.mode}_{args.automm_mode}_cat_to_text{args.cat_as_text}_{args.text_backbone}'
    th.manual_seed(args.seed)
    train(args)


Writing kaggle_california_house.py


In [None]:
!pip uninstall torchvision
!pip install torchvision

Found existing installation: torchvision 0.18.1
Uninstalling torchvision-0.18.1:
  Would remove:
    /usr/local/lib/python3.10/dist-packages/torchvision-0.18.1.dist-info/*
    /usr/local/lib/python3.10/dist-packages/torchvision.libs/libcudart.7ec1eba6.so.12
    /usr/local/lib/python3.10/dist-packages/torchvision.libs/libjpeg.ceea7512.so.62
    /usr/local/lib/python3.10/dist-packages/torchvision.libs/libnvjpeg.f00ca762.so.12
    /usr/local/lib/python3.10/dist-packages/torchvision.libs/libpng16.7f72a3c5.so.16
    /usr/local/lib/python3.10/dist-packages/torchvision.libs/libz.4e87b236.so.1
    /usr/local/lib/python3.10/dist-packages/torchvision/*
Proceed (Y/n)? y
  Successfully uninstalled torchvision-0.18.1
Collecting torchvision
  Downloading torchvision-0.19.1-cp310-cp310-manylinux1_x86_64.whl.metadata (6.0 kB)
Collecting torch==2.4.1 (from torchvision)
  Downloading torch-2.4.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.4.1-

### Running the `kaggle_california_house.py` Script

This code executes a Python script named `kaggle_california_house.py`, which is likely related to a project for predicting California house prices. Below is an explanation of each part of the command:

- `!python3 kaggle_california_house.py`: This runs the Python script using Python 3 within the Colab environment.

- `--mode single`: This argument specifies the mode in which the script should run. In this case, 'single' could refer to a single-task mode, such as training a single model or running a single experiment.

- `--data_path "/content/drive/My Drive/california-house-prices"`: This defines the location of the dataset. The dataset is stored in the user’s Google Drive at the specified path.

- `--exp_path "/content/drive/My Drive/california-house-prices/exp"`: This argument sets the path where experiment results, such as trained models, logs, or performance metrics, will be saved. These results will also be stored in Google Drive under the 'exp' directory.

This command ensures that both the dataset and the results are easily accessible within your Google Drive, allowing for persistent storage and seamless experimentation.


In [None]:
!python3 kaggle_california_house.py --mode single --data_path "/content/drive/My Drive/california-house-prices" --exp_path "/content/drive/My Drive/california-house-prices/exp"


2024-09-27 01:18:21.464264: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-27 01:18:21.485979: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-27 01:18:21.492597: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
  result = getattr(ufunc, method)(*inputs, **kwargs)
  result = getattr(ufunc, method)(*inputs, **kwargs)
AutoGluon Version:  1.1.1
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          12
Pytorch Version:    2.4.1+cu121
CUDA Version:       12.1


### Running the `kaggle_california_house.py` Script in Prediction Mode

This code runs the `kaggle_california_house.py` script, but now the script is executed in prediction mode. Let’s break down the components of the command:

- `!python3 kaggle_california_house.py`: This executes the Python script using Python 3 within the Colab environment.

- `--mode predict`: This argument specifies that the script should run in prediction mode. Instead of training a new model, the script will use an existing model to make predictions on new data.

- `--data_path "/content/drive/My Drive/california-house-prices"`: This argument specifies the location of the dataset. The dataset is located in Google Drive at the specified path.

- `--exp_path "/content/drive/My Drive/california-house-prices/exp"`: This argument defines where the experiment outputs and saved model are located. The trained model and related files should already be stored in this directory, allowing for the model to be loaded and used for predictions.

This command leverages the previously trained model to predict the target variable, such as house prices, based on the input data provided in the dataset.


In [None]:
!python3 kaggle_california_house.py --mode predict --data_path "/content/drive/My Drive/california-house-prices" --exp_path "/content/drive/My Drive/california-house-prices/exp"


2024-09-27 03:38:19.323091: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-27 03:38:19.345433: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-27 03:38:19.352128: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
usage: kaggle_california_house.py [-h] [--mode {stack5,weighted,single,single_bag5}]
                                  [--automm-mode {ft-transformer,mlp}]
                                  [--text-backbone TEXT_BACKBONE] [--cat-as-text CAT_AS_TEXT]
                                  [--data_path DATA_PATH] [--seed SEED] [--exp_path EXP_PATH]
                     

In [None]:
!python3 kaggle_california_house.py --mode single --data_path "/content/drive/My Drive/california-house-prices" --exp_path "/content/drive/My Drive/california-house-prices/exp"


2024-09-27 04:03:06.090637: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-27 04:03:06.112322: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-27 04:03:06.118930: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
  result = getattr(ufunc, method)(*inputs, **kwargs)
  result = getattr(ufunc, method)(*inputs, **kwargs)
Traceback (most recent call last):
  File "/content/kaggle_california_house.py", line 107, in <module>
    train(args)
  File "/content/kaggle_california_house.py", line 92, in train
    predictor.fit(train_df, hyperparameters=automm_hyperparameters, seed=arg

### Deleting the Experiment Directory

In this code, we are using the `shutil` library to remove an existing directory that stores experiment results. This can be useful when you want to clear previous experiment outputs before running new ones.

- `import shutil`: We first import the `shutil` module, which provides high-level file and directory handling functions.

- `shutil.rmtree('/content/drive/My Drive/california-house-prices/exp')`: This command removes the entire directory located at the specified path, including all of its contents. In this case, it deletes the experiment directory `exp` under the `california-house-prices` folder in Google Drive.

- `print("Directory deleted. Now you can rerun the script.")`: After the directory is successfully deleted, a message is printed to confirm the action.

This is helpful when you want to clean up old files before running new experiments, ensuring that no conflicting data or results remain in the directory.


In [None]:
import shutil

# Remove the existing directory
shutil.rmtree('/content/drive/My Drive/california-house-prices/exp')

print("Directory deleted. Now you can rerun the script.")
