**Stretching AutoGluon's Legs (SAL)**

For reference, make some tea, sit back and [RTFM](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html)

Lets first get some house keeping out of the way by installing the necessary packages such as, you guessed it, AutoGluon

In [None]:
!pip install autogluon
!pip uninstall lightgbm -y
!pip install lightgbm --install-option=--gpu

In [None]:
# imports that we will need
import pandas as pd                              # Bread and butter of data science
from autogluon.tabular import TabularPredictor   # We want a tabular predictor from Autogluon
import os                                        # Operating system, for you know, the operating system
import numpy as np                               # Another bread and butter

In [None]:
# just grab the files that we will need and save them to a dictionary for easy reference

data_files_paths = {}

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        file_name = filename.split('.')[0]
        data_files_paths[file_name] = os.path.join(dirname, filename)

In [None]:
# This function was reused from a fellow Kaggler that is very useful in reducing the size of DataFrames
def reduce_mem_usage(props):
    start_mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    for col in props.columns:
        if props[col].dtype != object:  # Exclude strings
            # make variables for Int, max and min
            IsInt = False
            mx = props[col].max()
            mn = props[col].min()
    
            # test if column can be converted to an integer
            asint = props[col].astype(np.int64)
            result = (props[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True

            
            # Make Integer/unsigned Integer datatypes
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)
                    elif mx < 65535:
                        props[col] = props[col].astype(np.uint16)
                    elif mx < 4294967295:
                        props[col] = props[col].astype(np.uint32)
                    else:
                        props[col] = props[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        props[col] = props[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        props[col] = props[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        props[col] = props[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        props[col] = props[col].astype(np.int64)    
            
            # Make float datatypes 32 bit
            else:
                props[col] = props[col].astype(np.float32)
    
    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return props


In [None]:
# Load the training data, reduce its memory footprint, and check if it looks semi decent

train_data = pd.read_csv(data_files_paths.get('train'), index_col=0)
train_data = reduce_mem_usage(train_data)
train_data.head()

In [None]:
# adding some extra goodies to the DataFrame
def add_aux_columns(df: pd.DataFrame):
    df['sum'] = train_data[[f'f{i}'for i in range(100)]].sum(axis=1)
    df['mean'] = train_data[[f'f{i}'for i in range(100)]].mean(axis=1)
    df['median'] = train_data[[f'f{i}'for i in range(100)]].mean(axis=1)
    return df

In [None]:
train_data = add_aux_columns(df=train_data)

In [None]:
# Variables for AutoGluon
label_column = 'target'
eval_metric = 'roc_auc'
save_path = '/kaggle/working/AutoGluonModelAkaTheBeast'
time_limit = 3600 # time limit for autogluon in seconds

In [None]:
predictor = TabularPredictor(
    label = label_column,
    eval_metric = eval_metric,
    path = save_path
)

From my experiments, AutoGluon can take some time to train. It does train quite a few classification models in the process so that is expected so patience is key here. 

A note from my experiments: the longer you train the longer your inference step is going to be but luckily there is no time limit on inference so train for as long as you dare.


For reference on a 12 core machine with 32GB RAM, a model that was trained for 12 hours took about 5 hours to do inference on the test dataset for the previous Tabular Comp.




In [None]:
predictor.fit(
    train_data,
    presets='best_quality',
    time_limit=time_limit,
    verbosity=3,
    ag_args_fit={'num_gpus': 1}
)

Let's have a look at what AutoGluon has done 

In [None]:
predictor.leaderboard(train_data.iloc[:1000], silent=True)

Nice! Its trained some models and bagged them, then fitted those bags in another model.


Enough scenery watching, time to load the test data and perform some inference

In [None]:
test_data = pd.read_csv(data_files_paths.get('test'), index_col=0)

In [None]:
# to be consistent, lets reduce the memory of the test dataset even though it *shouldn't* make a difference
test_data = reduce_mem_usage(test_data)
# Do not forget to add extra columns
test_data = add_aux_columns(df=test_data)

In [None]:
predictions = predictor.predict_proba(test_data)

In [None]:
predictions = predictions[1].reset_index()
predictions.columns = ['id', 'target']

In [None]:
# SAVE
predictions[['id', 'target']].to_csv('/kaggle/working/submission.csv', index=False)