# Introduction

## Context

I was looking into the latest Kaggle competitions and two of them caught my eye. The first one is the Playground series on Depression detection and the other one is the Gemini Long Context usecases. Binary classification is a common problem in machine learning and I thought how about I try to solve both of them in one go. So, here is how I went about it.

## More context

Context window is the number of tokens that the model can remember, and tokens are the words or characters that make up the input text. Gemini is one of the unique models that can remember a large number of tokens. Gemini flash comes with 1 million token context window and Gemini 1.5 comes with 2 million token context window.

However context window usage by sending large documents (or datasets) in our case, can be expensive and slow. Fortunately, Gemini also provides context caching. Context caching is a way to store the context in the model and reuse it for future requests. This can be done by sending the context once and then sending only the new tokens in the subsequent requests! Cheap and fast!

## How did I go about solving the problem?

Usually, when we want to quickly dip our toes and get a sense of a common ML problem, like binary classification, we use AutoML libraries. One of the popular AutoML library is PyCaret. PyCaret is an open-source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment. However, to get a bit better results, we need to tweak the arguments of the PyCaret functions. For example, there are multiple choices that we have  to make when we do data preparation and feature engineering steps. Similarly after training a bunch of models, we'll have to decide on whether to ensemble them or not and which type of ensemble to use.

Either we can do all of this manually or we can give Gemini all the information as a context and let it do the heavy lifting for us. This is where the context caching comes in handy. So, I decided to extract relevant documentation from PyCaret docs and sliced the dataset into a smaller CSV file and uploaded it to Gemini. As these context can be cached, I can reuse them for future requests and enable Gemini to train the binary classification model for me.


## What is the expected outcome?

I am just curious what will be the leaderboard score on both the competitions :D

My best guess is that the model will perform better than a base PyCaret AutoML model with default settings. However, I'm sure it won't beat the top models on the leaderboard (probably will come in the top 25 percentile). But hey, it's worth a shot!

# Setup

In [1]:
# Constants

SEED = 42
MODEL = "models/gemini-1.5-flash-002"

In [2]:
import os

import google.generativeai as genai
from fastkaggle.core import iskaggle

In [3]:
if iskaggle:
    from kaggle_secrets import UserSecretsClient

    GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
else:
    from dotenv import load_dotenv

    load_dotenv()

    GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

genai.configure(api_key=GOOGLE_API_KEY)

In [4]:
from pathlib import Path

dataset_path = Path("/kaggle/input/playground-series-s4e11")
output_path = Path("/kaggle/working")

if not iskaggle:
    import kagglehub

    dataset_path = kagglehub.competition_download("playground-series-s4e11")
    dataset_path = Path(dataset_path)
    output_path = Path(dataset_path)

train_csv_path = dataset_path / "train.csv"
test_csv_path = dataset_path / "test.csv"
submission_csv_path = dataset_path / "sample_submission.csv"

# Loading dataset

In [5]:
import pandas as pd

train_df = pd.read_csv(train_csv_path, index_col=0)
test_df = pd.read_csv(test_csv_path, index_col=0)
submission_df = pd.read_csv(submission_csv_path, index_col=0)

In [6]:
import re


def convert_to_snake_case(s):
    """
    Convert a string to snake_case.
    """

    s = re.sub(r"[^\w\s]", " ", s)
    return s.lower().strip().replace(" ", "_")


train_df.columns = [convert_to_snake_case(col) for col in train_df.columns]
test_df.columns = [convert_to_snake_case(col) for col in test_df.columns]
submission_df.columns = [convert_to_snake_case(col) for col in submission_df.columns]

# PyCaret AutoML with default settings

In [7]:
from pycaret.classification import ClassificationExperiment

experiment = ClassificationExperiment()

# Just filling the required fields here. These are the default settings. I'm not cheating!
experiment.setup(data=train_df, target="depression")

Unnamed: 0,Description,Value
0,Session id,3895
1,Target,depression
2,Target type,Binary
3,Original data shape,"(140700, 19)"
4,Transformed data shape,"(140700, 38)"
5,Transformed train set shape,"(98490, 38)"
6,Transformed test set shape,"(42210, 38)"
7,Ordinal features,4
8,Numeric features,8
9,Categorical features,10


<pycaret.classification.oop.ClassificationExperiment at 0x12f597750>

In [8]:
top5 = experiment.compare_models(n_select=5)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.9381,0.9739,0.8059,0.8461,0.8254,0.7878,0.7882,1.297
lda,Linear Discriminant Analysis,0.9272,0.9677,0.7648,0.8225,0.7925,0.7485,0.7492,0.242
knn,K Neighbors Classifier,0.9236,0.9396,0.7832,0.7938,0.7884,0.7418,0.7419,1.437
ridge,Ridge Classifier,0.9233,0.0,0.7029,0.8489,0.769,0.7234,0.7281,0.178
svm,SVM - Linear Kernel,0.9223,0.0,0.8124,0.7957,0.789,0.7424,0.7527,0.444
et,Extra Trees Classifier,0.8788,0.9622,0.3642,0.9213,0.5214,0.4667,0.5332,1.036
rf,Random Forest Classifier,0.8253,0.8616,0.0469,0.8476,0.0889,0.0712,0.1739,0.982
dummy,Dummy Classifier,0.8183,0.5,0.0,0.0,0.0,0.0,0.0,0.244
ada,Ada Boost Classifier,0.8182,0.3735,0.0001,0.0333,0.0001,-0.0,-0.0005,0.662
gbc,Gradient Boosting Classifier,0.8182,0.4307,0.0,0.0,0.0,-0.0001,-0.0013,2.098


Processing:   0%|          | 0/73 [00:00<?, ?it/s]

In [13]:
stacked_model = experiment.stack_models(top5, choose_better=True)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9377,0.9731,0.7937,0.8529,0.8222,0.7845,0.7852
1,0.9431,0.9779,0.8195,0.8608,0.8396,0.8051,0.8055
2,0.935,0.9721,0.7731,0.8553,0.8121,0.7729,0.7744
3,0.9406,0.9753,0.8067,0.858,0.8316,0.7955,0.7961
4,0.9323,0.9746,0.8911,0.7716,0.8271,0.7852,0.7884
5,0.9373,0.9721,0.8067,0.8415,0.8237,0.7856,0.7858
6,0.9321,0.9694,0.7732,0.8403,0.8054,0.7643,0.7653
7,0.9383,0.9736,0.8542,0.815,0.8342,0.7963,0.7966
8,0.9388,0.9733,0.8039,0.851,0.8268,0.7896,0.7901
9,0.9317,0.9727,0.7346,0.8691,0.7962,0.7555,0.7594


Processing:   0%|          | 0/6 [00:00<?, ?it/s]

Original model was better than the stacked model, hence it will be returned. NOTE: The display metrics are for the stacked model (not the original one).


In [14]:
blended_model = experiment.blend_models(top5, choose_better=True)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9358,0.0,0.7909,0.8458,0.8174,0.7786,0.7792
1,0.9408,0.0,0.82,0.849,0.8342,0.7982,0.7984
2,0.9281,0.0,0.7328,0.8507,0.7874,0.7444,0.7474
3,0.9386,0.0,0.805,0.8491,0.8265,0.7892,0.7896
4,0.9347,0.0,0.8285,0.8153,0.8218,0.7819,0.7819
5,0.9345,0.0,0.7944,0.837,0.8151,0.7754,0.7758
6,0.9268,0.0,0.7318,0.8446,0.7842,0.7404,0.7431
7,0.9336,0.0,0.8101,0.822,0.816,0.7755,0.7755
8,0.9362,0.0,0.7994,0.8418,0.8201,0.7813,0.7817
9,0.9332,0.0,0.7765,0.8434,0.8086,0.7682,0.7692


Processing:   0%|          | 0/6 [00:00<?, ?it/s]

Original model was better than the blended model, hence it will be returned. NOTE: The display metrics are for the blended model (not the original one).


In [18]:
def get_submission_df(model):
    """
    Generate the dataframe for submission to Kaggle Depression Prediction Challenge.
    """

    predictions = experiment.predict_model(model, data=test_df)
    submission_df["depression"] = predictions["prediction_label"]
    return submission_df

In [19]:
best_model = top5[0]  # Logistic regression provided the best Accuracy
submission = get_submission_df(best_model)
submission.to_csv(output_path / "submission.csv")

## Conclusion of PyCaret AutoML with default settings experiment

PyCaret AutoML with default settings gave me a accuracy of 0.94067 on the test set. This was better than the H2O AutoML model that I tried out. This pushed me to 1320 rank on the public leaderboard (out of 2313 submissions), which is in the top 57 percentile. Let's see whether Gemini beats this score.

# PyCaret AutoML tuned by Gemini