# Introduction

## Context

I was looking into the latest Kaggle competitions and two of them caught my eye. The first one is the Playground series on Depression detection and the other one is the Gemini Long Context usecases. Binary classification is a common problem in machine learning and I thought how about I try to solve both of them in one go. So, here is how I went about it.

## More context

Context window is the number of tokens that the model can remember, and tokens are the words or characters that make up the input text. Gemini is one of the unique models that can remember a large number of tokens. Gemini flash comes with 1 million token context window and Gemini 1.5 comes with 2 million token context window.

However context window usage by sending large documents (or datasets) in our case, can be expensive and slow. Fortunately, Gemini also provides context caching. Context caching is a way to store the context in the model and reuse it for future requests. This can be done by sending the context once and then sending only the new tokens in the subsequent requests! Cheap and fast!

## How did I go about solving the problem?

Usually, when we want to quickly dip our toes and get a sense of a common ML problem, like binary classification, we use AutoML libraries. One of the popular AutoML library is PyCaret. PyCaret is an open-source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment. However, to get a bit better results, we need to tweak the arguments of the PyCaret functions. For example, there are multiple choices that we have  to make when we do data preparation and feature engineering steps. Similarly after training a bunch of models, we'll have to decide on whether to ensemble them or not and which type of ensemble to use.

Either we can do all of this manually or we can give Gemini all the information as a context and let it do the heavy lifting for us. This is where the context caching comes in handy. So, I decided to extract relevant documentation from PyCaret docs and sliced the dataset into a smaller CSV file and uploaded it to Gemini. As these context can be cached, I can reuse them for future requests and enable Gemini to train the binary classification model for me.


## What is the expected outcome?

I am just curious what will be the leaderboard score on both the competitions :D

My best guess is that the model will perform better than a base PyCaret AutoML model with default settings. However, I'm sure it won't beat the top models on the leaderboard (probably will come in the top 25 percentile). But hey, it's worth a shot!

# Setup

In [1]:
# Constants

SEED = 42
MODEL = "models/gemini-1.5-flash-002"

In [2]:
import os

import google.generativeai as genai
from fastkaggle.core import iskaggle

from rich import print

In [3]:
if iskaggle:
    from kaggle_secrets import UserSecretsClient

    GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
else:
    from dotenv import load_dotenv

    load_dotenv()

    GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

genai.configure(api_key=GOOGLE_API_KEY)

In [4]:
from pathlib import Path

dataset_path = Path("/kaggle/input/playground-series-s4e11")
output_path = Path("/kaggle/working")

if not iskaggle:
    import kagglehub

    dataset_path = kagglehub.competition_download("playground-series-s4e11")
    dataset_path = Path(dataset_path)
    output_path = Path(dataset_path)

train_csv_path = dataset_path / "train.csv"
test_csv_path = dataset_path / "test.csv"
submission_csv_path = dataset_path / "sample_submission.csv"

# Loading dataset

In [5]:
import pandas as pd

train_df = pd.read_csv(train_csv_path, index_col=0)
test_df = pd.read_csv(test_csv_path, index_col=0)
submission_df = pd.read_csv(submission_csv_path, index_col=0)

In [6]:
import re


def convert_to_snake_case(s):
    """
    Convert a string to snake_case.
    """

    s = re.sub(r"[^\w\s]", " ", s)
    return s.lower().strip().replace(" ", "_")


train_df.columns = [convert_to_snake_case(col) for col in train_df.columns]
test_df.columns = [convert_to_snake_case(col) for col in test_df.columns]
submission_df.columns = [convert_to_snake_case(col) for col in submission_df.columns]

# PyCaret AutoML with default settings

In [33]:
from pycaret.classification import ClassificationExperiment

experiment = ClassificationExperiment()

# Just filling the required fields here. These are the default settings. I'm not cheating!
experiment.setup(data=train_df, target="depression")

Unnamed: 0,Description,Value
0,Session id,4441
1,Target,depression
2,Target type,Binary
3,Original data shape,"(140700, 19)"
4,Transformed data shape,"(140700, 35)"
5,Transformed train set shape,"(98490, 35)"
6,Transformed test set shape,"(42210, 35)"
7,Ordinal features,4
8,Numeric features,8
9,Categorical features,10


<pycaret.classification.oop.ClassificationExperiment at 0x3d2997390>

In [34]:
top5 = experiment.compare_models(n_select=5)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.9372,0.9734,0.8034,0.8435,0.8229,0.7847,0.7851,1.3
svm,SVM - Linear Kernel,0.9319,0.0,0.7735,0.8463,0.8042,0.7633,0.7672,0.404
lda,Linear Discriminant Analysis,0.9259,0.967,0.76,0.8194,0.7886,0.7438,0.7445,0.582
knn,K Neighbors Classifier,0.9229,0.9383,0.7812,0.7917,0.7863,0.7393,0.7393,1.266
ridge,Ridge Classifier,0.9221,0.0,0.6982,0.8465,0.7651,0.719,0.7238,0.157
et,Extra Trees Classifier,0.8753,0.9613,0.3436,0.9208,0.4995,0.445,0.5164,0.927
nb,Naive Bayes,0.8309,0.9238,0.9206,0.5277,0.6667,0.5659,0.6086,0.177
rf,Random Forest Classifier,0.8252,0.8636,0.0461,0.8477,0.0874,0.07,0.1724,0.946
ada,Ada Boost Classifier,0.8183,0.6146,0.0,0.0,0.0,-0.0001,-0.0014,0.645
gbc,Gradient Boosting Classifier,0.8183,0.3948,0.0,0.0,0.0,0.0,0.0,2.263


Processing:   0%|          | 0/73 [00:00<?, ?it/s]

In [35]:
stacked_model = experiment.stack_models(top5, choose_better=True)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9343,0.9715,0.8603,0.7949,0.8263,0.7859,0.7868
1,0.9332,0.9687,0.7848,0.8372,0.8102,0.7697,0.7703
2,0.9382,0.9733,0.796,0.8537,0.8238,0.7864,0.7871
3,0.9374,0.9713,0.7966,0.8493,0.8221,0.7842,0.7848
4,0.9391,0.9746,0.7944,0.8597,0.8258,0.7889,0.7898
5,0.9364,0.9737,0.7916,0.8485,0.8191,0.7806,0.7813
6,0.9388,0.975,0.7983,0.8552,0.8258,0.7887,0.7894
7,0.9377,0.9712,0.8223,0.8326,0.8274,0.7894,0.7894
8,0.9376,0.9745,0.8179,0.8351,0.8264,0.7884,0.7884
9,0.9376,0.9751,0.7994,0.8483,0.8231,0.7852,0.7858


Processing:   0%|          | 0/6 [00:00<?, ?it/s]

Original model was better than the stacked model, hence it will be returned. NOTE: The display metrics are for the stacked model (not the original one).


In [36]:
blended_model = experiment.blend_models(top5, choose_better=True)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9312,0.0,0.7887,0.8247,0.8063,0.7645,0.7647
1,0.9314,0.0,0.7798,0.8318,0.805,0.7634,0.764
2,0.9367,0.0,0.7747,0.863,0.8165,0.7784,0.7801
3,0.9294,0.0,0.7581,0.8382,0.7961,0.7536,0.755
4,0.9308,0.0,0.7486,0.8524,0.7971,0.7556,0.7579
5,0.9328,0.0,0.7654,0.8499,0.8054,0.7649,0.7665
6,0.9339,0.0,0.7754,0.8479,0.81,0.7701,0.7713
7,0.9363,0.0,0.7955,0.8451,0.8196,0.781,0.7815
8,0.9358,0.0,0.7994,0.8398,0.8191,0.7801,0.7805
9,0.9353,0.0,0.7961,0.8397,0.8173,0.7781,0.7785


Processing:   0%|          | 0/6 [00:00<?, ?it/s]

Original model was better than the blended model, hence it will be returned. NOTE: The display metrics are for the blended model (not the original one).


In [37]:
def get_submission_df(model):
    """
    Generate the dataframe for submission to Kaggle Depression Prediction Challenge.
    """

    predictions = experiment.predict_model(model, data=test_df)
    submission_df["depression"] = predictions["prediction_label"]
    return submission_df

In [38]:
best_model = top5[0]  # Logistic regression provided the best Accuracy
submission = get_submission_df(best_model)
submission.to_csv(output_path / "submission.csv")

## Conclusion of PyCaret AutoML with default settings experiment

PyCaret AutoML with default settings gave me a accuracy of 0.94067 on the test set. This was better than the H2O AutoML model that I tried out. This pushed me to 1320 rank on the public leaderboard (out of 2313 submissions), which is in the top 57 percentile. Let's see whether Gemini beats this score.

# PyCaret AutoML tuned by Gemini

### Utility functions

In [7]:
import enum

# Converting column names as Enum for structured output.
ColumnEnums = enum.Enum("ColumnEnums", {col: col for col in train_df.columns})


# Temporary Fix for TypedDict structured response issue in genai library: https://github.com/google-gemini/generative-ai-python/issues/560
def get_dict_schema(response_schema: type) -> dict:
    config = genai.GenerationConfig(response_schema=response_schema)
    config = genai.types.generation_types.to_generation_config_dict(config)
    schema = config["response_schema"]
    schema.required = list(response_schema.__required_keys__)
    return schema


## Uploading context files to Gemini

Let's take a sample of the dataset and upload it to Gemini. We'll also upload the relevant documentation from PyCaret docs to Gemini.

These context files can be used by Gemini to know about the dataset and the PyCaret functions.

The PyCaret documentation are just snapshot of the relevant sections from the PyCaret docs:
* Data Preparation - https://pycaret.gitbook.io/docs/get-started/preprocessing/data-preparation
* Scale and Transform - https://pycaret.gitbook.io/docs/get-started/preprocessing/scale-and-transform
* Feature Engineering - https://pycaret.gitbook.io/docs/get-started/preprocessing/feature-engineering
* Feature selection - https://pycaret.gitbook.io/docs/get-started/preprocessing/feature-selection

In [8]:
sample_df = train_df.sample(10_000, random_state=SEED)
sample_df.to_csv(output_path / "sample.csv", index=False)
sample_df.head(n=2)

Unnamed: 0_level_0,name,gender,age,city,working_professional_or_student,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration,dietary_habits,degree,have_you_ever_had_suicidal_thoughts,work_study_hours,financial_stress,family_history_of_mental_illness,depression
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
18347,Sanya,Female,51.0,Patna,Working Professional,Teacher,,3.0,,,5.0,More than 8 hours,Moderate,B.Ed,No,11.0,2.0,Yes,0
96193,Sneha,Female,20.0,Agra,Working Professional,,,1.0,,,4.0,Less than 5 hours,Moderate,Class 12,No,0.0,5.0,Yes,0


In [None]:
csv_file = genai.upload_file(output_path / "sample.csv", mime_type="text/csv")
data_preparation_file = genai.upload_file(
    "experiment/Data_Preparation.md", mime_type="text/markdown"
)
feature_engineering_file = genai.upload_file(
    "experiment/Feature Engineering.md", mime_type="text/markdown"
)
scale_and_transform_file = genai.upload_file(
    "experiment/Scale and Transform.md", mime_type="text/markdown"
)
feature_selection_file = genai.upload_file(
    "experiment/Feature Selection.md", mime_type="text/markdown"
)

# Context caching & saving 💰

Now, let's use the uploaded files and create a cached context in Gemini. This will help us to reuse the context in future requests and save some money from repeated input token usage. We'll also add some system instruction to the context to help Gemini understand the context better.

In [None]:
import textwrap
from datetime import timedelta

from google.generativeai import caching

cache = caching.CachedContent.create(
    model=MODEL,
    display_name="Data scientist for Depression Prediction",
    system_instruction=textwrap.dedent(
        f"""You are a highly skilled and experienced data scientist specializing in Python-based machine learning solutions. You are adept at leveraging automated tools and libraries to streamline the data science workflow. You are proficient in:

            * **Domain knowledge:** You are familiar with the task of predicting depression based on various features.
            * **Data analysis:** You can effectively analyze databased on the CSV file you have access to.
            * **Automated feature engineering:** You have expertise in utilizing the `pycaret` library to automatically generate relevant features from raw data.
            * **Automated machine learning:** You are skilled in using the `pycaret` library to automate the process of model selection, training, and evaluation. You can effectively use this library to identify the best-performing machine learning algorithm for a given dataset and task.
            * **Programming languages and tools:** You are fluent in Python and familiar with relevant libraries like `pycaret`. 

            **When responding to user requests, adhere to the following principles:**

            * **Data-driven approach:** Base your analysis and recommendations CSV file you have access to and avoid making assumptions or drawing conclusions without sufficient evidence.
            * **Ethical considerations:** Be mindful of potential biases in the data and ensure your analysis and models are fair and unbiased.
            * **Provide actionable insights:** Focus on delivering insights that the user can act upon to solve their problem or make informed decisions.

            **Workflow:**
            
            1. **Understand the Problem:** Use the provided CSV and run analysis to understand the problem of predicting depression based on various features.

            2. **Setup Experiment with Pycaret:** Define the required parameters and setup the experiment using the `pycaret` library.

            3. **Model Training and Evaluation with Pycaret:** Leverage the `pycaret` library to automate the machine learning pipeline.  Initialize the `pycaret` setup, specifying the target variable and any preprocessing steps. Compare various models, tune hyperparameters, and evaluate performance metrics. Select the best-performing model based on the specific problem and desired outcome.

            4. **Interpretation and Communication:**  Interpret the results of the model and communicate the findings in a clear and concise manner. Explain the model's predictions, feature importance, and potential limitations.
            
            You are provided with the following files:
            
            * `{csv_file.name}`: A sample dataset for analysis.
            * `{data_preparation_file.name}`: A markdown file containing information about PyCaret data preparation.
            * `{feature_engineering_file.name}`: A markdown file containing information about PyCaret feature engineering.
            * `{scale_and_transform_file.name}`: A markdown file containing information about PyCaret scaling and transformation.
            * `{feature_selection_file.name}`: A markdown file containing information about PyCaret feature selection.
                        
            """
    ),
    contents=[
        csv_file,
        data_preparation_file,
        feature_engineering_file,
        scale_and_transform_file,
        feature_selection_file,
    ],
    ttl=timedelta(minutes=30),
    tools="code_execution",
)


# Let's create our own Gemini Data science assistant

We'll use the cached context to create a model in Gemini. This model will have the knowledge of the dataset, will be able to execute code and use that to understand the dataset better. We'll then chat with the "Data scientist" Gemini to get the best parameters for the PyCaret AutoML model.

In [11]:
from google.api_core import retry

retry_policy = {"retry": retry.Retry(predicate=retry.if_transient_error)}

model = genai.GenerativeModel.from_cached_content(cached_content=cache)

In [12]:
chat = model.start_chat()

Let's first ask the model to understand the dataset and create some history so it builds context around the dataset.

In [None]:
response = chat.send_message(
    textwrap.dedent(
        """
        You are a highly skilled and experienced data scientist specializing in Python-based machine learning solutions. You are adept at leveraging automated tools and libraries to streamline the data science workflow.
        
        Based on the files you have access to, analyze the data and provide insights on all the columns you have.
        """
    )
)

In [14]:
from IPython.display import Markdown

Markdown(response.text)

My analysis of the provided dataset focuses on understanding the characteristics of each column and their potential relevance to predicting depression.  Because the data is not fully cleaned and contains many missing values, I will focus my analysis on descriptive statistics and visualizations for the available data, rather than detailed inferential statistics.  A robust imputation strategy would be crucial before advanced statistical analyses and modeling.

**Column-wise Insights:**

1.  **name:** This column is an identifier for each individual. It's not directly useful for prediction but could be used for tracking individual records if needed.

2.  **gender:** A categorical variable (Male/Female).  It's important to examine if there are significant differences in depression rates between genders in this dataset.  This is a potential predictor variable.

3.  **age:** A numerical variable representing the age of individuals. Age is often correlated with mental health, and its distribution and relationship with depression needs further examination.

4.  **city:** A categorical variable representing the city of residence.  Given the large number of cities likely represented, it might be less informative unless there are distinct geographic patterns related to mental health that could be investigated.

5.  **working\_professional\_or\_student:** A categorical variable (Working Professional/Student). This is an important factor that likely influences stress levels and potentially depression. It's a key predictor.

6.  **profession:** A categorical variable.  It has many missing values and a potentially large number of unique values. Analysis needs to determine if profession is a meaningful predictor after imputation of missing values.

7.  **academic\_pressure:** A numerical variable representing academic pressure. This is a significant predictor for students and can be analyzed for correlation with depression. Many missing values need to be addressed.

8.  **work\_pressure:** A numerical variable representing work pressure. This is a significant predictor for working professionals and can be analyzed for correlation with depression. It contains many missing values.

9.  **cgpa:** A numerical variable representing CGPA (Cumulative Grade Point Average).  This is relevant only for students and needs to be considered along with academic pressure and study satisfaction.  There are many missing values to handle.

10. **study\_satisfaction:** A numerical variable. Relevant only for students, and should be analyzed for correlation with depression and CGPA.  It contains missing values.

11. **job\_satisfaction:** A numerical variable. Relevant only for working professionals and should be analyzed for correlation with depression and work pressure.  It contains many missing values.

12. **sleep\_duration:** A categorical variable (with ranges of hours). Sleep duration is strongly associated with mental health.  Its distribution and relationship with depression should be investigated.

13. **dietary\_habits:** A categorical variable (Healthy/Moderate/Unhealthy). Diet can impact mental wellbeing.  Analyzing its distribution and relationship with depression is important.

14. **degree:** A categorical variable (with different degree types). The type of degree pursued could correlate with academic pressure or professional field, potentially influencing depression.  It is important to examine the distribution of degree types.

15. **have\_you\_ever\_had\_suicidal\_thoughts:** A binary categorical variable (Yes/No). This is a critical variable indicative of severe mental distress and a strong predictor for depression.

16. **work\_study\_hours:** A numerical variable representing the number of work or study hours.  This is a crucial predictor, as excessive hours are known to increase stress.

17. **financial\_stress:** A numerical variable.  Financial stress significantly impacts mental health; its distribution and relationship with depression should be thoroughly examined.

18. **family\_history\_of\_mental\_illness:** A binary categorical variable (Yes/No). Family history of mental illness is a well-established risk factor for depression and a key predictor.

19. **depression:** This is the target variable (0/1), indicating the presence (1) or absence (0) of depression.  Its distribution (class imbalance) needs to be determined to ensure balanced training data.


**Next Steps and Recommendations:**

Before building predictive models, it's crucial to:

1.  **Handle Missing Values:** Implement an appropriate imputation strategy using PyCaret's `setup` function (as documented in the provided markdown files), selecting between `simple` or `iterative` imputation based on the characteristics of the missing data.  Consider using techniques like k-NN imputation for numerical features and mode imputation for categorical features.

2.  **Address Data Type Issues:**  Carefully review the inferred data types.  Correct them in the PyCaret `setup` if necessary using `numeric_features`, `categorical_features`, `date_features` parameters.  Pay particular attention to ensuring consistent encoding of sleep duration.

3.  **One-Hot Encoding and Ordinal Encoding:** Apply One-Hot Encoding to nominal categorical variables and Ordinal Encoding to ordinal variables (like sleep duration, dietary habits) within PyCaret's automated pipeline using the `max_encoding_ohe`, `encoding_method`, and `ordinal_features` parameters.

4.  **Handle Class Imbalance:** Check for class imbalance in the `depression` variable and apply techniques like SMOTE (or other suitable methods) via PyCaret's `fix_imbalance` and `fix_imbalance_method` parameters.

5.  **Feature Engineering:**  Explore automatic feature engineering options in PyCaret (polynomial features, group features, binning) to see if they improve model performance. Carefully consider the application and interpretation of the generated features.  This may involve using the `polynomial_features`, `polynomial_degree`, `group_features`, `group_names`, and `bin_numeric_features` parameters.

6.  **Feature Selection:**  Use PyCaret's automatic feature selection (via `feature_selection`, `feature_selection_method`, `feature_selection_estimator`, `n_features_to_select` parameters) to identify the most relevant predictors.

7.  **Model Training and Evaluation:** Employ PyCaret's automated machine learning capabilities to compare different models, tune hyperparameters, and evaluate performance using appropriate metrics (e.g., AUC, precision, recall, F1-score).  You can use PyCaret's `compare_models` and `tune_model` functions to automate this process.



By following this structured approach, you will obtain more reliable insights into the relationship between the features and depression and build a more robust and accurate predictive model. Remember to document and justify each step in your data preparation and model building process for transparency and reproducibility.


This is great! Looks like the model has a good understanding of the dataset and making some good suggestions. Now let's follow up and ask the model to give us some parameters for setting up the PyCaret AutoML experiment.

Why did I choose these parameters and not everything else? Honestly, I just wanted to see how the model performs with these parameters. I could have asked the model to give me all the parameters, but I wanted to keep it simple and fairly straightforward. As a lot of parameters depends on a lot of other parameters, like n_features_to_select depends on feature_selection to be True, I chose some of the parameters that made sense to start with. In future, I would like to experiment with more parameters and see how the model performance is impacted by them.

In [None]:
import json
from typing import List
from typing_extensions import TypedDict, NotRequired


class DataPreparationSchema(TypedDict):
    numeric_features: List[ColumnEnums]  # type: ignore
    categorical_features: List[ColumnEnums]  # type: ignore
    ignore_features: List[ColumnEnums]  # type: ignore
    fix_imbalance: bool
    remove_outliers: bool
    imputation_type: str


response = chat.send_message(
    textwrap.dedent(
        f"""
        Now let's prepare for data for the binary classification task.
        
        You are provided with a CSV file {csv_file.name}. This file contains a header row and uses commas as delimiters. The data will be used for a binary classification task in Pycaret, an AutoML library in Python. To prepare the data using the `setup()` function, analyse the data using code execution tool and then based on the analysis, generate the following parameters in JSON format:
        
        Remember that performing a binary classification to predict depression target variable based on various features. You can use the information from the provided markdown file {data_preparation_file.name} to guide you in this task.
        
        Generate the following parameters for data preparation step in PyCaret:

        * **`numeric_features`:**  A list of column names with numeric features.
        * **`categorical_features`:** A list of column names with categorical features.
        * **`ignore_features`:** A list of column names to be ignored during model training. These features might be irrelevant to the target variable in this case 'depression' column, redundant with other features, or could introduce data leakage.
        * **`fix_imbalance`:**  A boolean value indicating whether to handle class imbalance. If true, use oversampling to address the imbalance.
        * **`remove_outliers`:** A boolean value indicating whether to remove outliers.
        * **`imputation_type`:** The type of imputation to use for missing values. Choose between 'simple' (mean/median imputation) or 'iterative' (k-Nearest Neighbors imputation).

        All parameters are required.

        **Example JSON Response:**

        ```json
        {{
            "numeric_features": ["age", "income", "credit_score"],
            "categorical_features": ["gender", "education", "city"],
            "ignore_features": ["customer_id", "date"],
            "fix_imbalance": true,
            "remove_outliers": true,
            "imputation_type": "iterative" 
        }}
        ```

        """
    ),
    generation_config=genai.GenerationConfig(
        response_schema=get_dict_schema(DataPreparationSchema),
        response_mime_type="application/json",
    ),
    request_options=retry_policy,
)

data_preparation_parameters = json.loads(response.text)
print(data_preparation_parameters)

We can have follow up conversations with the model to get the best parameters for different steps in setting up the PyCaret AutoML experiment without worrying about the number of input tokens we use. Gemini will use the cached context to understand the dataset and give us the best parameters and won't charge us for repeated use of the uploaded files.

In [None]:
class ScaleAndTransformSchema(TypedDict):
    normalize: bool
    transformation: bool


result = chat.send_message(
    textwrap.dedent(
        f"""
        Let's decide on the scaling and transformation parameters for the data.
        
        You are provided with a CSV file {csv_file.name}. This file contains a header row and uses commas as delimiters. The data will be used for a binary classification task in Pycaret, an AutoML library in Python. To prepare the data using the `setup()` function, analyse the data using code execution tool and then based on the analysis, generate the following parameters in JSON format:
        
        Remember that performing a binary classification to predict depression target variable based on various features. You can use the information from the provided markdown file {scale_and_transform_file.name} to guide you in this task.
        
        Generate the following parameters for scaling and transformation step in PyCaret:

        * normalize: A boolean value indicating whether to normalize the data. If true, the data will be scaled to have a mean of 0 and a standard deviation of 1.
        * transformation: A boolean value indicating whether to apply a transformation to the data. If true, the data will be transformed using a power transformation.

        All parameters are required.

        **Example JSON Response:**

        ```json
        {{
            "normalize": true,
            "transformation": true
        }}
        ```

        """
    ),
    generation_config=genai.GenerationConfig(
        response_schema=get_dict_schema(ScaleAndTransformSchema),
        response_mime_type="application/json",
    ),
    request_options=retry_policy,
)

scale_and_transform_parameters = json.loads(result.text)
print(scale_and_transform_parameters)


In [None]:
class FeatureEngineeringSchema(TypedDict):
    polynomial_features: bool
    polynomial_degree: NotRequired[int]
    group_features: NotRequired[List[ColumnEnums]]  # type: ignore
    bin_numeric_features: NotRequired[List[ColumnEnums]]  # type: ignore
    rare_to_value: NotRequired[float]


result = chat.send_message(
    textwrap.dedent(
        f"""
        Let's decide on the feature engineering parameters for the data.
        
        You are provided with a CSV file {csv_file.name}. This file contains a header row and uses commas as delimiters. The data will be used for a binary classification task in Pycaret, an AutoML library in Python. To prepare the data using the `setup()` function, analyse the data using code execution tool and then based on the analysis, generate the following parameters in JSON format:
                
        Remember that performing a binary classification to predict depression target variable based on various features. You can use the information from the provided markdown file {feature_engineering_file.name} to guide you in this task.
        
        Generate the following parameters for feature engineering step in PyCaret:

        * polynomial_features: A boolean value indicating whether to generate polynomial features. If true, polynomial features will be created based on the specified degree.
        * polynomial_degree: An integer specifying the degree of polynomial features to generate. This parameter is required if polynomial_features is set to true.
        * group_features: A list of column names to group together for feature engineering. This parameter is optional. If provided, the features in the list will be grouped together for feature engineering.
        * bin_numeric_features: A list of column names with numeric features to bin into discrete intervals. This parameter is optional. If provided, the numeric features will be binned into discrete intervals.
        * rare_to_value: A float value specifying the threshold for rare categories. Categories with a frequency less than this threshold will be replaced with a specified value. This parameter is optional and only applicable to categorical features.
        

        All parameters are required.

        **Example JSON Response:**

        ```json
        {{
            polynomial_features: true,
            polynomial_degree: 2,
            group_features: ["age", "income"],
            bin_numeric_features: ["credit_score"],
            rare_to_value: 0.01
        }}
        ```

        """
    ),
    generation_config=genai.GenerationConfig(
        response_schema=get_dict_schema(FeatureEngineeringSchema),
        response_mime_type="application/json",
    ),
    request_options=retry_policy,
)

feature_engineering_parameters = json.loads(result.text)
print(feature_engineering_parameters)

In [None]:
class FeatureSelectionSchema(TypedDict):
    remove_multicollinearity: NotRequired[bool]
    low_variance_threshold: NotRequired[float]


result = chat.send_message(
    textwrap.dedent(
        f"""
        Finally, on experiment setup, let's decide on the feature selection parameters for the data.
        
        You are provided with a CSV file {csv_file.name}. This file contains a header row and uses commas as delimiters. The data will be used for a binary classification task in Pycaret, an AutoML library in Python. To prepare the data using the `setup()` function, analyse the data using code execution tool and then based on the analysis, generate the following parameters in JSON format:
        
        Remember that performing a binary classification to predict depression target variable based on various features. You can use the information from the provided markdown file {feature_selection_file.name} to guide you in this task.
        
        Generate the following parameters for feature selection step in PyCaret:

        * remove_multicollinearity: A boolean value indicating whether to remove multicollinear features. If true, multicollinear features will be removed. This parameter is optional.
        * low_variance_threshold: A float value specifying the threshold for low variance features. Features with a variance less than this threshold will be removed. This parameter is optional
        

        All parameters are required.

        **Example JSON Response:**

        ```json
        {{
            remove_multicollinearity: true,
            low_variance_threshold: 0.01
        }}
        ```

        """
    ),
    generation_config=genai.GenerationConfig(
        response_schema=get_dict_schema(FeatureSelectionSchema),
        response_mime_type="application/json",
    ),
    request_options=retry_policy,
)

feature_selection_parameters = json.loads(result.text)
print(feature_selection_parameters)

PermissionDenied: 403 CachedContent not found (or permission denied)

Now that we have the suggested parameters from Gemini, let's set up the PyCaret AutoML experiment with the suggested parameters and see how it performs.

In [None]:
# Setup the experiment with the parameters generated from the chat
from pycaret.classification import ClassificationExperiment

gemini_experiment = ClassificationExperiment()

gemini_experiment.setup(
    data=train_df,
    target="depression",
    **data_preparation_parameters,
    **scale_and_transform_parameters,
    **feature_engineering_parameters,
    **feature_selection_parameters,
)

Unnamed: 0,Description,Value
0,Session id,6319
1,Target,depression
2,Target type,Binary
3,Original data shape,"(140700, 19)"
4,Transformed data shape,"(203396, 438)"
5,Transformed train set shape,"(161186, 438)"
6,Transformed test set shape,"(42210, 438)"
7,Ordinal features,4
8,Numeric features,8
9,Categorical features,10


<pycaret.classification.oop.ClassificationExperiment at 0x313d287d0>

In [32]:
top10 = gemini_experiment.compare_models(n_select=10)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.9389,0.9747,0.8299,0.8334,0.8316,0.7943,0.7943,15.158
catboost,CatBoost Classifier,0.9376,0.9741,0.8144,0.8375,0.8258,0.7877,0.7879,66.627
lr,Logistic Regression,0.9371,0.9719,0.8248,0.8283,0.8265,0.7881,0.7881,20.886
xgboost,Extreme Gradient Boosting,0.9367,0.9725,0.8187,0.8305,0.8245,0.7859,0.786,14.522
gbc,Gradient Boosting Classifier,0.9366,0.974,0.8466,0.8125,0.8292,0.7903,0.7906,75.905
rf,Random Forest Classifier,0.9343,0.9691,0.8138,0.8227,0.8182,0.7781,0.7782,19.336
et,Extra Trees Classifier,0.9318,0.9659,0.7999,0.8204,0.81,0.7684,0.7686,18.361
lda,Linear Discriminant Analysis,0.931,0.9692,0.8246,0.8015,0.8128,0.7706,0.7707,14.531
ada,Ada Boost Classifier,0.9303,0.9714,0.8541,0.7824,0.8166,0.7737,0.7749,23.991
ridge,Ridge Classifier,0.9301,0.0,0.8236,0.7982,0.8106,0.7678,0.768,10.519


Processing:   0%|          | 0/78 [00:00<?, ?it/s]

All the models are performing a little better than the default PyCaret AutoML model! Now let's ask Gemini to ensemble the models and see if we can get a better score.

In [39]:
top10_models = gemini_experiment.pull()

Top10Models = enum.Enum("Top10Models", {model: model for model in top10_models})

In [None]:
class BlendModelsSchema(TypedDict):
    models_to_ensemble: List[Top10Models]  # type: ignore
    should_tune: bool
    blend_weights: NotRequired[List[float]]


result = chat.send_message(
    textwrap.dedent(
        f"""
        Here are the top 10 models based on the comparison. Let's decide on the blending models parameters for the data.
        
        {top10_models.to_markdown()}
                
        Remember that performing a binary classification to predict depression target variable based on various features.
        
        Generate the following parameters for blending models in PyCaret for final prediction:

        * models_to_ensemble: A list of the model names to ensemble for blending. You can choose any number of models from the top 10 models. It should be a list of model names.
        * should_tune: A boolean value indicating whether to tune the hyperparameters of the blending model. If true, the hyperparameters will be tuned.
        * blend_weights: A list of float values specifying the weights of each model in the blending ensemble. This parameter is optional. If it is not provided, the models will be blended with equal weights. Otherwise, the length of the list should be equal to the number of models in the ensemble denoted by `models_to_ensemble`. The total of the weights should ALWAYS sum up to 1.
        

        All parameters are required.

        **Example JSON Response:**

        ```json
        {{
            models_to_ensemble: ["lightgbm", "lt"],
            should_tune: true,
            blend_weights: [0.25, 0.75]
        }}
        ```

        """
    ),
    generation_config=genai.GenerationConfig(
        response_schema=get_dict_schema(BlendModelsSchema),
        response_mime_type="application/json",
    ),
    request_options=retry_policy,
)

blender_parameters = json.loads(result.text)
print(blender_parameters)

In [None]:
blended_model = gemini_experiment.blend_models(
    estimator_list=blender_parameters["models_to_ensemble"],
    weights=blender_parameters.get("blend_weights", None),
)

In [None]:
if blender_parameters["should_tune"]:
    blending_tuned_model = gemini_experiment.tune_model(blended_model)

In [None]:
class StackModelsSchema(TypedDict):
    models_to_ensemble: List[Top10Models]  # type: ignore
    should_tune: bool
    stack_weights: NotRequired[List[float]]


result = chat.send_message(
    textwrap.dedent(
        f"""
        Here are the top 10 models based on the comparison. Let's decide on the stacking model parameters for the data.
        
        {top10_models.to_markdown()}
                
        Remember that performing a binary classification to predict depression target variable based on various features.
        
        Generate the following parameters for stacking models in PyCaret for final prediction:

        * models_to_ensemble: A list of the model names to ensemble for stacking ensemble. You can choose any number of models from the top 10 models. It should be a list of model names.
        * should_tune: A boolean value indicating whether to tune the hyperparameters of the stacking model. If true, the hyperparameters will be tuned.
        * blend_weights: A list of float values specifying the weights of each model in the stacking ensemble. This parameter is optional. If it is not provided, the models will be blended with equal weights. Otherwise, the length of the list should be equal to the number of models in the ensemble denoted by `models_to_ensemble`. The total of the weights should ALWAYS sum up to 1.
        

        All parameters are required.

        **Example JSON Response:**

        ```json
        {{
            models_to_ensemble: ["lt", "catboost"],
            should_tune: true,
            blend_weights: [0.25, 0.75]
        }}
        ```

        """
    ),
    generation_config=genai.GenerationConfig(
        response_schema=get_dict_schema(StackModelsSchema),
        response_mime_type="application/json",
    ),
    request_options=retry_policy,
)

stacker_parameters = json.loads(result.text)
print(stacker_parameters)


In [None]:
stacking_model = gemini_experiment.stack_models(
    estimator_list=stacker_parameters["models_to_ensemble"],
    weights=stacker_parameters.get("blend_weights", None),
)

In [None]:
if blender_parameters["should_tune"]:
    stacking_tuned_model = gemini_experiment.tune_model(stacking_model)

It seems like the ensemble model is performing better than the individual models. Let's submit the predictions to the Kaggle competition and see how it performs.

In [None]:
submission = gemini_experiment.predict_model(stacking_tuned_model, data=test_df)
submission.to_csv("./gemini_submission.csv")

# Conclusion

The PyCaret AutoML model tuned by Gemini gave me a accuracy of 0.94167 on the test set. This was better than the PyCaret AutoML model with default settings. This pushed me to 1314 rank on the public leaderboard (out of 2313 submissions), which is in the top 57 percentile. This is a slight improvement over the default PyCaret AutoML model.

I'm seriously impressed by the capabilities of Gemini. It was able to understand the dataset and give me the best parameters for setting up the PyCaret AutoML experiment. It was also able to suggest the best ensemble model and helped me to get a better score on the Kaggle competition.

## What's next?

I did not use some more of the advanced features that Gemini and Langchain provides
* Function calling - https://ai.google.dev/gemini-api/docs/function-calling
* Agents - https://langchain-ai.github.io/langgraph/

I also did not experiment with more parameters for setting up the PyCaret AutoML experiment. I'm sure with more experiments, Gemini will probably score better on the Kaggle competition.

I'm sure with improved prompting, and some more experiments, Gemini get a better score on the Kaggle competition. I'm excited to see how it performs on the other Kaggle competition that I'm interested in.