# KaggleQA ReAct with Kaggle API


## Introduction
Inspired from the [ReAct Paper](https://arxiv.org/abs/2210.03629), this notebook aims to show how to leverage existing APIs to assist LLMs in answering common questions regarding Kaggle and its platform.

**ReAct** prompting is when the model employs reasoning and actions to complete a task. In this notebook, our Gemma model will utilize reasoning and leverage the Kaggle API along with Kaggle documents to answer Kaggle questions. This approach enables the model not only to search for information in the documents to respond to questions, thus grounding its responses, but also to access the current Kaggle datasets/kernels with the API.

The user will be able to see the model's thought process and the sources it used to lead to the response.


<h4>Steps:</h4>
<ol>
    <li style="font-size: larger;">Preparing the Kaggle Environment</li>
    <li style="font-size: larger;">Designing the LLMAgent</li>
    <li style="font-size: larger;">Testing on Gemma and Gemini</li>
    <li style="font-size: larger;">Finetuning Gemma</li>
    <li style="font-size: larger;">Results</li>
</ol>

First, let's show the final product on some questions. The model is a finetuned Gemma 2b-it.

## Gemma 2b Example Showcase

## Example 1

**Instruction**: What is a Kaggle notebook?

#### Gemma Output

**Thought 1**:  I need to search for Kaggle documents that involve what a Kaggle notebook is. I need to search for broad keyterms in the phrase. I will search instead for the keyword notebook.

**Action 1**: search_docs[notebook]

```
n our community.

The RMarkdown editor is the same one used for basic R or Python scripts,
except that it uses the special RMarkdown syntax. To start editing an
RMarkdown script, click on “Create Notebook”, navigate to the “Scripts”
pane, and click on that. Then, in the language dropdown, click on
“RMarkdown”.

“Head Start for Data Science <https://www.kaggle.com/hiteshp/head-start-
for-data-scientist>” is a great example of a RMarkdown Script-type.


        Notebooks

The last type is Jupyter notebooks (usually just “notebooks”). Jupyter
notebooks consist of a sequence of cells, where each cell is formatted
in either Markdown (for writing text) or in a programming language of
your choice (for writing code). To start a notebook, click on “Create
Notebook”, and select “Notebook”. This will open the Notebooks editing
interface.

Notebooks may be written in either R or Python.

“Comprehensive data exploration with Python <https://www.kaggle.com/
pmarcelino/comprehensive-data-exploration-
 ---- 
ome of the contents of the home page, most of it is dedicated to hot
new Notebooks activity. By browsing down the page you can check out all
the latest updates from your fellow Kagglers.

You can tweak your newsfeed to your liking by following other Kagglers.
To follow someone, go to their profile page and click on “Follow User”.
Content posted and upvotes made by users you have followed will show up
more prominently.

The same is true of other users who choose to follow you. Post high-
quality notebooks and datasets and you will soon find other users
following along with what you are doing!


        Notebook Listing

A more structured way of accessing Notebooks is the Notebook listing
<https://www.kaggle.com/notebooks>, accessible from the “Notebooks” tab
in the main menu bar.

The Notebook listing is sorted by “Hotness <https://www.kaggle.com/
notebooks?sortBy=hotness&group=everyone&pageSize=20>” by default.
“Hotness” is what it sounds like: a way of measuring the interestingness
of
 ---- 
otness” is what it sounds like: a way of measuring the interestingness
of Notebooks on the platform. Notebooks which score highly in Hotness,
and thus appear highly in this list, are usually either recently written
Notebooks that are scoring highly in things like upvotes and views, or
“all-time” greats that have been consistently popular on the platform
for a long time.

Other methods of sorting are by

  * Most Votes <https://www.kaggle.com/code?sortBy=voteCount>: Surfaces
    the most popular notebooks of all time
  * Most Comments <https://www.kaggle.com/code?sortBy=commentCount>:
    Returns the most discussed notebooks of all time
  * Recently Created <https://www.kaggle.com/code?sortBy=dateCreated>: A
    real-time stream of new Notebooks
  * Recently Run <https://www.kaggle.com/code?sortBy=dateRun>: A real-
    time stream of activity
  * Relevance <https://www.kaggle.com/code?sortBy=relevance>: Sorts the
    results based on their relevance to the query

Other filtering options
 ---- 
 on Kaggle is available through either Datasets or our Competitions.
Both prominently feature the best community-created Notebooks on the
“Notebooks” tab. Browsing Notebooks on Datasets and Competitions
provides a way to quickly get acquainted with a specific dataset. You
can fork any existing public Notebook to make a copy of the code and
start experimenting with changes.

The Iris Species dataset <https://www.kaggle.com/uciml/iris> and the
Titanic competition <https://www.kaggle.com/c/titanic/notebooks> are two
classic examples of Datasets and Competitions, respectively, hosting
great Notebooks on their content.


        Tags and Tag Pages

Tags are the most advanced of the searching options available in the
Notebook listing page. Tags are added by Notebook owners to indicate the
topic of the Notebook, techniques you can use (e.g., “classification”),
or the type of the data itself (e.g., “text data”). You can navigate to
tag pages to browse more content sharing a tag either by click
 ---- 
markdown editor ideal
for narrative analyses, visualizations, and sharing work).

The main difference between Scripts and Notebooks is the editing pane
and how you experience editing and executing code.


        Editing

Whether you use Scripts or Notebooks might depend on your choice of
language and what your use case is. R users tend to prefer the Scripts,
while Python users prefer the Notebooks. For more on why that is, refer
to the “Types of Notebooks” section <https://www.kaggle.com/docs/
notebooks#types-of-notebooks>. Scripts are also favored for making
competition submissions where the code is the focus, whereas Notebooks
are popular for sharing EDAs (exploratory data analysis), tutorials, and
other share-worthy insights.

Both editing interfaces are organized around the concept of “Versions”.
This is a collection consisting of a Notebook version, the output it
generates, and the associated metadata about the environment.

In the Script editor, the code you write is executed al
```

**Thought 2**:  The user is looking for information on what a Kaggle notebook is.

**Action 2**: finish[The definition of a Kaggle notebook is a Notebook
that contains code and a console. The code can be written in either R or
Python, and it will execute the code in the console pane and return a
result. The console pane provides an interactive environment where you can
run code, see the results, and make changes to the code to try out.

Notebooks are a powerful tool for data science. They provide a way to
share code and results with other Kagglers, making it easier to
collaborate on data science projects.

Here are some of the benefits of using Notebooks as your data science
tool:

  * Code versioning: You can version your code, meaning you can save
    different versions of your code and compare the results. This can be
    helpful for tracking changes you made or for understanding how your
    code evolved over time.

  * Sharing code and results: You can easily share your code and results
    with others. This makes it easier to collaborate on data science projects.

  * Reproducing results: Notebooks make it easy to reproduce the steps you
    take to get a result. This is because Notebooks are versioned, so you
    can always get the same result by running the code in the notebook again.

  * Exploring data: You can use Notebooks to explore data in the same way
    you would use a data exploration tool like Tableau or Power BI. You
    can filter, slice, and dice data directly in the notebook editor,
    and the results will be updated automatically.

  * Creating visualizations: You can create interactive visualizations in
    notebooks using the same API that you can use in the R console. This
    allows you to create scatter plots, bar charts, line charts, and more.

  * Executing competitions: You can use Notebooks to submit solutions to
    Kaggle competitions.

  * Sharing your work: You can share your Notebooks with others so they can
    view your code and results.

  * Collaborating with others: You can co-create Notebooks with others. This
    can be a great way to learn new techniques or to get help with a
    specific problem.

  * Learning from others: You can browse and explore public Notebooks to
    learn new techniques and data
  
**Finished**

## Example 2

**Instruction**: I am interested in financial fraud. What dataset would you recommend?

#### Gemma Output

**Thought 1**:  I should look up datasets that have to do with financial fraud.

**Action 1**: search_datasets[financial fraud]

 ---- 
    Dataset Title: Synthetic Financial Datasets For Fraud Detection
    Subtitle: Synthetic datasets generated by the PaySim mobile money simulator
    License: CC BY-SA 4.0
    Kernel Count: 247
    URL: https://www.kaggle.com/datasets/ealaxi/paysim1
    
 ---- 
 
    Dataset Title: FraudDetection
    Subtitle: frauddetectiontransactions
    License: Data files © Original Authors
    Kernel Count: 7
    URL: https://www.kaggle.com/datasets/bannourchaker/frauddetection
    
 ---- 
 
    Dataset Title: IBM Transactions for Anti Money Laundering (AML)
    Subtitle: Use for Foundation Models, GNNs, and More
    License: Community Data License Agreement - Sharing - Version 1.0
    Kernel Count: 11
    URL: https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml
    
 ---- 
 
    Dataset Title: Financial Fraud Detection Dataset
    Subtitle: Synthetic Mobile Money Transactions for Fraud Detection Research
    License: CC BY-SA 4.0
    Kernel Count: 4
    URL: https://www.kaggle.com/datasets/sriharshaeedala/financial-fraud-detection-dataset
    
 ---- 
 
    Dataset Title: Credit Card Spending Habits in India
    Subtitle: Gender, Location, and Transaction Trends
    License: Other (specified in description)
    Kernel Count: 32
    URL: https://www.kaggle.com/datasets/thedevastator/analyzing-credit-card-spending-habits-in-india
    
 ---- 
 
    Dataset Title: Comprehensive Credit Card Transactions Dataset
    Subtitle: Unraveling Consumer Transactions Flow: A Comprehensive Credit Card Dataset
    License: Apache 2.0
    Kernel Count: 8
    URL: https://www.kaggle.com/datasets/rajatsurana979/comprehensive-credit-card-transactions-dataset
    
 ---- 
 
    Dataset Title: Fabricated Fraud Detection
    Subtitle: Synthetic Financial Datasets For Fraud Detection
    License: Unknown
    Kernel Count: 4
    URL: https://www.kaggle.com/datasets/giladmanor/fraud-detection
    
 ---- 
 
    Dataset Title: Froud Detection Dataset
    Subtitle: Froud Detection Dataset for Notebook
    License: Unknown
    Kernel Count: 1
    URL: https://www.kaggle.com/datasets/ashishkumarjayswal/froud-detection-dataset
    
 ---- 
 
    Dataset Title: Enron Fraud Email Dataset
    Subtitle: Enron Email dataset labeled for fraud emails - containing around 450k emails.
    License: Apache 2.0
    Kernel Count: 1
    URL: https://www.kaggle.com/datasets/advaithsrao/enron-fraud-email-dataset
    
 ---- 
 
    Dataset Title: Enron Person of Interest Dataset
    Subtitle: Find who are the potential POIs or Person of Interest in the Enron Fraud Scandal
    License: Other (specified in description)
    Kernel Count: 2
    URL: https://www.kaggle.com/datasets/sagarnildass/enron-person-of-interest-dataset
    
**Thought 2**:  I recommend the Synthetic Financial Datasets For Fraud Detection dataset as a starting point for your research on financial fraud.

**Action 2**: finish[I recommend the Synthetic Financial Datasets For Fraud Detection dataset as a starting point for your research on financial fraud. It contains a variety of synthetic financial datasets that can be used to train machine learning models for fraud detection. The dataset includes a variety of features, including transaction date, amount, and merchant, which can be used to create realistic training data for fraud detection models.

**Finished**

The model is directly using the Kaggle API and Kaggle docs to get information. Let's see how we created this!

# Preparing the Kaggle Environment
- Preparing Kaggle documents
- Setting up Kaggle API

Install libraries.

In [1]:
!pip install -q transformers==4.38.0 accelerate==0.27.2 bitsandbytes==0.42.0 peft datasets==2.17.1 trl

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cubinlinker, which is not installed.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires ptxcompiler, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 15.0.2 which is incompatible.
beatrix-jupyterlab 2023.128.151533 requires jupyterlab~=3.6.0, but you have jupyterlab 4.1.2 which is incompatible.
cudf 23.8.0 requires cuda-python<12.0a0,>=11.7.1, but you have cuda-pyth

We then read all the kaggle documents and put them into a big string. This will be used in a search_doc[term] action that will search documents based on a term.

In [2]:
import os
import pandas as pd 
from IPython.display import Markdown

directory_path = '/kaggle/input/kaggle-docs/raw'

data = []
for filename in os.listdir(directory_path):
        with open(os.path.join(directory_path, filename), 'r') as file:
            data.append({'File_Name': filename, 'Content': file.read()})
            
directory_path = '/kaggle/input/kaggle-documents'
for filename in os.listdir(directory_path):
        with open(os.path.join(directory_path, filename), 'r') as file:
            data.append({'File_Name': filename, 'Content': file.read()})
            
df_general_info = pd.DataFrame(data)
# Contains all the kaggle documents
kaggle_docs = "----\n".join(df_general_info['Content'].tolist())

Secrets are attached to this notebook to access the Kaggle API and Hugging Face models

In [3]:
from kaggle_secrets import UserSecretsClient
import os
import gc

user = UserSecretsClient().get_secret('user')
key = UserSecretsClient().get_secret('key')
hugging_face_token = UserSecretsClient().get_secret('huggingface')

os.environ['KAGGLE_USERNAME'] = user
os.environ['KAGGLE_KEY'] = key
os.environ['HF_TOKEN'] = hugging_face_token

# Authenticate
import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi

When we call the Kaggle API for lists of datasets/kernels/competitions, it returns objects. We need to access these objects and return relevant info.

In [4]:
def get_dataset_info(dataset):
    """Retrieve relevant information from a dataset object.

    Args:
        dataset: A dataset object obtained from the Kaggle API.

    Returns:
        A formatted string containing the following dataset information:
            - Dataset Title
            - Subtitle
            - License
            - Kernel Count
            - URL
    """
    dataset_info = """ 
    Dataset Title: {title}
    Subtitle: {subtitle}
    License: {license}
    Kernel Count: {kernel_count}
    URL: {url}
    """
    return dataset_info.format(
        title=dataset.titleNullable,
        subtitle=dataset.subtitleNullable,
        url=dataset.urlNullable,
        license = dataset.licenseNameNullable,
        kernel_count = dataset.kernelCount
    )

def get_competition_info(competition):
    """Retrieve relevant information from a competition object.

    Args:
        competition: A competition object obtained from the Kaggle API.

    Returns:
        A formatted string containing the following competition information:
            - Competition Title
            - Description
            - Category
            - Evaluation Metric
            - Team Count
            - Reward
            - URL
    """
    
    competition_info = """ 
    Competition Title: {title}
    Description: {description}
    Category: {category}
    Evaluation Metric: {evaluation_metric}
    Team Count: {team_count} 
    Reward: {reward}
    URL: {url}
    """
    return competition_info.format(
        title=competition.titleNullable,
        description=competition.descriptionNullable,
        category = competition.categoryNullable,
        evaluation_metric=competition.evaluationMetricNullable,
        team_count = competition.teamCount,
        url=competition.url,
        reward=competition.rewardNullable
    )

def get_kernel_info(kernel):
    """Retrieve relevant information from a kernel object.

    Args:
        kernel: A kernel object obtained from the Kaggle API.

    Returns:
        A formatted string containing the following kernel information:
            - Kernel Title
            - Author
            - Votes
            - Ref
    """
    
    kernel_info = """ 
    Kernel Title: {title}
    Author: {author}
    Votes: {totalVotes}
    Ref: {ref}
    """
    return kernel_info.format(
        title=kernel.title,
        author=kernel.author,
        totalVotes=kernel.totalVotes,
        ref=kernel.ref
    )

Let's test these info functions.

In [5]:
# Use Kaggle API to search datasets
datasets = kaggle.api.dataset_list(search="face",sort_by="hottest")[:3]
# Get dataset info
datasets_meta = [get_dataset_info(dataset) for dataset in datasets]
print("\n ---- \n".join(datasets_meta))

 
    Dataset Title: Face Mask Detection
    Subtitle: 853 images belonging to 3 classes.
    License: CC0: Public Domain
    Kernel Count: 278
    URL: https://www.kaggle.com/datasets/andrewmvd/face-mask-detection
    
 ---- 
 
    Dataset Title: Anime Face Dataset
    Subtitle: Mckinsey666 dataset scraped from www.getchu.com
    License: Database: Open Database, Contents: Database Contents
    Kernel Count: 123
    URL: https://www.kaggle.com/datasets/splcher/animefacedataset
    
 ---- 
 
    Dataset Title: LFW - People (Face Recognition)
    Subtitle: The Labeled Faces in the Wild face recognition dataset.
    License: GNU Lesser General Public License 3.0
    Kernel Count: 52
    URL: https://www.kaggle.com/datasets/atulanandjha/lfwpeople
    


In [6]:
competitions = kaggle.api.competitions_list(search="")[:3]
competitions_meta = [get_competition_info(competiton) for competiton in competitions]
print("\n ---- \n".join(competitions_meta))

 
    Competition Title: AI Mathematical Olympiad - Progress Prize 1
    Description: Solve national-level math challenges using artificial intelligence models
    Category: Featured
    Evaluation Metric: Accuracy Score
    Team Count: 741 
    Reward: $1,048,576
    URL: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize
    
 ---- 
 
    Competition Title: LLM Prompt Recovery
    Description: Recover the prompt used to transform a given text
    Category: Featured
    Evaluation Metric: LLM Nerd-Off Sharpened Cosine Similarity
    Team Count: 2145 
    Reward: $200,000
    URL: https://www.kaggle.com/competitions/llm-prompt-recovery
    
 ---- 
 
    Competition Title: Home Credit - Credit Risk Model Stability
    Description: Create a model measured against feature stability over time
    Category: Featured
    Evaluation Metric: Home Credit 2023 - Gini Stability
    Team Count: 1929 
    Reward: $105,000
    URL: https://www.kaggle.com/competitions/home-credit-cred

In [7]:
kernels = kaggle.api.kernels_list(search="face")[:3]
kernels_meta = [get_kernel_info(kernel) for kernel in kernels]
print("\n ---- \n".join(kernels_meta))

 
    Kernel Title: Face Detection with OpenCV
    Author: Serkan Peldek
    Votes: 289
    Ref: serkanpeldek/face-detection-with-opencv
    
 ---- 
 
    Kernel Title: Face Recognition on Olivetti Dataset
    Author: Serkan Peldek
    Votes: 373
    Ref: serkanpeldek/face-recognition-on-olivetti-dataset
    
 ---- 
 
    Kernel Title: Face mask detection
    Author: Ayushi Mishra
    Votes: 136
    Ref: ayushimishra2809/face-mask-detection
    


All set to create our environment now!

## Kaggle Env


For the Kaggle Env, we have the following actions to assist in answering common Kaggle questions:

- **search_datasets[term]** which searches based on the term the public Kaggle datasets and returns a list of datasets.
- **search_kernels[term]** which searches based on the term the public Kaggle kernels and returns a list of kernels.
- **search_docs[term]** which returns sections that have the exact term in the Kaggle Documents.
- **list_comps[]** which lists all ongoing Kaggle competitions and returns a list of competition info.
- **finish[answer]** which returns the answer and finishes the task.

In [8]:
valid_actions = ["list_comps[], search_datasets[]","list_comps[]","search_kernels[]","search_docs[]"]
class KaggleEnv():
    """Kaggle Environment class for interacting with Kaggle API.

    Interface to interact with the Kaggle API to perform various actions
    such as searching datasets, listing competitions, searching kernels,
    and searching documents.

    Attributes:
        obs (str): Current observation or state of the environment.
        answer (str): Answer obtained from the environment (if any).

    Methods:
        reset(): Reset the environment to its initial state.
        search_datasets(term): Search for datasets based on a given term using Kaggle API.
        list_competitions(): List ongoing competitions using Kaggle API.
        search_kernels(term): Search for kernels based on a given term using Kaggle API.
        search_docs(term): Locate and return document sections with a given term in the middle.
        step(action): Take a step in the environment based on the provided action.

    """
    def __init__(self):
        super().__init__()
        self.obs = None
        self.answer = None
        
    def reset(self):
        self.obs = "Interact with Kaggle API using search_datasets[], list_comps[], search_kernels[], and finish[].\n"
        self.answer = None
    
    @staticmethod
    def search_datasets(term):
        """Searchs for datasets based on term  using Kaggle API."""
        datasets = kaggle.api.dataset_list(search=term)[:10]
        if datasets:
            datasets_meta = [get_dataset_info(dataset) for dataset in datasets]
            return "\n ---- \n".join(datasets_meta)
        return "No datasets found. Maybe use a more broad search term. Search terms should be no longer than 2 words."
    
    @staticmethod
    def list_competitons():
        """Lists ongoing competitions"""
        competitions = kaggle.api.competitions_list()
        if competitions:
            competitions_meta = [get_competition_info(competiton) for competiton in competitions]
            return "\n ---- \n".join(competitions_meta)
        return "No ongoing competitions found."
    
    @staticmethod
    def search_kernels(term):
        """Searchs for kernels based on term using Kaggle API."""
        kernels = kaggle.api.kernels_list(search=term)[:10]
        if kernels:
            kernels_meta = [get_kernel_info(kernel) for kernel in kernels]
            return "\n ---- \n".join(kernels_meta)
        return "No kernels found. Maybe use a more broad search term. Search terms should be no longer than 2 words."
    
    @staticmethod
    def search_docs(term):
        """Locates term and returns sections with term in the middle."""
        sections = []
        i = 0
        while i < len(kaggle_docs):
            if term in kaggle_docs[max(0, i - len(term)):min(len(kaggle_docs), i + len(term))]:
                # Extract section
                sections.append(
                    kaggle_docs[max(0, i - 500):min(len(kaggle_docs), i + 500)]
                )
                # Skip over section in docs
                i += 500
            else:
                i += 1

        if sections:
                return "\n ---- \n".join(sections[:5])

        return "No documents found. Maybe use a more broad search term. Search terms should be no longer than 2 words."
       
        
    def step(self, action):
        """Take a step in the environment based on the provided action.

        Args:
            action (str): Action to be performed in the environment.

        Returns:
            tuple: A tuple containing:
                - obs (str): Current observation or state of the environment.
                - done (bool): True if the question is answered, otherwise False.
                - invalid (bool): True if the action is invalid, otherwise False.

        """
        done = False # Is question answered
        invalid = False # Is invalid action
        action = action.strip()
        
        if self.answer is not None:
            done = True
            return self.obs, done
        
        # Parse out actions
        if action.startswith("search_datasets[") and action.endswith("]"):
            term = action[len("search["):-1]
            self.obs = self.search_datasets(term)
    
        elif action.startswith("list_comps[") and action.endswith("]"):
            term = action[len("search["):-1]
            self.obs = self.list_competitons()
            
        elif action.startswith("search_kernels[") and action.endswith("]"):
            term = action[len("search["):-1]
            self.obs = self.search_kernels(term)
            
        elif action.startswith("search_docs[") and action.endswith("]"):
            term = action[len("search_docs["):-1]
            self.obs = self.search_docs(term)
            
        elif action.startswith("finish["):
            answer = action[len("finish["):-1]
            self.answer = answer
            done = True
            self.obs = f"Finished."
            
        else:
            self.obs = "Invalid action: {action}. Close your brackets []. Your valid actions are {valid_actions}".format(action=action, valid_actions=valid_actions)
            invalid = True
        return self.obs, done, invalid

# Designing the LLM Agent

Like in the ReAct paper, our LLM Agent should take actions to help respond to the common Kaggle question. This will involve prompting and parsing its output.

In [9]:
# For Gemma
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, StoppingCriteria
import torch
from peft import AutoPeftModelForCausalLM

# For Gemini
import google.generativeai as genai

device = "cuda" if torch.cuda.is_available() else "cpu"

In [10]:
class MyStoppingCriteria(StoppingCriteria):
    """Stops generation when encountering target_sequence."""
    def __init__(self, target_sequence, prompt):
        self.target_sequence = target_sequence
        self.prompt=prompt

    def __call__(self, input_ids, scores, **kwargs):
        generated_text = tokenizer.decode(input_ids[0])
        generated_text = generated_text.replace(self.prompt,'')
        if self.target_sequence in generated_text:
            return True 

        return False  

    def __len__(self):
        return 1

    def __iter__(self):
        yield self

class LLMAgent:
    """Agent for interacting with Kaggle Environment using LLM (Large Language Model).

    Attributes:
        env (KaggleEnv): The Kaggle Environment instance.
        model: The LLM model for generation.
        instruction (str): Instruction for resolving a Kaggle Assistant task with interleaving Thought, Action, Observation steps.
        is_gemma (bool): Flag indicating whether to use Gemma for generation.
        examples (str): Examples to be added to the instruction.
    
    Methods:
        use_llm_gemma(prompt, stop=["\n"]): Generate output using Gemma.
        use_gemini(prompt, stop=["\n"]): Generate output using Gemini.
        generate_answer(question, logging=True): Generate answer to the instruction.

    """
    def __init__(self,env, examples, is_gemma=True, add_examples = True):
        """Initialize LLMAgent."""
        super().__init__()
        self.env= env # Kaggle Env
        self.model = model
        self.instruction =  """Resolve a kaggle assistant task with interleaving Thought, Action, Observation steps. Thought can reason about the current situation, and Action can be three types: 
        (1) search_datasets[term] which searches based on the term the public Kaggle datasets and returns a list of datasets.
        (2) search_kernels[term] which searches based on the term the public Kaggle kernels and returns a list of kernels.
        (3) list_comps[] which lists all ongoing Kaggle competitions and returns a list of competition info.
        (4) search_docs[term] which returns all sentences that have the exact term in the Kaggle Documents (Notebooks.txt, Public API.txt, Organizations.txt, Competitions Setup.txt, Datasets.txt, Models.txt, TPU Usage.txt, GPU Usage.txt, Competitions.txt, kaggle-privacy-policy.txt, kaggle-terms-use.txt). Be broad with terms.
        (5) finish[answer] which returns the answer and finishes the task.
        """
        
        self.is_gemma=is_gemma
        if add_examples:
            self.instruction+=examples
    
    def use_llm_gemma(self, prompt,stop=["\n"]):
        """Uses Gemma to generate output."""
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        stopping_criteria = MyStoppingCriteria(stop[0], prompt)
        outputs = model.generate(**inputs, stopping_criteria=stopping_criteria, max_new_tokens=500)
        return tokenizer.decode(outputs[0], skip_special_tokens=True).replace(prompt,"")
        
    def use_gemini(self, prompt, stop=["\n"]):
        """Uses Gemini to generate output."""
        generation_config = {
            'stop_sequences':stop,
            'top_p':1,
            "max_output_tokens":500,
            'temperature':0,

        }

        response = model.generate_content(
            prompt,
            generation_config=generation_config,
        )
        try:
            return response.text
        except Exception as e:
            print(e)
            return ""
    
    def generate_answer(self, question,logging=True):
        """Generate an answer to the given instruction/question.

        Args:
            question (str): The instruction for which the answer is to be generated.
            logging (bool, optional): Flag indicating whether to log the steps in Markdown. Default is True.

        Returns:
            tuple: A tuple containing:
                - answer (str): The generated answer.
                - prompt_for_training (str): The trajectory used for finetuning.
                - markdown_text (str): The Markdown text of the generated trajectory.
        """

        # See what llm to use
        if self.is_gemma:
            use_llm = self.use_llm_gemma
        else:
            use_llm = self.use_gemini
            
        self.env.reset()
        markdown_text = "" # What to print out
        question = f'Instruction: {question}'
        prompt = self.instruction + question
        
        # Trajectory to use for finetuning
        prompt_for_training = self.instruction+question
        
        # Generate Thought Action Observation sequences
        for i in range(1,7):
            thought_action = use_llm(prompt + f"Thought {i}:", stop=["Observation"])
            try:
                thought_action = thought_action.replace("Observation","")
                thought, action = thought_action.strip().split(f"\nAction {i}: ")
            except:
                thought = thought_action.strip().split('\n')[0]
                action = use_llm(prompt + f"Thought {i}: {thought}\nAction {i}:", stop=[f"\n"]).strip()
            
            if logging:
                markdown_text += f"\n**Thought {i}**: {thought}\n"
                markdown_text +=f"\n**Action {i}**: {action}\n"
                
            obs, done, invalid = self.env.step(action)
            
            if logging:
                markdown_text += "```\n" + obs + "\n```"
                
            obs = obs.replace('\\n', '')
            step_str = f"Thought {i}: {thought}\nAction {i}: {action}\nObservation {i}: {obs}\n"
            prompt += step_str
            
            # We do not want to use invalid trajectories for training
            if not invalid:
                prompt_for_training+=step_str

            if done:
                break
            gc.collect()
            torch.cuda.empty_cache()
            
        if not done:
            obs, done, invalid = self.env.step("finish[]")
            
        return self.env.answer, prompt_for_training, markdown_text

I created some examples using a combination of me + Gemini for few-shot prompting. This will help the LLM learn how to use the Kaggle Environment.

In [11]:
examples = """Here are some examples.
Example 1:
Instruction: I want a kernel that shows me how to setup unet how do i do that?
Thought 1: To assist the user, I need to search for Kaggle kernels related to setting up a UNet model.
Action 1: search_kernels[UNet]
Observation 1: \n    Votes: 2599\n    Ref: shivamb/data-science-glossary-on-kaggle\n    \n ---- \n \n    Kernel Title: UNet starter kernel (Pytorch) LB>0.88\n    Author: Rishabh Agrahari\n    Votes: 571\n    Ref: rishabhiitbhu/unet-starter-kernel-pytorch-lb-0-88\n    \n ---- \n \n    Kernel Title: UNet (Pytorch) inference kernel\n    Author: Rishabh Agrahari\n    Votes: 203\n    Ref: rishabhiitbhu/unet-pytorch-inference-kernel\n    \n ---- \n \n    Kernel Title: Image Segmentation From Scratch in Pytorch\n    Author: Dhananjay Raut\n    Votes: 251\n    Ref: dhananjay3/image-segmentation-from-scratch-in-pytorch\n    \n ---- \n \n    Kernel Title: PyTorch Starter (U-Net ResNet)\n    Author: Alexander Teplyuk\n    Votes: 177\n    Ref: ateplyuk/pytorch-starter-u-net-resnet\n    \n ---- \n \n    Kernel Title: UNet with ResNet34 encoder (Pytorch)\n    Author: Rishabh Agrahari\n    Votes: 174\n    Ref: rishabhiitbhu/unet-with-resnet34-encoder-pytorch\n    \n ---- \n \n    Kernel Title: Finding Pneumo [Part 1]: EDA and UNet\n    Author: Ekhtiar Syed\n    Votes: 190\n    Ref: ekhtiar/finding-pneumo-part-1-eda-and-unet\n    \n ---- \n \n    Kernel Title: sample submission leak\n    Author: raddar\n    Votes: 125\n    Ref: raddar/sample-submission-leak\n    \n ---- \n \n    Kernel Title: U-net with simple ResNet Blocks (Forked)\n    Author: Ahmet Erdem\n    Votes: 65\n    Ref: aerdem4/u-net-with-simple-resnet-blocks-forked\n    \n ---- \n \n    Kernel Title: Data Science with DL & NLP: Advanced Techniques\n    Author: Vitalii Mokin\n    Votes: 255\n    Ref: vbmokin/data-science-with-dl-nlp-advanced-techniques\n   
Thought 2: The kernel that appears most relevant to the user is UNet starter kernel by Rishabh Agrahari.
Action 2: I found several kernels that can help you set up a U-Net model in PyTorch. Here's the most relevant one:

Rishabh Agrahari's UNet starter kernel (Pytorch) LB>0.88 (https://www.kaggle.com/code/rishabhiitbhu/unet-starter-kernel-pytorch-lb-0-88) - This kernel provides a starting point for building a U-Net model with PyTorch and achieves a good validation score (LB>0.88) which indicates strong performance.
Additional Resources:

While Rishabh's kernel is a great starting point, here are some other resources that you might find helpful:

Image Segmentation From Scratch in Pytorch (https://www.kaggle.com/dhananjay3/image-segmentation-from-scratch-in-pytorch) - This kernel dives deeper into building a U-Net model from scratch, which can be beneficial for a more thorough understanding.
UNet with ResNet34 encoder (Pytorch) (https://www.kaggle.com/rishabhiitbhu/unet-with-resnet34-encoder-pytorch) - This kernel by the same author explores using a pre-trained ResNet34 encoder within the U-Net architecture, which can be helpful for tasks with limited data.]

Example 2:
Instruction: I am looking for a competition that has to do with stocks.
Thought 1: To assist the user, I need to search for Kaggle competitions related to stocks. However, I only have access to ongoing competitions.
Action 1: list_comps[]
Observation 1: \n  Competition Title: AI Mathematical Olympiad - Progress Prize 1\n    Description: Solve national-level math challenges using artificial intelligence models\n    Category: Featured\n    Evaluation Metric: Accuracy Score\n    Team Count: 591 \n    Reward: $1,048,576\n    URL: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize\n    \n ---- \n \n    Competition Title: LLM Prompt Recovery\n    Description: Recover the prompt used to transform a given text\n    Category: Featured\n    Evaluation Metric: LLM Nerd-Off Sharpened Cosine Similarity\n    Team Count: 2092 \n    Reward: $200,000\n    URL: https://www.kaggle.com/competitions/llm-prompt-recovery\n    \n ---- \n \n    Competition Title: Home Credit - Credit Risk Model Stability\n    Description: Create a model measured against feature stability over time\n    Category: Featured\n    Evaluation Metric: Home Credit 2023 - Gini Stability\n    Team Count: 1809 \n    Reward: $105,000\n    URL: https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability\n    \n ---- \n \n    Competition Title: The Learning Agency Lab - PII Data Detection\n    Description: Develop automated techniques to detect and remove PII from educational data.\n    Category: Featured\n    Evaluation Metric: TLAL_F_beta\n    Team Count: 1908 \n    Reward: $60,000\n    URL: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data\n    \n ---- \n \n    Competition Title: Image Matching Challenge 2024 - Hexathlon\n    Description: Reconstruct 3D scenes from 2D images over six different domains\n    Category: Research\n    Evaluation Metric: mAA-on-camera-centers-with-registration\n    Team Count: 350 \n    Reward: $50,000\n    URL: https://www.kaggle.com/competitions/image-matching-challenge-2024\n    \n ---- \n \n    Competition Title: Learning Agency Lab - Automated Essay Scoring 2.0\n    Description: Improve upon essay scoring algorithms to improve student learning outcomes\n    Category: Featured\n    Evaluation Metric: Cohen Kappa Score\n    Team Count: 328 \n    Reward: $50,000\n    URL: https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2\n    \n ---- \n \n    Competition Title: Leash Bio - Predict New Medicines with BELKA\n    Description: Predict small molecule-protein interactions using the Big Encoded Library for Chemical Assessment (BELKA)\n    Category: Featured\n    Evaluation Metric: Average Precision Score\n    Team Count: 184 \n    Reward: $50,000\n    URL: https://www.kaggle.com/competitions/leash-BELKA\n    \n ---- \n \n    Competition Title: BirdCLEF 2024\n    Description: Bird species identification from audio, focused on under-studied species in the Western Ghats, a major biodiversity hotspot in India.\n    Category: Research\n    Evaluation Metric: Birdclef ROC AUC\n    Team Count: 144 \n    Reward: $50,000\n    URL: https://www.kaggle.com/competitions/birdclef-2024\n    \n ---- \n \n    Competition Title: Google – AI Assistants for Data Tasks with Gemma\n    Description: Build tools to assist Kaggle developers\n    Category: Analytics\n    Evaluation Metric: None\n    Team Count: 0 \n    Reward: $50,000\n    URL: https://www.kaggle.com/competitions/data-assistants-with-gemma\n    \n ---- \n \n    Competition Title: Regression with an Abalone Dataset\n    Description: Playground Series - Season 4, Episode 4\n    Category: Playground\n    Evaluation Metric: Mean Squared Log Error\n    Team Count: 1207 \n    Reward: Swag\n    URL: https://www.kaggle.com/competitions/playground-series-s4e4\n    \n ---- \n \n    Competition Title: PlantTraits2024 - FGVC11\n    Description: Uncovering the biosphere: Predicting 6 Vital Plant Traits from Plant Images for Ecosystem Health\n    Category: Research\n    Evaluation Metric: R2 Score\n    Team Count: 113 \n    Reward: Knowledge\n    URL: https://www.kaggle.com/competitions/planttraits2024\n    \n ---- \n \n    Competition Title: GeoLifeCLEF 2024 @ LifeCLEF & CVPR-FGVC\n    Description: Location-based species presence prediction\n    Category: Research\n    Evaluation Metric: F-Score Beta (Micro)\n    Team Count: 44 \n    Reward: Knowledge\n    URL: https://www.kaggle.com/competitions/geolifeclef-2024\n    \n ---- \n \n    Competition Title: Titanic - Machine Learning from Disaster\n    Description: Start here! Predict survival on the Titanic and get familiar with ML basics\n    Category: Getting Started\n    Evaluation Metric: Categorization Accuracy\n    Team Count: 15708 \n    Reward: Knowledge\n    URL: https://www.kaggle.com/competitions/titanic\n    \n ---- \n \n    Competition Title: House Prices - Advanced Regression Techniques\n    Description: Predict sales prices and practice feature engineering, RFs, and gradient boosting\n    Category: Getting Started\n    Evaluation Metric: Root Mean Squared Logarithmic Error\n    Team Count: 4633 \n    Reward: Knowledge\n    URL: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques\n    \n ---- \n \n    Competition Title: Spaceship Titanic\n    Description: Predict which passengers are transported to an alternate dimension\n    Category: Getting Started\n    Evaluation Metric: Categorization Accuracy\n    Team Count: 2714 \n    Reward: Knowledge\n    URL: https://www.kaggle.com/competitions/spaceship-titanic\n    \n ---- \n \n    Competition Title: Digit Recognizer\n    Description: Learn computer vision fundamentals with the famous MNIST data\n    Category: Getting Started\n    Evaluation Metric: Categorization Accuracy\n    Team Count: 1859 \n    Reward: Knowledge\n    URL: https://www.kaggle.com/competitions/digit-recognizer\n    \n ---- \n \n    Competition Title: Natural Language Processing with Disaster Tweets\n    Description: Predict which Tweets are about real disasters and which ones are not\n    Category: Getting Started\n    Evaluation Metric: F-Score (Micro)\n    Team Count: 1158 \n    Reward: Knowledge\n    URL: https://www.kaggle.com/competitions/nlp-getting-started\n    \n ---- \n \n    Competition Title: Store Sales - Time Series Forecasting\n    Description: Use machine learning to predict grocery sales\n    Category: Getting Started\n    Evaluation Metric: Root Mean Squared Logarithmic Error\n    Team Count: 734 \n    Reward: Knowledge\n    URL: https://www.kaggle.com/competitions/store-sales-time-series-forecasting\n    \n ---- \n \n    Competition Title: Connect X\n    Description: Connect your checkers in a row before your opponent!\n    Category: Getting Started\n    Evaluation Metric: Connect X\n    Team Count: 185 \n    Reward: Knowledge\n    URL: https://www.kaggle.com/competitions/connectx\n    \n ---- \n \n    Competition Title: I’m Something of a Painter Myself\n    Description: Use GANs to create art - will you be the next Monet?\n    Category: Getting Started\n    Evaluation Metric: PostProcessorKernel\n    Team Count: 111 \n    Reward: Knowledge\n    URL: https://www.kaggle.com/competitions/gan-getting-started\n
Thought 2: It seems no ongoing compitions related to stocks.
Action 2: finish[I cannot find any ongoing competition to stocks. There may be a past compeition related to stocks but I only have access to current competitions.]

Example 3:
Instruction: What's up? How's your day?
Thought 1: The user asked a general question. I do not need to search.
Action 1: finish[I don't experience days in the same way that humans do, since I am always learning and processing information. That said, I'm always functioning well and ready to assist you with whatever you need today!  What would you like to talk about?]

Example 4:
Instruction: I am looking for a dataset that has to do with lifestyle.
Thought 1: I should look up datasets that have to do with lifestyle,
Action 1: search_datasets[lifestyle]
Observation 1: \n    Dataset Title: Risk Factors for Cardiovascular Heart Disease\n    Subtitle: Examining Age, Gender, Height, Weight and Health Metrics\n    License: Other (specified in description)\n    Kernel Count: 20\n    URL: https://www.kaggle.com/datasets/thedevastator/exploring-risk-factors-for-cardiovascular-diseas\n    \n ---- \n \n    Dataset Title: The Heart Failure Prediction Dataset\n    Subtitle:  The Dataset about heart patients of an Institute of Cardiology \n    License: CC0: Public Domain\n    Kernel Count: 13\n    URL: https://www.kaggle.com/datasets/asgharalikhan/mortality-rate-heart-patient-pakistan-hospital\n    \n ---- \n \n    Dataset Title: Prediction of Insurance Charges\n    Subtitle: A Study of Customers Insurance Charges\n    License: Other (specified in description)\n    Kernel Count: 16\n    URL: https://www.kaggle.com/datasets/thedevastator/prediction-of-insurance-charges-using-age-gender\n    \n ---- \n \n    Dataset Title: Predicting Heart Disease Using Clinical Variables\n    Subtitle: A 270-Patient Dataset with 13 Features\n    License: Data files © Original Authors\n    Kernel Count: 15\n    URL: https://www.kaggle.com/datasets/thedevastator/predicting-heart-disease-risk-using-clinical-var\n    \n ---- \n \n    Dataset Title: Chinese Diabetes Question Classification\n    Subtitle: NCAA2024 - Chinese Diabetes Question Classification\n    License: Apache 2.0\n    Kernel Count: 2\n    URL: https://www.kaggle.com/datasets/marquis03/ncaa2024-chinese-diabetes-question-classification\n    \n ---- \n \n    Dataset Title: Dogs Intelligence and Size\n    Subtitle: An Exploration of Breed-Based Characteristics\n    License: Data files © Original Authors\n    Kernel Count: 1\n    URL: https://www.kaggle.com/datasets/thedevastator/canine-intelligence-and-size\n    \n ---- \n \n    Dataset Title: Egyptian Hieroglyphics Datasets\n    Subtitle: Dataset of Egyptian Hieroglyphs extracted from dictionary of Gardiner's list.\n    License: Other (specified in description)\n    Kernel Count: 0\n    URL: https://www.kaggle.com/datasets/waleedumer/egyptian-hieroglyphics-datasets\n    \n ---- \n \n    Dataset Title: CardioKill\n    Subtitle: Unifying Heart Disease Datasets: A Comprehensive Approach to Early Detection\n    License: CC0: Public Domain\n    Kernel Count: 1\n    URL: https://www.kaggle.com/datasets/willianoliveiragibin/cardiokill\n    \n ---- \n \n    Dataset Title: Global suicide data\n    Subtitle: \n    License: CC0: Public Domain\n    Kernel Count: 2\n    URL: https://www.kaggle.com/datasets/sathutr/global-suicide-data\n    \n ---- \n \n    Dataset Title: Medical Conversation Corpus (100k+)\n    Subtitle: Generative Language Modeling for Medical Applications\n    License: CC0: Public Domain\n    Kernel Count: 1\n    URL: https://www.kaggle.com/datasets/thedevastator/medical-conversation-corpus-100k\n 
Thought 2: Among the datasets related to lifestyle, "Predicting Heart Disease Using Clinical Variables" seems particularly relevant. It involves examining clinical variables to predict heart disease risk, which is a significant aspect of lifestyle and health.
Action 2: finish[I found several datasets on Kaggle related to lifestyle, and 'Predicting Heart Disease Using Clinical Variables' seems like a strong fit for your needs. It contains data on patients with 13 features that can be used to predict heart disease risk, a major factor in overall health and lifestyle.

Here's a breakdown of why this dataset is a good choice, along with some additional resources you might find helpful:

* Explanation: Heart disease is a leading cause of death globally, and understanding risk factors is crucial for promoting healthy lifestyles. This dataset allows you to analyze clinical variables and potentially build models to predict heart disease, which can be a valuable tool in preventative healthcare.

* Additional Resources:
    * While "Predicting Heart Disease Using Clinical Variables" is a good starting point, here are some other relevant datasets on Kaggle:
        * Risk Factors for Cardiovascular Heart Disease: https://www.kaggle.com/datasets/thedevastator/exploring-risk-factors-for-cardiovascular-diseas examines factors like age, weight, and health metrics that can influence heart health.
        * The Heart Failure Prediction Dataset: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction provides data on heart patients, allowing for further exploration of heart-related conditions.
    * Beyond Kaggle, consider searching for government health resources or public health organization data repositories that might offer even more comprehensive datasets on lifestyle factors and health outcomes.

I hope this comprehensive explanation and additional resources aid in your Kaggle project!]

Example 5
Instruction: Can you add multiple data sources to your Notebook\'s environment?
Thought 1: I need to search for Kaggle documents that involve multiple data sources to your Notebook\'s environment. I need to search for broad keyterms in the phrase. I will search instead for the keyword multiple.
Action 1: search_docs[multiple]
Observation 1: d for the Notebook against a certain Docker container version. More\ninformation can be found in "The Notebook Environment" section.\n\n------------------------------------------------------------------------\n\n\n      Adding Data Sources\n\nOne of the advantages to using Notebooks as your data science workbench\nis that you can easily add data sources from thousands of publicly\navailable Datasets or even upload your own. You can also use output\nfiles from another Notebook as a data source. You can add multiple data\nsources to your Notebook’s environment, allowing you to join together\ninteresting datasets.\n\n\n        Datasets\n\nKaggle Datasets provides a rich mix of interesting datasets for any kind\nof data science project.\n\nThere are two ways of loading a Dataset in a Notebook. The first is to\nnavigate to a chosen dataset’s landing page, then click on the “New\nNotebook” button <https://www.kaggle.com/notebooks?modal=true>. This\nwill launch a new Notebook session with the dataset in question spu\n ---- \ne more and better content than you could in a\nsingle notebook alone.\n\n“Minimal LSTM + NB-SVM baseline ensemble <https://www.kaggle.com/\njhoward/minimal-lstm-nb-svm-baseline-ensemble/notebook>”, written by\nJeremy Howard, is one example of a great Notebook using this feature.\nClick on the “Data” tab to view the data sources he uses.\n\n------------------------------------------------------------------------\n\n\n      Collaborating on Notebooks\n\nNotebooks collaboration is a powerful feature. It allows multiple users\nto co-own and edit a Notebook. For example, you can work with\nCompetition teammates to iterate on a model or collaborate with\nclassmates on a data science project.\n\n\n        Inviting Collaborators\n\nFrom your Notebook editor or viewer, public or private, you may navigate\nto the \'Share\' or \'Sharing\' button in the Notebook’s menu to expose,\namong other settings, the Collaborators options. There, use the search\nbox to find and add other users as Notebook collaborators.\n\nIf your Notebo\n ---- \nll a moving\ntarget as our team continually updates them to ensure the latest and\ngreatest packages are available. We update the images about every two\nweeks, mainly to upgrade to the latest versions of the packages we\nprovide but also occasionally to add or remove certain packages. You can\nsubscribe to notifications when we release a new Docker image on GitHub\n<https://www.kaggle.com/product-feedback/161327>.\n\nIt is also possible to pin a specific Docker image for use in a Notebook\nif there are multiple custom images available. This can be done by\naccessing the “Settings” tab in the Notebook editor. Next to\n"Environment", there is an option to select "Preferences." This opens a\nmodal where you can select what your environment preference is between\npinning to a specific image (the image version when your notebook was\ncreated) or always using the latest image. You can read more about these\noptions here <https://admin.kaggle.com/product-feedback/150261>.\n\nIn order to ensure that your Note\n ---- \nn a Python metric, check that your solution file\'s format\nmatches that expected by the metric\'s documentation, or just continue to\ntesting a submission to see if it matches.\n\nIf you\'ve chosen a Legacy metric, then after uploading the |\nsolution.csv| file the column headers will auto populate the Solution\nMapping table below. Mapping allows our metric code to understand which\ncolumns to use for calculations. Choose the correct “Expected Column”\nvalues. Note, some evaluation metrics let you score multiple columns\nsimultaneously.\n\n\n          Upload the |sample_submission.csv| file and map the verification\n\nClick on the upload icon to upload your |sample_submission.csv| file.\n\nIf you\'ve chosen a Legacy metric, then after uploading you\'ll again need\nto complete the same process of column mapping for the submission format.\n\n\n          Upload data for participants\n\nClick on the Data tab and “Upload first version” button on the bottom of\nyour screen to upload all data that participants can acc\n ---- \no assign descriptions to\nindividual columns of the dataset, making it easier for users to\nunderstand what each column means. Column metrics, meanwhile, present\nhigh-level metrics about individual columns in a graphic format.\n\n“The Complete Pokemon Dataset <https://www.kaggle.com/rounakbanik/\npokemon>” is an example of a great CSV-type Dataset.\n\n\n          JSON\n\nWhile CSV is the most common file format for “flat” data, JSON is the\nmost common file format for “tree-like” data that potentially has\nmultiple layers, like the branches on a tree:\n\n{[{‘id’: 0, ‘type’: ‘bananas’, ‘quantity’: 12}, {‘id’: 1, ‘type’:\n‘apples’, ‘quantity’: 7}]}\n\nFor JSON files, the Data tab preview will present an interactive tree\nwith the nodes in the JSON file attached. You can click on individual\nkeys to open and collapse sections of the tree, exploring the structure\nof the dataset as you go along. JSON files do not support column\ndescriptions or metrics.\n\nYou can filter the Datasets listing by File Type to show'
Thought 2: Yes, you can add multiple data sources.
Action 2: finish[Yes, you can add multiple data sources to your Notebook's environment. Here's the relevant passage from the provided text:

"One of the advantages to using Notebooks as your data science workbench is that you can easily add data sources from thousands of publicly available Datasets or even upload your own. You can also use output files from another Notebook as a data source. You can add multiple data sources to your Notebook’s environment, allowing you to join together interesting datasets."

So, according to the passage, you can indeed add multiple data sources to your Notebook's environment.]
"""

## Gemini Test

Let's test our agent first on the bigger model: Gemini.

In [12]:
gemini_key = UserSecretsClient().get_secret("gemni_key")
genai.configure(api_key = gemini_key)
model = genai.GenerativeModel('models/gemini-1.0-pro-latest')

In [13]:
agent = LLMAgent(KaggleEnv(), examples, is_gemma=False)
_, _, markdown_text = agent.generate_answer("I am interested in ecommerce. What dataset would you recommend?",logging=True)
Markdown(markdown_text)


**Thought 1**: I should look up datasets that have to do with ecommerce.

**Action 1**: search_datasets[ecommerce]
```
 
    Dataset Title: eCommerce behavior data from multi category store
    Subtitle: This dataset contains 285 million users' events from eCommerce website
    License: Data files © Original Authors
    Kernel Count: 48
    URL: https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store
    
 ---- 
 
    Dataset Title: Brazilian E-Commerce Public Dataset by Olist
    Subtitle: 100,000 Orders with product, customer and reviews info
    License: CC BY-NC-SA 4.0
    Kernel Count: 502
    URL: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
    
 ---- 
 
    Dataset Title: eCommerce Events History in Cosmetics Shop
    Subtitle: This dataset contains 20M users' events from eCommerce website
    License: Data files © Original Authors
    Kernel Count: 36
    URL: https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-cosmetics-shop
    
 ---- 
 
    Dataset Title: Ecommerce Fraud Data
    Subtitle: Ecom. Fraud Detection data that is good for beginners to practice EDA and ML. 
    License: Unknown
    Kernel Count: 6
    URL: https://www.kaggle.com/datasets/aryanrastogi7767/ecommerce-fraud-data
    
 ---- 
 
    Dataset Title: Walmart Product Data 2019
    Subtitle: Updated Walmart USA Product Dataset 2019
    License: CC0: Public Domain
    Kernel Count: 3
    URL: https://www.kaggle.com/datasets/promptcloud/walmart-product-data-2019
    
 ---- 
 
    Dataset Title: eCommerce events history in electronics store
    Subtitle: This dataset contains 900K detailed users' events from eCommerce website
    License: Data files © Original Authors
    Kernel Count: 6
    URL: https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-electronics-store
    
 ---- 
 
    Dataset Title: Looker Ecommerce BigQuery Dataset
    Subtitle: CSV version of BigQuery Looker Ecommerce Dataset
    License: Unknown
    Kernel Count: 3
    URL: https://www.kaggle.com/datasets/mustafakeser4/looker-ecommerce-bigquery-dataset
    
 ---- 
 
    Dataset Title: Top Trending Topics in Online : 2023
    Subtitle: Search, Marketing, Real Estate, Ecommerce, Investment, Business, Health, Travel.
    License: Other (specified in description)
    Kernel Count: 0
    URL: https://www.kaggle.com/datasets/belayethossainds/top-trending-topics-in-online-2023
    
 ---- 
 
    Dataset Title: Global e-Commerce Websites List
    Subtitle: 4+ million e-Commerce sites built with various technologies & market share
    License: Database: Open Database, Contents: © Original Authors
    Kernel Count: 3
    URL: https://www.kaggle.com/datasets/wiredwith/websites-list
    
 ---- 
 
    Dataset Title: Ecommerce Rating Dataset
    Subtitle: Ecommerce Rating Dataset is collected from http://ecom.uelstore.com/ 
    License: Unknown
    Kernel Count: 1
    URL: https://www.kaggle.com/datasets/trnduythanhkhttt/uelstoredataset
    
```
**Thought 2**: The "Brazilian E-Commerce Public Dataset by Olist" seems like a good fit for your needs. It contains 100,000 orders with product, customer, and reviews info, which can be valuable for analyzing e-commerce trends and customer behavior.

**Action 2**: finish[I recommend the "Brazilian E-Commerce Public Dataset by Olist" for your e-commerce research. It's a comprehensive dataset with 100,000 orders, including product, customer, and review information. This data can provide valuable insights into e-commerce trends and customer behavior.

Here's a breakdown of why this dataset is a good choice, along with some additional resources you might find helpful:

* Explanation: The Brazilian E-Commerce Public Dataset by Olist is a rich source of data for anyone interested in e-commerce. It covers a wide range of aspects, from product details to customer demographics and reviews. This data can be used to analyze customer behavior, identify trends, and develop strategies for improving e-commerce operations.

* Additional Resources:
    * While the Brazilian E-Commerce Public Dataset by Olist is a great starting point, here are some other relevant datasets on Kaggle:
        * eCommerce behavior data from multi category store: https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store provides data on user events from a multi-category e-commerce website.
        * eCommerce Events History in Cosmetics Shop: https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-cosmetics-shop offers data on user events from a cosmetics e-commerce website.
    * Beyond Kaggle, consider searching for industry reports or academic papers on e-commerce trends and customer behavior. These resources can provide additional context and insights for your research.

I hope this comprehensive explanation and additional resources aid in your Kaggle project!]
```
Finished.
```

Results look pretty good! Let's try a general question about Kaggle.

In [14]:
agent = LLMAgent(KaggleEnv(), examples,is_gemma=False)
_, _, markdown_text = agent.generate_answer("Does Kaggle provide GPUs?",logging=True)
Markdown(markdown_text)


**Thought 1**: I need to search for Kaggle documents that involve GPUs.

**Action 1**: search_docs[GPU]
```
ht side of the
screen. In the compact editor (where you hide the settings pane), it is
folded into tabs above the Editor tab. In either case the settings pane
contains the following tabs:

There's a tab called “Data” that provides a way of adding or removing
data from the Notebook.

There's a tab called the Settings. The Settings tab has settings for
toggling Language, toggling Docker image selection, toggling Internet
(which is on by default), and toggling an Accelerator between CPU
(default), GPU, and TPU.

Language is the programming language the Notebook is authored in. You
can use it to switch between R and Python in the notebook flavor, and
between R, RMarkdown, and Python in the script flavor. For more details
on the differences, see the “Types of Notebooks” section.

The Docker image section can be used to pin the R or Python environment
used for the Notebook against a certain Docker container version. More
information can be found in "The Notebook Environment" section.

------
 ---- 
com/docs/datasets#collaborating-on-datasets>

------------------------------------------------------------------------


      The Notebook Environment

Notebooks is more than just a code editor. It’s a versioned
computational environment designed to make it easy to reproduce data
science work. In the Notebooks IDE, you have access to an interactive
session running in a Docker container with pre-installed packages, the
ability to mount versioned data sources, customizable compute resources
like GPUs, and more.


        Notebook Versions and Containers

When you create a Notebook version using 'Save & Run All', you execute
the Notebook from top to bottom in a separate session from your
interactive session. Once it finishes, you will have generated a new
Notebook version. A Notebook version is a snapshot of your work
including your compiled code, log files, output files, data sources, and
more. The latest Notebook version of your Notebook is what is shown to
users in the Notebook viewer
 ---- 
u can run arbitrary shell commands by prepending ! to a
code cell. For instance, to install a new package using pip, run |!pip
install my-new-package|. You can also upgrade or downgrade an existing
package by running |!pip install my-existing-package==X.Y.Z|.

To install packages from GitHub in R, load the devtools package by
running |library(devtools)|. Then, you can run commands such as |
install_github("some_user/some_package")| to install a new package from
GitHub.


          Adding a free GPU

You can add a single NVIDIA Tesla P100 to your Notebook for free. GPU
environments have lower CPU and main memory, but are a great way to
achieve significant speed-ups for certain types of work like training
neural networks on image data. One of the major benefits to using
Notebooks as opposed to a local machine or your own VM is that the
Notebook environment is already pre-configured with GPU-ready software
and packages which can be time consuming and frustrating to set-up. Free
GPU availa
 ---- 
nvironments have lower CPU and main memory, but are a great way to
achieve significant speed-ups for certain types of work like training
neural networks on image data. One of the major benefits to using
Notebooks as opposed to a local machine or your own VM is that the
Notebook environment is already pre-configured with GPU-ready software
and packages which can be time consuming and frustrating to set-up. Free
GPU availability is limited: in busy times, you might be placed in a queue.

To add a GPU, navigate to the “Settings” pane from the Notebook editor
and click the “Accelerator" > GPU option. Your session will restart
which may take a few moments to several minutes if you don’t need to
wait in a queue to access a GPU-enabled machine.

To learn more about getting the most out of using a GPU in Notebooks,
check out this tutorial Notebook by Dan Becker <https://www.kaggle.com/
dansbecker/running-kaggle-kernels-with-a-gpu>.


          Adding a free TPU

You can add a TPU v3-8 to your 
 ---- 
e:

  * [Feature Launch] Upgrade to Notebooks on Google Cloud for more
    compute! <https://www.kaggle.com/product-feedback/159602>

------------------------------------------------------------------------


      Technical Specifications

Kaggle Notebooks run in a remote computational environment. We provide
the hardware—you need only worry about the code.

At time of writing, each Notebook editing session is provided with the
following resources:

  *

    12 hours execution time for CPU and GPU notebook sessions and 9
    hours for TPU notebook sessions

  *

    20 Gigabytes of auto-saved disk space (/kaggle/working)

  *

    Additional scratchpad disk space (outside /kaggle/working) that will
    not be saved outside of the current session

CPU Specifications

  *

    4 CPU cores

  *

    30 Gigabytes of RAM

P100 GPU Specifications

  *

    1 Nvidia Tesla P100 GPU

  *

    4 CPU cores

  *

    29 Gigabytes of RAM

T4 x2 GPU Specifications

  *

    2 Nvidia Tesla T4 GPUs


```
**Thought 2**: Yes, Kaggle provides free GPUs.

**Action 2**: finish[Yes, Kaggle provides free GPUs. Here's the relevant passage from the provided text:

"You can add a single NVIDIA Tesla P100 to your Notebook for free. GPU environments have lower CPU and main memory, but are a great way to achieve significant speed-ups for certain types of work like training neural networks on image data. One of the major benefits to using Notebooks as opposed to a local machine or your own VM is that the Notebook environment is already pre-configured with GPU-ready software and packages which can be time consuming and frustrating to set-up. Free GPU availability is limited: in busy times, you might be placed in a queue."

So, according to the passage, you can indeed add a free GPU to your Notebook.]
```
Finished.
```

## Gemma Test

Now, let's see what Gemma can do.

In [15]:
model_id = "google/gemma-2b-it"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

tokenizer_config.json:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [16]:
# Sadly, we cannot add examples to help Gemma because context window too small
agent = LLMAgent(KaggleEnv(), examples,is_gemma=True, add_examples = False)
_, _, markdown_text = agent.generate_answer("I am interested in ecommerce. What dataset would you recommend?",logging=True)
Markdown(markdown_text)

2024-04-14 23:44:15.638115: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-14 23:44:15.638210: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-14 23:44:15.775708: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered



**Thought 1**: I want to find datasets related to online shopping.

**Action 1**: search_datasets[term]

**
```
Invalid action: search_datasets[term]

**. Close your brackets []. Your valid actions are ['list_comps[], search_datasets[]', 'list_comps[]', 'search_kernels[]', 'search_docs[]']
```
**Thought 2**: I want to find datasets related to specific categories within online shopping.

**Action 2**: search_datasets(category) which searches for datasets related to the specified category.

**
```
Invalid action: search_datasets(category) which searches for datasets related to the specified category.

**. Close your brackets []. Your valid actions are ['list_comps[], search_datasets[]', 'list_comps[]', 'search_kernels[]', 'search_docs[]']
```
**Thought 3**: I want to find datasets related to specific competitors in online shopping.

**Action 3**: list_comps[competitor] which lists all ongoing competitions that the competitor is participating in.

**
```
Invalid action: list_comps[competitor] which lists all ongoing competitions that the competitor is participating in.

**. Close your brackets []. Your valid actions are ['list_comps[], search_datasets[]', 'list_comps[]', 'search_kernels[]', 'search_docs[]']
```
**Thought 4**: I want to find datasets related to specific products within online shopping.

**Action 4**: search_docs[product] which returns all sentences that have the exact term in the Kaggle Documents (Notebooks.txt, Public API.txt, Organizations.txt, Competitions Setup.txt, Datasets.txt, Models.txt, TPU Usage.txt, GPU Usage.txt, Competitions.txt, kaggle-privacy-policy.txt, kaggle-terms-use.txt)

**
```
Invalid action: search_docs[product] which returns all sentences that have the exact term in the Kaggle Documents (Notebooks.txt, Public API.txt, Organizations.txt, Competitions Setup.txt, Datasets.txt, Models.txt, TPU Usage.txt, GPU Usage.txt, Competitions.txt, kaggle-privacy-policy.txt, kaggle-terms-use.txt)

**. Close your brackets []. Your valid actions are ['list_comps[], search_datasets[]', 'list_comps[]', 'search_kernels[]', 'search_docs[]']
```
**Thought 5**: I want to finish the task.

**Action 5**: finish[answer] which returns the answer and finishes the task.
```
Finished.
```

Hmm... Gemma has no idea what it's doing! Gemini seems to perform much better with its 1 million context window and much more robust model. How about **finetuning** Gemma on Gemini's outputs!

# Finetune Gemma

Gemma is clearly limited by its size and context. That's why we finetune it on Gemini's outputs. This technique is called knowledge distillation and can help smaller models greatly. Gemma will learn from the the bigger, better model's outputs.

In [17]:
# Load Gemini trajectories
data = pd.read_csv('/kaggle/input/react-finetune-gemma/kaggleqa_react_gemini.csv')
data.head()

Unnamed: 0,question,answer
0,Can you help me find a dataset on the average ...,Resolve a kaggle assistant task with interleav...
1,What is the role of service providers in Kaggl...,Resolve a kaggle assistant task with interleav...
2,What are the most popular organization profile...,Resolve a kaggle assistant task with interleav...
3,What happens to the intellectual property righ...,Resolve a kaggle assistant task with interleav...
4,How can I make my dataset publicly available?,Resolve a kaggle assistant task with interleav...


Above is a dataset of questions + trajectories curated from using Gemini as the LLM Agent in the Kaggle Environment. Questions were generated by asking Gemini to create questions related to Kaggle's datasets/competitions/general questions, etc. Here is a sample trajectory:

In [18]:
data['answer'][0]

'Resolve a kaggle assistant task with interleaving Thought, Action, Observation steps. Thought can reason about the current situation, and Action can be three types: \n        (1) search_datasets[term] which searches based on the term the public Kaggle datasets and returns a list of datasets.\n        (2) search_kernels[term] which searches based on the term the public Kaggle kernels and returns a list of kernels.\n        (3) list_comps[] which lists all ongoing Kaggle competitions and returns a list of competition info.\n        (4) search_docs[term] which returns all sentences that have the exact term in the Kaggle Documents (Notebooks.txt, Public API.txt, Organizations.txt, Competitions Setup.txt, Datasets.txt, Models.txt, TPU Usage.txt, GPU Usage.txt, Competitions.txt, kaggle-privacy-policy.txt, kaggle-terms-use.txt). Be broad with terms.\n        (5) finish[answer] which returns the answer and finishes the task.\n        Instruction: Can you help me find a dataset on the average 

We perform standard finetuning on Gemma in a separate notebook with **LoRa** and 4-bit quantization to reduce memory.

In [19]:
# Due to dataset package issues, you need to restart kernel to run this
# data = Dataset.from_pandas(data)

In [20]:
"""
data = Dataset.from_pandas(data)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "google/gemma-2b-it"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])
"""

'\ndata = Dataset.from_pandas(data)\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n\nmodel_id = "google/gemma-2b-it"\nbnb_config = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_quant_type="nf4",\n    bnb_4bit_compute_dtype=torch.bfloat16\n)\n\ntokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ[\'HF_TOKEN\'])\nmodel = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ[\'HF_TOKEN\'])\n'

We now define the LoraConfig for finetuning.

In [21]:
"""
from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)
"""

'\nfrom peft import LoraConfig\n\nlora_config = LoraConfig(\n    r=8,\n    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],\n    task_type="CAUSAL_LM",\n)\n'

For training, we use standard baseline parameters for finetuning (low learning rate, few epochs, etc). We can use 1 batch size due to memory constraints.

In [22]:
"""
from trl import SFTTrainer
import transformers 

def formatting_func(example):
    output_texts = []
    for i in range(len(example)):
        text = example['answer']
        output_texts.append(text)
    return output_texts[0]

trainer = SFTTrainer(
    model=model,
    train_dataset=data,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=600,
        learning_rate=8e-5,
        fp16=True,
        logging_steps=20,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)
trainer.train()
"""

'\nfrom trl import SFTTrainer\nimport transformers \n\ndef formatting_func(example):\n    output_texts = []\n    for i in range(len(example)):\n        text = example[\'answer\']\n        output_texts.append(text)\n    return output_texts[0]\n\ntrainer = SFTTrainer(\n    model=model,\n    train_dataset=data,\n    args=transformers.TrainingArguments(\n        per_device_train_batch_size=1,\n        gradient_accumulation_steps=4,\n        warmup_steps=2,\n        max_steps=600,\n        learning_rate=8e-5,\n        fp16=True,\n        logging_steps=20,\n        output_dir="outputs",\n        optim="paged_adamw_8bit"\n    ),\n    peft_config=lora_config,\n    formatting_func=formatting_func,\n)\ntrainer.train()\n'

# Test Finetuned Gemma

After training, let's load the finetuned model!

In [23]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
peft_model_id = "/kaggle/input/gemma-kaggleqa-react/transformers/gemma-2b-it/1"

# Load Model with PEFT adapter
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model = AutoPeftModelForCausalLM.from_pretrained(peft_model_id, device_map="auto", torch_dtype=torch.float16)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [24]:
agent = LLMAgent(KaggleEnv(), examples,is_gemma=True, add_examples = False)
_, _, markdown_text = agent.generate_answer("I am interested in ecommerce. What dataset would you recommend?",logging=True)
Markdown(markdown_text)


**Thought 1**: I should look up datasets that have to do with ecommerce.

**Action 1**: search_datasets[ecommerce]
```
 
    Dataset Title: eCommerce behavior data from multi category store
    Subtitle: This dataset contains 285 million users' events from eCommerce website
    License: Data files © Original Authors
    Kernel Count: 48
    URL: https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store
    
 ---- 
 
    Dataset Title: Brazilian E-Commerce Public Dataset by Olist
    Subtitle: 100,000 Orders with product, customer and reviews info
    License: CC BY-NC-SA 4.0
    Kernel Count: 502
    URL: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
    
 ---- 
 
    Dataset Title: eCommerce Events History in Cosmetics Shop
    Subtitle: This dataset contains 20M users' events from eCommerce website
    License: Data files © Original Authors
    Kernel Count: 36
    URL: https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-cosmetics-shop
    
 ---- 
 
    Dataset Title: Ecommerce Fraud Data
    Subtitle: Ecom. Fraud Detection data that is good for beginners to practice EDA and ML. 
    License: Unknown
    Kernel Count: 6
    URL: https://www.kaggle.com/datasets/aryanrastogi7767/ecommerce-fraud-data
    
 ---- 
 
    Dataset Title: Walmart Product Data 2019
    Subtitle: Updated Walmart USA Product Dataset 2019
    License: CC0: Public Domain
    Kernel Count: 3
    URL: https://www.kaggle.com/datasets/promptcloud/walmart-product-data-2019
    
 ---- 
 
    Dataset Title: eCommerce events history in electronics store
    Subtitle: This dataset contains 900K detailed users' events from eCommerce website
    License: Data files © Original Authors
    Kernel Count: 6
    URL: https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-electronics-store
    
 ---- 
 
    Dataset Title: Looker Ecommerce BigQuery Dataset
    Subtitle: CSV version of BigQuery Looker Ecommerce Dataset
    License: Unknown
    Kernel Count: 3
    URL: https://www.kaggle.com/datasets/mustafakeser4/looker-ecommerce-bigquery-dataset
    
 ---- 
 
    Dataset Title: Top Trending Topics in Online : 2023
    Subtitle: Search, Marketing, Real Estate, Ecommerce, Investment, Business, Health, Travel.
    License: Other (specified in description)
    Kernel Count: 0
    URL: https://www.kaggle.com/datasets/belayethossainds/top-trending-topics-in-online-2023
    
 ---- 
 
    Dataset Title: Global e-Commerce Websites List
    Subtitle: 4+ million e-Commerce sites built with various technologies & market share
    License: Database: Open Database, Contents: © Original Authors
    Kernel Count: 3
    URL: https://www.kaggle.com/datasets/wiredwith/websites-list
    
 ---- 
 
    Dataset Title: Ecommerce Rating Dataset
    Subtitle: Ecommerce Rating Dataset is collected from http://ecom.uelstore.com/ 
    License: Unknown
    Kernel Count: 1
    URL: https://www.kaggle.com/datasets/trnduythanhkhttt/uelstoredataset
    
```
**Thought 2**: I recommend the "Ecommerce Behavior Data from Multi Category Store" dataset as a good starting point for your exploration of ecommerce datasets.

**Action 2**: finish[I recommend the "Ecommerce Behavior Data from Multi Category Store" dataset as a good starting point for your exploration of ecommerce datasets. It contains a large dataset of user events from an eCommerce website, which can be used to train machine learning models for tasks such as customer behavior analysis, fraud detection, and product recommendation.
```
Finished.
```

In [25]:
agent = LLMAgent(KaggleEnv(), examples,is_gemma=True, add_examples = False)
_, _, markdown_text = agent.generate_answer("What is a Kaggle notebook?",logging=True)
Markdown(markdown_text)


**Thought 1**: I need to search for Kaggle documents that involve what a Kaggle notebook is. I need to search for broad keyterms in the phrase. I will search instead for the keyword notebook.

**Action 1**: search_docs[notebook]
```
n our community.

The RMarkdown editor is the same one used for basic R or Python scripts,
except that it uses the special RMarkdown syntax. To start editing an
RMarkdown script, click on “Create Notebook”, navigate to the “Scripts”
pane, and click on that. Then, in the language dropdown, click on
“RMarkdown”.

“Head Start for Data Science <https://www.kaggle.com/hiteshp/head-start-
for-data-scientist>” is a great example of a RMarkdown Script-type.


        Notebooks

The last type is Jupyter notebooks (usually just “notebooks”). Jupyter
notebooks consist of a sequence of cells, where each cell is formatted
in either Markdown (for writing text) or in a programming language of
your choice (for writing code). To start a notebook, click on “Create
Notebook”, and select “Notebook”. This will open the Notebooks editing
interface.

Notebooks may be written in either R or Python.

“Comprehensive data exploration with Python <https://www.kaggle.com/
pmarcelino/comprehensive-data-exploration-
 ---- 
ome of the contents of the home page, most of it is dedicated to hot
new Notebooks activity. By browsing down the page you can check out all
the latest updates from your fellow Kagglers.

You can tweak your newsfeed to your liking by following other Kagglers.
To follow someone, go to their profile page and click on “Follow User”.
Content posted and upvotes made by users you have followed will show up
more prominently.

The same is true of other users who choose to follow you. Post high-
quality notebooks and datasets and you will soon find other users
following along with what you are doing!


        Notebook Listing

A more structured way of accessing Notebooks is the Notebook listing
<https://www.kaggle.com/notebooks>, accessible from the “Notebooks” tab
in the main menu bar.

The Notebook listing is sorted by “Hotness <https://www.kaggle.com/
notebooks?sortBy=hotness&group=everyone&pageSize=20>” by default.
“Hotness” is what it sounds like: a way of measuring the interestingness
of
 ---- 
otness” is what it sounds like: a way of measuring the interestingness
of Notebooks on the platform. Notebooks which score highly in Hotness,
and thus appear highly in this list, are usually either recently written
Notebooks that are scoring highly in things like upvotes and views, or
“all-time” greats that have been consistently popular on the platform
for a long time.

Other methods of sorting are by

  * Most Votes <https://www.kaggle.com/code?sortBy=voteCount>: Surfaces
    the most popular notebooks of all time
  * Most Comments <https://www.kaggle.com/code?sortBy=commentCount>:
    Returns the most discussed notebooks of all time
  * Recently Created <https://www.kaggle.com/code?sortBy=dateCreated>: A
    real-time stream of new Notebooks
  * Recently Run <https://www.kaggle.com/code?sortBy=dateRun>: A real-
    time stream of activity
  * Relevance <https://www.kaggle.com/code?sortBy=relevance>: Sorts the
    results based on their relevance to the query

Other filtering options
 ---- 
 on Kaggle is available through either Datasets or our Competitions.
Both prominently feature the best community-created Notebooks on the
“Notebooks” tab. Browsing Notebooks on Datasets and Competitions
provides a way to quickly get acquainted with a specific dataset. You
can fork any existing public Notebook to make a copy of the code and
start experimenting with changes.

The Iris Species dataset <https://www.kaggle.com/uciml/iris> and the
Titanic competition <https://www.kaggle.com/c/titanic/notebooks> are two
classic examples of Datasets and Competitions, respectively, hosting
great Notebooks on their content.


        Tags and Tag Pages

Tags are the most advanced of the searching options available in the
Notebook listing page. Tags are added by Notebook owners to indicate the
topic of the Notebook, techniques you can use (e.g., “classification”),
or the type of the data itself (e.g., “text data”). You can navigate to
tag pages to browse more content sharing a tag either by click
 ---- 
markdown editor ideal
for narrative analyses, visualizations, and sharing work).

The main difference between Scripts and Notebooks is the editing pane
and how you experience editing and executing code.


        Editing

Whether you use Scripts or Notebooks might depend on your choice of
language and what your use case is. R users tend to prefer the Scripts,
while Python users prefer the Notebooks. For more on why that is, refer
to the “Types of Notebooks” section <https://www.kaggle.com/docs/
notebooks#types-of-notebooks>. Scripts are also favored for making
competition submissions where the code is the focus, whereas Notebooks
are popular for sharing EDAs (exploratory data analysis), tutorials, and
other share-worthy insights.

Both editing interfaces are organized around the concept of “Versions”.
This is a collection consisting of a Notebook version, the output it
generates, and the associated metadata about the environment.

In the Script editor, the code you write is executed al
```
**Thought 2**: The user is looking for information on what a Kaggle notebook is.

**Action 2**: finish[The definition of a Kaggle notebook is a Notebook
that contains code and a console. The code can be written in either R or
Python, and it will execute the code in the console pane and return a
result. The console pane provides an interactive environment where you can
run code, see the results, and make changes to the code to try out.

Notebooks are a powerful tool for data science. They provide a way to
share code and results with other Kagglers, making it easier to
collaborate on data science projects.

Here are some of the benefits of using Notebooks as your data science
tool:

  * Code versioning: You can version your code, meaning you can save
    different versions of your code and compare the results. This can be
    helpful for tracking changes you made or for understanding how your
    code evolved over time.

  * Sharing code and results: You can easily share your code and results
    with others. This makes it easier to collaborate on data science projects.

  * Reproducing results: Notebooks make it easy to reproduce the steps you
    take to get a result. This is because Notebooks are versioned, so you
    can always get the same result by running the code in the notebook again.

  * Exploring data: You can use Notebooks to explore data in the same way
    you would use a data exploration tool like Tableau or Power BI. You
    can filter, slice, and dice data directly in the notebook editor,
    and the results will be updated automatically.

  * Creating visualizations: You can create interactive visualizations in
    notebooks using the same API that you can use in the R console. This
    allows you to create scatter plots, bar charts, line charts, and more.

  * Executing competitions: You can use Notebooks to submit solutions to
    Kaggle competitions.

  * Sharing your work: You can share your Notebooks with others so they can
    view your code and results.

  * Collaborating with others: You can co-create Notebooks with others. This
    can be a great way to learn new techniques or to get help with a
    specific problem.

  * Learning from others: You can browse and explore public Notebooks to
    learn new techniques and data
```
Finished.
```

In [26]:
agent = LLMAgent(KaggleEnv(), examples,is_gemma=True, add_examples = False)
_, _, markdown_text = agent.generate_answer("How do I add datasets to Kaggle?",logging=True)
Markdown(markdown_text)


**Thought 1**: I need to search for Kaggle documents that involve adding datasets to Kaggle. I need to search for broad keyterms in the phrase. I will search for the keyword add.

**Action 1**: search_docs[add]
```

interface.

Notebooks may be written in either R or Python.

“Comprehensive data exploration with Python <https://www.kaggle.com/
pmarcelino/comprehensive-data-exploration-with-python>” is a great
example of a Python Jupyter Notebook-type. “How to Become a Data
Scientist <https://www.kaggle.com/jackcook/how-to-become-a-data-
scientist>” is a great example of an R Jupyter Notebook-type.

------------------------------------------------------------------------


      Searching for Notebooks

In addition to being an interactive editing platform, you can find and
use code that others in the community have shared public. Kagglers
working with data across both the Datasets and Competitions platforms
are constantly building cool things. Exploring and reading other
Kagglers’ code is a great way to both learn new techniques and stay
involved in the community.

There’s no better place than Kaggle Notebooks to discover such a huge
repository of public, open-sourced, and reproducible code for da
 ---- 
nd machine learning.

The latest and greatest from Notebooks is surfaced on Kaggle in several
different places.


        Site Search

You can use the site search in the top bar of the website while on any
page to look for not only Notebooks but Datasets, Competitions, Users,
and more across Kaggle. Start typing a search query to get quick results
and hit "Enter" to see a full page of results that you can drill down
into. From the full page search results, you can filter just to
"Notebooks" and add even more filter criteria using the filter options
on the left hand side of the page.


        Homepage

When you’re logged into your Kaggle account, the Kaggle homepage
<https://kaggle.com/> provides a live newsfeed of what people are doing
on the platform. While Discussion forum posts and new Datasets make up
some of the contents of the home page, most of it is dedicated to hot
new Notebooks activity. By browsing down the page you can check out all
the latest updates from your fellow Kagg
 ---- 
ecific dataset. You
can fork any existing public Notebook to make a copy of the code and
start experimenting with changes.

The Iris Species dataset <https://www.kaggle.com/uciml/iris> and the
Titanic competition <https://www.kaggle.com/c/titanic/notebooks> are two
classic examples of Datasets and Competitions, respectively, hosting
great Notebooks on their content.


        Tags and Tag Pages

Tags are the most advanced of the searching options available in the
Notebook listing page. Tags are added by Notebook owners to indicate the
topic of the Notebook, techniques you can use (e.g., “classification”),
or the type of the data itself (e.g., “text data”). You can navigate to
tag pages to browse more content sharing a tag either by clicking on a
tag on a Notebook, or by searching by tag using the tag-specific search
syntax: |tag:[TAG NAME]|.

Searching by tags allow you to search for Notebooks by topical area or
technique. For example, if you are interested in learning new techniques
f
 ---- 
bout the environment.

In the Script editor, the code you write is executed all at once,
whenever you generate a new version. For finer-grained control, it’s
also possible to specifically execute only a single line or selection of
lines of code.

Notebooks are built on Jupyter notebooks. Notebook Notebooks consist of
individual cells, each of which may be a Markdown (text) cell or a code
cell. Code can be run (and the resulting variables saved) by running
individual code cells, and cells can be added or deleted from the
notebook at any time.


        Console

The console tab provides an alternative interface to the same Python or
R container running in the Notebook. Commands you input into the console
will not change the content of your version. However, any variables you
create in the console will persist throughout the session (unless you
delete them). Additionally, any code that you execute in the editor will
also execute in the console pane.


        Settings

In the expanded edi
 ---- 
 you
create in the console will persist throughout the session (unless you
delete them). Additionally, any code that you execute in the editor will
also execute in the console pane.


        Settings

In the expanded editor, the settings pane takes up the right side of the
screen. In the compact editor (where you hide the settings pane), it is
folded into tabs above the Editor tab. In either case the settings pane
contains the following tabs:

There's a tab called “Data” that provides a way of adding or removing
data from the Notebook.

There's a tab called the Settings. The Settings tab has settings for
toggling Language, toggling Docker image selection, toggling Internet
(which is on by default), and toggling an Accelerator between CPU
(default), GPU, and TPU.

Language is the programming language the Notebook is authored in. You
can use it to switch between R and Python in the notebook flavor, and
between R, RMarkdown, and Python in the script flavor. For more details
on the differ
```
**Thought 2**: The user wants to add datasets to Kaggle.

**Action 2**: finish[The user wants to add datasets to Kaggle. To add datasets to Kaggle, you can either upload them from your local machine or use a remote storage service like Google Drive or Dropbox.

To upload a dataset from your local machine, follow these steps:

1. Open the Notebook editor.
2. Click on the "Data" tab in the notebook editor.
3. Click on the "Add Data" button.
4. Select "CSV File" from the "File" type.
5. Select the CSV file you want to upload from your local machine.
6. Click on the "Add" button.

Alternatively, you can use a remote storage service to upload your dataset. To do this, follow these steps:

1. Open the Notebook editor.
2. Click on the "Data" tab in the notebook editor.
3. Click on the "Add Data" button.
4. Select "Google Drive" from the "File" type.
5. Select the CSV file you want to upload from your Google Drive.
6. Click on the "Add" button.

Your dataset will now be added to your Notebook.

You can also add datasets from other users by clicking on the "Add Data"
button and selecting "Data Sources". This will open a modal that lets you
search for datasets by tags, owners, or keywords. You can also filter
your results by language (R or Python), file type (CSV, JSON, HTML,
Notebook output, or other), and more.

When you find a dataset you'd like to add to your Notebook, click on
it to open it. This will open the dataset in a new tab. You can then
add the dataset to your Notebook by clicking on the "Add Data" button
in the editor.

You can also add datasets to your Notebook by uploading them from your
local machine or by using a remote storage service like Google Drive or
Dropbox.

To upload a dataset from your local machine, follow these steps:

1. Open the Notebook editor.
2. Click on the "Data" tab in the notebook editor.
3. Click on the "Add Data" button.
4. Select "CSV File" from the "File" type.
5. Select the CSV file you want to upload from
```
Finished.
```

Nice! Although not as detailed as Gemini, Gemma performs much better than before.

# Conclusion
This notebook demonstrates how to use the Kaggle API alongside ReAct Prompting to build an LLM capable of retrieving information directly from Kaggle to answer questions. Gemma underwent fine-tuning using Gemini's outputs for knowledge distillation, resulting in interpretable results.

## Usefulness
ReAct prompting enhances model interpretability by grounding it in current, factual information. Retrieving information through existing Kaggle APIs simplifies the process, and the environment demonstrated here can be adapted for use with any LLM.

## Limitations
Despite the fine-tuning, Gemini remains significantly more robust than Gemma. Gemma may occasionally hallucinate or fail to provide a response. Due to hardware/time limitations, Gemma could only be fine-tuned on the 2b version with LoRA on under 1000 trajectories. To improve upon this work and create a more robust model, one could consider training a larger Gemma model on more trajectories. Another limitation is the context window; sometimes, search_docs may return documents larger than the context, causing Gemma to malfunction. A potential solution could involve using summarization tools to trim down the retrieved documents and shorten the context. Moreover, the model heavily relies on the observations obtained from the Kaggle API. Sometimes, the Kaggle API search returns dubious results or descriptions are misleading, interfering with the model's performance.

## Closing Remarks
This notebook only scratches the surface of what's possible with the ReAct framework and Kaggle API integration into Gemma. While it has potential, the current Gemma model still is not exactly reliable due to hardware/time constraints. There are numerous avenues for further improvement and exploration. Adding summarization, RAG for loading Kaggle documents on top of the Kaggle Environment are all possibilities for improvement.