# Notes about the book The Kaggle book

- toc: true 
- badges: true
- comments: true
- categories: [book, kaggle, data science]

## Chapter 1 - Introducing Kaggle and other data science competition

* [Kaggle public API docs](https://www.kaggle.com/docs/api).
* [Kaggle API Github repo](https://github.com/Kaggle/kaggle-api).
* Tip: Interact with others on the discussion forum when enrolled on a competition to share and learn.
* Common Task Framework (CTF): Great for advancing state of the art solutions.
    * Well defined metrics and quality data
    * Competition
    * Sharing between competitors
    * Compute-resource availability
* What can go wrong in a competition:
    * Leakeage from the data: data contain informatio of the target not available in real-time.
    * Probing from the leaderboard: Use the leaderboard to metric to tune your solution.
        * Example: https://www.kaggle.com/c/dont-overfit-ii/discussion/91766
    * Overfitting and consequent leaderboard shake-up: cases with huge gap between the training set and the public test set
        * Technique to measure discrepancies between training set and test set: https://www.kaggle.com/code/tunguz/adversarial-ieee/notebook
    * Private sharing
* Jeremy Howard on how to set you up for success on Kaggle: https://www.kaggle.com/code/jhoward/first-steps-road-to-the-top-part-1

## Chapter 2 - Organizing Data with Datasets

### Setting up a dataset

* It is possible to upload a dataset either privately or publicly.
* It is possible to use "Import a GitHub repository" option to import a experimental library not yet available on Kaggle Notebooks.

### Gathering data

* Interesting interview from [Larxel](https://www.kaggle.com/andrewmvd):
    * On creating datasets:
        * All in all, the process that I recommend starts with setting your purpose, breaking it down into objectives and topics, formulating questions to fulfil these topics, surveying possible sources of data, selecting and gathering, pre-processing, documenting, publishing, maintaining and supporting, and finally, improvement actions.
    * On learning on Kaggle:
        * Absorbing all the knowledge at the end of a competition
        * Replication of winning solutions in finished competitions

### Working with datasets

* The easiest way to work with Kaggle datasets is by creating a notebook from the dataset webpage.

### Using Kaggle datasets in Google Colab

* This section contains a step-by-step to download Kaggle Datasets into Colab.
    1. Download Kaggle API from your Kaggle account. Place it ~/.kaggle/kaggle.json
    2. Create folder `Kaggle` on your GDrive and upload .json there.
    3. Mount GDrive to your colab
    ```
    from google.colab import drive
    drive.mount('/content/gdrive')
    ```
    4. Provide path to .json config
    ```python
    import os
    # content/gdrive/My Drive/Kaggle is the path where kaggle.json is 
    # present in the Google Drive
    os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"
    # change the working directory
    %cd /content/gdrive/My Drive/Kaggle
    
    ```
    5. Go to the dataset page and use the `copy API command`.
    6. Run the command on the colab. Ex.:
    ```
    !kaggle datasets download -d bricevergnou/spotify-recommendation
    ```
    7. Data is downloaded to `os.environ['KAGGLE_CONFIG_DIR']`. Unzip it and you are ready to go.