# 0.0.1 - Retrieve Dataset

## Overview

This notebook demonstrates how to download and extract the Kaggle dataset using the Kaggle API python package and the `doit` task automation tool.

### Actions

This notebook executes tasks using the `doit` task automation tool in order to:

- Downloads the Kaggle dataset using your Kaggle API credentials (download_dataset.py).
- Extracts the appropriate `.csv` file from the `.zip` source (unpack_dataset.py).

### Dependencies

Users should have installed the dependencies listed in `environment.yaml` or `requirements.txt` using the appropriate tool.

Users should also have their Kaggle credentials appropriately set using one of two options:

- A valid `kaggle.json` file in their home configuration folder (`~/.kaggle/kaggle.json`) or,
- Appropriately set `KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables in a `.env` file that is excluded from version control.

### Targets

**Targets**: This notebook outputs two files:

- `data/raw/walmart-product-data-2019.zip`
- `data/raw/marketing_sample_for_walmart_com-ecommerce__20191201_20191231__30k_data.csv`.

## Setup

The following cell changes into the root project directory in order to execute command line tasks.

In [1]:
# Import libraries.
import os, sys
from pathlib import Path

In [2]:
# Get the project directory.
PROJECT_DIR = Path(os.getcwd()).resolve().parents[0]

In [3]:
%cd {PROJECT_DIR}

C:\Users\effen\OneDrive\Documents\RIT\ISTE 780\Projects\price-clf


In [4]:
# Install dependencies.
!{sys.executable} -m pip install -r requirements.txt

Obtaining file:///C:/Users/effen/OneDrive/Documents/RIT/ISTE%20780/Projects/price-clf (from -r requirements.txt (line 2))
Installing collected packages: src
  Attempting uninstall: src
    Found existing installation: src 0.1.0
    Uninstalling src-0.1.0:
      Successfully uninstalled src-0.1.0
  Running setup.py develop for src
Successfully installed src-0.1.0


In [5]:
# List the possible tasks.
!doit -f "dodo.py" list

clean_dataset      Clean the raw dataset (.csv) and place in the interim folder.
download_dataset   Download the dataset from Kaggle.
unpack_dataset     Unpack the raw dataset (.zip) as the (.csv) file.


In [6]:
!doit -f "dodo.py" clean

clean_dataset - removing file 'data/interim/ecommerce_data-cleaned-0.1.csv'
unpack_dataset - removing file 'data/raw/marketing_sample_for_walmart_com-ecommerce__20191201_20191231__30k_data.csv'
download_dataset - removing file 'data/raw/walmart-product-data-2019.zip'


In [7]:
!doit -f "dodo.py" run

.  download_dataset

2021-10-01 20:27:31,251 - __main__ - INFO - Loading envvars...

2021-10-01 20:27:31,259 - __main__ - INFO - Done loading envvars.

2021-10-01 20:27:31,259 - __main__ - INFO - Authenticating to the Kaggle API...

2021-10-01 20:27:31,262 - __main__ - INFO - Done authenticating to the Kaggle API.

2021-10-01 20:27:31,263 - __main__ - INFO - Downloading dataset...


  0%|          | 0.00/13.6M [00:00<?, ?B/s]
 22%|##1       | 3.00M/13.6M [00:00<00:00, 25.0MB/s]
 51%|#####1    | 7.00M/13.6M [00:00<00:00, 33.7MB/s]
 81%|########  | 11.0M/13.6M [00:00<00:00, 36.6MB/s]
100%|##########| 13.6M/13.6M [00:00<00:00, 36.8MB/s]

2021-10-01 20:27:32,276 - __main__ - INFO - Looking for dataset at "walmart-product-data-2019.zip" ...

2021-10-01 20:27:32,276 - __main__ - INFO - Found dataset.

2021-10-01 20:27:32,276 - __main__ - INFO - Moving dataset into output directory "data\raw\walmart-product-data-2019.zip" ...

2021-10-01 20:27:32,276 - __main__ - INFO - Moved dataset.

2021-10-01 20:27:32,463 - _


.  unpack_dataset
.  clean_dataset



2021-10-01 20:27:35,638 - __main__ - INFO - File saved.



In [8]:
# Show our current tasks.
with open("dodo.py") as f: # The with keyword automatically closes the file when you are done
    print(f.read())

# dodo.py
#
# Tasks in this file are executed by the doit cli.
import sys

DOIT_CONFIG = { 'action_string_formatting': 'both' }
EXECUTABLE = sys.executable

def task_download_dataset():
    """
    Download the dataset from Kaggle.
    """
    return {
        'targets': [ 'data/raw/walmart-product-data-2019.zip' ],
        'actions': [ EXECUTABLE + ' src/data/download_dataset.py {targets}' ],
        'clean': True
    }

def task_unpack_dataset():
    """
    Unpack the raw dataset (.zip) as the (.csv) file.
    """
    return {
        'file_dep': [ 'data/raw/walmart-product-data-2019.zip' ],
        'targets': [ 'data/raw/marketing_sample_for_walmart_com-ecommerce__20191201_20191231__30k_data.csv' ],
        'actions': [ EXECUTABLE + ' src/data/make_dataset.py {dependencies} data/raw' ],
        'clean': True
    }
    
def task_clean_dataset():
    """
    Clean the raw dataset (.csv) and place in the interim folder.
    """
    return {
        'file_dep': [ 'data/raw/marketing_sa

In [9]:
%ls "data/raw/"
# Displays the files in the data/raw folder.

 Volume in drive C has no label.
 Volume Serial Number is F69E-1392

 Directory of C:\Users\effen\OneDrive\Documents\RIT\ISTE 780\Projects\price-clf\data\raw

10/01/2021  08:27 PM    <DIR>          .
10/01/2021  08:27 PM    <DIR>          ..
10/01/2021  12:52 AM                 0 .gitkeep
10/01/2021  08:27 PM        45,653,183 marketing_sample_for_walmart_com-ecommerce__20191201_20191231__30k_data.csv
10/01/2021  08:27 PM        14,308,028 walmart-product-data-2019.zip
               3 File(s)     59,961,211 bytes
               2 Dir(s)  66,125,627,392 bytes free
