# 0.0.1 - Retrieve & Clean Dataset

**Overview**: This notebook demonstrates how to download and extract the Kaggle dataset using the Kaggle API python package and the `doit` task automation tool.

**Actions**: This notebook executes tasks using the `doit` task automation tool in order to:

- Downloads the Kaggle dataset using your Kaggle API credentials (download_dataset.py).
- Extracts the appropriate `.csv` file from the `.zip` source (unpack_dataset.py).

**Dependencies**: This notebook has the following dependencies:

- Users should have installed the dependencies listed in `environment.yaml` or `requirements.txt` using the appropriate tool.
- Users should also have their Kaggle credentials appropriately set using one of two options:
    - A valid `kaggle.json` file in their home configuration folder (`~/.kaggle/kaggle.json`) or,
    - Appropriately set `KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables in a `.env` file that is excluded from version control.

**Targets**: This notebook outputs three artifacts:

- `data/raw/walmart-product-data-2019.zip`
- `data/raw/marketing_sample_for_walmart_com-ecommerce__20191201_20191231__30k_data.csv`.
- `data/interim/ecommerce_data-cleaned-0.1.csv`

## Setup

The following cell changes into the root project directory in order to execute command line tasks.

In [1]:
# Change current working directory to project root.
from pathlib import Path
PROJECT_DIR = Path.cwd().resolve().parents[0]
%cd {PROJECT_DIR}

D:\Repositories\rit\ISTE780\Project


In [2]:
# List the possible tasks.
!doit list

clean_data        Clean the raw dataset (.csv) and place in the interim folder.
create_data_dir   Create the data directory.
create_env        Create new conda environment using 'environment.yaml'.
download_data     Download the dataset from Kaggle.
unpack_data       Unpack the raw dataset (.zip) as the (.csv) file.
update_env        Update existing conda environment using 'environment.yaml'.


In [3]:
# Performs the initial cleaning of the data.
!doit clean_data

-- create_data_dir
-- create_env
-- download_data
-- unpack_data
-- clean_data


In [4]:
# Display files in the data folders.
import os
for root, dir, files in os.walk("data/", topdown=False):
    for name in files:
        if ".csv" in name:
            print(Path(root) / name)

data\interim\ecommerce_data-cleaned-0.1.csv
data\interim.bak\.ipynb_checkpoints\ecommerce_data-cleaned-0.1-checkpoint.csv
data\interim.bak\.ipynb_checkpoints\ecommerce_data-cleaned-0.2.2-checkpoint.csv
data\interim.bak\.ipynb_checkpoints\ecommerce_data-cleaned-0.2.3-checkpoint.csv
data\interim.bak\ecommerce_data-cleaned-0.1.0.csv
data\interim.bak\ecommerce_data-cleaned-0.1.csv
data\interim.bak\ecommerce_data-cleaned-0.2.1.csv
data\interim.bak\ecommerce_data-cleaned-0.2.2.csv
data\interim.bak\ecommerce_data-cleaned-0.2.3.csv
data\raw\marketing_sample_for_walmart_com-ecommerce__20191201_20191231__30k_data.csv
