# 0.0.1 - Retrieve & Clean Dataset

## Overview

This notebook demonstrates how to download and extract the Kaggle dataset using the Kaggle API python package and the `doit` task automation tool.

### Actions

This notebook executes tasks using the `doit` task automation tool in order to:

- Downloads the Kaggle dataset using your Kaggle API credentials (download_dataset.py).
- Extracts the appropriate `.csv` file from the `.zip` source (unpack_dataset.py).

### Dependencies

Users should have installed the dependencies listed in `environment.yaml` or `requirements.txt` using the appropriate tool.

Users should also have their Kaggle credentials appropriately set using one of two options:

- A valid `kaggle.json` file in their home configuration folder (`~/.kaggle/kaggle.json`) or,
- Appropriately set `KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables in a `.env` file that is excluded from version control.

### Targets

**Targets**: This notebook outputs two files:

- `data/raw/walmart-product-data-2019.zip`
- `data/raw/marketing_sample_for_walmart_com-ecommerce__20191201_20191231__30k_data.csv`.

## Setup

The following cell changes into the root project directory in order to execute command line tasks.

In [1]:
# Import libraries.
from pathlib import Path

# Get the project directory.
PROJECT_DIR = Path.cwd().resolve().parents[0]

%cd {PROJECT_DIR}

D:\Repositories\rit\ISTE780\Project


In [2]:
# List the possible tasks.
!doit list

clean_data        Clean the raw dataset (.csv) and place in the interim folder.
create_data_dir   Create the data directory.
create_env        Create new conda environment using 'environment.yaml'.
download_data     Download the dataset from Kaggle.
unpack_data       Unpack the raw dataset (.zip) as the (.csv) file.
update_env        Update existing conda environment using 'environment.yaml'.


In [3]:
# Uncomment the line below if you would like fresh copies of the data.
# !doit clean

clean_data - removing file 'data/interim/ecommerce_data-cleaned-0.1.csv'
unpack_data - removing file 'data/raw/marketing_sample_for_walmart_com-ecommerce__20191201_20191231__30k_data.csv'
download_data - removing file 'data/raw/walmart-product-data-2019.zip'


In [6]:
!doit clean_data

-- create_data_dir
-- create_env
-- download_data
-- unpack_data
-- clean_data


In [5]:
%ls "data/raw/"
# Displays the files in the data/raw folder.

 Volume in drive D is Data
 Volume Serial Number is 6C42-7B68

 Directory of D:\Repositories\rit\ISTE780\Project\data\raw

11/05/2021  04:56 PM    <DIR>          .
11/05/2021  04:56 PM    <DIR>          ..
11/05/2021  04:56 PM        45,653,183 marketing_sample_for_walmart_com-ecommerce__20191201_20191231__30k_data.csv
11/05/2021  04:56 PM        14,308,028 walmart-product-data-2019.zip
               2 File(s)     59,961,211 bytes
               2 Dir(s)  462,156,988,416 bytes free
