# Data Collection

## Objectives

*   Fetch data from Kaggle competition and save as raw data
*   Get a data summary
*   Inspect the shape of the data
*   Output the data and save in outputs/datasets/collection


## Inputs

* JSON file from Kaggle & Authentication Token

## Outputs

* Generated Dataset saved as outputs/datasets/collection/HousePriceRegression.csv

## Additional Comments


* This data set does not contain any pricate or sensitive information. As such, the data set is committed to a repository which would usually not be the case when data needs to be protected.



---

# Change Working Directory

In order to install Kaggle and fetch the data from Kaggle, the current working directory needs to be changed to the root level of the workspace.

In [1]:
import os

# Get the current working directory (cwd)
cwd = os.getcwd()
print(f"* Previous current working directory: {cwd}")

# Make the parent of the cwd the new cwd
os.chdir(os.path.dirname(cwd))
cwd = os.getcwd()
print(f"* Updated current working directory: {cwd}")

* Previous current working directory: /workspace/house-price-regression/jupyter_notebooks
* Updated current working directory: /workspace/house-price-regression


---

# Fetch Data From Kaggle

In order to fetch data from Kaggle using the Kaggle API, Kaggle needs to installed in the workspace. An access token can be obtained using instructions at the [Kaggle API repository](https://github.com/Kaggle/kaggle-api).

> If a cloud-based development environment is being used (e.g. Gitpod), the `kaggle.json` credentials file which gets downloaded to your local machine when creating a key should be uploaded to the remote workspace.

**NB** Your Kaggle API key is private and should not be shared.

In [2]:
# Install Kaggle
! pip install kaggle==1.5.12

You should consider upgrading via the '/home/gitpod/.pyenv/versions/3.8.12/bin/python3 -m pip install --upgrade pip' command.[0m


In [3]:
# Set up token variable
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

# Set read/write permission for file owner only
! chmod 600 kaggle.json

Download the data to destination folder and unzip it.

In [7]:
# Define data source path
kaggle_dataset_path = "house-prices-advanced-regression-techniques"

# Define destination folder
destination_folder = "inputs/datasets/raw"

#Download the data
! kaggle competitions download -c {kaggle_dataset_path} -p {destination_folder}

# Unzip downloaded file and remove the zip file
! unzip {destination_folder}/*.zip -d {destination_folder} \
  && rm {destination_folder}/*.zip

Downloading house-prices-advanced-regression-techniques.zip to inputs/datasets/raw
  0%|                                                | 0.00/199k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 199k/199k [00:00<00:00, 70.8MB/s]
Archive:  inputs/datasets/raw/house-prices-advanced-regression-techniques.zip
  inflating: inputs/datasets/raw/data_description.txt  
  inflating: inputs/datasets/raw/sample_submission.csv  
  inflating: inputs/datasets/raw/test.csv  
  inflating: inputs/datasets/raw/train.csv  


# Your second notebook section

# **Push** generated/new files from this Session to GitHub repo

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "update"
!git add .
!git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main


---