# Data Collection for House Price Predictors

This notebook documents the steps to collect the house price dataset from Kaggle using the Kaggle API. We will set up the Kaggle API, download the dataset, and extract it to the appropriate directory.


## Kaggle API Setup

### 1. Install the Kaggle Package

If you have not installed the Kaggle package, you can install it using the following command:

```bash
pip install kaggle
```

### 2. Kaggle API Token

- Go to your Kaggle account and navigate to the `Account` tab.
- Scroll down to the `API` section and click on `Create New API Token`.
- This will download a `kaggle.json` file to your computer.
- Move the `kaggle.json` file to the `~/.kaggle/` directory on Unix-based systems or `C:\Users\<Windows-username>\.kaggle\` on Windows.

### 3. Set Environment Variables (optional but recommended for security)

You can set the Kaggle username and key as environment variables. This is recommended for security reasons.

```bash
export KAGGLE_USERNAME=your_kaggle_username
export KAGGLE_KEY=your_kaggle_key
```
For Windows, you can set the environment variables using the following commands:

```bash
set KAGGLE_USERNAME=your_kaggle_username
set KAGGLE_KEY=your_kaggle_key
```

## Data Collection Script

We use a Python script to download the dataset from Kaggle. The script is saved as `src/data_collection.py`.

In [None]:
import os
import zipfile


def download_and_extract_data(competition, file_path):
    # Download the dataset using Kaggle API
    os.system(f'kaggle competitions download -c {competition} -p {file_path}')

    # Extract the downloaded zip file
    for file in os.listdir(file_path):
        if file.endswith('.zip'):
            with zipfile.ZipFile(os.path.join(file_path, file), 'r') as zip_ref:
                zip_ref.extractall(file_path)
            os.remove(os.path.join(file_path, file))
            

if __name__ == '__main__':
    competition = "house-prices-advanced-regression-techniques"
    file_path = "data/raw/"

    # Create the directory if it does not exist
    if not os.path.exists(file_path):
        os.makedirs(file_path)

    # Download and extract the data
    download_and_extract_data(competition, file_path)
    print("Data downloaded and extracted successfully!")

## Running the Script

Navigate to the project directory and run the script to download and extract the data:

```bash
python src/data_collection.py
```

## Verify the Data

After running the script, verify that the data has been downloaded and extracted correctly by checking the `data/raw/` directory. You should see the following files:

- `train.csv`
- `test.csv`
- `sample_submission.csv`
- `data_description.txt`