# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect data and save it under outputs/datasets/collection.
* Assess the need for data cleaning based on inital inspection.

## Inputs

* Kaggle JSON file - this is the authentication token required to access the Kaggle dataset.

## Outputs

* Generate dataset: outputs/datasets/datacollection/house_prices_records.csv




## Notebook in Relation to CRISP-DM

This constitutes the data collection stage.

## Kernel Selection

Please use a Python 3.8.12 kernel to run this notebook. This application has been developed in a local workspace using Python 3.8.12. If you use a different version of Python to run this notebook, some of the required dependencies may not function correctly.

---

## Install Python Packages in the Notebooks

In [None]:
%pip install -r /workspace/CI-PP5-Peter-Regan-Heritage-Housing-Project/requirements.txt

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch Data from Kaggle

Install Kaggle package to collect data from Kaggle.

In [None]:
%pip install kaggle==1.5.12

You will need a Kaggle token to fetch the data required to run these notebooks. If you do not already have a Kaggle token, follow these steps:

1. Visit www.kaggle.com and register an account.

2. When you have finished registering your account, log in.

3. Click on your profile/account avatar that will be visible at the top right hand corner of the browser window.

4. Scroll down to the "API" section.

5. Click on "Create New Token". This will automatically create your kaggle token which will be in the form of a json 
   file.

6. Depending on your browser, this kaggle.json file should download automatically.

7. Download the kaggle.json file manually if this is not the case.


# Make Kaggle Token Available in Your Coding Environment

Once your Kaggle token has been downloaded, you can manually click, hold and drag it from your Downloads folder on your device (or wherever else you may have chosen to store it) into the file directory of your coding environment.

After you have put your Kaggle token in your coding environment's file directory, run the cell below so that it is recognised within the session.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get the dataset path from its Kaggle URL. For this application, use the dataset from this url: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data

Run the following cell to define the Kaggle dataset within the context of your environment and session, set its destination folder and download it.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Run the following cell to unzip the downloaded file, delete the zip file and delete the kaggle.json file.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle Data

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
pd.set_option('display.max_columns', None)
df.head(40)

Generate dataframe summary.

In [None]:
df.info()

Check the dataset for missing values.

In [None]:
df.isnull().sum()

We notice here that columns with missing values may be due to the fact that these features may not be applicable to all properties in the dataset.

Check for data entries that have been duplicated.

In [None]:
df[df.duplicated(subset=None)]

Check for unique values in columns that have non-numerical data values as these are likely to contain categorical data.

In [None]:
for col in df:
    if df[col].dtypes=='object':
        print(col, '-', df[col].unique())

---

# Push files to Repo

Save the data file in a local folder named 'datacollection' under 'outputs/datasets/' and push to repository.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/datacollection') # create a folder for the data output
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/datacollection/house_prices_records.csv",index=False)