# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/raw

## Inputs

* Kaggle JSON file - the authentication token. 

## Outputs

* Generate Dataset: outputs/datasets/raw/house_prices_records.csv
* Generate Dataset: outputs/datasets/raw/inherited_houses.csv

### CRISP-DM
* Data Collection


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch Data

Data fetched from https://www.kaggle.com/datasets/codeinstitute/housing-prices-data

In [None]:
! pip install kaggle==1.5.12    

* Drag kaggle.json into the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from Kaggle url.
* Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle data

**House Prices Data**

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

### Dataset explaining
* See readme for explanation and meaning.

* DataFrame Summary

In [None]:
df.info()

* Checking for duplicates:

In [None]:
df[df.duplicated(subset=None, keep= 'first')]

No duplicates in House Price Data

**Inherited Houses Data**

In [None]:
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df.head()

* Inherited dataframe summary

In [None]:
df_inherited.info()

* Checking for duplicates

In [None]:
df_inherited[df_inherited.duplicated(subset=None, keep= 'first')]

* No duplicates in Inherited Houses Data

### Comments
* We noticed 23 columns in Inherited Houses Data and 24 columns in House Prices Data.
* We noticed both floats and integers in both dataframes. When cleaning the data, consider convert floats to int because there is no reason to use decimal in relation to the features and working with only int would make things easier.

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/raw') # create outputs/datasets/raw folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/raw/house_prices_records.csv",index=False)
df_inherited.to_csv(f"outputs/datasets/raw/inherited_houses.csv",index=False)