# **Data Collection**

## Objectives

* Fetch data from Kaggle and save as raw data.
* Inspect the data and save it under outputs/dataset/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/HousingPricesData.csv

## Additional Comments

* The Housing Prices Data have been issued on Kaggle by Code Institute.
* This data is public therefore it can be push in a repository.


---

# Change working directory


* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Data Collection

**Fetch Data from Kaggle**


Install Kaggle package

In [None]:
%pip install kaggle

Import kaggle authentication

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset, destination folder and download it.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the Kaggle downloaded file, delete the zip file and delete the kaggle.son file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

**Load and Inspect Kaggle data**

We read and inspect house_prices_record

In [None]:
import pandas as pd
df1 = pd.read_csv(f"inputs/datasets/raw/house-price/house-price/house_prices_records.csv")
df1.head()


In [None]:
df1.info()

We check the selections for each object type

In [None]:
df1['BsmtExposure'].unique()

In [None]:
df1['BsmtFinType1'].unique()

In [None]:
df1['GarageFinish'].unique()

In [None]:
df1['KitchenQual'].unique()


* We can already notice that the database has a lot of missing datas.
* The target is the SalePrice which is a integer. This is what we want as the ML mdel requires numeric variables.
* Some of the features are object, we will see later in different step if we need to transform them into integer.



We read and inspect inherited_houses

In [None]:
df2 = pd.read_csv(f"inputs/datasets/raw/house-price/house-price/inherited_houses.csv")
df2.head()

In [None]:
df2.info()

The databse for inherited house (df2) has the same features as df1. Note that one of the bussiness requirement is to predict the SalePrice of the inherited houses in this database (df2).

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df1.to_csv(f"outputs/datasets/collection/HousePriceRecord.csv",index=False)
df2.to_csv(f"outputs/datasets/collection/Inherited_houses.csv",index=False)

**HousePriceRecord and Inherited_houses have been pushed in the folder outputs/datasets/collection**