# **DATA COLLECTION**

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - authentication token to access and download Kaggle data

## Outputs

* Generate Datasets:
    * inputs/datasets/raw/house_price_records.csv
    * inputs/datasets/raw/inherited_houses.csv
* Metadata:
    * inputs/datasets/raw/house_price_records_metadata.txt

## Additional Comments

 It is assumed that two csv file follows same data structure (because only one metadata file is provided), however it is worth checking down the line.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch raw data from Kaggle

Install kaggle

In [None]:
! pip install kaggle==1.5.12

Make the kaggle authentication token available for the session. 

Make sure token file is not commited (add filename to .gitignore)

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

- Define path to the Kaggle dataset we want to download
- Indicate the destination folder for the downloaded data
- Download the data

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

- Unzip the downloaded folder
- Remove the zipped folder
- Remove the kaggle JSON file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

# Load and Inspect Kaggle Data

In [None]:
import pandas as pd
df_prices_records = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df_prices_records.head()

Load and Inspect inherited_houses file

In [None]:
df_inherited_houses = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited_houses.head()

# Conclusion

- The first look reveals that two csv files share same columns except **SalePrice**, which is missing in inherited_houses.csv. That is as expected because the houses are just inherited and not been sold.

- Some columns have missing values.

- Column names are shorthand. The metadata file explains the abbreviations.

---

# Clean up file and folder organisation

Remove un-necessary nested folder structure for readability

**Current folder structure** 

/inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv

**Taget folder structure**

/inputs/datasets/raw/house_prices_records.csv

Do this for all three files (two csv and one matadata)

#### Remember that the following code will throw error, if you are re-running notebook back to back, because the folder structure has been cleaned up once,hence the don't exist at the same location they were.

#### But if you are running in sequence from the beginning, it should work fine.

In [None]:
import shutil

# Define the base directory and the target directory
source_dir = 'inputs/datasets/raw/house-price-20211124T154130Z-001/house-price'
target_dir = 'inputs/datasets/raw/'

# Define full path to directories
full_source_dir = os.path.join(current_dir, source_dir)
full_target_dir = os.path.join(current_dir, target_dir)

# Create the target directory if it doesn't exist
os.makedirs(full_target_dir, exist_ok=True)

# List all files in the source directory
files = os.listdir(full_source_dir)

# Iterate over the files and move them to the target directory
for file in files:
    # Construct full file path to the file
    current_file_path = os.path.join(full_source_dir, file)
    target_file_path = os.path.join(full_target_dir, file)
    
    # Move the file to the target directory
    try:
        shutil.move(current_file_path, target_file_path)
        print(f"Moved {current_file_path} to {target_file_path}")
    except Exception as e:
        print(f"Error moving {current_file_path}: {e}")


Remove empty directory

In [None]:
if not os.listdir(full_source_dir):
    parent_dir=os.path.dirname(full_source_dir)
    os.rmdir(full_source_dir)
    os.rmdir(parent_dir)
    print(f"Removed empty directory: {full_source_dir}")

Self note : Remember to clear output and push code to repo