# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect data and save it under outputs/datasets/collection.
* Assess the need for data cleaning based on inital inspection.

## Inputs

* Kaggle JSON file - this is the authentication token required to access the Kaggle dataset.

## Outputs

* Generate dataset: outputs/datasets/collection/house_prices_records.csv




## Kernel Selection

Please use a Python 3.8.12 kernel to run this notebook. This application has been developed in a local workspace using Python 3.8.12. If you use a different version of Python to run this notebook, some of the required dependencies may not function correctly.

---

## Install Python Packages in the Notebooks

In [2]:
%pip install -r /workspace/CI-PP5-Peter-Regan-Heritage-Housing-Project/requirements.txt

Collecting altair==4.2.2
  Using cached altair-4.2.2-py3-none-any.whl (813 kB)
Collecting astor==0.8.1
  Using cached astor-0.8.1-py2.py3-none-any.whl (27 kB)
Collecting attrs==23.1.0
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting backports.zoneinfo==0.2.1
  Using cached backports.zoneinfo-0.2.1-cp38-cp38-manylinux1_x86_64.whl (74 kB)
Collecting base58==2.1.1
  Using cached base58-2.1.1-py3-none-any.whl (5.6 kB)
Collecting blinker==1.6.2
  Using cached blinker-1.6.2-py3-none-any.whl (13 kB)
Collecting cachetools==5.3.1
  Using cached cachetools-5.3.1-py3-none-any.whl (9.3 kB)
Collecting certifi==2023.7.22
  Using cached certifi-2023.7.22-py3-none-any.whl (158 kB)
Collecting charset-normalizer==3.2.0
  Using cached charset_normalizer-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (199 kB)
Collecting click==7.1.2
  Using cached click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting cycler==0.11.0
  Using cached cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collect

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/CI-PP5-Peter-Regan-Heritage-Housing-Project/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/CI-PP5-Peter-Regan-Heritage-Housing-Project'

# Fetch Data from Kaggle

Install Kaggle package to collect data from Kaggle.

In [6]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[K     |████████████████████████████████| 58 kB 2.3 MB/s eta 0:00:011
Collecting python-slugify
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 5.3 MB/s eta 0:00:011
Using legacy 'setup.py install' for kaggle, since package 'wheel' is not installed.
Installing collected packages: text-unidecode, python-slugify, kaggle
    Running setup.py install for kaggle ... [?25ldone
[?25hSuccessfully installed kaggle-1.5.12 python-slugify-8.0.1 text-unidecode-1.3
You should consider upgrading via the '/home/gitpod/.pyenv/versions/3.8.12/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


You will need a Kaggle token to fetch the data required to run these notebooks. If you do not already have a Kaggle token, follow these steps:

1. Visit www.kaggle.com and register an account.

2. When you have finished registering your account, log in.

3. Click on your profile/account avatar that will be visible at the top right hand corner of the browser window.

4. Scroll down to the "API" section.

5. Click on "Create New Token". This will automatically create your kaggle token which will be in the form of a json 
   file.

6. Depending on your browser, this kaggle.json file should download automatically.

7. Download the kaggle.json file manually if this is not the case.


# Make Kaggle Token Available in Your Coding Environment

Once your Kaggle token has been downloaded, you can manually click, hold and drag it from your Downloads folder on your device (or wherever else you may have chosen to store it) into the file directory of your coding environment.

After you have put your Kaggle token in your coding environment's file directory, run the cell below so that it is recognised within the session.

In [7]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get the dataset path from its Kaggle URL. For this application, use the dataset from this url: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data

Run the following cell to define the Kaggle dataset within the context of your environment and session, set its destination folder and download it.

In [8]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 2.13MB/s]


Run the following cell to unzip the downloaded file, delete the zip file and delete the kaggle.json file.

In [9]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/housing-prices-data.zip
  inflating: inputs/datasets/raw/house-metadata.txt  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


---

# Load and Inspect Kaggle Data

In [11]:
import pandas as pd
import numpy as np
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
pd.set_option('display.max_columns', None)
df.head(40)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,2003.0,1710,Gd,8450,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,1976.0,1262,TA,9600,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,2001.0,1786,Gd,11250,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,1998.0,1717,Gd,9550,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,2000.0,2198,Gd,14260,84.0,350.0,84,5,8,1145,,2000,2000,250000
5,796,566.0,1.0,No,732,GLQ,64,,480,Unf,1993.0,1362,TA,14115,85.0,0.0,30,5,5,796,,1993,1995,143000
6,1694,0.0,3.0,Av,1369,GLQ,317,,636,RFn,2004.0,1694,Gd,10084,75.0,186.0,57,5,8,1686,,2004,2005,307000
7,1107,983.0,3.0,Mn,859,ALQ,216,,484,,1973.0,2090,TA,10382,,240.0,204,6,7,1107,,1973,1973,200000
8,1022,752.0,2.0,No,0,Unf,952,,468,Unf,1931.0,1774,TA,6120,51.0,0.0,0,5,7,952,,1931,1950,129900
9,1077,0.0,2.0,No,851,GLQ,140,,205,RFn,1939.0,1077,TA,7420,50.0,0.0,4,6,5,991,,1939,1950,118000


---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
