# Data Collection

## Objectives

*   Fetch data from Kaggle competition and save as raw data
*   Get a data summary
*   Inspect the shape of the data
*   Output the data and save in outputs/datasets/collection


## Inputs

* JSON file from Kaggle & Authentication Token

## Outputs

* Generated Dataset saved as outputs/datasets/collection/HousePriceRegression.csv

## Additional Comments


* This data set does not contain any pricate or sensitive information. As such, the data set is committed to a repository which would usually not be the case when data needs to be protected.



---

# Change Working Directory

In order to install Kaggle and fetch the data from Kaggle, the current working directory needs to be changed to the root level of the workspace.

In [4]:
import os

# Get the current working directory (cwd)
cwd = os.getcwd()
print(f"[*] Previous working directory: {cwd}")

# Make the parent of the cwd the new cwd
os.chdir(os.path.dirname(cwd))
cwd = os.getcwd()
print(f"[*] Updated current working directory: {cwd}")

[*] Previous working directory: /workspace/house-price-regression/jupyter_notebooks
[*] Updated current working directory: /workspace/house-price-regression


---

# Fetch Data From Kaggle

In order to fetch data from Kaggle using the Kaggle API, Kaggle needs to installed in the workspace. An access token can be obtained using instructions at the [Kaggle API repository](https://github.com/Kaggle/kaggle-api).

> If a cloud-based development environment is being used (e.g. Gitpod), the `kaggle.json` credentials file which gets downloaded to your local machine when creating a key should be uploaded to the remote workspace.

**NB** Your Kaggle API key is private and should not be shared.

In [2]:
# Install Kaggle
! pip install kaggle==1.5.12

You should consider upgrading via the '/home/gitpod/.pyenv/versions/3.8.12/bin/python3 -m pip install --upgrade pip' command.[0m


In [3]:
# Set up token variable
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

# Set read/write permission for file owner only
! chmod 600 kaggle.json

Download the data to destination folder and unzip it.

In [7]:
# Define data source path
kaggle_dataset_path = "house-prices-advanced-regression-techniques"

# Define destination folder
destination_folder = "inputs/datasets/raw"

#Download the data
! kaggle competitions download -c {kaggle_dataset_path} -p {destination_folder}

# Unzip downloaded file and remove the zip file
! unzip {destination_folder}/*.zip -d {destination_folder} \
  && rm {destination_folder}/*.zip

Downloading house-prices-advanced-regression-techniques.zip to inputs/datasets/raw
  0%|                                                | 0.00/199k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 199k/199k [00:00<00:00, 70.8MB/s]
Archive:  inputs/datasets/raw/house-prices-advanced-regression-techniques.zip
  inflating: inputs/datasets/raw/data_description.txt  
  inflating: inputs/datasets/raw/sample_submission.csv  
  inflating: inputs/datasets/raw/test.csv  
  inflating: inputs/datasets/raw/train.csv  


# Inspect Data

The data set downloaded in the previous section contains data that has already been split into training and test sets.

In this section, we'll:

- import the data into a DataFrame
- get information about the DataFrame
- get the data types for different label colums
- look for missing values

In [5]:
# Import train.csv into a DataFrame and inspect the first 5 entries
import pandas as pd
df = pd.read_csv("inputs/datasets/raw/train.csv")
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


Get a summary of the df DataFrame

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

DataFrame shape

- Rows: 1460
- Colums: 81

In [8]:
df.shape

(1460, 81)

---