# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect data and save it under outputs/datasets/collection.
* Assess the need for data cleaning based on inital inspection.

## Inputs

* Kaggle JSON file - this is the authentication token required to access the Kaggle dataset.

## Outputs

* Generate dataset: outputs/datasets/datacollection/house_prices_records.csv




## Notebook in Relation to CRISP-DM

This constitutes the data collection stage.

## Kernel Selection

Please use a Python 3.8.12 kernel to run this notebook. This application has been developed in a local workspace using Python 3.8.12. If you use a different version of Python to run this notebook, some of the required dependencies may not function correctly.

---

## Install Python Packages in the Notebooks

In [None]:
%pip install -r /workspaces/CI-PP5-Peter-Regan-Heritage-Housing-Project/requirements.txt

Collecting altair==4.2.2 (from -r /workspaces/CI-PP5-Peter-Regan-Heritage-Housing-Project/requirements.txt (line 1))
  Downloading altair-4.2.2-py3-none-any.whl (813 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m813.6/813.6 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting astor==0.8.1 (from -r /workspaces/CI-PP5-Peter-Regan-Heritage-Housing-Project/requirements.txt (line 2))
  Downloading astor-0.8.1-py2.py3-none-any.whl (27 kB)
Collecting asttokens==2.2.1 (from -r /workspaces/CI-PP5-Peter-Regan-Heritage-Housing-Project/requirements.txt (line 3))
  Downloading asttokens-2.2.1-py2.py3-none-any.whl (26 kB)
Collecting backports.zoneinfo==0.2.1 (from -r /workspaces/CI-PP5-Peter-Regan-Heritage-Housing-Project/requirements.txt (line 6))
  Downloading backports.zoneinfo-0.2.1-cp38-cp38-manylinux1_x86_64.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.0/74.0 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [6]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [7]:
current_dir = os.getcwd()
current_dir

'/workspace/CI-PP5-Peter-Regan-Heritage-Housing-Project'

# Fetch Data from Kaggle

Install Kaggle package to collect data from Kaggle.

In [8]:
%pip install kaggle==1.5.12

You should consider upgrading via the '/home/gitpod/.pyenv/versions/3.8.12/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


You will need a Kaggle token to fetch the data required to run these notebooks. If you do not already have a Kaggle token, follow these steps:

1. Visit www.kaggle.com and register an account.

2. When you have finished registering your account, log in.

3. Click on your profile/account avatar that will be visible at the top right hand corner of the browser window.

4. Scroll down to the "API" section.

5. Click on "Create New Token". This will automatically create your kaggle token which will be in the form of a json 
   file.

6. Depending on your browser, this kaggle.json file should download automatically.

7. Download the kaggle.json file manually if this is not the case.


# Make Kaggle Token Available in Your Coding Environment

Once your Kaggle token has been downloaded, you can manually click, hold and drag it from your Downloads folder on your device (or wherever else you may have chosen to store it) into the file directory of your coding environment.

After you have put your Kaggle token in your coding environment's file directory, run the cell below so that it is recognised within the session.

In [9]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get the dataset path from its Kaggle URL. For this application, use the dataset from this url: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data

Run the following cell to define the Kaggle dataset within the context of your environment and session, set its destination folder and download it.

In [10]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 2.35MB/s]


Run the following cell to unzip the downloaded file, delete the zip file and delete the kaggle.json file.

In [12]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/housing-prices-data.zip
  inflating: inputs/datasets/raw/house-metadata.txt  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


---

# Load and Inspect Kaggle Data

In [13]:
import pandas as pd
import numpy as np
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
pd.set_option('display.max_columns', None)
df.head(40)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,2003.0,1710,Gd,8450,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,1976.0,1262,TA,9600,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,2001.0,1786,Gd,11250,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,1998.0,1717,Gd,9550,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,2000.0,2198,Gd,14260,84.0,350.0,84,5,8,1145,,2000,2000,250000
5,796,566.0,1.0,No,732,GLQ,64,,480,Unf,1993.0,1362,TA,14115,85.0,0.0,30,5,5,796,,1993,1995,143000
6,1694,0.0,3.0,Av,1369,GLQ,317,,636,RFn,2004.0,1694,Gd,10084,75.0,186.0,57,5,8,1686,,2004,2005,307000
7,1107,983.0,3.0,Mn,859,ALQ,216,,484,,1973.0,2090,TA,10382,,240.0,204,6,7,1107,,1973,1973,200000
8,1022,752.0,2.0,No,0,Unf,952,,468,Unf,1931.0,1774,TA,6120,51.0,0.0,0,5,7,952,,1931,1950,129900
9,1077,0.0,2.0,No,851,GLQ,140,,205,RFn,1939.0,1077,TA,7420,50.0,0.0,4,6,5,991,,1939,1950,118000


Generate dataframe summary.

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

Check the dataset for missing values.

In [14]:
df.isnull().sum()

1stFlrSF            0
2ndFlrSF           86
BedroomAbvGr       99
BsmtExposure        0
BsmtFinSF1          0
BsmtFinType1      114
BsmtUnfSF           0
EnclosedPorch    1324
GarageArea          0
GarageFinish      162
GarageYrBlt        81
GrLivArea           0
KitchenQual         0
LotArea             0
LotFrontage       259
MasVnrArea          8
OpenPorchSF         0
OverallCond         0
OverallQual         0
TotalBsmtSF         0
WoodDeckSF       1305
YearBuilt           0
YearRemodAdd        0
SalePrice           0
dtype: int64

We notice here that columns with missing values may be due to the fact that these features may not be applicable to all properties in the dataset.

Check for data entries that have been duplicated.

In [15]:
df[df.duplicated(subset=None)]

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice


Check for unique values in columns that have non-numerical data values as these are likely to contain categorical data.

In [28]:
for col in df:
    if df[col].dtypes=='object':
        print(col, '-', df[col].unique())

BsmtExposure - ['No' 'Gd' 'Mn' 'Av' 'None']
BsmtFinType1 - ['GLQ' 'ALQ' 'Unf' 'Rec' nan 'BLQ' 'None' 'LwQ']
GarageFinish - ['RFn' 'Unf' nan 'Fin' 'None']
KitchenQual - ['Gd' 'TA' 'Ex' 'Fa']


---

# Push files to Repo

Save the data file in a local folder named 'datacollection' under 'outputs/datasets/' and push to repository.

In [32]:
import os
try:
  os.makedirs(name='outputs/datasets/datacollection') # create a folder for the data output
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/datacollection/house_prices_records.csv",index=False)