# Data Cleaning

## 1. Objectives

- Evaluate missing data
- Clean data
- Export cleaned data for use in modeling

## 2. Inputs

- /workspace/house-price-regression/inputs/datasets/raw/train.csv

## 3. Outputs

- Clean data sets saved in /workspace/house-price-regression/outputs/datasets/cleaned

## 4. Imports

In [1]:
import pandas as pd
import numpy as np

## 5. Load data

In [2]:
import os

# Get the current working directory (cwd)
cwd = os.getcwd()
print(f"[*] Previous working directory: {cwd}")

# Make the parent of the cwd the new cwd
os.chdir(os.path.dirname(cwd))
cwd = os.getcwd()
print(f"[*] Updated current working directory: {cwd}")

# Load the data
df_train = pd.read_csv("inputs/datasets/raw/train.csv")
df_test = pd.read_csv("inputs/datasets/raw/test.csv")

[*] Previous working directory: /workspace/house-price-regression/jupyter_notebooks
[*] Updated current working directory: /workspace/house-price-regression


## 6. Create DataFrames with selected feature variables only

- OverallQual: Overall Quality
- YearBuilt: Year Built
- YearRemodAdd: Remodel Date
- TotalBsmtSF: Total Basement Square Footage
- 1stFlrSF: Total First Floor Square Footage
- GrLivArea: Above Grade (Ground) Living Area Square Footage
- FullBath: Number of Full Bathrooms
- TotRmsAbvGrd: Number of Rooms above Grade
- GarageCars: Garage (Number of Cars)
- GarageArea: Garage Area Square Footage

In [5]:
# Selected features

selected_features_train = ['OverallQual', 'YearBuilt',
                           'YearRemodAdd', 'TotalBsmtSF',
                           '1stFlrSF', 'GrLivArea',
                           'FullBath', 'TotRmsAbvGrd',
                           'GarageCars', 'GarageArea',
                           'SalePrice']

selected_features_test = ['OverallQual', 'YearBuilt',
                     'YearRemodAdd', 'TotalBsmtSF',
                     '1stFlrSF', 'GrLivArea',
                     'FullBath', 'TotRmsAbvGrd',
                     'GarageCars', 'GarageArea']
df_train_selected = df_train[selected_features_train]
df_test_selected = df_test[selected_features_test]

df_test_selected.head()

Unnamed: 0,OverallQual,YearBuilt,YearRemodAdd,TotalBsmtSF,1stFlrSF,GrLivArea,FullBath,TotRmsAbvGrd,GarageCars,GarageArea
0,5,1961,1961,882.0,896,896,1,5,1.0,730.0
1,6,1958,1958,1329.0,1329,1329,1,6,1.0,312.0
2,5,1997,1998,928.0,928,1629,2,6,2.0,482.0
3,6,1998,1998,926.0,926,1604,2,7,2.0,470.0
4,8,1992,1992,1280.0,1280,1280,2,5,2.0,506.0


In [8]:
df_train_selected.head()

Unnamed: 0,OverallQual,YearBuilt,YearRemodAdd,TotalBsmtSF,1stFlrSF,GrLivArea,FullBath,TotRmsAbvGrd,GarageCars,GarageArea,SalePrice
0,7,2003,2003,856,856,1710,2,8,2,548,208500
1,6,1976,1976,1262,1262,1262,2,6,2,460,181500
2,7,2001,2002,920,920,1786,2,6,2,608,223500
3,7,1915,1970,756,961,1717,1,7,3,642,140000
4,8,2000,2000,1145,1145,2198,2,9,3,836,250000
