## Import software libraries

Import the primary modules that will be used in this project:

In [32]:
import os  # Interact with the operating system
import sys  # Read system parameters

import matplotlib  # Create 2D charts
import numpy as np  # Work with multi-dimensional arrays and matrices
import pandas as pd  # Manipulate and analyze data
import scipy as sp  # Perform scientific computing and advanced mathematics
import seaborn as sb  # Perform data visualization
import sklearn  # Perform data mining and analysis

# Summarize software libraries used
print("Libraries used in this project:")
print("- NumPy {}".format(np.__version__))
print("- Pandas {}".format(pd.__version__))
print("- Matplotlib {}".format(matplotlib.__version__))
print("- SciPy {}".format(sp.__version__))
print("- Scikit-learn {}".format(sklearn.__version__))
print("- Python {}\n".format(sys.version))

Libraries used in this project:
- NumPy 1.24.3
- Pandas 2.0.3
- Matplotlib 3.7.2
- SciPy 1.10.1
- Scikit-learn 1.3.2
- Python 3.8.18 | packaged by conda-forge | (default, Dec 23 2023, 17:23:49) 
[Clang 15.0.7 ]



## Load the Dataset

To analyze the price of a home  through multiple inputs in King County, load the dataset into `DataFrame`. Once loaded as `DataFrame` the data can be explored and visualized with pandas.

In [33]:
PROJECT_ROOT_DIR = "."
DATA_PATH = os.path.join(PROJECT_ROOT_DIR, "housing_data")
print("Data files in this project:", os.listdir(DATA_PATH))

# Read the raw dataset
raw_housing_data_file = os.path.join(DATA_PATH, "kc_house_data.csv")
raw_housing_data = pd.read_csv(raw_data_file)
print(
    "Loaded {} records from {}.\n".format(len(raw_housing_data), raw_housing_data_file)
)

Data files in this project: ['.ipynb_checkpoints', 'kc_house_data.csv']
Loaded 21613 records from ./housing_data/kc_house_data.csv.



## Dataset Information


#### View features and data types

Column Labels:
- **id**—Unique identifier for each house sold.
- **date**—Date of the house's most recent sale.
- **price**—Price the house most recently sold for.
- **bedrooms**—Number of bedrooms in the house.
- **bathrooms**—Number of bathrooms. A room with a toilet but no shower is counted as 0.5.
- **sqft_living**—Square footage of the house's interior living space.
- **sqft_lot**—Square footage of the lot on which the house is located.
- **floors**—Number of floor levels in the house.
- **waterfront**—Whether the property borders on or contains a body of water. (0 = not waterfront, 1 =
waterfront)
- **view**—An index from 0 to 4 representing the subjective quality of the view from the property. The
higher the number, the better the view.
- **condition**—An index from 1 to 5 representing the subjective condition of the property. The higher
the number, the better the condition.
- **grade**—An index from 0 to 14 representing the quality of the building's construction and design.
The higher the number, the better the grade.
- **sqft_above**The square footage of the interior housing space that is above ground level.
- **sqft_basement**—The square footage of the interior housing space that is below ground level.
- **yr_built**—The year the house was initially built.
- **yr_renovated**—The year of the house's last renovation.
- **zipcode**—What zipcode area the house is located within.
- **lat**—Latitude of the house's location.
- **long**—Longitude of the house's location.
- **sqft_living15**—The square footage of interior housing living space for the nearest 15 neighbors.
- **sqft_lot15**—The square footage of the lan

In [34]:
print(raw_housing_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

- 21,613 records ("entries" regarding a particular house) are in the dataset.
- There are 21 columns in the dataset.
- Each column in the dataset is listed, along with its data type and the number of records that
include a data value.
- Five columns contain floating point number values: price, bathrooms, floors, lat, and
long.
- Fifteen columns contain integer number values: id, bedrooms, sqft_living, sqft_lot,
waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built,
yr_renovated, zipcode, sqft_living15, and sqft_lot15.
- One column (date) contains a date value (reported as an "object" value).
- There are no missing data values. Each column contains 21,613 entries.