# Import Libraries

In [1]:
import os  # Interact with the operating system.
import sys  # Read system parameters.
import warnings  # Suppress warnings
from time import time  # Calculate training time.

import folium  # Plot values on a map.
import matplotlib  # Create 2D charts.
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np  # Work with multi-dimensional arrays and matrices.
import pandas as pd  # Manipulate and analyze data.
import sklearn  # Perform data mining and analysis.
import yellowbrick  # Visualize elbow and silhouette plots.

warnings.filterwarnings("ignore")
import os  # Interact with the operating system.
import sys  # Read system parameters.
from time import time

# Summarize software libraries used.
print("Libraries used in this project:")
print("- Python {}".format(sys.version))
print("- NumPy {}".format(np.__version__))
print("- pandas {}".format(pd.__version__))
print("- Matplotlib {}".format(matplotlib.__version__))
print("- Folium {}".format(folium.__version__))
print("- Yellowbrick {}".format(yellowbrick.__version__))
print("- scikit-learn {}\n".format(sklearn.__version__))

Libraries used in this project:
- Python 3.8.18 | packaged by conda-forge | (default, Dec 23 2023, 17:23:49) 
[Clang 15.0.7 ]
- NumPy 1.24.3
- pandas 2.0.3
- Matplotlib 3.7.2
- Folium 0.18.0
- Yellowbrick 1.5
- scikit-learn 1.3.2



## Load the Dataset

To analyze the price of a home  through multiple inputs in King County, load the dataset into `DataFrame`. Once loaded as `DataFrame` the data can be explored and visualized with pandas.

In [2]:
# Load the dataset.
PATH = os.path.join(".", "housing_data")
print("Data files in this project:", os.listdir(PATH))

data = os.path.join(PATH, "kc_house_data_prep.pickle")
datanorm = os.path.join(PATH, "kc_house_data_prep_norm.pickle")
df = pd.read_pickle(data)
dfnorm = pd.read_pickle(datanorm)
print("Loaded {} records from {}.".format(len(df), data))
print("Loaded {} records from {}.".format(len(dfnorm), datanorm))

Data files in this project: ['kc_house_data_prep_norm.pickle', 'kc_house_data_prep.pickle']
Loaded 21609 records from ./housing_data/kc_house_data_prep.pickle.
Loaded 21609 records from ./housing_data/kc_house_data_prep_norm.pickle.


The pickle files contain the King County housing data. For initial visualization,  the non-normalized version is utilized, while the normalized version will be employed for training a candidate model.

## Relearn the Dataset

#### View features and data types

Column Labels:
- **id**—Unique identifier for each house sold.
- **date**—Date of the house's most recent sale.
- **price**—Price the house most recently sold for.
- **bedrooms**—Number of bedrooms in the house.
- **bathrooms**—Number of bathrooms. A room with a toilet but no shower is counted as 0.5.
- **sqft_living**—Square footage of the house's interior living space.
- **sqft_lot**—Square footage of the lot on which the house is located.
- **floors**—Number of floor levels in the house.
- **waterfront**—Whether the property borders on or contains a body of water. (0 = not waterfront, 1 =
waterfront)
- **view**—An index from 0 to 4 representing the subjective quality of the view from the property. The
higher the number, the better the view.
- **condition**—An index from 1 to 5 representing the subjective condition of the property. The higher
the number, the better the condition.
- **grade**—An index from 0 to 14 representing the quality of the building's construction and design.
The higher the number, the better the grade.
- **sqft_above**The square footage of the interior housing space that is above ground level.
- **sqft_basement**—The square footage of the interior housing space that is below ground level.
- **yr_built**—The year the house was initially built.
- **yr_built_group**—
- **yr_built_encoded**—
- **yr_renovated**—The year of the house's last renovation.
- **yr_ren_group**—
- **yr_ren_encoded**—
- **zipcode**—What zipcode area the house is located within.
- **lat**—Latitude of the house's location.
- **long**—Longitude of the house's location.
- **sqft_living15**—The square footage of interior housing living space for the nearest 15 neighbors.
- **sqft_lot15**—The square footage of the lan

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21609 entries, 0 to 21617
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id                21609 non-null  int64         
 1   date              21609 non-null  datetime64[ns]
 2   price             21609 non-null  float64       
 3   bedrooms          21609 non-null  int64         
 4   bathrooms         21609 non-null  float64       
 5   sqft_living       21609 non-null  int64         
 6   sqft_lot          21609 non-null  int64         
 7   floors            21609 non-null  float64       
 8   waterfront        21609 non-null  int64         
 9   view              21609 non-null  int64         
 10  condition         21609 non-null  int64         
 11  grade             21609 non-null  int64         
 12  sqft_above        21609 non-null  int64         
 13  sqft_basement     21609 non-null  int64         
 14  yr_built          21609 non

In [4]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,yr_built_group,yr_built_encoded,yr_renovated,yr_ren_group,yr_ren_encoded,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,2014-10-13,221900.0,3,1.0,1180,5650,1.0,0,0,...,1941–1960,2,0,,0,98178,47.5112,-122.257,1340,5650
1,6414100192,2014-12-09,538000.0,3,2.25,2570,7242,2.0,0,0,...,1941–1960,2,1991,1975–1994,3,98125,47.721,-122.319,1690,7639
2,5631500400,2015-02-25,180000.0,2,1.0,770,10000,1.0,0,0,...,1921–1940,1,0,,0,98028,47.7379,-122.233,2720,8062
3,2487200875,2014-12-09,604000.0,4,3.0,1960,5000,1.0,0,0,...,1961–1980,3,0,,0,98136,47.5208,-122.393,1360,5000
4,1954400510,2015-02-18,510000.0,3,2.0,1680,8080,1.0,0,0,...,1981–2001,4,0,,0,98074,47.6168,-122.045,1800,7503


**Spotlights** 

Each row in the dataset represents a house, characterized by various attributes such as the number of bedrooms, bathrooms, and total square footage. While the price was the primary target in your supervised learning tasks, this unsupervised project will shift the focus to analyzing other features of the houses.

## Data Characteristics - Statistical Measures

**Descriptive summary statistics**

In [5]:
with pd.option_context("float_format", "{:.2f}".format):
    print(df.describe())

                 id                           date      price  bedrooms  \
count      21609.00                          21609   21609.00  21609.00   
mean  4579845174.17  2014-10-29 04:59:32.511453696  539156.72      3.37   
min      1000102.00            2014-05-02 00:00:00   75000.00      0.00   
25%   2123049175.00            2014-07-22 00:00:00  321500.00      3.00   
50%   3904930240.00            2014-10-16 00:00:00  450000.00      3.00   
75%   7308900100.00            2015-02-17 00:00:00  645000.00      4.00   
max   9900000190.00            2015-05-27 00:00:00 5570000.00     11.00   
std   2876363064.42                            NaN  358610.69      0.91   

       bathrooms  sqft_living   sqft_lot   floors  waterfront     view  ...  \
count   21609.00     21609.00   21609.00 21609.00    21609.00 21609.00  ...   
mean        2.11      2078.73   15105.03     1.49        0.01     0.23  ...   
min         0.00       290.00     520.00     1.00        0.00     0.00  ...   
25%     

**Summarize the Most Common Values**

In [6]:
features_to_summarize = [
    "view",
    "waterfront",
    "grade",
    "zipcode",
    "bedrooms",
    "bathrooms",
    "floors",
    "sqft_living",
    "sqft_lot",
    "condition",
]
df[features_to_summarize].mode()

Unnamed: 0,view,waterfront,grade,zipcode,bedrooms,bathrooms,floors,sqft_living,sqft_lot,condition
0,0,0,7,98103,3,2.5,1.0,1300,5000,3


**The typical house**:
- Does not have a "view" and is not on the waterfront.
- Has a grade of 7.
- Has a zipcode of 98103.
- Has 3 bedrooms, 2.5 bathrooms, and 1 floor level.
- Has a single story.
- Has 1300 sqaure feet  of living space.
- Has an above average condition.

## Develop a New Price Per Sq-Foot Metric

The next step develops a pricing feature that calculates the cost per square foot by utilizing the price and the living square footage.

In [7]:
df["price_per_sqft"] = round(df["price"] / df["sqft_living"], 2)
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,yr_built_encoded,yr_renovated,yr_ren_group,yr_ren_encoded,zipcode,lat,long,sqft_living15,sqft_lot15,price_per_sqft
0,7129300520,2014-10-13,221900.0,3,1.0,1180,5650,1.0,0,0,...,2,0,,0,98178,47.5112,-122.257,1340,5650,188.05
1,6414100192,2014-12-09,538000.0,3,2.25,2570,7242,2.0,0,0,...,2,1991,1975–1994,3,98125,47.721,-122.319,1690,7639,209.34
2,5631500400,2015-02-25,180000.0,2,1.0,770,10000,1.0,0,0,...,1,0,,0,98028,47.7379,-122.233,2720,8062,233.77
3,2487200875,2014-12-09,604000.0,4,3.0,1960,5000,1.0,0,0,...,3,0,,0,98136,47.5208,-122.393,1360,5000,308.16
4,1954400510,2015-02-18,510000.0,3,2.0,1680,8080,1.0,0,0,...,4,0,,0,98074,47.6168,-122.045,1800,7503,303.57


**Spotlights** 

The final column in the data frame presents the price_per_sqft attribute, which was derived by calculating the ratio of price to sqft_living. This newly created feature will be utilized in the visualization of the clusters.