# **STEP 1: Data Importing and Pre-processing**
## - Import dataset and describe characteristics such as dimensions, data types, file types, and import methods used
## - Clean, wrangle, and handle missing data
## - Transform data appropriately using techniques such as aggregation, normalization, and feature construction
## - Reduce redundant data and perform need-based discretization

In [200]:
# import all packages used for the project in the first cell, use code cells for code and comments, 
#and use markdown cells for headings and descriptions

In [300]:
import pandas as pd
import numpy as np
import os

In [314]:
os.getcwd()
os.chdir('/Users/vrb/Downloads/')

df = pd.read_csv("house_sales.csv", header = 0)
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3.0,1.0,1180.0,5650.0,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3.0,2.25,2570.0,7242.0,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2.0,1.0,770.0,10000.0,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4.0,3.0,1960.0,5000.0,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3.0,2.0,1680.0,8080.0,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Basic Characteristics

1. Shape of the data frame

In [None]:
print("Shape (rows, columns):", df.shape)

2. Defining file type:

The dataset was provided as a CSV file, which is a plain-text tabular file commonly used for structured data.

2. Data types by column

In [None]:
print("Data types:")
print(df.dtypes)

3. Missing values

In [None]:
df.isna().sum()

Cleaning the data

1. Separating missing value columns

In [None]:
cols_na = ["bedrooms", "bathrooms", "sqft_living", "sqft_lot"]

df_na = df[cols_na]


2. Missing percentage in each column

In [None]:
missing_percent = (df_na.isna().sum() / len (df_na)) * 100
print ("Missing percent: \n", missing_percent)

4. Distribution in missing value columns

In [None]:
df["bedrooms"].describe()

In [None]:
df["bathrooms"].describe()

5. Filling in missing values

    a. bedrooms
        This columns missing percentage is under 10% and the variable is discrete with clear central tendency. Most homes have 3 bedrooms, due to outliers, the mean would not be a reliable choice. The median is more robust to those outliers and better represents a typical value. For these reasons, the median, was used to fill the missing bedroom values. 

In [None]:
df["bedrooms"] = df["bedrooms"].fillna(df["bedrooms"].median())

    b. bathrooms

In [None]:
df["bathrooms"] = df ["bathrooms"].fillna(df["bathrooms"].median())

    c. sqft_living 
    Missing values in this columns were filled using the median for each bedroom count to keep estimates accruate. sqft_living15 was avoided becasue it represents nearby homes, not the specific property.
    
        

In [None]:
for b in sorted(df['bedrooms'].unique()):
    
    mis_sq = (df['bedrooms'] == b) & (df['sqft_living'].isna())

    median_sqft = df.loc[df['bedrooms'] == b, 'sqft_living'].median()

    df.loc[mis_sq, 'sqft_living'] = median_sqft

    d. sqft_lot
        Misisng values in sqft_lot were filled by using median lot size within each zip code. Lot size varies heavily by location, so grouping by zip code provides more realistic estimates than using one overall median. 

In [None]:
for z in sorted(df['zipcode'].unique()):
    
    zip = (df['zipcode'] == z) & (df['sqft_lot'].isna())
    
    median_lot = df.loc[df['zipcode'] == z, 'sqft_lot'].median()
    
    df.loc[zip, 'sqft_lot'] = median_lot

                                  

 6. Converting data types

    a. date
        The date column was converted to datetime to allow accurate time-based calculations and avoid treating the values as plain text, 

In [None]:
df['date'] = pd.to_datetime(df['date'])

df['date'].head()
df.dtypes

7. Duplicate checks and redundant data

In [None]:
df.duplicated().sum()

8. Impossible data

In [None]:
df[df['bedrooms'] < 0]

In [None]:
df[df['bathrooms'] < 0]

In [None]:
df[df['floors'] < 1]

In [None]:
df[df['sqft_living'] <= 0]
df[df['sqft_lot'] <= 0]

In [None]:
df[df['price'] <= 0]

9. Outliers
    
    a. bedrooms
       A single extreme outlier was found in the bedrooms column where a property was listed with 33 bedroomsl. Based on the sq footage of the home, bathrooms and price, this was ultimatley determined to be a data entry error, and the value was corrected to 3.

In [None]:
col = 'bedrooms'

Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df[(df[col] < lower) | (df[col] > upper)][['bedrooms']]


In [None]:
df ['bedrooms'].describe()

In [None]:
df[df['bedrooms'] > 10][['bedrooms']]

In [None]:
df.loc[15870]

In [None]:
df.loc[15870, 'bedrooms'] = 3

    b. bathrooms

In [None]:
df['bathrooms'].describe

In [None]:
df['bathrooms'].sort_values().head(10)
df['bathrooms'].sort_values(ascending=False).head(10)

    c. sqft_living

In [None]:
df['sqft_living'].describe()
df['sqft_living'].sort_values().head(20)
df['sqft_living'].sort_values(ascending=False).head(20)

    d. sqft_lot

In [None]:
df['sqft_lot'].describe()
df['sqft_lot'].sort_values().head(10)
df['sqft_lot'].sort_values(ascending=False).head(10)

    e. floors

In [None]:
df['floors'].describe()
df['floors'].sort_values().head(10)
df['floors'].sort_values(ascending=False).head(10)

    f. condition

In [None]:
df['condition'].describe()
df['condition'].value_counts()

    g. grade

In [None]:
df['grade'].describe()
df['grade'].value_counts()

    h. yr_built

In [None]:
df['yr_built'].sort_values().head(10)
df['yr_built'].sort_values(ascending=False).head(10)

    i. sqft_living15

In [None]:
df['sqft_living15'].describe()
df['sqft_living15'].sort_values().head(10)
df['sqft_living15'].sort_values(ascending=False).head(10)

    j. sqft_lot15

In [None]:
df['sqft_lot15'].describe()
df['sqft_lot15'].sort_values().head(10)
df['sqft_lot15'].sort_values(ascending=False).head(10)

    k. price

In [None]:
df.sort_values('price', ascending=True).head(50)

In [None]:
df.sort_values('price', ascending=False).head(10)

10. Validation

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isna().sum()

# **STEP 2: Data Analysis and Visualization**
## -Identify categorical, ordinal, and numerical variables within the data
## -Provide measures of centrality and distribution with visualizations
## -Diagnose for correlations between variables and determine independent and dependent variables
## -Perform exploratory analysis in combination with visualization techniques to discover patterns and features of interest

**2.1 Identify categorical, ordinal, and numerical values within the data.**

In [None]:
total_num_columns = df.shape[1]
print(total_num_columns)

In [None]:
# Possible data types in pandas include numbers (integer and float), objects, strings, datetimes, timedeltas, categories, and datetimez.

numerical_col = df.select_dtypes(include = 'number').columns
numerical_col_count = len(numerical_col)
print("Numerical data =", list(numerical_col))
print("Number of numerical columns =", numerical_col_count)

Pandas. Dataframe. Select_dtypes—Pandas 2. 3. 3 documentation. (n.d.). Retrieved November 24, 2025, from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html

In [None]:
# Of the 21 columns, 20 are numerical. Therefore, there is one remaining non-numerical column.
# The process for numerical data will be repeated for object, datetime, and categorical data.

# object data
object_col = df.select_dtypes(include = 'object').columns
object_col_count = len(object_col)
print("Object data =", list(object_col))
print("Number of object columns =", object_col_count)

# datetimes
datetime_col = df.select_dtypes(include = 'datetime64').columns
datetime_col_count = len(datetime_col)
print("Datetime data =", list(datetime_col))
print("Number of datetime columns =", datetime_col_count)

# categories
categorical_col = df.select_dtypes(include = 'category').columns
categorical_col_count = len(categorical_col)
print("Categorical data =", list(categorical_col))
print("Number of categorical columns =", categorical_col_count)

In [None]:
data_types = df.dtypes
data_types

Of the 21 total columns in the house sales data frame, **20 contain numerical data and 1 contains ordinal, specifically datetime, data**.
The output from the earlier script was verified with df.dtypes. The listed data types align with the df.info() output from Step 1.

**2.2 Provide measures of centrality and distributions with visualizations.**

In [None]:
# date
# Dates are not technically a continuous dataset, therefore, it does not make sense to calculate the mean.

print("median date =", df['date'].median())
print("mode date = ", df['date'].mode())

In [None]:
# Using the square root of the number of entries to determine the number of boxes.

import math

print(round(math.sqrt(21613), 2))

In [None]:
# Plotting a histogram to look at the spread of the data.

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

fig, ax = plt.subplots(figsize=(12, 6))  # Increase figure width
ax.hist(df['date'], bins=147, color = 'steelblue', edgecolor='none')
ax.set_title ('Histogram of Date')
ax.set_xlabel('Date')
ax.set_ylabel('Count')

ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
plt.xticks(rotation=45)

plt.show()

I used ChatGPT to help me reformat the histogram. Originally, the figure was too small to draw any conclusions. Therefore, I entered my original code (plt.hist()), and ChatGPT helped me to set the values as dates, replot the data using Axes, rather than pyplot, expand the x-axis, and increase the number of intervals to visualize the fluctuations over time.

Chatgpt. (n.d.). ChatGPT. Retrieved November 24, 2025, from https://chatgpt.com/

In [None]:
# price

print("mean price = $", df['price'].mean().round(2))
print("median price = $", df['price'].median())
print("mode price = $", df['price'].mode())

In [None]:
# The previous result suggests that there are two modes: $350,000.00 and $450,000.00, so I want to count the number of rows with those prices.

df.loc[df['price'] == 350000,'price']

In [None]:
df.loc[df['price'] == 450000,'price']

In [None]:
# Plotting a histogram and boxplot to look at the spread of the data.

fig = plt.figure()
plt.hist(df['price'], bins=147, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Price')
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.show()

fig = plt.figure()
plt.boxplot(df['price'])
plt.title ('Boxplot of Price')
plt.ylabel('Price ($)')
plt.show()

In [None]:
# bedrooms
# The number of bedrooms is not a continuous dataset, therefore, it does not make sense to calculate the mean.

print("median bedrooms =", df['bedrooms'].median())
print("mode bedrooms =", df['bedrooms'].mode())
print("max number of bedrooms =", df['bedrooms'].max())
print("min number of bedrooms =", df['bedrooms'].min())

In [None]:
# Plotting a histogram to look at the spread of the data.
# The range of number of bedrooms is 0 to 10, therefore 10 bins are used in the histogram.

fig = plt.figure()
plt.hist(df['bedrooms'], bins=11, range = (0, 11), color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Number of Bedrooms')
plt.xlabel('Bedrooms')
plt.ylabel('Count')
plt.show()

In [None]:
# bathrooms
# The number of bathrooms is not a continuous dataset, therefore, it does not make sense to calculate the mean.

print("median bathrooms =", df['bathrooms'].median())
print("mode bathrooms =", df['bathrooms'].mode())
print("max number of bathrooms =", df['bathrooms'].max())
print("min number of bathrooms =", df['bathrooms'].min())

In [None]:
# Plotting a histogram to look at the spread of the data.
# The range of number of bathrooms is 0 to 8, therefore 8 bins are used in the histogram.

fig = plt.figure()
plt.hist(df['bathrooms'], bins=9, range = (0, 9), color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Number of Bathrooms')
plt.xlabel('Bathrooms')
plt.ylabel('Count')
plt.show()

In [None]:
# sqft_living

print("mean living sqft =", df['sqft_living'].mean().round(2), "ft\u00b2")
print("median living sqft =", df['sqft_living'].median(), "ft\u00b2")
print("mode living sqft =", df['sqft_living'].mode(), "ft\u00b2")

In [None]:
# Plotting a histogram and boxplot to look at the spread of the data.

fig = plt.figure()
plt.hist(df['sqft_living'], bins=147, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Living Space Size')
plt.xlabel('Area (sqft)')
plt.ylabel('Count')
plt.show()

fig = plt.figure()
plt.boxplot(df['sqft_living'])
plt.title ('Boxplot of Living Space Size')
plt.ylabel('Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_lot

print("mean lot sqft =", df['sqft_lot'].mean().round(2), "ft\u00b2")
print("median lot sqft =", df['sqft_lot'].median(), "ft\u00b2")
print("mode lot sqft =", df['sqft_lot'].mode(), "ft\u00b2")

In [None]:
# Plotting a histogram and boxplot to look at the spread of the data.

fig = plt.figure()
plt.hist(df['sqft_lot'], bins=147, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Lot Size')
plt.xlabel('Area (ft\u00b2)')
plt.ylabel('Count')
plt.show()

fig = plt.figure()
plt.boxplot(df['sqft_living'])
plt.title ('Boxplot of Lot Size')
plt.ylabel('Area (ft\u00b2)')
plt.show()

In [None]:
# floors
# The number of floors is not a continuous dataset, therefore, it does not make sense to calculate the mean.

print("median floors =", df['floors'].median())
print("mode floors =", df['floors'].mode())
print("max number of floors =", df['floors'].max())
print("min number of floors =", df['floors'].min())

In [None]:
# Plotting a histogram to look at the spread of the data.
# The range of number of bathrooms is 1 to 3.5, therefore 6 bins are used in the histogram.

fig = plt.figure()
plt.hist(df['floors'], bins = 6, range = (1, 4), color = 'steelblue', edgecolor = 'none')
plt.title ('Histogram of Number of Floors')
plt.xlabel('Floors')
plt.ylabel('Count')
plt.show()

In [None]:
# waterfront
# The presence (=1) or absence (=0) of a waterfront view is binary, therefore, it does not make sense to calculate the median.
# Although the dataset is a binary, the mean can illuminate a intermediate measure of centrality between 0 and 1.

print("mean waterfront =", df['waterfront'].mean())
print("mode waterfront =", df['waterfront'].mode())

In [None]:
# Plotting a histogram to look at the frequency of a waterfront view.
# The presence or absence of a waterfront view is a binary dataset. Therefore only 2 bins are needed.

fig = plt.figure()
plt.hist(df['waterfront'], bins = 2, color = 'steelblue', edgecolor = 'none')
plt.title ('Frequency of a Waterfront View')
plt.xlabel('Waterfront View')
plt.ylabel('Count')
plt.show()

In [None]:
# view
# The presence (=1) or absence (=0) of a view is binary, therefore, it does not make sense to calculate the median.
# Although the dataset is a binary, the mean can illuminate a intermediate measure of centrality between 0 and 1.

print("mean view =", df['view'].mean())
print("median view =", df['view'].median())
print("mode view =", df['view'].mode())
print("max score of view =", df['view'].max())
print("min score of view =", df['view'].min())

In [None]:
# Plotting a histogram to look at the frequency of a view.
# The presence or absence of a view is a binary dataset. Therefore only 2 bins are needed.

fig = plt.figure()
plt.hist(df['view'], bins = 5, range = (0, 5), color = 'steelblue', edgecolor = 'none')
plt.title ('Frequency of a View')
plt.xlabel('View')
plt.ylabel('Count')
plt.show()

In [None]:
# condition
# The score of the condition is not a continuous dataset, therefore, it does not make sense to calculate the mean.

print("median condition =", df['condition'].median())
print("mode condition =", df['condition'].mode())
print("max score of condition =", df['condition'].max())
print("min score of condition =", df['condition'].min())

In [None]:
# Plotting a histogram to look at the spread of the data.
# The range of condition scores is 1 to 5, therefore 4 bins are used in the histogram.

fig = plt.figure()
plt.hist(df['condition'], bins = 5, range = (1, 6), color = 'steelblue', edgecolor = 'none')
plt.title ('Histogram of Condition')
plt.xlabel('Condition')
plt.ylabel('Count')
plt.show()

In [None]:
# grade
# The grade is not a continuous dataset, therefore, it does not make sense to calculate the mean.

print("median grade =", df['grade'].median())
print("mode grade =", df['grade'].mode())
print("max score of grade =", df['grade'].max())
print("min score of grade =", df['grade'].min())

In [None]:
# Plotting a histogram to look at the spread of the data.
# The range of grades is 1 to 13, therefore 12 bins are used in the histogram.

fig = plt.figure()
plt.hist(df['grade'], bins = 13, range = (1, 14), color = 'steelblue', edgecolor = 'none')
plt.title ('Histogram of Grade of House')
plt.xlabel('Grade')
plt.ylabel('Count')
plt.show()

In [None]:
# sqft_above

print("mean sqft above =", df['sqft_above'].mean().round(1), "ft\u00b2")
print("median sqft above =", df['sqft_above'].median(), "ft\u00b2")
print("mode sqft above =", df['sqft_above'].mode(), "ft\u00b2")

In [None]:
# Plotting a histogram and boxplot to look at the spread of the data.

fig = plt.figure()
plt.hist(df['sqft_above'], bins=147, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Area Above')
plt.xlabel('Area (ft\u00b2)')
plt.ylabel('Count')
plt.show()

fig = plt.figure()
plt.boxplot(df['sqft_above'])
plt.title ('Boxplot of Area Above')
plt.ylabel('Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_basement

print("mean basement sqft =", df['sqft_basement'].mean().round(1))
print("median basement sqft =", df['sqft_basement'].median())
print("mode basement sqft =", df['sqft_basement'].mode())

In [None]:
# Plotting a histogram and boxplot to look at the spread of the data.

fig = plt.figure()
plt.hist(df['sqft_basement'], bins=147, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Basement Size')
plt.xlabel('Area (ft\u00b2)')
plt.ylabel('Count')
plt.show()

fig = plt.figure()
plt.boxplot(df['sqft_basement'])
plt.title ('Boxplot of Basement Size')
plt.ylabel('Area (ft\u00b2)')
plt.show()

In [None]:
# year built
# Dates are not technically a continuous dataset, therefore, it does not make sense to calculate the mean.

print("median year built =", df['yr_built'].median())
print("mode year built = ", df['yr_built'].mode())

In [None]:
# Plotting a histogram to look at the spread of the data.

fig = plt.figure()
plt.hist(df['yr_built'], bins=147, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Year Built')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

In [None]:
# year renovated
# Dates are not technically a continuous dataset, therefore, it does not make sense to calculate the mean.

print("median year renovated =", df['yr_renovated'].median())
print("mode year renovated = ", df['yr_renovated'].mode())

In [None]:
# Creating a list of years that the houses were renovated.

reno_yr = np.sort(df['yr_renovated'].unique())
print(reno_yr)

reno_yr_count = df['yr_renovated'].value_counts().sort_index()
print(reno_yr_count)

In [None]:
# Plotting a bar chart to look at the frequency of houses renovated.

reno_yr_string = reno_yr.astype(str)

plt.figure(figsize=(14,4))
plt.bar(reno_yr_string, reno_yr_count.values)
plt.xlabel("Year Renovated")
plt.ylabel("Count")
plt.title("Bar Chart of Renovation Year")
plt.xticks(rotation = 45)
plt.show()

In [None]:
# zipcode
# Zipcodes are not a continuous dataset, therefore, it does not make sense to calculate the mean.

print("zipcode median =", df['zipcode'].median())
print("zipcode mode = ", df['zipcode'].mode())
print("zipcode range =", df['zipcode'].max() - df['zipcode'].min())

In [None]:
# Plotting a histogram to look at the spread of the data.
# The zipcode range is 198, therefore 198 bins are used in the histogram.

fig = plt.figure()
plt.hist(df['zipcode'], bins=198, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Zipcode')
plt.xlabel('Zipcode')
plt.ylabel('Count')
plt.show()

In [None]:
# lat

print("mean latitude =", df['lat'].mean().round(1))
print("median latitude =", df['lat'].median())
print("mode latitude =", df['lat'].mode())

In [None]:
# The previous result suggests that there are four modes: 47.5322, 47.5491, 47.6624, and 47.6846, so I want to count the number of rows with those prices.

lat_mode_1 = df['lat'] == 47.5322
print("lat_mode_1 =", lat_mode_1.sum())

lat_mode_2 = df['lat'] == 47.5491
print("lat_mode_2 =", lat_mode_2.sum())

lat_mode_3 = df['lat'] == 47.6624
print("lat_mode_3 =", lat_mode_3.sum())

lat_mode_4 = df['lat'] == 47.6846
print("lat_mode_4 =", lat_mode_4.sum())

In [None]:
# Plotting a histogram and boxplot to look at the spread of the data.

fig = plt.figure()
plt.hist(df['lat'], bins=147, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Latitude')
plt.xlabel('Latitude (coordinate)')
plt.ylabel('Count')
plt.show()

fig = plt.figure()
plt.boxplot(df['lat'])
plt.title ('Boxplot of Latitude')
plt.ylabel('Latitude (coordinate)')
plt.show()

In [None]:
# lat

print("mean longitude =", df['long'].mean().round(1))
print("median longitude =", df['long'].median())
print("mode longitude =", df['long'].mode())

In [None]:
# Plotting a histogram and boxplot to look at the spread of the data.

fig = plt.figure()
plt.hist(df['long'], bins=147, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Longitude')
plt.xlabel('Longitude (coordinate)')
plt.ylabel('Count')
plt.show()

fig = plt.figure()
plt.boxplot(df['long'])
plt.title ('Boxplot of Longitude')
plt.ylabel('Longitude (coordinate)')
plt.show()

In [None]:
# sqft_living15

print("mean sqft_living15 =", df['sqft_living15'].mean().round(1), "ft\u00b2")
print("median sqft_living15 =", df['sqft_living15'].median(), "ft\u00b2")
print("mode sqft_living15 =", df['sqft_living15'].mode(), "ft\u00b2")

In [None]:
# Plotting a histogram and boxplot to look at the spread of the data.

fig = plt.figure()
plt.hist(df['sqft_living15'], bins=147, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Average Living Space Size in Nearest 15 Houses')
plt.xlabel('Area (ft\u00b2)')
plt.ylabel('Count')
plt.show()

fig = plt.figure()
plt.boxplot(df['sqft_living15'])
plt.title ('Boxplot of Average Living Space Size in Nearest 15 Houses')
plt.ylabel('Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_lot15

print("mean sqft_lot15 =", df['sqft_lot15'].mean().round(1), "ft\u00b2")
print("median sqft_lot15 =", df['sqft_lot15'].median(), "ft\u00b2")
print("mode sqft_lot15 =", df['sqft_lot15'].mode(), "ft\u00b2")

In [None]:
# Plotting a histogram and boxplot to look at the spread of the data.

fig = plt.figure()
plt.hist(df['sqft_lot15'], bins=147, color = 'steelblue', edgecolor='none')
plt.title ('Histogram of Average Lot Size of the Nearest 15 Houses')
plt.xlabel('Area (ft\u00b2)')
plt.ylabel('Count')
plt.show()

fig = plt.figure()
plt.boxplot(df['sqft_lot15'])
plt.title ('Boxplot of Average Lot Size of the Nearest 15 Houses')
plt.ylabel('Area (ft\u00b2)')
plt.show()

**2.3 Diagnose for correlations between variables and determine independent and dependent variables.**

In [None]:
pd.set_option('display.max_columns', None)

corr_matrix = df.corr()
corr_matrix

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create upper triangle mask
mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)

# Plot heatmap with masked lower triangle
plt.figure(figsize=(12, 8))
sns.heatmap(
    corr_matrix, 
    mask=mask,          # mask the lower triangle
    annot=True,          # show values
    fmt=".2f",           # number format
    cmap="coolwarm", 
    vmin=-1, vmax=1, 
    linewidths=0.5
)

plt.title("Lower Triangle Correlation Heatmap")
plt.show()


I used ChatGPT to help me figure out how to print the correlation table and color the cells using a heatmap. I was able to create an incomplete correlation table, see above (because it included the 1.0 and duplicate values), but that was not conducive to reporting the values.
Therefore, I asked ChatGPT to help me to generate a refined correlation table with a color scale to identify strong correlations.

Chatgpt. (n.d.). ChatGPT. Retrieved November 24, 2025, from https://chatgpt.com/

In [None]:
# The correlation table above contains variables that exhibit weak correlations.
# I will filter out variables with weak correlation values (< |0.5|) and the self-correlation values.
# I will return the variables with moderate to strong correlation values (>= |0.50|).

high_corr = ((corr_matrix.abs() > 0.49) & (corr_matrix.abs() < 1))
high_corr.columns[high_corr.any()]

In [None]:
# A reduced correlation matrix will be used to visualize the moderate to strong correlations only.

corr_col = ['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'grade', 'sqft_above', 'yr_built', 'zipcode', 'long', 'sqft_living15', 'sqft_lot15']

corr_matrix_red = df[corr_col].corr()
corr_matrix_red

In [None]:
# Create upper triangle mask
mask1 = np.triu(np.ones_like(corr_matrix_red, dtype=bool), k=1)

# Plot heatmap with masked lower triangle
plt.figure(figsize=(12, 8))
sns.heatmap(
    corr_matrix_red, 
    mask=mask1,          # mask the lower triangle
    annot=True,          # show values
    fmt=".2f",           # number format
    cmap="coolwarm", 
    vmin=-1, vmax=1, 
    linewidths=0.5
)

plt.title("Reduced Lower Triangle Correlation Heatmap")
plt.show()

Of the 13 variables with moderate to strong correlation values, **price** appears to be a dependent variable with moderately positive correlations with independent variables such as living area size (0.69), grade (0.67), area above (0.61), and the average living area of 15 nearby houses (0.59).

The **living area size**, a potential dependent variable, has strong correlations with likely independent variables such as the area above (0.86), grade (0.75), average living area of 15 nearby houses (0.74), and number of bathrooms (0.72).

The **area of the lot**, a dependent variable, has a strong positive correlation with the average lot areas of 15 nearby houses (0.72), a potential predictor.

The **average living area of 15 nearby houses**, an unlikely dependent variable, has strong positive correlations with the grade of the house (0.71), area above (0.73), and living area (0.74).

These relationships will be explored.

In [None]:
# price vs. sqft_living (and grade)

fig = plt.figure()
plt.scatter(df['sqft_living'], df['price'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(df['sqft_living'], df['price'], 1)
line = slope * df['sqft_living'] + intercept
plt.plot(df['sqft_living'], line, color = 'red')
plt.text(500, 7500000, ("price =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*sqft_living"))

print("slope =", str(round(slope, 2)))
print("intercept =", str(round(intercept, 2)))

plt.title ('Scatter Plot of Price and Living Area Colored by Grade')
plt.xlabel('Living Area (ft\u00b2)')
plt.ylabel('Price ($)')
plt.show()

The Bobbit (2020) webpage was used to help me find the code to create a scatter plot with a color scheme from a third column. In this case, I plotted living area on the x-axis and price on the y-axis while color-coding the data points by grade of the house. (Grade was selected for color coding because it is a categorical dataset.) The color coding itself was very useful, however I still required a reference to the magnitudes described by the color. As such, I sought a color bar to provide a scale for the grade of each house. This is when I turned to GeeksforGeeks (2020). I used a single line of code in their example to build the color bar on the existing scatter plot. Lastly, I wanted to fit a trendline to the scatter plot data to visualize the relationship described in the correlation table. Therefore, I referenced the code written in GeeksforGeeks (2024) to create a best fit line.

Bobbitt, Z. (2020, September 3). Matplotlib: How to color a scatterplot by value. Statology. https://www.statology.org/matplotlib-scatterplot-color-by-value/

How to draw a line inside a scatter plot. (2024, July 22). GeeksforGeeks. https://www.geeksforgeeks.org/data-visualization/how-to-draw-a-line-inside-a-scatter-plot/

Matplotlib.pyplot.colorbar() function in Python. (2020, December 5). GeeksforGeeks. https://www.geeksforgeeks.org/python/matplotlib-pyplot-colorbar-function-in-python/

In [None]:
import pandas as pd
from scipy.stats import zscore

z_scaled = df.copy()

norm_col = ['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'grade', 'sqft_above', 'yr_built', 'zipcode', 'long', 'sqft_living15', 'sqft_lot15']

z_scaled[norm_col] = zscore(z_scaled[norm_col])
print(z_scaled)

Normalizing the data opens the door to modeling. Therefore, I referenced the GeeksforGeeks (2021) webpage to standardize the graphed data by their respective z-scores. Normalizing allows for a proper comparison between two variables on a common scale, which is the deviation from the mean, which is set at zero.

How to standardize data in a pandas dataframe? (2021, December 16). GeeksforGeeks. https://www.geeksforgeeks.org/python/how-to-standardize-data-in-a-pandas-dataframe/

In [None]:
# price vs. sqft_living (and grade), normalized

from sklearn.linear_model import LinearRegression

fig = plt.figure()
plt.scatter(z_scaled['sqft_living'], z_scaled['price'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['sqft_living'], z_scaled['price'], 1)
line = slope * z_scaled['sqft_living'] + intercept
plt.plot(z_scaled['sqft_living'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['sqft_living']], z_scaled['price'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['sqft_living']], z_scaled['price'])

plt.text(-2, 19, ("price =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*sqft_living" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Price and Living Area Colored by Grade')
plt.xlabel('Living Area (ft\u00b2)')
plt.ylabel('Price ($)')
plt.show()

After graphing the normalized data, I wanted to return the R-squared value of the best fit line, and verify that the best fit line matches with the actual linear regression. I referenced the Bobbit (2022) webpage to obatin that data.

Bobbitt, Z. (2022, March 24). How to calculate r-squared in python(With example). Statology. https://www.statology.org/r-squared-in-python/

In [None]:
# price vs. grade

fig = plt.figure()
plt.scatter(df['grade'], df['price'], s = 10)

# Calculate the line of best fit
slope, intercept = np.polyfit(df['grade'], df['price'], 1)
line = slope * df['grade'] + intercept
plt.plot(df['grade'], line, color = 'red')
plt.text(1, 7300000, ("price =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*grade"))

print("best fit slope =", round(slope, 2))
print("best fit intercept =", round(intercept, 2))

plt.title ('Scatter Plot of Price and Grade')
plt.xlabel('Grade')
plt.ylabel('Price ($)')
plt.show()

In [None]:
# price vs. grade, normalized

fig = plt.figure()
plt.scatter(z_scaled['grade'], z_scaled['price'], s = 10)

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['grade'], z_scaled['price'], 1)
line = slope * z_scaled['grade'] + intercept
plt.plot(z_scaled['grade'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['grade']], z_scaled['price'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['grade']], z_scaled['price'])

plt.text(-5.8, 19, ("price =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*grade" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Price and Grade')
plt.xlabel('Grade')
plt.ylabel('Price')
plt.show()

In [None]:
# price vs. sqft_above (and grade)

fig = plt.figure()
plt.scatter(df['sqft_above'], df['price'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(df['sqft_above'], df['price'], 1)
line = slope * df['sqft_above'] + intercept
plt.plot(df['sqft_above'], line, color = 'red')
plt.text(500, 7500000, ("price =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*sqft_above"))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

plt.title ('Scatter Plot of Price and Area Above Colored by Grade')
plt.xlabel('Area Above (ft\u00b2)')
plt.ylabel('Price ($)')
plt.show()

In [None]:
# price vs. sqft_above (and grade), normalized

fig = plt.figure()
plt.scatter(z_scaled['sqft_above'], z_scaled['price'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['sqft_above'], z_scaled['price'], 1)
line = slope * z_scaled['sqft_above'] + intercept
plt.plot(z_scaled['sqft_above'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['sqft_above']], z_scaled['price'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['sqft_above']], z_scaled['price'])

plt.text(-2, 19, ("price =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*sqft_above" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Price and Area Above Colored by Grade')
plt.xlabel('Area Above (ft\u00b2)')
plt.ylabel('Price ($)')
plt.show()

In [None]:
# price vs. sqft_living15

fig = plt.figure()
plt.scatter(df['sqft_living15'], df['price'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(df['sqft_living15'], df['price'], 1)
line = slope * df['sqft_living15'] + intercept
plt.plot(df['sqft_living15'], line, color = 'red')
plt.text(500, 7500000, ("price =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*sqft_living15"))

print("best fit slope =", round(slope, 2))
print("best fit intercept =", round(intercept, 2))

plt.title ('Scatter Plot of Price and Average Living Area of Nearby Houses Colored by Grade')
plt.xlabel('Living Area (ft\u00b2)')
plt.ylabel('Price ($)')
plt.show()

In [None]:
# price vs. sqft_living15, normalized

fig = plt.figure()
plt.scatter(z_scaled['sqft_living15'], z_scaled['price'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['sqft_living15'], z_scaled['price'], 1)
line = slope * z_scaled['sqft_living15'] + intercept
plt.plot(z_scaled['sqft_living15'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['sqft_living15']], z_scaled['price'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['sqft_living15']], z_scaled['price'])

plt.text(-2.3, 18.5, ("price =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*sqft_living15" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Price and Living Area of Nearby Houses Colored by Grade')
plt.xlabel('Living Area (ft\u00b2)')
plt.ylabel('Price ($)')
plt.show()

To varying yet moderate degrees, the house prices are influenced by the area above, grade, average living area of the 15 nearby houses, and number of bathrooms. These, among other factors, would predictably play a role in the overall price of a house. However, the factors not included did not exhibit, at minimum, correlation values of 0.59 or greater. All of the variables graphed on the x-axis exhibited a positive relationship with price. Ultimately, the R-squared values did not exceed 0.7, which means that no single independent variable could reliably explain the variation in house prices.

In [None]:
# living area vs. area above

fig = plt.figure()
plt.scatter(df['sqft_above'], df['sqft_living'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(df['sqft_above'], df['sqft_living'], 1)
line = slope * df['sqft_above'] + intercept
plt.plot(df['sqft_above'], line, color = 'red')
plt.text(200, 11800, ("sqft_living =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*sqft_above"))

print("best fit slope =", round(slope, 2))
print("best fit intercept =", round(intercept, 2))

plt.title ('Scatter Plot of Living Area and Area Above Colored by Grade')
plt.xlabel('Area Above (ft\u00b2)')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_living vs. sqft_above, normalized

fig = plt.figure()
plt.scatter(z_scaled['sqft_above'], z_scaled['sqft_living'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['sqft_above'], z_scaled['sqft_living'], 1)
line = slope * z_scaled['sqft_above'] + intercept
plt.plot(z_scaled['sqft_above'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['sqft_above']], z_scaled['sqft_living'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['sqft_above']], z_scaled['sqft_living'])

plt.text(-2, 10.5, ("sqft_living =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*sqft_above" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Living Area and Area Above Colored by Grade')
plt.xlabel('Area Above (ft\u00b2)')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_living vs. grade

fig = plt.figure()
plt.scatter(df['grade'], df['sqft_living'], s = 10)

# Calculate the line of best fit
slope, intercept = np.polyfit(df['grade'], df['sqft_living'], 1)
line = slope * df['grade'] + intercept
plt.plot(df['grade'], line, color = 'red')
plt.text(1, 11500, ("sqft_living =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*grade"))

print("best fit slope =", round(slope, 2))
print("best fit intercept =", round(intercept, 2))

plt.title ('Scatter Plot of Living Area and Grade')
plt.xlabel('Grade')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_living vs. grade, normalized

fig = plt.figure()
plt.scatter(z_scaled['grade'], z_scaled['sqft_living'], s = 10)

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['grade'], z_scaled['sqft_living'], 1)
line = slope * z_scaled['grade'] + intercept
plt.plot(z_scaled['grade'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['grade']], z_scaled['sqft_living'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['grade']], z_scaled['sqft_living'])

plt.text(-5.7, 10.5, ("sqft_living =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*grade" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Living Area and Grade')
plt.xlabel('Grade')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_living vs. sqft_living15

fig = plt.figure()
plt.scatter(df['sqft_living15'], df['sqft_living'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(df['sqft_living15'], df['sqft_living'], 1)
line = slope * df['sqft_living15'] + intercept
plt.plot(df['sqft_living15'], line, color = 'red')
plt.text(200, 11200, ("sqft_living =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*sqft_living15"))

print("best fit slope =", round(slope, 2))
print("best fit intercept =", round(intercept, 2))

plt.title ('Scatter Plot of Living Area and Living Area of Nearby Houses Colored by Grade')
plt.xlabel('Living Area of Nearby Houses (ft\u00b2)')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_living vs. sqft_living15, normalized

fig = plt.figure()
plt.scatter(z_scaled['sqft_living15'], z_scaled['sqft_living'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['sqft_living15'], z_scaled['sqft_living'], 1)
line = slope * z_scaled['sqft_living15'] + intercept
plt.plot(z_scaled['sqft_living15'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['sqft_living15']], z_scaled['sqft_living'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['sqft_living15']], z_scaled['sqft_living'])

plt.text(-2, 10.1, ("sqft_living =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*sqft_living15" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Living Area and Living Area of Nearby Houses Colored by Grade')
plt.xlabel('Living Area of Nearby Houses (ft\u00b2)')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_living vs. bathrooms

fig = plt.figure()
plt.scatter(df['bathrooms'], df['sqft_living'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(df['bathrooms'], df['sqft_living'], 1)
line = slope * df['bathrooms'] + intercept
plt.plot(df['bathrooms'], line, color = 'red')
plt.text(0, 11800, ("sqft_living =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*bathrooms"))

print("best fit slope =", round(slope, 2))
print("best fit intercept =", round(intercept, 2))

plt.title ('Scatter Plot of Living Area and Number of Bathrooms Colored by Grade')
plt.xlabel('Number of Bathrooms')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_living vs. bathrooms, normalized

fig = plt.figure()
plt.scatter(z_scaled['bathrooms'], z_scaled['sqft_living'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['bathrooms'], z_scaled['sqft_living'], 1)
line = slope * z_scaled['bathrooms'] + intercept
plt.plot(z_scaled['bathrooms'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['bathrooms']], z_scaled['sqft_living'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['bathrooms']], z_scaled['sqft_living'])

plt.text(-3, 10.7, ("sqft_living =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*bathrooms" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Living Area and Area Above Colored by Grade')
plt.xlabel('Number of Bathrooms')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

To varying yet strong degrees, the living area is influenced by the area above, grade, average living area of 15 nearby houses, and the number of bathrooms. These, among other factors, would predictably play a role in the overall living area of a house. However, the factors not included did not exhibit, at minimum, correlation values of 0.7 or greater. Understandably, all of the variables graphed on the x-axis exhibited a positive relationship with living area. Ultimately, the R-squared values did not exceed 0.7, which means that no single independent variable could reliably explain the variation in living area.

In [None]:
# sqft_lot vs. sqft_lot15

fig = plt.figure()
plt.scatter(df['sqft_lot15'], df['sqft_lot'], s = 10)

# Calculate the line of best fit
slope, intercept = np.polyfit(df['sqft_lot15'], df['sqft_lot'], 1)
line = slope * df['sqft_lot15'] + intercept
plt.plot(df['sqft_lot15'], line, color = 'red')
plt.text(0, 1500000, ("sqft_lot =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*sqft_lot15"))

print("best fit slope =", round(slope, 2))
print("best fit intercept =", round(intercept, 2))

plt.title ('Scatter Plot of Lot Area and Lot Area of Nearby Houses')
plt.xlabel('Lot Area of Nearby Houses (ft\u00b2)')
plt.ylabel('Lot Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_lot vs. sqft_lot15, normalized

fig = plt.figure()
plt.scatter(z_scaled['sqft_lot15'], z_scaled['sqft_lot'], s = 10)

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['sqft_lot15'], z_scaled['sqft_lot'], 1)
line = slope * z_scaled['sqft_lot15'] + intercept
plt.plot(z_scaled['sqft_lot15'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['sqft_lot15']], z_scaled['sqft_lot'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['sqft_lot15']], z_scaled['sqft_lot'])

plt.text(0, 35, ("sqft_lot =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*sqft_lot15" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Lot Area and Lot Area of Nearby Houses')
plt.xlabel('Lot Area of Nearby Houses (ft\u00b2)')
plt.ylabel('Lot Area (ft\u00b2)')
plt.show()

The slope of the scatter plots help to confirm the strong positive correlation value between lot area and the lot areas of 15 nearby houses. Ultimately, the R-squared value did not exceed 0.7, which means that lot area of 15 nearby houses alone cannot reliably explain the variation in lot area.

In [None]:
# sqft_living15 vs. grade

fig = plt.figure()
plt.scatter(df['grade'], df['sqft_living15'], s = 10)

# Calculate the line of best fit
slope, intercept = np.polyfit(df['grade'], df['sqft_living15'], 1)
line = slope * df['grade'] + intercept
plt.plot(df['grade'], line, color = 'red')
plt.text(1, 6000, ("sqft_living15 =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*grade"))

print("best fit slope =", round(slope, 2))
print("best fit intercept =", round(intercept, 2))

plt.title ('Scatter Plot of Living Area of 15 Nearby Houses and Grade')
plt.xlabel('Grade')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_living15 vs. grade, normalized

fig = plt.figure()
plt.scatter(z_scaled['grade'], z_scaled['sqft_living15'], s = 10)

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['grade'], z_scaled['sqft_living15'], 1)
line = slope * z_scaled['grade'] + intercept
plt.plot(z_scaled['grade'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['grade']], z_scaled['sqft_living15'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['grade']], z_scaled['sqft_living15'])

plt.text(-5.7, 5.8, ("sqft_living15 =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*grade" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Living Area of Nearby Houses and Grade')
plt.xlabel('Grade')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# living area vs. area above

fig = plt.figure()
plt.scatter(df['sqft_above'], df['sqft_living15'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(df['sqft_above'], df['sqft_living15'], 1)
line = slope * df['sqft_above'] + intercept
plt.plot(df['sqft_above'], line, color = 'red')
plt.text(100, 6500, ("sqft_living15 =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*sqft_above"))

print("best fit slope =", round(slope, 2))
print("best fit intercept =", round(intercept, 2))

plt.title ('Scatter Plot of Living Area of Nearby Houses and Area Above Colored by Grade')
plt.xlabel('Area Above (ft\u00b2)')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_living15 vs. sqft_above, normalized

fig = plt.figure()
plt.scatter(z_scaled['sqft_above'], z_scaled['sqft_living15'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['sqft_above'], z_scaled['sqft_living15'], 1)
line = slope * z_scaled['sqft_above'] + intercept
plt.plot(z_scaled['sqft_above'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['sqft_above']], z_scaled['sqft_living15'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['sqft_above']], z_scaled['sqft_living15'])

plt.text(-2, 6.5, ("sqft_living15 =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*sqft_above" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Living Area of Nearby Houses and Area Above Colored by Grade')
plt.xlabel('Area Above (ft\u00b2)')
plt.ylabel('Living Area (ft\u00b2)')
plt.show()

In [None]:
# sqft_living15 vs. sqft_living (and grade)

fig = plt.figure()
plt.scatter(df['sqft_living'], df['sqft_living15'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(df['sqft_living'], df['sqft_living15'], 1)
line = slope * df['sqft_living'] + intercept
plt.plot(df['sqft_living'], line, color = 'red')
plt.text(200, 7400, ("sqft_living15 =" + str(round(intercept, 2)) + "+" + str(round(slope, 2)) + "*sqft_living"))

print("slope =", str(round(slope, 2)))
print("intercept =", str(round(intercept, 2)))

plt.title ('Scatter Plot of Living Area of Nearby Houses and Living Area Colored by Grade')
plt.xlabel('Living Area (ft\u00b2)')
plt.ylabel('Living Area of Nearby Houses (ft\u00b2)')
plt.show()

In [None]:
# sqft_living15 vs. sqft_living (and grade), normalized

fig = plt.figure()
plt.scatter(z_scaled['sqft_living'], z_scaled['sqft_living15'], s = 10, c = df['grade'], cmap = 'viridis')
plt.colorbar(label = "Grade", orientation = "vertical")

# Calculate the line of best fit
slope, intercept = np.polyfit(z_scaled['sqft_living'], z_scaled['sqft_living15'], 1)
line = slope * z_scaled['sqft_living'] + intercept
plt.plot(z_scaled['sqft_living'], line, color = 'red')

# creating a linear regression model from the normalized data
model = LinearRegression()
model.fit(z_scaled[['sqft_living']], z_scaled['sqft_living15'])
slope_m = model.coef_[0]
intercept_m = model.intercept_
r_squared = model.score(z_scaled[['sqft_living']], z_scaled['sqft_living15'])

plt.text(-2, 7.8, ("sqft_living15 =" + str(round(intercept_m, 2)) + "+" + str(round(slope_m, 2)) + "*sqft_living" + ', R\u00b2 =' + str(round(r_squared, 2))))

print("best fit slope =", str(round(slope, 2)))
print("best fit intercept =", str(round(intercept, 2)))

print("model slope =", round(slope_m, 2))
print("model intercept =", round(intercept_m, 2))
print ('model R\u00b2 =', str(round(r_squared, 2)))

plt.title ('Scatter Plot of Normalized Living Area of Nearby Houses and Living Area Colored by Grade')
plt.xlabel('Living Area (ft\u00b2)')
plt.ylabel('Living Area of Nearby Houses (ft\u00b2)')
plt.show()

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))

# Scatter plot
sc = plt.scatter(
    df['long'],       # x-axis = longitude
    df['lat'],        # y-axis = latitude
    c=df['price'],    # color by 'price'
    cmap='viridis',         # colormap
    s=20,                   # marker size
    alpha=0.7
)

plt.colorbar(sc, label='Price ($)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Houses Colored by Price')
plt.show()


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))

# Scatter plot
sc = plt.scatter(
    df['long'],       # x-axis = longitude
    df['lat'],        # y-axis = latitude
    c=df['zipcode'],    # color by 'price'
    cmap='viridis',         # colormap
    s=20,                   # marker size
    alpha=0.7
)

plt.colorbar(sc, label='Zipcode')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Houses Colored by Zipcode')
plt.show()

# **STEP 3: Data Analytics**
## -Determine the need for a supervised or unsupervised learning method and identify dependent and independent variables
## -Train, test, and provide accuracy and evaluation metrics for model results


For this portion, we will focus on supervised machine learning because we have already identified labels for our independent variables. This method will ensure that our model accurately predicts results for our testing data, increasing generalizability for populations outside of our sample. 
The dependent variable will be price, and independent variables will be the features that had a correlation of above the absolute value of 0.5. These include: the size of nearby houses, squarefoot living, squarefoot above, number of bathrooms, and grade. We will perform regression to deal with continuous data points, where classification would be inappropriate. We will first use linear regression to determine accuracy, then move on to random forrest regression to better fit the model. 

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, cross_val_score, KFold #used to split the data into training and testing groups
from sklearn.linear_model import LinearRegression #we will perform linear regression using this function
from sklearn.metrics import mean_absolute_error #used for evaluating the accuracy of our ML model
from sklearn.ensemble import RandomForestRegressor #we will use this for a more reliable model build for non-linear relationships


In [288]:
#for interactions between variables, random forest is more appropriate because we can 

NameError: name 'df' is not defined

# References

Bobbitt, Z. (2020, September 3). Matplotlib: How to color a scatterplot by value. Statology. https://www.statology.org/matplotlib-scatterplot-color-by-value/

Bobbitt, Z. (2022, March 24). How to calculate r-squared in python(With example). Statology. https://www.statology.org/r-squared-in-python/

How to draw a line inside a scatter plot. (2024, July 22). GeeksforGeeks. https://www.geeksforgeeks.org/data-visualization/how-to-draw-a-line-inside-a-scatter-plot/

How to standardize data in a pandas dataframe? (2021, December 16). GeeksforGeeks. https://www.geeksforgeeks.org/python/how-to-standardize-data-in-a-pandas-dataframe/

Matplotlib.pyplot.colorbar() function in Python. (2020, December 5). GeeksforGeeks. https://www.geeksforgeeks.org/python/matplotlib-pyplot-colorbar-function-in-python/

https://medium.com/@sarah.ahmed.aboelseoud/beyond-the-numbers-understanding-linear-regression-modeling-2c9ae5697199 

Pandas. Dataframe. Select_dtypes—Pandas 2. 3. 3 documentation. (n.d.). Retrieved November 24, 2025, from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html

Shah, C. (2020). A Hands-On Introduction to Data Science. Cambridge: Cambridge University Press. Accessed via web: https://www.cambridge.org/highereducation/books/a-hands-on-introduction-to-data-science/9D55C29C653872F13289EA7909953842#overview 


