# **SmartPrice Navigator: Predictive Housing Market Analysis**
This notebook demonstrates the step-by-step creation of a machine learning model to predict housing prices using historical housing market data from the ATTOM API. The goal is to provide actionable insights for buyers and sellers based on real estate attributes.

---


# Import Libraries
In this section, we import the necessary libraries for data handling, visualization, and model building.

In [1]:
# Import libraries for data handling, visualization, and model building.
%pip install python-dotenv

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

Note: you may need to restart the kernel to use updated packages.


In [5]:
# Load the ATTOM API data from the ATTOM API website. 
# The data is stored at https://www.attomdata.com/

# Load the API key for the ATTOM API from the localkeys.env file.
import os
from dotenv import load_dotenv
load_dotenv('C:\\Users\\ReisH\\anaconda\\envs\\dev')
ATTOM_API_KEY=("ec83bdb94ce2322466b65581559fc46d")

print(os.environ)


attom_api_key = os.getenv("ATTOM_API_KEY")
print(attom_api_key)

environ({'ALLUSERSPROFILE': 'C:\\ProgramData', 'APPDATA': 'C:\\Users\\ReisH\\AppData\\Roaming', 'ASW_SERVER_STATE': '1', 'CHROME_CRASHPAD_PIPE_NAME': '\\\\.\\pipe\\crashpad_21160_RYBAZCBPDDWOVISD', 'CHROME_RESTART': 'Google Chrome|Whoa! Google Chrome has crashed. Relaunch now?|LEFT_TO_RIGHT', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'COMMONPROGRAMFILES(X86)': 'C:\\Program Files (x86)\\Common Files', 'COMMONPROGRAMW6432': 'C:\\Program Files\\Common Files', 'COMPUTERNAME': 'DOOBLES', 'COMSPEC': 'C:\\windows\\system32\\cmd.exe', 'CONDA_ALLOW_SOFTLINKS': 'false', 'CONDA_DEFAULT_ENV': 'dev', 'CONDA_EXE': 'C:\\Users\\ReisH\\anaconda\\Scripts\\conda.exe', 'CONDA_EXES': '"C:\\Users\\ReisH\\anaconda\\condabin\\..\\Scripts\\conda.exe"  ', 'CONDA_PREFIX': 'C:\\Users\\ReisH\\anaconda\\envs\\dev', 'CONDA_PROMPT_MODIFIER': '(dev) ', 'CONDA_PYTHON_EXE': 'C:\\Users\\ReisH\\anaconda\\python.exe', 'CONDA_ROOT': 'C:\\Users\\ReisH\\anaconda', 'CONDA_SHLVL': '1', 'DRIVERDATA': 'C:\\Windows\

In [15]:
load_dotenv(r"C:\Users\ReisH\OneDrive\Desktop\Project-2\local_keys.env")

True

In [16]:
import os
from dotenv import load_dotenv

# Load environment variables from the .env file
load_dotenv()

# Check if the variable is set correctly
attom_api_key = os.getenv("ATTOM_API_KEY")
if attom_api_key is None:
    print("ATTOM_API_KEY is not set. Please check your .env file.")
else:
    print("ATTOM_API_KEY loaded successfully.")

    attom_api_key = "ec83bdb94ce2322466b65581559fc46d"  # Replace with your actual API key
print(attom_api_key)
load_dotenv('path/to/your/.env')

import os
from dotenv import load_dotenv

# Load environment variables from the .env file
load_dotenv()

# Print all environment variables
for key, value in os.environ.items():
    print(f"{key}: {value}")

ATTOM_API_KEY loaded successfully.
ec83bdb94ce2322466b65581559fc46d
ALLUSERSPROFILE: C:\ProgramData
APPDATA: C:\Users\ReisH\AppData\Roaming
ASW_SERVER_STATE: 1
CHROME_CRASHPAD_PIPE_NAME: \\.\pipe\crashpad_21160_RYBAZCBPDDWOVISD
CHROME_RESTART: Google Chrome|Whoa! Google Chrome has crashed. Relaunch now?|LEFT_TO_RIGHT
COMMONPROGRAMFILES: C:\Program Files\Common Files
COMMONPROGRAMFILES(X86): C:\Program Files (x86)\Common Files
COMMONPROGRAMW6432: C:\Program Files\Common Files
COMPUTERNAME: DOOBLES
COMSPEC: C:\windows\system32\cmd.exe
CONDA_ALLOW_SOFTLINKS: false
CONDA_DEFAULT_ENV: dev
CONDA_EXE: C:\Users\ReisH\anaconda\Scripts\conda.exe
CONDA_EXES: "C:\Users\ReisH\anaconda\condabin\..\Scripts\conda.exe"  
CONDA_PREFIX: C:\Users\ReisH\anaconda\envs\dev
CONDA_PROMPT_MODIFIER: (dev) 
CONDA_PYTHON_EXE: C:\Users\ReisH\anaconda\python.exe
CONDA_ROOT: C:\Users\ReisH\anaconda
CONDA_SHLVL: 1
DRIVERDATA: C:\Windows\System32\Drivers\DriverData
EFC_8980: 1
ELECTRON_NO_ATTACH_CONSOLE: 1
ELECTRON_RUN

In [3]:
import pandas as pd

# Load the data from the CSV file.
data1 = pd.read_csv(r'C:\Users\ReisH\OneDrive\Desktop\Project-2\Metro_invt_fs_uc_sfrcondo_sm_month.csv')
data2 = pd.read_csv(r'C:\Users\ReisH\OneDrive\Desktop\Project-2\Metro_market_temp_index_uc_sfrcondo_month (1).csv')
data3 = pd.read_csv(r'C:\Users\ReisH\OneDrive\Desktop\Project-2\Metro_mean_doz_pending_uc_sfrcondo_sm_month.csv')
data4 = pd.read_csv(r'C:\Users\ReisH\OneDrive\Desktop\Project-2\Metro_new_con_sales_count_raw_uc_sfrcondo_month.csv')
data5 = pd.read_csv(r'C:\Users\ReisH\OneDrive\Desktop\Project-2\Metro_sales_count_now_uc_sfrcondo_month.csv')
data6 = pd.read_csv(r'C:\Users\ReisH\OneDrive\Desktop\Project-2\Metro_zhvf_growth_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv')
data7 = pd.read_csv(r'C:\Users\ReisH\OneDrive\Desktop\Project-2\Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month (1).csv')

# Display the first few rows of each dataset.
print(data1.head())
print(data2.head())
print(data3.head())
print(data4.head())
print(data5.head())
print(data6.head())
print(data7.head())

   RegionID  SizeRank       RegionName RegionType StateName  2018-03-31  \
0    102001         0    United States    country       NaN   1421530.0   
1    394913         1     New York, NY        msa        NY     73707.0   
2    753899         2  Los Angeles, CA        msa        CA     21998.0   
3    394463         3      Chicago, IL        msa        IL     38581.0   
4    394514         4       Dallas, TX        msa        TX     24043.0   

   2018-04-30  2018-05-31  2018-06-30  2018-07-31  ...  2024-01-31  \
0   1500196.0   1592417.0   1660615.0   1709144.0  ...    890491.0   
1     80345.0     85864.0     90067.0     91881.0  ...     36461.0   
2     23784.0     25605.0     27109.0     28811.0  ...     14058.0   
3     42253.0     45757.0     47492.0     48984.0  ...     19092.0   
4     25876.0     28225.0     30490.0     32408.0  ...     21664.0   

   2024-02-29  2024-03-31  2024-04-30  2024-05-31  2024-06-30  2024-07-31  \
0    876361.0    913841.0    967480.0   1036855.0  

In [4]:
# Load the data from the ATTOM API.
import requests
import json
import time
import random

# Load environment variables
load_dotenv()
attom_api_key = os.getenv("ATTOM_API_KEY")


In [5]:

# Set the base URL for the ATTOM API.
base_url = "https://api.gateway.attomdata.com/propertyapi/v1.0.0/assessment/detail"


In [6]:

# Set the headers for the ATTOM API.
params = {
    "address": "123 Main St",  
    "city": "Seattle",         
    "state": "WA",             
    "apikey": attom_api_key
}

In [7]:

# Set the parameters for the ATTOM API.
headers = {
    "accept": "application/json",
    "apikey": attom_api_key
}

In [8]:

# Make a request to the ATTOM API.
response = requests.get(base_url, headers=headers, params=params)
if response.status_code == 200:
    data = response.json()
    print(json.dumps(data, indent=2))
else:
    print(f"Error: {response.status_code}")
    

Error: 401


In [10]:
   # Print the full JSON response to inspect its structure
   print("Full API Response:", data1)
   print("Full API Response:", data2)
   print("Full API Response:", data3)
   print("Full API Response:", data4)
   print("Full API Response:", data5)
   print("Full API Response:", data6)
   print("Full API Response:", data7)
   print("Full API Response:", response.json())


Full API Response:      RegionID  SizeRank       RegionName RegionType StateName  2018-03-31  \
0      102001         0    United States    country       NaN   1421530.0   
1      394913         1     New York, NY        msa        NY     73707.0   
2      753899         2  Los Angeles, CA        msa        CA     21998.0   
3      394463         3      Chicago, IL        msa        IL     38581.0   
4      394514         4       Dallas, TX        msa        TX     24043.0   
..        ...       ...              ...        ...       ...         ...   
923    753929       935       Zapata, TX        msa        TX        55.0   
924    394743       936    Ketchikan, AK        msa        AK        77.0   
925    753874       937        Craig, CO        msa        CO       115.0   
926    395188       938       Vernon, TX        msa        TX        21.0   
927    394767       939       Lamesa, TX        msa        TX        35.0   

     2018-04-30  2018-05-31  2018-06-30  2018-07-31  ...

In [11]:

# Get the data from the response.
data = response.json()

In [12]:
# Inspect the keys of the JSON response
print(data.keys())

# Convert the relevant data to a DataFrame
# Assuming the relevant data is in 'property' key within the JSON response
if 'property' in data:
	data_df = pd.DataFrame(data['property'])
	data_df.head()
else:
	print("Key 'property' not found in the JSON response")






dict_keys(['Response'])
Key 'property' not found in the JSON response


In [13]:
# Convert the data to a DataFrame and display the data types of each column.
df = pd.DataFrame(data)
df.dtypes

Response    object
dtype: object

In [14]:
# Convert the data to a DataFrame and display the summary statistics.
df = pd.DataFrame(data)
df.describe()

Unnamed: 0,Response
count,1
unique,1
top,"{'version': '1.0.0', 'code': '401', 'msg': 'Un..."
freq,1


In [15]:
# Ensure the DataFrame contains only numerical values.
df_numeric = df.select_dtypes(include=[np.number])

# Display the correlation matrix of the numerical data.
correlation_matrix = df_numeric.corr()
print("Correlation Matrix:")
print(correlation_matrix)


Correlation Matrix:
Empty DataFrame
Columns: []
Index: []


In [16]:
# Ensure the DataFrame contains only numerical values.
numerical_df = df.select_dtypes(include=['number'])
print("Numerical DataFrame:")
print(numerical_df.head()) 

# Check if the numerical DataFrame is not empty before plotting.
if not numerical_df.empty:
	# Plot the correlation matrix as a heatmap.
	plt.figure(figsize=(10, 8))
	sns.heatmap(numerical_df.corr(), annot=True, cmap='coolwarm')
	plt.show()
else:
	print("The numerical DataFrame is empty. Cannot plot heatmap.")

Numerical DataFrame:
Empty DataFrame
Columns: []
Index: [status]
The numerical DataFrame is empty. Cannot plot heatmap.


In [23]:
# Convert the data dictionary to a DataFrame.
data_df = pd.DataFrame(data)

# Split the data into features and target variable.
X = data_df.loc[:, ['zipcode', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']]
y = data_df.loc[:, 'Sales']

KeyError: "None of [Index(['zipcode', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',\n       'waterfront', 'view', 'condition', 'grade', 'sqft_above',\n       'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long',\n       'sqft_living15', 'sqft_lot15'],\n      dtype='object')] are in the [columns]"

In [None]:
# Split the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
#**Input Python code to load the historical housing data obtained from the ATTOM API or other sources. Ensure the data includes features like property attributes, location data, and market performance metrics.**

# Load the historical housing data from the ATTOM API.
import requests
import json

In [None]:

# Python code to load dataset goes here
# Load the historical housing data from the CSV file.
data = pd.read_csv('data.csv')

# Exploratory Data Analysis (EDA)
                                 

**Objectives:**
Understand the dataset's structure and quality.
Identify the most important features.
Visualize the relationships between features and the target variable.
Identify any potential issues with the data.


In [None]:
# Display the distribution of the target variable.
plt.figure(figsize=(10, 6))
sns.histplot(data['Sales'], kde=True)
plt.title('Distribution of Sales')
plt.show()

In [79]:
# Explore distributions, relationships, and trends in the data.
plt.figure(figsize=(12, 8))
sns.pairplot(data[['zipcode', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'Sales']])
plt.show()

TypeError: unhashable type: 'list'

<Figure size 1200x800 with 0 Axes>

In [None]:

# Identify potential outliers and anomalies.
plt.figure(figsize=(12, 6))
sns.boxplot(x=data['Sales'])
plt.title('Boxplot of Sales')
plt.show()

In [None]:
# Check for missing or inconsistent data.
data.isnull().sum()

In [None]:
# Display summary statistics and data structure.
data.info()

In [None]:
#**Model Building**

# Visualize data distributions (e.g., histograms, boxplots).
plt.figure(figsize=(12, 8))
sns.histplot(data['Sales'], kde=True)
plt.title('Distribution of Sales')
plt.show()


In [None]:

# Explore relationships using scatter plots and heatmaps.
plt.figure(figsize=(12, 8))
sns.scatterplot(x='sqft_living', y='Sales', data=data)
plt.title('Sales vs. Sqft Living')
plt.show()


In [None]:

# Display the distribution of the target variable.
plt.figure(figsize=(10, 6))
sns.histplot(data['Sales'], kde=True)
plt.title('Distribution of Sales')
plt.show()


# Data Preprocessing


In [None]:
# Check for missing or inconsistent data.
missing_data = data.isnull().sum()
print("Missing data points per column:\n", missing_data)

# Display summary statistics and data structure.
print(data.info())
print(data.describe())


In [None]:
# Visualize data distributions (e.g., histograms, boxplots).
plt.figure(figsize=(10, 6))
sns.boxplot(data=data)
plt.title('Boxplot of Data Distributions')
plt.show()


In [None]:
# Explore relationships using scatter plots and heatmaps.
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Calculate the number of missing values in each column.
missing_values = data.isnull().sum()

In [None]:
# Impute missing values or drop rows with missing data.
data.fillna(data.mean(), inplace=True)
data.dropna(inplace=True)

In [None]:
# Handle outliers.
data = data[data['Sales'] < 1000000]
data = data[data['sqft_living'] < 5000]

In [None]:
# Display the number of missing values in each column.
# Identify outliers using boxplots and scatter plots.
plt.figure(figsize=(10, 6))

In [None]:
# Encode categorical variables.
data = pd.get_dummies(data, columns=['zipcode'])

In [None]:
# Split the data into features and target variable.
X = data.drop(columns=['Sales'])
y = data['Sales']

In [None]:
# Scale numerical features for improved model performance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)

In [None]:
# Split data into training and testing sets.
X = data.drop('Sales', axis=1)
y = data['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Engineering
### Objectives: 
- Create new features based on domain knowledge (e.g., price per square foot, proximity to amenities).
- Select the most relevant features for the model using feature importance or correlation analysis.

In [None]:
# Python code for feature engineering goes here
data['age'] = 2022 - data['yr_built']
data['renovated'] = data['yr_renovated'].apply(lambda x: 1 if x > 0 else 0)
data['sqft_ratio'] = data['sqft_above'] / data['sqft_living']
data['sqft_total'] = data['sqft_living'] + data['sqft_lot']
data['sqft_total15'] = data['sqft_living15'] + data['sqft_lot15']


# Model Training
### Objectives:
- Train the selected machine learning model on the training dataset.
- Optimize the model using techniques like grid search or random search for hyperparameter tuning.

In [None]:
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)


In [None]:

# Make predictions
y_pred = model.predict(X_test)


In [None]:
# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV
param_grid = {'fit_intercept': [True, False], 'normalize': [True, False]}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
model = grid.best_estimator_


In [None]:
# Python code for optimizing the model goes here
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


# Model Evaluation
### Objectives:
- Evaluate the trained model's performance on the test dataset using metrics such as:
    - Mean Absolute Error (MAE)
    - Root Mean Squared Error (RMSE)
    - R² Score
- Visualize predictions vs. actual values to assess model accuracy.

In [None]:
# Python code for model evaluation goes here
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("R^2 Score:", r2)

# Model Deployment
### Objectives:
- Save the trained model using a library like joblib or pickle.
- Create a function or API to make predictions based on user input.

In [None]:
# Save the model
import joblib
joblib.dump(model, 'model.pkl')


In [None]:
# Load the model
model = joblib.load('model.pkl')


In [None]:
# Create a prediction function
def predict(data):
    X = pd.DataFrame(data, index=[0])
    prediction = model.predict(X)
    return prediction[0]


In [None]:
# Python code for prediction logic goes here
data = {'zipcode': 98103, 'bedrooms': 3, 'bathrooms': 2, 'sqft_living': 2000, 'sqft_lot': 5000, 'floors': 2, 'waterfront': 0, 'view': 0, 'condition': 3, 'grade': 7, 'sqft_above': 1500, 'sqft_basement': 500, 'yr_built': 1990, 'yr_renovated': 0, 'lat': 47.65, 'long': -122.33, 'sqft_living15': 1500, 'sqft_lot15': 5000}
prediction = predict(data)
print("Predicted Sales Price:", prediction)

# User Interface and Integration
### Objectives:
- Build a user-friendly interface (e.g., web app or CLI) for buyers and sellers to input property attributes and view predicted prices.
- Integrate the model with real-time or updated data sources to keep predictions current.

In [None]:
# Python code for building user interface or integration goes here
# Import Flask and other libraries
from flask import Flask, request, jsonify


# Conclusion and Future Improvements
### Summary:
Recap the results and effectiveness of the trained model.
Discuss potential improvements, such as:
- Adding more data from diverse sources.
- Enhancing the feature set with additional external factors (e.g., economic indicators, weather data).
- Refining the user interface for better usability.


# Next Steps
- Implement the outlined code for each section to build the SmartPrice Navigator model.
- Continuously refine the pipeline based on user feedback and additional data insights.

# References
- [ATTOM API Documentation](https://www.attomdata.com/)
- [Flask Documentation](https://flask.palletsprojects.com/)
- [Joblib Documentation](https://joblib.readthedocs.io/)
- [Matplotlib Documentation](https://matplotlib.org/)
- [Numpy Documentation](https://numpy.org/doc/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Python Standard Library](https://docs.python.org/3/library/)
- [Scikit-learn Documentation](https://scikit-learn.org/stable/)
- [Seaborn Documentation](https://seaborn.pydata.org/)
- Any other relevant resources