# **SmartPrice Navigator: Predictive Housing Market Analysis**
This notebook demonstrates the step-by-step creation of a machine learning model to predict housing prices using historical housing market data from the ATTOM API. The goal is to provide actionable insights for buyers and sellers based on real estate attributes.

---


# Import Libraries
In this section, we import the necessary libraries for data handling, visualization, and model building.

In [62]:
# Import libraries for data handling, visualization, and model building.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import LeavePOut
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import median_absolute_error   
from sklearn.metrics import explained_variance_score
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_poisson_deviance
from sklearn.metrics import mean_gamma_deviance
from sklearn.metrics import mean_tweedie_deviance
from sklearn.metrics import mean_absolute_percentage_error


In [63]:
# Load the ATTOM API data from the ATTOM API website. 
# The data is stored at https://www.attomdata.com/

# Load the API key for the ATTOM API from the localkeys.env file.
import os
from dotenv import load_dotenv
load_dotenv()
attom_api_key = os.getenv("ATTOM_API_KEY")


In [64]:
# # Load the data from the CSV file.
# data = pd.read_csv('data.csv')

In [65]:
# Load the data from the ATTOM API.
import requests
import json
import time
import random


In [66]:

# Set the base URL for the ATTOM API.
base_url = "https://api.gateway.attomdata.com/propertyapi/v1.0.0/assessment/detail"


In [67]:

# Set the headers for the ATTOM API.
headers = {
    'accept': 'application/json',
    'apikey': attom_api_key,
}


In [68]:

# Set the parameters for the ATTOM API.
params = {
    'address1': '1 Microsoft Way',
    'address2': '',
    'postal1': '98052',
    'postal2': '',
}

In [69]:

# Make a request to the ATTOM API.
response = requests.get(base_url, headers=headers, params=params)

In [70]:

# Get the data from the response.
data = response.json()

In [71]:
# Convert the data to a DataFrame and display the first few rows.
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Response
status,"{'version': '1.0.0', 'code': '401', 'msg': 'Un..."


In [72]:
# Convert the data to a DataFrame and display the data types of each column.
df = pd.DataFrame(data)
df.dtypes

Response    object
dtype: object

In [73]:
# Convert the data to a DataFrame and display the summary statistics.
df = pd.DataFrame(data)
df.describe()

Unnamed: 0,Response
count,1
unique,1
top,"{'version': '1.0.0', 'code': '401', 'msg': 'Un..."
freq,1


In [74]:
# Display the correlation matrix of the data.
df.corr()

TypeError: float() argument must be a string or a real number, not 'dict'

In [None]:
# Ensure the DataFrame contains only numerical values.
df_numeric = df.select_dtypes(include=[np.number])

# Plot the correlation matrix as a heatmap.
plt.figure(figsize=(10, 8))
sns.heatmap(df_numeric.corr(), annot=True, cmap='coolwarm')
plt.show()

ValueError: zero-size array to reduction operation fmin which has no identity

<Figure size 1000x800 with 0 Axes>

In [78]:
# Split the data into features and target variable.
X = data[['zipcode', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']].copy()
y = data['Sales'].copy()

TypeError: unhashable type: 'list'

In [None]:
# Split the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
#**Input Python code to load the historical housing data obtained from the ATTOM API or other sources. Ensure the data includes features like property attributes, location data, and market performance metrics.**

# Load the historical housing data from the ATTOM API.
import requests
import json

In [None]:

# Python code to load dataset goes here
# Load the historical housing data from the CSV file.
data = pd.read_csv('data.csv')

# Exploratory Data Analysis (EDA)
                                 

**Objectives:**
Understand the dataset's structure and quality.
Identify the most important features.
Visualize the relationships between features and the target variable.
Identify any potential issues with the data.


In [None]:
# Display the distribution of the target variable.
plt.figure(figsize=(10, 6))
sns.histplot(data['Sales'], kde=True)
plt.title('Distribution of Sales')
plt.show()

In [79]:
# Explore distributions, relationships, and trends in the data.
plt.figure(figsize=(12, 8))
sns.pairplot(data[['zipcode', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'Sales']])
plt.show()

TypeError: unhashable type: 'list'

<Figure size 1200x800 with 0 Axes>

In [None]:

# Identify potential outliers and anomalies.
plt.figure(figsize=(12, 6))
sns.boxplot(x=data['Sales'])
plt.title('Boxplot of Sales')
plt.show()

In [None]:
# Check for missing or inconsistent data.
data.isnull().sum()

In [None]:
# Display summary statistics and data structure.
data.info()

In [None]:
#**Model Building**

# Visualize data distributions (e.g., histograms, boxplots).
plt.figure(figsize=(12, 8))
sns.histplot(data['Sales'], kde=True)
plt.title('Distribution of Sales')
plt.show()


In [None]:

# Explore relationships using scatter plots and heatmaps.
plt.figure(figsize=(12, 8))
sns.scatterplot(x='sqft_living', y='Sales', data=data)
plt.title('Sales vs. Sqft Living')
plt.show()


In [None]:

# Display the distribution of the target variable.
plt.figure(figsize=(10, 6))
sns.histplot(data['Sales'], kde=True)
plt.title('Distribution of Sales')
plt.show()


# Data Preprocessing


In [None]:
# Check for missing or inconsistent data.
missing_data = data.isnull().sum()
print("Missing data points per column:\n", missing_data)

# Display summary statistics and data structure.
print(data.info())
print(data.describe())


In [None]:
# Visualize data distributions (e.g., histograms, boxplots).
plt.figure(figsize=(10, 6))
sns.boxplot(data=data)
plt.title('Boxplot of Data Distributions')
plt.show()


In [None]:
# Explore relationships using scatter plots and heatmaps.
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Calculate the number of missing values in each column.
missing_values = data.isnull().sum()

In [None]:
# Impute missing values or drop rows with missing data.
data.fillna(data.mean(), inplace=True)
data.dropna(inplace=True)

In [None]:
# Handle outliers.
data = data[data['Sales'] < 1000000]
data = data[data['sqft_living'] < 5000]

In [None]:
# Display the number of missing values in each column.
# Identify outliers using boxplots and scatter plots.
plt.figure(figsize=(10, 6))

In [None]:
# Encode categorical variables.
data = pd.get_dummies(data, columns=['zipcode'])

In [None]:
# Split the data into features and target variable.
X = data.drop(columns=['Sales'])
y = data['Sales']

In [None]:
# Scale numerical features for improved model performance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)

In [None]:
# Split data into training and testing sets.
X = data.drop('Sales', axis=1)
y = data['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Engineering
### Objectives: 
- Create new features based on domain knowledge (e.g., price per square foot, proximity to amenities).
- Select the most relevant features for the model using feature importance or correlation analysis.

In [None]:
# Python code for feature engineering goes here
data['age'] = 2022 - data['yr_built']
data['renovated'] = data['yr_renovated'].apply(lambda x: 1 if x > 0 else 0)
data['sqft_ratio'] = data['sqft_above'] / data['sqft_living']
data['sqft_total'] = data['sqft_living'] + data['sqft_lot']
data['sqft_total15'] = data['sqft_living15'] + data['sqft_lot15']


# Model Training
### Objectives:
- Train the selected machine learning model on the training dataset.
- Optimize the model using techniques like grid search or random search for hyperparameter tuning.

In [None]:
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)


In [None]:

# Make predictions
y_pred = model.predict(X_test)


In [None]:
# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV
param_grid = {'fit_intercept': [True, False], 'normalize': [True, False]}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
model = grid.best_estimator_


In [None]:
# Python code for optimizing the model goes here
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


# Model Evaluation
### Objectives:
- Evaluate the trained model's performance on the test dataset using metrics such as:
    - Mean Absolute Error (MAE)
    - Root Mean Squared Error (RMSE)
    - R² Score
- Visualize predictions vs. actual values to assess model accuracy.

In [None]:
# Python code for model evaluation goes here
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("R^2 Score:", r2)

# Model Deployment
### Objectives:
- Save the trained model using a library like joblib or pickle.
- Create a function or API to make predictions based on user input.

In [None]:
# Save the model
import joblib
joblib.dump(model, 'model.pkl')


In [None]:
# Load the model
model = joblib.load('model.pkl')


In [None]:
# Create a prediction function
def predict(data):
    X = pd.DataFrame(data, index=[0])
    prediction = model.predict(X)
    return prediction[0]


In [None]:
# Python code for prediction logic goes here
data = {'zipcode': 98103, 'bedrooms': 3, 'bathrooms': 2, 'sqft_living': 2000, 'sqft_lot': 5000, 'floors': 2, 'waterfront': 0, 'view': 0, 'condition': 3, 'grade': 7, 'sqft_above': 1500, 'sqft_basement': 500, 'yr_built': 1990, 'yr_renovated': 0, 'lat': 47.65, 'long': -122.33, 'sqft_living15': 1500, 'sqft_lot15': 5000}
prediction = predict(data)
print("Predicted Sales Price:", prediction)

# User Interface and Integration
### Objectives:
- Build a user-friendly interface (e.g., web app or CLI) for buyers and sellers to input property attributes and view predicted prices.
- Integrate the model with real-time or updated data sources to keep predictions current.

In [None]:
# Python code for building user interface or integration goes here
# Import Flask and other libraries
from flask import Flask, request, jsonify


# Conclusion and Future Improvements
### Summary:
Recap the results and effectiveness of the trained model.
Discuss potential improvements, such as:
- Adding more data from diverse sources.
- Enhancing the feature set with additional external factors (e.g., economic indicators, weather data).
- Refining the user interface for better usability.


# Next Steps
- Implement the outlined code for each section to build the SmartPrice Navigator model.
- Continuously refine the pipeline based on user feedback and additional data insights.

# References
- [ATTOM API Documentation](https://www.attomdata.com/)
- [Flask Documentation](https://flask.palletsprojects.com/)
- [Joblib Documentation](https://joblib.readthedocs.io/)
- [Matplotlib Documentation](https://matplotlib.org/)
- [Numpy Documentation](https://numpy.org/doc/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Python Standard Library](https://docs.python.org/3/library/)
- [Scikit-learn Documentation](https://scikit-learn.org/stable/)
- [Seaborn Documentation](https://seaborn.pydata.org/)
- Any other relevant resources