<a id='intro'></a>
## _Real Estate Price Prediction Using Machine Learning_

### Dataset Description 

> This data set contains 2,226,382 house listings. The dataset used for this coursework is a data of real estate listings in the United States of America (USA). The data was gotten from, and is hosted on [Kaggle](https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset/data) while the static dataset hosted on google drive and used throughout this project is found [here](https://drive.usercontent.google.com/download?id=1wBJmx7yGbrjRFdSZOp11PkfPfC3e4NNG&export=download&confirm=%7b%7bVALUE%7d%7d)

> This study aims at developing a Machine Learning model that predicts the price of houses. 
> A quick overview of the dataset shows there are **12** Columns: 
`['brokered_by', 'status', 'price', 'bed', 'bath', 'acre_lot', 'street', 'city', 'state', 'zip_code', 'house_size', 'prev_sold_date']`
 
 > The `'price'` column is the target feature for our supervised learning



### Questions we aim to answer with this study
1. What regression models perform best in the prediction of real estate price?
2. What impact will ensemble and hyperparameter tuning have?
3. What features have the most impact on prediction accuracy?

In [None]:
# Import all necessary libraries

## Data Analysis, Statistics and Visualization Libraries
import pandas as pd # Data manipulation
import numpy as np # Numerical operations and array manipulation
import matplotlib.pyplot as plt # Creating visualizations
import seaborn as sns # Creative graphical illustrations built on top of matplotlib
import plotly.express as px # Graph
import plotly.graph_objects as go #Graph
from IPython.display import HTML
%matplotlib inline
from scipy.stats import f_oneway
from statsmodels.stats.anova import anova_lm # Anova
from statsmodels.formula.api import ols
import statsmodels.api as sm
import math


## Machine Learning Libraries
from sklearn.model_selection import train_test_split # To split data into training and testing sets
from sklearn.preprocessing import StandardScaler # Helps standardize features by removing the mean and scaling to unit variance
from sklearn.tree import DecisionTreeRegressor # Constructs decision tree-based regression models
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor # Ensemble method based on decision trees for regression tasks
from sklearn import metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, confusion_matrix, classification_report, accuracy_score # Calculates performance and accuracy scores
from sklearn.preprocessing import OneHotEncoder # Used for One-Hot Encoding
import category_encoders as ce # Used for Target Encoding
from category_encoders import MEstimateEncoder # Used for Target Encoding
from sklearn.preprocessing import MinMaxScaler, StandardScaler, KBinsDiscretizer # Used for normalization and Z-normalization (standardization)
from sklearn.linear_model import LinearRegression, RidgeCV, ElasticNetCV, Lasso, Ridge,  LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# download and load dataset from googledrive direct download link
data_url = 'https://drive.usercontent.google.com/download?id=1wBJmx7yGbrjRFdSZOp11PkfPfC3e4NNG&export=download&confirm={{VALUE}}'
data_original = pd.read_csv(data_url)

In [None]:
# Print the column names
print(list(data_original.columns))

In [None]:
# Generate a copy of the original data, and place the 'price' column at the end as it is the target variable for the supervised learning
data = data_original[['status', 'bed', 'bath', 'acre_lot', 'city', 'state', 'zip_code', 'house_size', 'prev_sold_date', 'brokered_by', 'street', 'price']]

In [None]:
# Shape of the data
print(f'The data has {data.shape[0]} rows and {data.shape[1]} features\n')
print('#'*50)
data.info()

In [None]:
# Preview of the data
data.head()

### Data Cleaning

> Data is checked for consistency and accuracy. Appropriate actions are taken to clean the data.

In [None]:
# Check 1: Check for duplicates
data.duplicated().sum()

In [None]:
# Check 2: Check for null/missing values
print('Missing Values Count')
print('*'*50)
data.isna().sum()

In [None]:
# Samples of data with missing values in the 'price' column
data[data['price'].isna()].head()

In [None]:
# Treating missing Values 1:
# Drop missing values in the 'price' field

data.dropna(subset = ['price', 'zip_code', 'city', 'state'], inplace=True)

In [None]:
# Treating missing Values 2:
# Replacing filled values with 'Yes' and missing values with 'No' in the 'prev_sold' feature
#Creating new column to encode 'Yes' as 1 and 'No' as 0

data.rename(columns={'prev_sold_date': 'prev_sold'}, inplace=True)
data['prev_sold'] = data['prev_sold'].apply(lambda x: 'Yes' if pd.notna(x) else 'No')

data['prev_sold_enc'] = data['prev_sold'].map({'Yes': 1, 'No': 0})

In [None]:
# Treating missing Values 3:
# Replace missing values in the numerical fields

data['bed'].fillna(data['bed'].mode()[0], inplace=True)
data['bath'].fillna(data['bath'].mode()[0], inplace=True)
data['acre_lot'].fillna(data['acre_lot'].mode()[0], inplace=True)
data['house_size'].fillna(data['house_size'].mode()[0], inplace=True)



In [None]:
# Confirm treating of missing values is successful
data.isna().sum()

In [None]:
# Outlier Validation

fig, axis = plt.subplots(2, 3, figsize=(15, 7))

axis[0, 0].boxplot(data['price'])
axis[0, 0].set_title('Price')

axis[0, 1].boxplot(data['bed'])
axis[0, 1].set_title('Bedrooms')

axis[0, 2].boxplot(data['bath'])
axis[0, 2].set_title('Bathrooms')

axis[1, 0].boxplot(data['acre_lot'])
axis[1, 0].set_title('Acres')

axis[1, 1].boxplot(data['house_size'])
axis[1, 1].set_title('House size');

In [None]:
# remove outliers that are outside 1.5 times the interquartile range below and above the lower and upper quartiles respectively

outlier_columns = ['price', 'bed', 'bath', 'acre_lot', 'house_size' ] #numerical columns where outliers are to be removed
quart_1 = data[outlier_columns].quantile(0.25)
quart_3 = data[outlier_columns].quantile(0.85)
IQR = quart_3 - quart_1 #Interquartile Range

data = data[~((data[outlier_columns] < (quart_1 - 1.5 * IQR)) | (data[outlier_columns] > (quart_3 + 1.5 * IQR))).any(axis=1)]
data = data[data['price']>=100000]

In [None]:
# Outlier Validation After outliers have been removed
fig, axis = plt.subplots(2, 3, figsize=(15, 7))

axis[0, 0].boxplot(data['price'])
axis[0, 0].set_title('Price')

axis[0, 1].boxplot(data['bed'])
axis[0, 1].set_title('Bedrooms')

axis[0, 2].boxplot(data['bath'])
axis[0, 2].set_title('Bathrooms')

axis[1, 0].boxplot(data['acre_lot'])
axis[1, 0].set_title('Acres')

axis[1, 1].boxplot(data['house_size'])
axis[1, 1].set_title('House size');

In [None]:
# States in the dataset but are not in the US
states_in_us = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming', 'District of Columbia', 'American Samoa', 'Guam', 'Northern Mariana Islands', 'Puerto Rico', 'United States Minor Outlying Islands', 'Virgin Islands']
states_in_dataset = list(data['state'].unique())

diff = [x for x in states_in_us + states_in_dataset if x not in states_in_us]

print(diff)

In [None]:
# # Remove non-US state from data

# data = data[data['state']!='New Brunswick']

In [None]:
# remove ‘brokered_by’ and ‘street’ columns

data = data[['status', 'bed', 'bath', 'acre_lot', 'city', 'state', 'zip_code', 'house_size', 'prev_sold', 'prev_sold_enc', 'price']]

In [None]:
# Convert zip code data type to nominal

data['zip_code'] = data['zip_code'].astype(str)

### Exploratory Data Analysis 

> After the data has been cleaned, we now gain some insights from the data

In [None]:
# Shape of the data
print(f'The data has {data.shape[0]} rows and {data.shape[1]} features\n')
print('#'*50)
data.info()

In [None]:
data.head()

> Code to display dataframes side by side
> Reference and acknowledgement: Liu Zuo Lin on [Medium](https://python.plainenglish.io/displaying-multiple-dataframes-side-by-side-in-jupyter-lab-notebook-9a4649a4940)

In [None]:
# Function to display DataFrames side by side

def side_by_side(*dfs):
    html = '<div style="display:flex">'
    for df in dfs:
        html += '<div style="margin-right: 2em">'
        html += df.to_html()
        html += '</div>'
    html += '</div>'
    display(HTML(html))

In [None]:
# Describe the data to show mean, minimum, maximum etc.
# Format output to 2 decimal places for readability
describe_num_a = data.describe().T.applymap(lambda x: f"{x:,.2f}")
describe_obj_b = data.describe(include = 'object').T

side_by_side(describe_num_a, describe_obj_b)

In [None]:
# visualize distribution in each numerical column
# visualize outliers in each numerical column

fig, axis = plt.subplots(2, 3, figsize=(15, 7))

axis[0, 0].hist(data['price'])
axis[0, 0].set_title('Price')
axis[0, 0].set_ylabel('Count')

axis[0, 1].hist(data['bed'], range=(0,10))
axis[0, 1].set_title('Bedroom')

axis[0, 2].hist(data['bath'], range=(0,10))
axis[0, 2].set_title('Bathroom')

axis[1, 0].hist(data['acre_lot'])
axis[1, 0].set_title('Acres')
axis[1, 0].set_ylabel('Count')

axis[1, 1].hist(data['house_size'])
axis[1, 1].set_title('House Size')

axis[1, 2].hist(data['prev_sold_enc'], range=(0,1), bins=2)
axis[1, 2].set_title('Previously Sold');

In [None]:
# Computation of a standard correlation coefficient between every attribute pair

corr_matrix = data.corr(numeric_only=True)

clustermap_corr = sns.clustermap(corr_matrix, annot=True, cmap='crest', figsize=(10, 4.5))

# How each attribute correlate with price
print(f'How each attribute correlate with price')
print('-'*50)
print(corr_matrix['price'].sort_values(ascending=False))

In [None]:
# # Price Distribution

# fig = px.histogram(data, x="price", nbins=30, template="plotly", width=900, height=500)
# fig.update_layout(title="Price Distribution", xaxis_title="House Price", yaxis_title="House Count")

In [None]:
# Distribution of Price by Bed

fig1 = px.histogram(data, x="price", color="bed", nbins=30, width=900, height=340, 
                   color_discrete_sequence=px.colors.qualitative.Light24)

fig2 = px.histogram(data, x="price", color="bath", nbins=30, width=900, height=340, 
                   color_discrete_sequence=px.colors.qualitative.Light24)

fig1.update_layout(title="Distribution of Price by Bed", xaxis_title="Price", yaxis_title="Count")
fig2.update_layout(title="Distribution of Price by Bath", xaxis_title="Price", yaxis_title="Count")

fig1.show()
fig2.show()

In [None]:
mean_price_bed = data.groupby('bed', as_index=False).agg({'price' : 'mean'}).sort_values('price', ascending=False)
mean_price_bath = data.groupby('bath', as_index=False).agg({'price' : 'mean'}).sort_values('price', ascending=False)
print('Distribution of Bed and Bath by Mean Price')
side_by_side(mean_price_bed, mean_price_bath)

In [None]:
data['bed'].unique()

In [None]:
# One-Way ANOVA to see relationship between price and number of bedrooms

print('One-Way ANOVA to see relationship between price and number of bedrooms')
print('*'*70)
two_bed = data.loc[data["bed"] == 2.0, "price"]
three_bed = data.loc[data["bed"] == 3.0, "price"]
four_bed = data.loc[data["bed"] == 4.0, "price"]
five_bed = data.loc[data["bed"] == 5.0, "price"]
f_oneway(two_bed,three_bed,four_bed,five_bed)

In [None]:
# Two-Way ANOVA to see relationship between price and other numeric features

formula = "price ~ bed + bath + acre_lot + prev_sold_enc + house_size"
model = ols(formula,data = data).fit()
anova_result = sm.stats.anova_lm(model, type=2)
print('Two-Way ANOVA to see relationship between price and other numeric features')
print('*'*70)
anova_result

### Model Selection and Machine Learning

> The practical work and results are carried out in this section

In [None]:
# Make a copy of the original dataset

data_ = data.copy()

#### Unsupervised Learning
> Model 1: k-nearest neighbors (KNN) 

In [None]:
# Define a function to categorize the price feature into:
# High price ($1,500,000 and above); Medium price ($between $500,000 and $1,499,999); Low price (Less than $500,000)

def price_category(price):
    if price >= 1500000:
        return 2
    elif 500000 <= price < 1500000:
        return 1
    elif price < 500000:
        return 0
    else:
        return np.nan

data_['price_group'] = data_['price'].apply(price_category)
plt.ticklabel_format(style='plain')
data_.price_group.value_counts().sort_index().plot(kind='bar')

# Add labels to the bars
for i, value in enumerate(data_['price_group'].value_counts().sort_index()):
    plt.annotate(str(value), xy=(i, value), ha='center', va='bottom')

plt.xlabel('Price Group')  # Label for x-axis
plt.ylabel('Count')  # Label for y-axis
plt.title('Count of Instances in Each Price Group')  # Plot title
plt.show()

In [None]:
knn_data = data_[['bath', 'bed' , 'acre_lot' , 'house_size' , 'prev_sold_enc','price_group']]
knn_factor = math.sqrt(knn_data.shape[0])

In [None]:
y = knn_data.price_group

feature_columns = ['bath', 'bed' , 'acre_lot' , 'house_size' , 'prev_sold_enc']
X = knn_data[feature_columns]


scores = []
for frac in [.1,.2,.3,.4,.5]:
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=frac, random_state = 42, stratify = y, shuffle=True)

    knn = KNeighborsClassifier(n_neighbors=370)
    knn.fit(X_train,y_train)

    y_pred=knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test,y_pred))
    print("Accuracy of test data: ", metrics.accuracy_score(y_test,y_pred),'frac:', frac)

> Confusion Matrix

In [None]:
# Confusion Matrix
cf_matrix =confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True, fmt='.2%', cmap='Blues')
plt.xlabel('predicted')
plt.ylabel('actual')

In [None]:
# KNN Classification Report

target_names = ['low_price', 'med_price', 'high_price']
print(classification_report(y_test,y_pred,target_names = target_names))

> Feature Engineering

In [None]:
# Feature Normalization

# Min-Max normalization using MinMaxScaler
data_['bed'] = MinMaxScaler().fit_transform(data_['bed'].values.reshape(len(data_), 1))
data_['bath'] = MinMaxScaler().fit_transform(data_['bath'].values.reshape(len(data_), 1))
data_['acre_lot'] = MinMaxScaler().fit_transform(data_['acre_lot'].values.reshape(len(data_), 1))


# Z-score normalization (Standardization) using StandardScaler
data_ = data.copy()
data_['house_size'] = StandardScaler().fit_transform(data_['house_size'].values.reshape(len(data_), 1))
data_['price'] = StandardScaler().fit_transform(data_['price'].values.reshape(len(data_), 1))

In [None]:
# One-Hot Encoding for the 'state' and 'status' features

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False).set_output(transform='pandas')
encoder_transform = encoder.fit_transform(data_[['status', 'state']])
data_ = pd.concat([data_,encoder_transform], axis=1).drop(columns=['status', 'state'])

In [None]:
# Split dataset into training and test sets
# Split into training-test sets with ration 80:20
# shuffle data to ensure no embedded pattern is retained
# Set random_state to ensure same sets of training and test data are generated each time data is loaded

train_data, test_data = train_test_split(data_, test_size=0.2, random_state=42, shuffle=True)

X = data_.drop('price', axis = 1)
y = data_['price']
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state=42, shuffle=True)

In [None]:
# Shape of training and test datasets.
# Also included is the price of training and test data for code test purpose 
# Result should be same each time data is loaded and below code is run

print(f'The training data has {train_data.shape[0]} rows and {train_data.shape[1]} features\n')
print('*'*50)
print(f'The test data has {test_data.shape[0]} rows and {test_data.shape[1]} features\n')
print('*'*50)
print(f'Sum of price of training data is ${train_data.price.sum():,}\n')
print('*'*50)
print(f'Sum of price of test data is ${test_data.price.sum():,}\n')

In [None]:
# Target Encoding for the 'city' feature
encoder_target = ce.TargetEncoder(return_df=True)
X_train_loo = encoder_target.fit_transform(train_x[['city']], train_y)
train_x = pd.concat([train_x.drop('city', axis = 1),X_train_loo], axis=1)

X_test_loo = encoder_target.fit_transform(test_x[['city']], test_y)
test_x = pd.concat([test_x.drop('city', axis = 1),X_test_loo], axis=1)

In [None]:
train_x = train_x.select_dtypes(exclude=['object'])
test_x = test_x.select_dtypes(exclude=['object'])

#### Supervised Learning (SL)
> SL Model 1: DecisionTreeRegressor 

In [None]:
#create the decision tree model
model_DT = DecisionTreeRegressor(max_depth=5)

# fit the model to the training set
model_DT.fit(train_x, train_y)

# make predictions on the test set
y_pred = model_DT.predict(test_x)

# scores
mse_DT = mean_squared_error(test_y, y_pred)
rmse_DT = mean_squared_error(test_y, y_pred, squared=False)
mae_DT = mean_absolute_error(test_y, y_pred)
r2_DT = r2_score(test_y, y_pred)

> SL Model 2: GradientBoostingRegressor 

In [None]:
#create the GradientBoostingRegressor model
model_GD = GradientBoostingRegressor(learning_rate=0.05,
    n_estimators=150,
    max_depth=3,
    min_samples_split=4,
    min_samples_leaf=1)

# fit the model to the training set
model_GD.fit(train_x, train_y)

# make predictions on the test set
y_pred = model_GD.predict(test_x)

# scores
mse_GD = mean_squared_error(test_y, y_pred)
rmse_GD = mean_squared_error(test_y, y_pred, squared=False)
mae_GD = mean_absolute_error(test_y, y_pred)
r2_GD = r2_score(test_y, y_pred)

> SL Model 3: LinearRegression

In [None]:
#create the LinearRegression model
model_LR = LinearRegression()

# fit the model to the training set
model_LR.fit(train_x, train_y)

# make predictions on the test set
y_pred = model_LR.predict(test_x)

# Scores
mse_LR = mean_squared_error(test_y, y_pred)
rmse_LR = mean_squared_error(test_y, y_pred, squared=False)
mae_LR = mean_absolute_error(test_y, y_pred)
r2_LR = r2_score(test_y, y_pred)

> SL Model 4: Lasso Regression

In [None]:
#create the Lasso model
model_Lasso = Lasso()

# fit the model to the training set
model_Lasso.fit(train_x, train_y)

# make predictions on the test set
y_pred = model_Lasso.predict(test_x)

# scores
mse_Lasso = mean_squared_error(test_y, y_pred)
rmse_Lasso = mean_squared_error(test_y, y_pred, squared=False)
mae_Lasso = mean_absolute_error(test_y, y_pred)
r2_Lasso = r2_score(test_y, y_pred)

> SL Model 5: Ridge Regression

In [None]:
#create the ridge model
ridge_model = Ridge()

# fit the model to the training set
ridge_model.fit(train_x, train_y)

# make predictions on the test set
y_pred = ridge_model.predict(test_x)

# scores
mse_Rg = mean_squared_error(test_y, y_pred)
rmse_Rg = mean_squared_error(test_y, y_pred, squared=False)
mae_Rg = mean_absolute_error(test_y, y_pred)
r2_Rg = r2_score(test_y, y_pred)

> SL Model 6: Ridge with Cross Validation

In [None]:
#create the ridge cross-validation model
ridge_cv_model = RidgeCV(alphas=(1.38), scoring='neg_mean_absolute_error')

# fit the model to the training set
ridge_cv_model.fit(train_x, train_y)

# make predictions on the test set
y_pred = ridge_cv_model.predict(test_x)

# scores
mse_R = mean_squared_error(test_y, y_pred)
rmse_R = mean_squared_error(test_y, y_pred, squared=False)
mae_R = mean_absolute_error(test_y, y_pred)
r2_R = r2_score(test_y, y_pred)

> SL Model 7: ElasticNet Regression

In [None]:
# create the ElasticNetCV model
elastic_model = ElasticNetCV(l1_ratio=[0.01], tol=0.01)

# fit the model to the training set
elastic_model.fit(train_x, train_y)

# make predictions on the test set
y_pred = elastic_model.predict(test_x)

# scores
mse_E = mean_squared_error(test_y, y_pred)
rmse_E = mean_squared_error(test_y, y_pred, squared=False)
mae_E = mean_absolute_error(test_y, y_pred)
r2_E = r2_score(test_y, y_pred)

> SL Model 8: KNeighborsRegressor Regression

In [None]:
# create the KNeighborsRegressor model
model_KNN = KNeighborsRegressor(n_neighbors=10)

# fit the model to the training set
model_KNN.fit(train_x, train_y)

# make predictions on the test set
y_pred = model_KNN.predict(test_x)

# Scores
mse_KNN = mean_squared_error(test_y, y_pred)
rmse_KNN = mean_squared_error(test_y, y_pred, squared=False)
mae_KNN = mean_absolute_error(test_y, y_pred)
r2_KNN = r2_score(test_y, y_pred)

> Supervised Learning Models Results

In [None]:
results = {
    'Linear Regression': {'MSE': mse_LR, 'RMSE': rmse_LR, 'MAE': mae_LR, 'R^2': r2_LR},
    'Decision Tree': {'MSE': mse_DT, 'RMSE': rmse_DT, 'MAE': mae_DT, 'R^2': r2_DT},
    'K-Nearest Neighbors': {'MSE': mse_KNN, 'RMSE': rmse_KNN, 'MAE': mae_KNN, 'R^2': r2_KNN},
    'Gradient Boosting': {'MSE': mse_GD, 'RMSE': rmse_GD, 'MAE': mae_GD, 'R^2': r2_GD},
    'Lasso': {'MSE': mse_Lasso, 'RMSE': rmse_Lasso, 'MAE': mae_Lasso, 'R^2': r2_Lasso},
    'Ridge': {'MSE': mse_Rg, 'RMSE': rmse_Rg, 'MAE': mae_Rg, 'R^2': r2_Rg},
    'Ridge CV': {'MSE': mse_R, 'RMSE': rmse_R, 'MAE': mae_R, 'R^2': r2_R},
    'ElasticNet CV': {'MSE': mse_E, 'RMSE': rmse_E, 'MAE': mae_E, 'R^2': r2_E}
}


result_data = pd.DataFrame.from_dict(results, orient='index')
result_data = result_data.applymap(lambda x: f'{x:.2f}')

print('Performance Measures of all Models')
print('*'*50)
result_data

In [None]:
# Regression Models Comparison

model_names = ['Linear Regression', 'Decision Tree', 'K-Nearest Neighbors', 'Gradient Boosting', 'Ridge', 'Ridge CV', 'ElasticNet CV' ]
scores = [r2_LR, r2_DT, r2_KNN, r2_GD, r2_Rg, r2_R, r2_E]
plt.figure(figsize=(15, 5))
#plt.xlim(2, 8)
#plt.yticks([0.4, 0.45, 0.5, 0.55, 0.6, 0.65])
plt.ylim(0, 1)
colors = ['blue', 'green', 'red', 'purple', 'cyan', 'grey', 'yellow', 'purple']
plt.bar(model_names, scores, width=0.8, color=colors)
#plt.hist(result_data)
plt.title('Comparison of the Model Scores')
plt.xlabel('Model')
plt.ylabel('Score')

plt.show()



> Regression Coefficient 

In [None]:
# Regression Coefficients

train_x_cor = train_x[['bed', 'bath', 'acre_lot', 'house_size', 'prev_sold_enc', 'city']]
test_x_cor = test_x[['bed', 'bath', 'acre_lot', 'house_size', 'prev_sold_enc', 'city']]

# Fit linear regression model
lr_model = LinearRegression()
lr_model.fit(train_x_cor, train_y)

# Fit ridge regression model
ridge_model = Ridge(alpha=1.0)  # You can adjust the regularization strength (alpha) if needed
ridge_model.fit(train_x_cor, train_y)

# Fit Lasso regression model
lasso_model = Lasso(alpha=1.0)  # You can adjust the regularization strength (alpha) if needed
lasso_model.fit(train_x_cor, train_y)



coefficients = pd.DataFrame({
    'Feature': train_x_cor.columns,
    'Linear Regression Coefficient': lr_model.coef_,
    'Ridge Regression Coefficient': ridge_model.coef_,
    'Lasso Regression Coefficient': lasso_model.coef_})


## Use columns that have not been normalized

In [None]:
# Regression Coefficients
print('Regression Coefficients')
print('*'*50)
coefficients

In [None]:
X1 = train_x[['bed', 'bath', 'acre_lot', 'house_size', 'prev_sold_enc']]

In [None]:
# Classifier Acuracy Score
X2 = data_[['bed', 'bath', 'acre_lot', 'house_size', 'prev_sold_enc']]
y2 = data_['price']
# Discretize the target variable into bins (classification problem)
# You can adjust the number of bins and strategy as needed
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
y2_discretized = discretizer.fit_transform(y2.values.reshape(-1, 1))

train_x2, test_x2, train_y2, test_y2 = train_test_split(X2, y2_discretized, test_size = 0.2, random_state=42, shuffle=True)

# Split the dataset into training and testing sets


# Initialize and train a logistic regression classifier
classifier = LogisticRegression(max_iter=1000)  # Increase max_iter if needed
classifier.fit(train_x2, train_y2.ravel())  # Note: ravel() converts y1_discretized to 1D array

# Make predictions on the test set
y_pred = classifier.predict(test_x2)

# Calculate accuracy
accuracy = accuracy_score(test_y2, y_pred)
print("Classifier Accuracy:", accuracy)


In [None]:
# Extract coefficients and feature names
coefficients = classifier.coef_[0]
feature_names = X2.columns

# Print feature contributions to accuracy
for feature, coefficient in zip(feature_names, coefficients):
    print(f"Feature: {feature}, Coefficient: {coefficient}")