Dataset link: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

From this dataset, I build a Machine Learning model in Spark and Java to predict the house price.

In this section, I follow each steps below:

1. Foundation
- Set up environment
2. Explore and Preprocessing data
- Exploratory Data Analysis (EDA)
- Data Cleaning
- Feature Engineering
- Data Splitting
3. Select and build a model
4. Train the model
5. Evaluate the model
6. Tune the model
7. Save model

1. Foundation

In [2]:
import pyspark
import findspark

In [3]:
from pyspark import SparkContext
sc = SparkContext(master = 'local')

from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("House Pricing Prediction") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()

2. Explore and Preprocessing data

In [27]:
# Read data
hp = spark.read.csv('housprice.csv', header=True, inferSchema=True)
hp.show(5)

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition

In [28]:
# Check the schema
hp.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- MSSubClass: integer (nullable = true)
 |-- MSZoning: string (nullable = true)
 |-- LotFrontage: string (nullable = true)
 |-- LotArea: integer (nullable = true)
 |-- Street: string (nullable = true)
 |-- Alley: string (nullable = true)
 |-- LotShape: string (nullable = true)
 |-- LandContour: string (nullable = true)
 |-- Utilities: string (nullable = true)
 |-- LotConfig: string (nullable = true)
 |-- LandSlope: string (nullable = true)
 |-- Neighborhood: string (nullable = true)
 |-- Condition1: string (nullable = true)
 |-- Condition2: string (nullable = true)
 |-- BldgType: string (nullable = true)
 |-- HouseStyle: string (nullable = true)
 |-- OverallQual: integer (nullable = true)
 |-- OverallCond: integer (nullable = true)
 |-- YearBuilt: integer (nullable = true)
 |-- YearRemodAdd: integer (nullable = true)
 |-- RoofStyle: string (nullable = true)
 |-- RoofMatl: string (nullable = true)
 |-- Exterior1st: string (nullable = true)
 |--

- The schema seems does not match with the value. Let's redefine it.

In [29]:
from pyspark.sql.types import *
from pyspark.sql.functions import col

hp = hp.withColumn("LotFrontage", col("LotFrontage").cast(IntegerType()))
hp = hp.withColumn("MasVnrArea", col("MasVnrArea").cast(IntegerType()))
hp = hp.withColumn("GarageYrBlt", col("GarageYrBlt").cast(IntegerType()))

In [30]:
# Summary statistics
hp.describe().show()

+-------+-----------------+------------------+--------+-----------------+------------------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+------------------+------------------+------------------+------------------+---------+--------+-----------+-----------+----------+------------------+---------+---------+----------+--------+--------+------------+------------+-----------------+------------+-----------------+-----------------+------------------+-------+---------+----------+----------+-----------------+------------------+-----------------+-----------------+-------------------+--------------------+------------------+-------------------+------------------+-------------------+-----------+------------------+----------+------------------+-----------+----------+------------------+------------+------------------+-----------------+----------+----------+----------+------------------+-----------------+------------------+-----

In [31]:
# Define categorical/numerical features
num_cols = ['Id','MSSubClass','LotFrontage','LotArea','OverallQual','OverallCond','YearBuilt','YearRemodAdd','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageYrBlt','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal','MoSold','YrSold']
categorical_cols = ['MSZoning','Street','Alley','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir','Electrical','KitchenQual','Functional','FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','PavedDrive','PoolQC','Fence','MiscFeature','SaleType','SaleCondition']

In [33]:
# Check the missing values
from pyspark.sql.functions import col, when, isnan, count

missing_values = hp.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in hp.columns])
missing_values.show()

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition

In [34]:
# Handle missing value
from pyspark.sql.functions import col, when, max

# With numerical attributes
# Convert 'NA' strings to null values for all numeric attributes
hp = hp.select([when(col(c) == 'NA', None).otherwise(col(c)).alias(c) for c in hp.columns])

# Fill null values with 0
hp = hp.fillna(0)

# With categorical attributes
# Fill null values in categorical attributes with the most frequent value
for column_name in categorical_cols:
    hp = hp.na.fill(hp.groupBy().agg(max(col(column_name))).collect()[0][0], subset=[column_name])


In [35]:
# Checking if there are any remaining missing values
missing_values = hp.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in hp.columns])
missing_values.show()

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition

In [None]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

In [36]:
features = hp.columns
features.remove('SalePrice')


In [37]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation

# Assemble the features into a single vector column
vector_assembler = VectorAssembler(inputCols=features, outputCol="features")
hp_vector = vector_assembler.transform(hp).select("features")

IllegalArgumentException: Data type string of column MSZoning is not supported.
Data type string of column Street is not supported.
Data type string of column Alley is not supported.
Data type string of column LotShape is not supported.
Data type string of column LandContour is not supported.
Data type string of column Utilities is not supported.
Data type string of column LotConfig is not supported.
Data type string of column LandSlope is not supported.
Data type string of column Neighborhood is not supported.
Data type string of column Condition1 is not supported.
Data type string of column Condition2 is not supported.
Data type string of column BldgType is not supported.
Data type string of column HouseStyle is not supported.
Data type string of column RoofStyle is not supported.
Data type string of column RoofMatl is not supported.
Data type string of column Exterior1st is not supported.
Data type string of column Exterior2nd is not supported.
Data type string of column MasVnrType is not supported.
Data type string of column ExterQual is not supported.
Data type string of column ExterCond is not supported.
Data type string of column Foundation is not supported.
Data type string of column BsmtQual is not supported.
Data type string of column BsmtCond is not supported.
Data type string of column BsmtExposure is not supported.
Data type string of column BsmtFinType1 is not supported.
Data type string of column BsmtFinType2 is not supported.
Data type string of column Heating is not supported.
Data type string of column HeatingQC is not supported.
Data type string of column CentralAir is not supported.
Data type string of column Electrical is not supported.
Data type string of column KitchenQual is not supported.
Data type string of column Functional is not supported.
Data type string of column FireplaceQu is not supported.
Data type string of column GarageType is not supported.
Data type string of column GarageFinish is not supported.
Data type string of column GarageQual is not supported.
Data type string of column GarageCond is not supported.
Data type string of column PavedDrive is not supported.
Data type string of column PoolQC is not supported.
Data type string of column Fence is not supported.
Data type string of column MiscFeature is not supported.
Data type string of column SaleType is not supported.
Data type string of column SaleCondition is not supported.

In [None]:
# Initialize Spark Session
spark = SparkSession.builder.appName("CorrelationMatrixExample").getOrCreate()

# Replace 'feature1', 'feature2', ... with the actual feature column names
feature_columns = ['feature1', 'feature2', 'feature3', 'feature4']

# Assemble the features into a single vector column
vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
df_vector = vector_assembler.transform(df).select("features")

# Compute the correlation matrix
correlation_matrix = Correlation.corr(df_vector, "features").head()[0]

# Convert DenseMatrix to array
correlation_array = correlation_matrix.toArray()

# Convert to a DataFrame for plotting
import pandas as pd
correlation_df = pd.DataFrame(correlation_array, index=feature_columns, columns=feature_columns)

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_df, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()