### 7. Create a DataFrame containing real estate data with the following columns: HouseID, Location, Size, Bedrooms, Bathrooms, Price etc. Use the given dataset to build a linear regression model using PySpark's MLlib to predict the Price of a house based on the other features.
### • Preprocess the data by handling missing values, encoding categorical variables (Location), and normalizing numerical features (Size, Bedrooms, Bathrooms, etc).
### • Split the data into training and testing sets.
### • Train a linear regression model on the training data.
### • Evaluate the model's performance on the test data using the root mean square error (RMSE)
### • Display the feature importances and interpret the results.

In [45]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

In [46]:
spark = SparkSession.builder.appName("Program 7").getOrCreate()

In [47]:
df=spark.read.csv("Datasets/House Price India.csv", header=True, inferSchema=True)

In [48]:
df.show(5)

+----------+-----+------------------+-------------------+-----------+--------+----------------+------------------+---------------+----------------------+------------------+-------------------------------------+--------------------+----------+---------------+-----------+---------+---------+-----------------+--------------+------------------------+-------------------------+-------+
|        id| Date|number of bedrooms|number of bathrooms|living area|lot area|number of floors|waterfront present|number of views|condition of the house|grade of the house|Area of the house(excluding basement)|Area of the basement|Built Year|Renovation Year|Postal Code|Lattitude|Longitude|living_area_renov|lot_area_renov|Number of schools nearby|Distance from the airport|  Price|
+----------+-----+------------------+-------------------+-----------+--------+----------------+------------------+---------------+----------------------+------------------+-------------------------------------+--------------------+---

In [49]:
df=df.select("id","Postal Code", "Area of the house(excluding basement)","number of bedrooms", "number of bathrooms","Price")

In [50]:
df.show(5)

+----------+-----------+-------------------------------------+------------------+-------------------+-------+
|        id|Postal Code|Area of the house(excluding basement)|number of bedrooms|number of bathrooms|  Price|
+----------+-----------+-------------------------------------+------------------+-------------------+-------+
|6762810635|     122004|                                 1910|                 4|                2.5|1400000|
|6762810998|     122004|                                 2910|                 5|               2.75|1200000|
|6762812605|     122005|                                 3310|                 4|                2.5| 838000|
|6762812919|     122006|                                 1880|                 3|                2.0| 805000|
|6762813105|     122007|                                 1700|                 3|                2.5| 790000|
+----------+-----------+-------------------------------------+------------------+-------------------+-------+
only showi

In [51]:
df=(
    df.withColumnRenamed("id", "HouseID")
    .withColumnRenamed("Postal Code", "Location")
    .withColumnRenamed("Area of the house(excluding basement)", "Size")
    .withColumnRenamed("number of bedrooms", "Bedrooms")
    .withColumnRenamed("number of bathrooms", "Bathrooms")
)

In [52]:
df.show(5)

+----------+--------+----+--------+---------+-------+
|   HouseID|Location|Size|Bedrooms|Bathrooms|  Price|
+----------+--------+----+--------+---------+-------+
|6762810635|  122004|1910|       4|      2.5|1400000|
|6762810998|  122004|2910|       5|     2.75|1200000|
|6762812605|  122005|3310|       4|      2.5| 838000|
|6762812919|  122006|1880|       3|      2.0| 805000|
|6762813105|  122007|1700|       3|      2.5| 790000|
+----------+--------+----+--------+---------+-------+
only showing top 5 rows



#### • Preprocess the data by handling missing values, encoding categorical variables (Location), and normalizing numerical features (Size, Bedrooms, Bathrooms, etc).

In [53]:
# Handling missing values
df=df.dropna()

In [54]:
# Encoding categorical variables
indexer=StringIndexer(inputCol="Location", outputCol="LocationIndex")
df=indexer.fit(df).transform(df)

In [55]:
# Using assembler to convert features to a single vector
feature_columns = ["Size", "Bedrooms", "Bathrooms", "LocationIndex"]
assembler=VectorAssembler(inputCols=feature_columns, outputCol="Features")
df=assembler.transform(df)

In [56]:
# Normalizing the values
scaler = StandardScaler(inputCol="Features", outputCol="ScaledFeatures")
df=scaler.fit(df).transform(df)

In [57]:
df.show(5)

+----------+--------+----+--------+---------+-------+-------------+--------------------+--------------------+
|   HouseID|Location|Size|Bedrooms|Bathrooms|  Price|LocationIndex|            Features|      ScaledFeatures|
+----------+--------+----+--------+---------+-------+-------------+--------------------+--------------------+
|6762810635|  122004|1910|       4|      2.5|1400000|         49.0|[1910.0,4.0,2.5,4...|[2.29088867712370...|
|6762810998|  122004|2910|       5|     2.75|1200000|         49.0|[2910.0,5.0,2.75,...|[3.49030683268585...|
|6762812605|  122005|3310|       4|      2.5| 838000|          1.0|[3310.0,4.0,2.5,1.0]|[3.97007409491071...|
|6762812919|  122006|1880|       3|      2.0| 805000|          2.0|[1880.0,3.0,2.0,2.0]|[2.25490613245684...|
|6762813105|  122007|1700|       3|      2.5| 790000|          3.0|[1700.0,3.0,2.5,3.0]|[2.03901086445565...|
+----------+--------+----+--------+---------+-------+-------------+--------------------+--------------------+
only showi

#### • Split the data into training and testing sets.

In [58]:
train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)

#### • Train a linear regression model on the training data.

In [59]:
lr = LinearRegression(featuresCol="ScaledFeatures", labelCol="Price")

In [60]:
lr_model = lr.fit(train_data)

#### • Evaluate the model's performance on the test data using the root mean square error (RMSE)

In [62]:
predictions= lr_model.transform(test_data)

In [63]:
predictions

DataFrame[HouseID: bigint, Location: int, Size: int, Bedrooms: int, Bathrooms: double, Price: int, LocationIndex: double, Features: vector, ScaledFeatures: vector, prediction: double]

In [64]:
evaluator = RegressionEvaluator(labelCol="Price", predictionCol="prediction", metricName="rmse")

In [65]:
rmse = evaluator.evaluate(predictions)

In [66]:
print("Root Mean Squared Error: ", rmse)

Root Mean Squared Error:  273022.41852111975


#### • Display the feature importances and interpret the results.

In [69]:
print("Feature Importances")
for col, weight in zip (feature_columns, lr_model.coefficients):
    print(col, " : ", weight)

Feature Importances
Size  :  171468.83058606222
Bedrooms  :  -11562.86499890396
Bathrooms  :  84836.65711617062
LocationIndex  :  6297.088662391119


In [70]:
spark.stop()