## Introduction

Online property companies offer valuations of houses using machine learning techniques. This dataset tells us about the house sales in King County, Washington State, USA. The dataset consists of historic data of houses sold between May 2014 to May 2015. This data was published/released under CC0*: Public Domain. The data was downloaded from https://www.kaggle.com/shivachandel/kc-house-data.

## About the Project

The aim of this project is to predict the price of houses based on various factors like number of bedrooms, bathrooms, conditions, area and age of the houses. These features greatly contribute to the variation in prices of the house. Online property companies offer valuations of houses using machine learning techniques. We have included the analysis of maximum price of house based on number of bedrooms, highest prices of ten houses with the oldest build year, average price of houses with each condition ranging from one to five in a range of 25 years. We have used visualizations to better demonstrate the data to detect patterns, trends, and outliers in groups of data. The outer part of the project provides you with a series of exploratory analysis to keep you aware of the variables and factors predisposing to decisions on results. A few models are executed to train our data to estimate the dependence of different characteristics on house prices. Jupyter Notebooks is a perfect way to comment and evaluate code thoroughly. For analysing and modelling, we have used PySpark.

In [12]:
from pyspark.sql import SQLContext,Row
from pyspark import SparkConf, SparkContext

In [83]:
sc=SparkContext()

In [13]:
sc

Preprocessing--Data Upload in RDD form.

## Data Cleaning

In [14]:
rdd=sc.textFile(r'C:\Everything\Data\2020Spring\BigData\kc_house_data.csv')

In [15]:
rdd=rdd.map(lambda l:l.split(","))
rdd.take(2)

[['id',
  'date',
  'price',
  'bedrooms',
  'bathrooms',
  'sqft_living',
  'sqft_lot',
  'floors',
  'waterfront',
  'view',
  'condition',
  'grade',
  'sqft_above',
  'sqft_basement',
  'yr_built',
  'yr_renovated',
  'zipcode',
  'lat',
  'long',
  'sqft_living15',
  'sqft_lot15'],
 ['7129300520',
  '20141013T000000',
  '221900',
  '3',
  '1',
  '1180',
  '5650',
  '1',
  '0',
  '0',
  '3',
  '7',
  '1180',
  '0',
  '1955',
  '0',
  '98178',
  '47.5112',
  '-122.257',
  '1340',
  '5650']]

Remvoing the header so that further analysis on RDD could be performed

In [16]:
header=rdd.first()
rdd=rdd.filter(lambda line:line != header)
rdd.take(2)

[['7129300520',
  '20141013T000000',
  '221900',
  '3',
  '1',
  '1180',
  '5650',
  '1',
  '0',
  '0',
  '3',
  '7',
  '1180',
  '0',
  '1955',
  '0',
  '98178',
  '47.5112',
  '-122.257',
  '1340',
  '5650'],
 ['6414100192',
  '20141209T000000',
  '538000',
  '3',
  '2.25',
  '2570',
  '7242',
  '2',
  '0',
  '0',
  '3',
  '7',
  '2170',
  '400',
  '1951',
  '1991',
  '98125',
  '47.721',
  '-122.319',
  '1690',
  '7639']]

In [17]:
from pyspark.sql import SparkSession
from pyspark import SparkContext

In [18]:
spark = SparkSession(sc)

Converting Rdd to dataframe

In [19]:
from pyspark.sql import Row

df=rdd.map(lambda r1:Row(id=r1[0],
                         date=r1[1],
                         price=r1[2],
                         bedrooms=r1[3],
                         bathrooms=r1[4],
                         livingsqft=r1[5],
                         lotsqft=r1[6],
                         floor=r1[7],
                         waterfront=r1[8],
                         view=r1[9],
                         condition=r1[10],
                         grade=r1[11],
                         sqftabove=r1[12],
                         sqftbase=r1[13],
                         yr_built=r1[14],
                         yr_renovated=r1[15],
                         zipcode=r1[16],
                         lat=r1[17],
                         long=r1[18],
                         sqftliving1=r1[19],
                         sqftlot15=r1[20])).toDF()
  

In [20]:
df.show(3)

+---------+--------+---------+---------------+-----+-----+----------+-------+----------+--------+-------+------+---------+--------+-----------+---------+----+----------+--------+------------+-------+
|bathrooms|bedrooms|condition|           date|floor|grade|        id|    lat|livingsqft|    long|lotsqft| price|sqftabove|sqftbase|sqftliving1|sqftlot15|view|waterfront|yr_built|yr_renovated|zipcode|
+---------+--------+---------+---------------+-----+-----+----------+-------+----------+--------+-------+------+---------+--------+-----------+---------+----+----------+--------+------------+-------+
|        1|       3|        3|20141013T000000|    1|    7|7129300520|47.5112|      1180|-122.257|   5650|221900|     1180|       0|       1340|     5650|   0|         0|    1955|           0|  98178|
|     2.25|       3|        3|20141209T000000|    2|    7|6414100192| 47.721|      2570|-122.319|   7242|538000|     2170|     400|       1690|     7639|   0|         0|    1951|        1991|  98125|


In this data set, we will convert the data types of columns to the appropriate one and calculate the age of the house to include the age of the house in the analysis

In [21]:
df.toPandas().head(3)

Unnamed: 0,bathrooms,bedrooms,condition,date,floor,grade,id,lat,livingsqft,long,...,price,sqftabove,sqftbase,sqftliving1,sqftlot15,view,waterfront,yr_built,yr_renovated,zipcode
0,1.0,3,3,20141013T000000,1,7,7129300520,47.5112,1180,-122.257,...,221900,1180,0,1340,5650,0,0,1955,0,98178
1,2.25,3,3,20141209T000000,2,7,6414100192,47.721,2570,-122.319,...,538000,2170,400,1690,7639,0,0,1951,1991,98125
2,1.0,2,3,20150225T000000,1,6,5631500400,47.7379,770,-122.233,...,180000,770,0,2720,8062,0,0,1933,0,98028


In [22]:
df.count()

21613

In [23]:
## extracting year
from pyspark.sql.functions import substring
df1=df.withColumn("year",substring(df["date"],1,4))

In [24]:
df1.show(2)

+---------+--------+---------+---------------+-----+-----+----------+-------+----------+--------+-------+------+---------+--------+-----------+---------+----+----------+--------+------------+-------+----+
|bathrooms|bedrooms|condition|           date|floor|grade|        id|    lat|livingsqft|    long|lotsqft| price|sqftabove|sqftbase|sqftliving1|sqftlot15|view|waterfront|yr_built|yr_renovated|zipcode|year|
+---------+--------+---------+---------------+-----+-----+----------+-------+----------+--------+-------+------+---------+--------+-----------+---------+----+----------+--------+------------+-------+----+
|        1|       3|        3|20141013T000000|    1|    7|7129300520|47.5112|      1180|-122.257|   5650|221900|     1180|       0|       1340|     5650|   0|         0|    1955|           0|  98178|2014|
|     2.25|       3|        3|20141209T000000|    2|    7|6414100192| 47.721|      2570|-122.319|   7242|538000|     2170|     400|       1690|     7639|   0|         0|    1951|  

We will be predicting the price of the house in this analysis. To accomplish this we are including the the house price, bedrooms, bathrooms, and age of the house.

In [25]:
### choosing relevant variables for price analysis
df3=df1.select('price','bedrooms','bathrooms','condition','yr_built', 'livingsqft','year')

In [26]:
df3.printSchema()

root
 |-- price: string (nullable = true)
 |-- bedrooms: string (nullable = true)
 |-- bathrooms: string (nullable = true)
 |-- condition: string (nullable = true)
 |-- yr_built: string (nullable = true)
 |-- livingsqft: string (nullable = true)
 |-- year: string (nullable = true)



In [27]:
### converting data types for analysis
from pyspark.sql.types import IntegerType
from pyspark.sql.types import *
df3=df3.withColumn("price",df3["price"].cast(IntegerType()))
df3=df3.withColumn("bedrooms",df3["bedrooms"].cast(IntegerType()))
df3=df3.withColumn("bathrooms",df3["bathrooms"].cast(FloatType()))
df3=df3.withColumn("condition",df3["condition"].cast(IntegerType()))
df3=df3.withColumn("yr_built",df3["yr_built"].cast(IntegerType()))
df3=df3.withColumn("livingsqft",df3["livingsqft"].cast(IntegerType()))
df3=df3.withColumn("year",df3["year"].cast(IntegerType()))






In [28]:
df3.printSchema()

root
 |-- price: integer (nullable = true)
 |-- bedrooms: integer (nullable = true)
 |-- bathrooms: float (nullable = true)
 |-- condition: integer (nullable = true)
 |-- yr_built: integer (nullable = true)
 |-- livingsqft: integer (nullable = true)
 |-- year: integer (nullable = true)



In [29]:
df3.show(3)

+------+--------+---------+---------+--------+----------+----+
| price|bedrooms|bathrooms|condition|yr_built|livingsqft|year|
+------+--------+---------+---------+--------+----------+----+
|221900|       3|      1.0|        3|    1955|      1180|2014|
|538000|       3|     2.25|        3|    1951|      2570|2014|
|180000|       2|      1.0|        3|    1933|       770|2015|
+------+--------+---------+---------+--------+----------+----+
only showing top 3 rows



In [30]:
df3=df3.withColumn("age",df3["year"]-df3["yr_built"])

In [31]:
df_final=df3.select(["price","bedrooms","bathrooms","condition","livingsqft","age"])


In [32]:
df_final.toPandas().head()

Unnamed: 0,price,bedrooms,bathrooms,condition,livingsqft,age
0,221900,3,1.0,3,1180,59
1,538000,3,2.25,3,2570,63
2,180000,2,1.0,3,770,82
3,604000,4,3.0,5,1960,49
4,510000,3,2.0,3,1680,28


## Data Analysis

In [33]:
#Creat table
temp_table_name = "house"

df.createOrReplaceTempView(temp_table_name)

Analysis 1: Identify the topic that has the maximum number of price at each bedroom level.



In [34]:
price=spark.sql("select bedrooms,max(price) from house group by bedrooms order by max(price) desc")
price.show()

+--------+----------+
|bedrooms|max(price)|
+--------+----------+
|       3|    999999|
|       4|    999950|
|       5|    999000|
|       7|    999000|
|       2|    998500|
|       6|    989000|
|       1|     95000|
|       9|    934000|
|       8|    900000|
|      10|    660000|
|      33|    640000|
|      11|    520000|
|       0|    380000|
+--------+----------+



Analysis 2: Identify the topic that the top 10 highest price with the oldest build year.

In [35]:
old = spark.sql("select  price, min(yr_built) from house group by price order by min(yr_built) asc,price desc limit 10")
old.show()

+------+-------------+
| price|min(yr_built)|
+------+-------------+
|975000|         1900|
|905000|         1900|
|887200|         1900|
|870000|         1900|
|850000|         1900|
|835000|         1900|
|829000|         1900|
|819000|         1900|
|783200|         1900|
|759000|         1900|
+------+-------------+



In [36]:
#The sale year only have two number 2014 and 2015, so we can treat the build year like age.
avgprice=spark.sql("(select '1900-1925' as YearRange,condition, avg(price) as AveragePrice from house where yr_built >=1900 and yr_built <= 1925 group by condition,'1900-1925'order by condition)\
union all\
(select '1926-1950'as YearRange,condition, avg(price) as AveragePrice from house where yr_built >=1926 and yr_built <= 1950 group by condition,'1926-1950'order by condition)\
union all\
(select '1951-1975'as YearRange,condition, avg(price)  as AveragePrice from house where yr_built >=1951 and yr_built <= 1975 group by condition,'1951-1975'order by condition)\
union all\
(select '1976-2000'as YearRange,condition, avg(price) as AveragePrice from house where yr_built >=1976 and yr_built <= 2000 group by condition,'1976-2000'order by condition)\
union all\
(select '2001-2015'as YearRange,condition, avg(price) as AveragePrice from house where yr_built >=2001 and yr_built <= 2015 group by condition,'2001-2015'order by condition)")
avgprice.show()

+---------+---------+------------------+
|YearRange|condition|      AveragePrice|
+---------+---------+------------------+
|1900-1925|        1| 315222.7272727273|
|1900-1925|        2|399629.94736842107|
|1900-1925|        3| 568711.5706638115|
|1900-1925|        4|  589354.199146515|
|1900-1925|        5| 702907.9360189573|
|1926-1950|        1| 264791.6666666667|
|1926-1950|        2| 267064.0350877193|
|1926-1950|        3| 470085.3057366362|
|1926-1950|        4| 530870.2468899521|
|1926-1950|        5| 617390.9605568446|
|1951-1975|        1|          484000.0|
|1951-1975|        2|          338777.8|
|1951-1975|        3|450728.95230885694|
|1951-1975|        4|475343.01169102296|
|1951-1975|        5|  552290.103343465|
|1976-2000|        2|         332493.75|
|1976-2000|        3| 541170.8573049002|
|1976-2000|        4| 553578.7993377483|
|1976-2000|        5| 607998.2978723404|
|2001-2015|        3| 615196.9836029249|
+---------+---------+------------------+
only showing top

## Visualization

In [37]:
import folium
import pandas as pd



In [38]:
# define the world map
world_map = folium.Map()

In [39]:
# San Francisco latitude and longitude values
latitude = 47.608013
longitude = -122.335167

In [40]:
# Create map and display it
san_map = folium.Map(location=[latitude, longitude], zoom_start=12)

In [41]:
# Read Dataset 
cdata = pd.read_csv('https://raw.githubusercontent.com/superfishmaster/BigDataHouse/master/kc_house_data.csv')
cdata.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [42]:
# get the first 200 crimes in the cdata
limit = 500
data = cdata.iloc[0:limit, :]

In [43]:
# Instantiate a feature group for the incidents in the dataframe
incidents = folium.map.FeatureGroup()

In [44]:
# Loop through the 200 crimes and add each to the incidents feature group
for lat, lng, in zip(cdata.lat, data.long):
    incidents.add_child(
        folium.CircleMarker(
            [lat, lng],
            radius=7, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='red',
            fill_opacity=0.4
        )
    )

In [45]:
# Add incidents to map
# We are creating a mmp of house locations in Washington area. The following graph below just show the point in the map.
#The second map show the lable with price.
san_map = folium.Map(location=[latitude, longitude], zoom_start=12)
san_map.add_child(incidents)

In [46]:

# add pop-up text to each marker on the map
latitudes = list(data.lat)
longitudes = list(data.long)
labels = list(data.price)

for lat, lng, label in zip(latitudes, longitudes, labels):
    folium.Marker([lat, lng], popup=label).add_to(san_map)    
    
# add incidents to map
san_map.add_child(incidents)
san_map.add_child(incidents)

In [47]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf


In [48]:
cdata.iplot(kind='bar',y='price',x='yr_built',title='Price over Time')

In [49]:

cdata=cdata.groupby(by='yr_built').sum()

In [50]:
import plotly.graph_objs as go
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf

In [51]:
cdata.iplot(kind='line',y='price',title='Price group by Build Year',
         xTitle='Build Year', yTitle='Price',colors='red')

In [52]:
cdata.iplot(kind='line',y='sqft_living',title='Livingsqft group by Build Year',
         xTitle='yr_built', yTitle='livingsqft',colors='blue')

## Data Modeling--Machine Learning

In [53]:
df_final.dropna()

DataFrame[price: int, bedrooms: int, bathrooms: float, condition: int, livingsqft: int, age: int]

In [54]:
df_final.dropDuplicates()

DataFrame[price: int, bedrooms: int, bathrooms: float, condition: int, livingsqft: int, age: int]

In [55]:
from pyspark.ml.feature import VectorAssembler

In [56]:
df_final.columns

['price', 'bedrooms', 'bathrooms', 'condition', 'livingsqft', 'age']

In [57]:
# Vector assembler is used to create a vector of input features
assembler = VectorAssembler(inputCols=['bedrooms', 'bathrooms', 'condition', 'livingsqft', 'age'],
                            outputCol="features")

In [58]:
final_data=assembler.transform(df_final)

In [59]:
final_data.show()

+-------+--------+---------+---------+----------+---+--------------------+
|  price|bedrooms|bathrooms|condition|livingsqft|age|            features|
+-------+--------+---------+---------+----------+---+--------------------+
| 221900|       3|      1.0|        3|      1180| 59|[3.0,1.0,3.0,1180...|
| 538000|       3|     2.25|        3|      2570| 63|[3.0,2.25,3.0,257...|
| 180000|       2|      1.0|        3|       770| 82|[2.0,1.0,3.0,770....|
| 604000|       4|      3.0|        5|      1960| 49|[4.0,3.0,5.0,1960...|
| 510000|       3|      2.0|        3|      1680| 28|[3.0,2.0,3.0,1680...|
|1225000|       4|      4.5|        3|      5420| 13|[4.0,4.5,3.0,5420...|
| 257500|       3|     2.25|        3|      1715| 19|[3.0,2.25,3.0,171...|
| 291850|       3|      1.5|        3|      1060| 52|[3.0,1.5,3.0,1060...|
| 229500|       3|      1.0|        3|      1780| 55|[3.0,1.0,3.0,1780...|
| 323000|       3|      2.5|        3|      1890| 12|[3.0,2.5,3.0,1890...|
| 662500|       3|      2

In [60]:
Reg_data=final_data.select(["features","price"])

### Chaisquare Test

Chaisquare test is done in order to check the statistical significance of the different variables

In [61]:
####Chaisquare test
from pyspark.ml.stat import ChiSquareTest

In [62]:
result_hypothesis = ChiSquareTest.test(Reg_data,"features", "price").head()

In [63]:
print("pValues: " + str(result_hypothesis.pValues))



pValues: [0.0,0.0,0.0,0.0,1.0]


Interpretation: we can observe from the above output that all the p-values are 0 or 1 which explains that all the slected variables has some identifiable linear relationship.

In [64]:
print("statistics: " + str(result_hypothesis.statistics))

statistics: [53975.57556657787,247434.92963883132,17992.03798498814,6194336.902883839,455135.459615309]


In [65]:
# Preparing train and test data set
splits = Reg_data.randomSplit([0.8, 0.2],seed='2020')
train_df = splits[0]
test_df = splits[1]

In [66]:
train_df.describe().show()

+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|             17304|
|   mean| 541489.1171983356|
| stddev|367245.02584372973|
|    min|             78000|
|    max|           7700000|
+-------+------------------+



In these outputs we can observe the number of data points in the training and testing data test

In [67]:
test_df.describe().show()

+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|              4309|
|   mean| 534462.1313529821|
| stddev|366642.29612750106|
|    min|             75000|
|    max|           7062500|
+-------+------------------+



### Linear Regression

In [68]:

# Create a Linear Regression Model object
from pyspark.ml.regression import LinearRegression
linearReg = LinearRegression(featuresCol = 'features', labelCol='price', maxIter=10, regParam=0.3, elasticNetParam=0.8)

In [69]:
# Fit the model to the data and call this model lrModel
lr_model = linearReg.fit(train_df)

In [70]:

print("Coefficients: " + str(lr_model.coefficients))

Coefficients: [-76780.6740269383,90570.58999713506,10692.110654578224,298.97330001314975,3055.7382638566655]


In [71]:
'bedrooms', 'bathrooms', 'condition', 'livingsqft', 'age'

('bedrooms', 'bathrooms', 'condition', 'livingsqft', 'age')

From the model we have betacoefficients that explains the following:<br />
For age of the house:-76780.67, this explains the decrese of price of house by 76780.67 for 1 year increase in house.<br />
For bathrooms of the house:90570.58, this explains the increase of price of house by 90570.58 for 1 bathrooms increase in house.<br />
For bedrooms of the house:10692.11, this explains the increase of price of house by 10692.11 for 1 bathrooms increase in house.<br />
For condition of the house:298.97, this explains the increase of price of house by 298.97 for 1 increase in condition rating of house.<br />
For livingsqft of the house:3055.73, this explains the increase of price of house by 3055.73 for increase in 1 square ft of living area in house.


In [72]:
print("Intercept: " + str(lr_model.intercept))

Intercept: -182618.14706737935


In [73]:
test_results = lr_model.evaluate(test_df)

In [74]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))

RMSE: 247017.4997918995
MSE: 61017645203.44106


In [75]:
print("R-squared value: {}".format(test_results.r2))

R-squared value: 0.5459840454189775


R-squared value of 0.54 tells us that this model is able to explaing approx 54 percent of variance in data with the applied input variables.

In [76]:
lr_predictions=lr_model.transform(test_df)
lr_predictions.select(['prediction','Features','price']).toPandas()

Unnamed: 0,prediction,Features,price
0,3.439522e+05,"[0.0, 0.0, 3.0, 1470.0, 18.0]",235000
1,8.388501e+05,"[0.0, 0.0, 3.0, 3064.0, 24.0]",1095000
2,6.506395e+05,"[0.0, 2.5, 3.0, 1810.0, 11.0]",240000
3,8.491499e+05,"[0.0, 2.5, 3.0, 2290.0, 29.0]",339950
4,1.227208e+05,"[1.0, 0.0, 3.0, 670.0, 49.0]",75000
...,...,...,...
4304,1.067317e+06,"[7.0, 4.5, 3.0, 4140.0, 36.0]",565000
4305,1.140654e+06,"[7.0, 5.75, 3.0, 3700.0, 66.0]",540000
4306,8.169362e+05,"[8.0, 3.0, 3.0, 3840.0, 53.0]",575000
4307,5.238997e+05,"[10.0, 2.0, 4.0, 3610.0, 56.0]",650000


Interpretation: We have created the regression model which is able to explaing 54 percent of change in price if we use the 
variables which are included in the above regression model. that means there is still 46 percent of variation unexplainded
there can be more factors affecting the price such as inflation, Localicty(Area is urban,suburban ou rural) etc.


### Random Forest Regressor Model

In [77]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

In [78]:
# Create a Linear Regression Model object
RandomReg = RandomForestRegressor(featuresCol = 'features', labelCol='price')

In [79]:
# Fit the model to the data and call this model lrModel
rf_model = RandomReg.fit(train_df)
import warnings
warnings.filterwarnings("ignore")


In [80]:
rf_predictions=rf_model.transform(test_df)
rf_predictions.select(['prediction','Features','price']).toPandas()

Unnamed: 0,prediction,Features,price
0,3.590739e+05,"[0.0, 0.0, 3.0, 1470.0, 18.0]",235000
1,5.555982e+05,"[0.0, 0.0, 3.0, 3064.0, 24.0]",1095000
2,4.545889e+05,"[0.0, 2.5, 3.0, 1810.0, 11.0]",240000
3,5.233859e+05,"[0.0, 2.5, 3.0, 2290.0, 29.0]",339950
4,3.076893e+05,"[1.0, 0.0, 3.0, 670.0, 49.0]",75000
...,...,...,...
4304,1.332930e+06,"[7.0, 4.5, 3.0, 4140.0, 36.0]",565000
4305,1.430560e+06,"[7.0, 5.75, 3.0, 3700.0, 66.0]",540000
4306,8.560041e+05,"[8.0, 3.0, 3.0, 3840.0, 53.0]",575000
4307,8.163169e+05,"[10.0, 2.0, 4.0, 3610.0, 56.0]",650000


In [81]:
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(rf_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 249079


In [82]:
import sklearn.metrics
y_true = rf_predictions.select("price").toPandas()
y_pred = rf_predictions.select("prediction").toPandas()
r2_score = sklearn.metrics.r2_score(y_true, y_pred)
print('r2_score: {0}'.format(r2_score))

r2_score: 0.5383734787624195


Interpretation: we have r2 value from Linear rgression model = 0.54, whereas 0.52 from random forest regressor. so we have better explaination of the variance in data from linear regression model.
Although our regression model is able to explain 54 percent of variation, it is better model than Random forest regressor.
here we find out there are other factors affecting the price which are not avilable in the data set such as inflation, locality and development in that region. So to develop the model with prediction we need to include those parameters which are not present in current data set


### Model comparison

r2 value: we have r2 value from Linear rgression model = 0.54, whereas 0.52 from random forest regressor. so we have better explaination of the variance in data from linear regression model. Although our regression model is able to explain 54 percent of variation, it is better model than Random forest regressor. here we find out there are other factors affecting the price which are not avilable in the data set such as inflation, locality and development in that region. So to develop the model with prediction we need to include those parameters which are not present in current data set.

Root Mean Squared Error: RMSE from regression model: 247017.49. (RMSE) from random forest regressor: 252755. We have lower value of RMSE for regression model than random forest.

Therefore, Regression model appear to be a better model fot this data set.