# ISM6562.002S23

## Team: APEX

## Team mebers: Aswith Reddy Kovvuri, Muralidhar Reddy Reddem, Sindhura Alla


##### About the Dataset:

We have chosen one of the most happening cities, New York, for our analysis. The dataset includes 32,500 properties  which are listed in the Airbnb website.

Listings, including full descriptions,Host details,Availabilty,Beds ,Bathroom count,Property type, review scores, Reviews, including unique id for each listing the price and availability for that day,
##### About the Source:
Airbnb, Inc is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008. Airbnb is a shortened version of its original name, AirBedandBreakfast.com.
Inside Airbnb is a mission driven project that provides data and advocacy about Airbnb's impact on residential communities, where data and information empower communities to understand, decide and control the role of renting residential homes to tourists.
##### Collaborators:
Inside Airbnb relies on the kind contribution of a variety of collaborators and partners.
* Murray Cox.,
* Taylor Higgins. 
* Alice Corona.
* Luca Lamonaca. 
* Michael "Ziggy" Mintz. 

##### Technology:
The site uses the following Open Source technologies:- 
D3, Bootstrap, Python, PostgreSQL and Google Fonts.
Maps are designed in Mapbox with OpenStreetMap data and hosted via Mapbox.
The site is served by an Amazon S3 "bucket".

##### Importing the Required poackages

In [1]:
import pandas as pd
import findspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_replace, trim, when, ceil, avg, coalesce
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
import json
from pyspark.sql.window import Window
from pyspark.ml.classification import LinearSVC
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import MultilayerPerceptronClassifier

##### Building Spark Session and Creating context

In [2]:
findspark.init()

spark = SparkSession.builder \
        .master("local[4]") \
        .appName("ISM6562 Spark App01") \
        .enableHiveSupport() \
        .getOrCreate()

# Let's get the SparkContext object. It's the entry point to the Spark API. It's created when you create a sparksession
sc = spark.sparkContext  

# note: If you have multiple spark sessions running (like from a previous notebook you've run), 
# this spark session webUI will be on a different port than the default (4040). One way to 
# identify this part is with the following line. If there was only one spark session running, 
# this will be 4040. If it's higher, it means there are still other spark sesssions still running.
spark_session_port = spark.sparkContext.uiWebUrl.split(":")[-1]
print("Spark Session WebUI Port: " + spark_session_port)

# It's best if you find that the port number displayed below is not 4040, then you should shut down all other spark sessions and 
# run this code again. If you don't, you may have trouble accessing the data in the spark-warehouse directory.

Spark Session WebUI Port: 4040


In [3]:
# this will set the log level to ERROR. This will hide the INFO or WARNING messages that are printed out by default. If you want to see them, set this to INFO or WARN.
sc.setLogLevel("ERROR")

In [4]:
spark

#### Reading the data from source

In [5]:
# Reading csv into pandas dataframe instead of spark dataframe directly. Csv file has column amenities with list of values. 
# As of today spark csv read doesn't support ArrayType(StringType). Pandas csv read is much more complex than spark, hence the choice
# Of using pandas read csv instead of spark
df = pd.read_csv('http://data.insideairbnb.com/united-states/ny/new-york-city/2023-03-06/data/listings.csv.gz', low_memory=False)
df["amenities_count"] = df["amenities"].apply(lambda x: len(json.loads(x))) # Count number of amenities

# Dropping columns that are list of strings like amenities as this is the main source of culprit for unable to spark upload to csv. 
# This has list of strings leading to issues while spark upload.
cols_to_delete = ['host_neighbourhood', 'neighborhood_overview', 'host_location', 
                  'host_about', 'picture_url','host_url', 'host_thumbnail_url', 'host_picture_url', 
                  'host_verifications', 'host_has_profile_pic', 
                  'neighbourhood', 'latitude', 'longitude', 'bathrooms', 'amenities', 'minimum_minimum_nights',
                  'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 
                  'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'calendar_last_scraped', 
                  'license', 
                  'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 
                  'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms']
df = df.drop(cols_to_delete, axis=1)

# Replace commas in string columns as commas are causing issues in spark upload of csv
str_cols = ['name', 'description']
for str_col in str_cols:
    df[str_col] = df[str_col].str.replace(',', '')
df.to_csv('listings.csv', index=False)
df = spark.read.csv('listings.csv', header=True, inferSchema=True)
df.show(5)

+------------------+--------------------+--------------+------------+-----------+--------------------+--------------------+---------+---------+----------+------------------+------------------+--------------------+-----------------+-------------------+-------------------------+----------------------+----------------------+----------------------------+--------------------+---------------+------------+--------------+--------+----+-------+--------------+--------------+----------------+---------------+---------------+---------------+----------------+-----------------+---------------------+----------------------+------------+-----------+--------------------+----------------------+-------------------------+---------------------+---------------------------+----------------------+-------------------+----------------+-----------------+---------------+
|                id|         listing_url|     scrape_id|last_scraped|     source|                name|         description|  host_id|host_name|hos

##### Printing the Schema

In [6]:
df.printSchema()

root
 |-- id: long (nullable = true)
 |-- listing_url: string (nullable = true)
 |-- scrape_id: long (nullable = true)
 |-- last_scraped: string (nullable = true)
 |-- source: string (nullable = true)
 |-- name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- host_id: integer (nullable = true)
 |-- host_name: string (nullable = true)
 |-- host_since: string (nullable = true)
 |-- host_response_time: string (nullable = true)
 |-- host_response_rate: string (nullable = true)
 |-- host_acceptance_rate: string (nullable = true)
 |-- host_is_superhost: string (nullable = true)
 |-- host_listings_count: double (nullable = true)
 |-- host_total_listings_count: double (nullable = true)
 |-- host_identity_verified: string (nullable = true)
 |-- neighbourhood_cleansed: string (nullable = true)
 |-- neighbourhood_group_cleansed: string (nullable = true)
 |-- property_type: string (nullable = true)
 |-- room_type: string (nullable = true)
 |-- accommodates: integer (nullab

#### Applying transformations for relevant columns

In [7]:
# Converting relevant columns to date
dt_cols = ['host_since', 'first_review', 'last_review']
for dt_col in dt_cols:
    df = df.withColumn(dt_col, col(dt_col).cast("date"))

# Converting relevant columns to float
float_cols = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 
              'review_scores_communication', 'review_scores_location', 'review_scores_value']
for float_col in float_cols:
    df = df.withColumn(float_col, col(float_col).cast("float"))

# Cleaning up percentage columns
per_cols = ['host_response_rate', 'host_acceptance_rate']
for per_col in per_cols:
    df = df.withColumn(per_col, regexp_replace(trim(col(per_col)), "%", "").cast("float"))
    df = df.fillna({per_col: 0})

# Cleaning up price column
# Removing rows that have 0 price as it's erroneous to have zero price for a listing
df = df.withColumn("price", regexp_replace(trim(col("price")), "\$", "").cast("float"))
df = df.filter(col("price") > 1)


# Converting string represented boolean columns
bool_cols = ['host_is_superhost', 'host_identity_verified', 'has_availability', 'instant_bookable']
for bool_col in bool_cols:
    df = df.fillna({bool_col: 'f'})
    df = df.withColumn(bool_col, when(col(bool_col) == "t", 1).otherwise(0))


# Handling beds and bedrooms nan values
# Handling cases where there are number of beds but not bedrooms
df = df.withColumn("bedrooms", coalesce(col("bedrooms"), when(col("beds").isNotNull(), ceil(col("beds") / 2)).otherwise(None)))
# Using same assumption, fill na values in beds if there are bedrooms
df = df.withColumn("beds", coalesce(col("beds"), when(col("bedrooms").isNotNull(), col("bedrooms") * 2).otherwise(None)))
df = df.dropna(subset=['beds', 'bedrooms'], how="all")


# Removing rows with null ratings as ratings are important factor for price
df = df.na.drop(subset=["review_scores_rating"])
# Other review columns that are null can be reasonably assumed to have same rating as review_scores_rating
for rating_col in ['review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value']:
    df = df.withColumn(rating_col, coalesce(rating_col, 'review_scores_rating'))

# Filling nas for host_total_listings_count
df = df.fillna({'host_total_listings_count': 1})

# Renaming columns
df = df.withColumnRenamed("neighbourhood_group_cleansed", "neighbourhood").drop("neighbourhood_group_cleansed")
df = df.withColumnRenamed("bathrooms_text", "bathrooms").drop("bathrooms_text")

data_cols = ["host_since", "host_response_rate", "host_acceptance_rate", "host_is_superhost", "host_total_listings_count", 
             "host_identity_verified", "neighbourhood", "property_type", "room_type", "accommodates",
             "bathrooms", "bedrooms", "beds", "price", "number_of_reviews", "last_review",	"review_scores_rating", 
             "review_scores_accuracy", "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication", 
             "review_scores_location", "review_scores_value", "instant_bookable", "amenities_count"]
data = df.select(data_cols).dropna()
data.show()

+----------+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+----------------+--------+----+-----+-----------------+-----------+--------------------+----------------------+-------------------------+---------------------+---------------------------+----------------------+-------------------+----------------+---------------+
|host_since|host_response_rate|host_acceptance_rate|host_is_superhost|host_total_listings_count|host_identity_verified|neighbourhood|       property_type|      room_type|accommodates|       bathrooms|bedrooms|beds|price|number_of_reviews|last_review|review_scores_rating|review_scores_accuracy|review_scores_cleanliness|review_scores_checkin|review_scores_communication|review_scores_location|review_scores_value|instant_bookable|amenities_count|
+----------+------------------+--------------------+-----------------+-------------------------+----------

##### Save data to hive store

In [None]:
spark.sql('drop table if exists airbnb_newyork')
data.write.saveAsTable('airbnb_newyork')

### Determine if listing is over priced or under priced considering relevant factors 

### Regression analysis

In [9]:
cols_filtered = ["host_response_rate", "host_acceptance_rate", "host_is_superhost", "host_total_listings_count", 
                 "host_identity_verified", "neighbourhood", "property_type", "room_type", "accommodates",
                 "bathrooms", "bedrooms", "beds", "price", "number_of_reviews", "review_scores_rating", 
                 "instant_bookable", "amenities_count"]
train_data, test_data=data.randomSplit([0.7,0.3])
train_data = train_data.select(cols_filtered)
test_data = test_data.select(cols_filtered)
train_data.show(5)

+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+----------------+---------------+
|host_response_rate|host_acceptance_rate|host_is_superhost|host_total_listings_count|host_identity_verified|neighbourhood|       property_type|      room_type|accommodates|    bathrooms|bedrooms|beds|price|number_of_reviews|review_scores_rating|instant_bookable|amenities_count|
+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+----------------+---------------+
|             100.0|                89.0|                0|                     13.0|                     1|     Brooklyn|Private room in t...|   Private room|    

In [10]:
# Using StringIndexer to convert the categorical columns to hold numerical data
neighbourhood_group = StringIndexer(inputCol='neighbourhood',outputCol='neighbourhood_group',handleInvalid='keep')
property_type_group = StringIndexer(inputCol='property_type',outputCol='property_type_group',handleInvalid='keep')
room_type_group = StringIndexer(inputCol='room_type',outputCol='room_type_group',handleInvalid='keep')
bathrooms_group = StringIndexer(inputCol='bathrooms',outputCol='bathrooms_group',handleInvalid='keep')

In [11]:
assembler = VectorAssembler(
    inputCols=[
        "host_is_superhost", 
        "host_total_listings_count", 
        "host_identity_verified", 
        "neighbourhood_group",
        'property_type_group',
        'room_type_group',
        "accommodates", 
        'bathrooms_group',
        "bedrooms", 
        "beds", 
        "number_of_reviews", 
        "review_scores_rating", 
        "instant_bookable",
        "amenities_count"
    ],
    outputCol="features"
)

In [12]:
pipe = Pipeline(stages=[
    neighbourhood_group,
    property_type_group,
    room_type_group,
    bathrooms_group,
    assembler
    ]
)

In [13]:
fitted_pipe=pipe.fit(train_data)

In [14]:
train_data=fitted_pipe.transform(train_data)
train_data.show(5)

+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+----------------+---------------+-------------------+-------------------+---------------+---------------+--------------------+
|host_response_rate|host_acceptance_rate|host_is_superhost|host_total_listings_count|host_identity_verified|neighbourhood|       property_type|      room_type|accommodates|    bathrooms|bedrooms|beds|price|number_of_reviews|review_scores_rating|instant_bookable|amenities_count|neighbourhood_group|property_type_group|room_type_group|bathrooms_group|            features|
+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+-----------

In [15]:
test_data=fitted_pipe.transform(test_data)
test_data.show(5)

+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+----------------+---------------+-------------------+-------------------+---------------+---------------+--------------------+
|host_response_rate|host_acceptance_rate|host_is_superhost|host_total_listings_count|host_identity_verified|neighbourhood|       property_type|      room_type|accommodates|    bathrooms|bedrooms|beds|price|number_of_reviews|review_scores_rating|instant_bookable|amenities_count|neighbourhood_group|property_type_group|room_type_group|bathrooms_group|            features|
+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+-----------

In [16]:
lr_model = LinearRegression(labelCol='price')
fit_model = lr_model.fit(train_data.select(['features', 'price']))

In [17]:
results = fit_model.transform(test_data)
results.show(5)

+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+----------------+---------------+-------------------+-------------------+---------------+---------------+--------------------+------------------+
|host_response_rate|host_acceptance_rate|host_is_superhost|host_total_listings_count|host_identity_verified|neighbourhood|       property_type|      room_type|accommodates|    bathrooms|bedrooms|beds|price|number_of_reviews|review_scores_rating|instant_bookable|amenities_count|neighbourhood_group|property_type_group|room_type_group|bathrooms_group|            features|        prediction|
+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+------------

In [18]:
results.select(['price', 'prediction']).show(5)

+-----+------------------+
|price|        prediction|
+-----+------------------+
|140.0| 172.3727718051572|
| 55.0|  79.3056708651953|
|125.0|115.02980304833997|
| 42.0| 79.38672386768718|
|200.0|163.82250353233016|
+-----+------------------+
only showing top 5 rows



In [19]:
test_results = fit_model.evaluate(test_data)

In [20]:
print(f"{'RMSE:':7s} {test_results.rootMeanSquaredError:>7.3f}")
print(f"{'Ex Var:':7s} {test_results.explainedVariance:>7.3f}")
print(f"{'MAE:':7s} {test_results.meanAbsoluteError:>7.3f}")
print(f"{'MSE:':7s} {test_results.meanSquaredError:>7.3f}")
print(f"{'RMSE:':7s} {test_results.rootMeanSquaredError:>7.3f}")
print(f"{'R2:':7s} {test_results.r2:>7.3f}")

RMSE:    97.724
Ex Var: 4977.520
MAE:     62.989
MSE:    9549.981
RMSE:    97.724
R2:       0.356


## logistic reg


In [21]:
from pyspark.ml.classification import LogisticRegression

In [22]:
log_model = LogisticRegression(labelCol='price')

In [23]:
fit_model = log_model.fit(train_data.select(['features', 'price']))

In [24]:
results = fit_model.transform(test_data)
results.show(5)

+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+----------------+---------------+-------------------+-------------------+---------------+---------------+--------------------+--------------------+--------------------+----------+
|host_response_rate|host_acceptance_rate|host_is_superhost|host_total_listings_count|host_identity_verified|neighbourhood|       property_type|      room_type|accommodates|    bathrooms|bedrooms|beds|price|number_of_reviews|review_scores_rating|instant_bookable|amenities_count|neighbourhood_group|property_type_group|room_type_group|bathrooms_group|            features|       rawPrediction|         probability|prediction|
+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+-------

In [25]:
results.select(['price', 'prediction']).show(5)

+-----+----------+
|price|prediction|
+-----+----------+
|140.0|     120.0|
| 55.0|      50.0|
|125.0|      50.0|
| 42.0|      45.0|
|200.0|     150.0|
+-----+----------+
only showing top 5 rows



In [26]:
test_results = fit_model.evaluate(test_data)

In [27]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

AUC_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='price',metricName='areaUnderROC')

AUC = AUC_evaluator.evaluate(results)

In [28]:
print("The area under the curve is {}".format(AUC))

The area under the curve is 1.0


In [29]:
PR_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='price',metricName='areaUnderPR')
PR = PR_evaluator.evaluate(results)

In [30]:
print("The area under the PR curve is {}".format(PR))

The area under the PR curve is 1.0


In [31]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

ACC_evaluator = MulticlassClassificationEvaluator(  #  Multiclass or Binary, the accuracy is calculated in the same way.
    labelCol="price", predictionCol="prediction", metricName="accuracy")

accuracy = ACC_evaluator.evaluate(results)

In [32]:
print("The accuracy of the model is {}".format(accuracy))

The accuracy of the model is 0.04541213063763608


### Classification

In [33]:
# Find Average price per person by group. Group is defined as (host_is_superhost, neighbourhood, property_type, room_type)
# Create categories of price based on average price
data_clf = data.select("*")
data_clf = data_clf.withColumn('price_per_person', col('price')/col('accommodates'))
group_cols = ['neighbourhood', 'host_is_superhost', 'property_type', 'room_type']

w = Window.partitionBy(group_cols)
data_clf = data_clf.withColumn("avg_price_per_person", avg('price_per_person').over(w))
percent_band = 0.2
data_clf = data_clf.withColumn(
    'price_category',
    when(col("price_per_person") > (1 + percent_band) * col('avg_price_per_person'), "COSTLY").
    when(col("price_per_person") < (1 + percent_band) * col('avg_price_per_person'), "DISCOUNTED").
    otherwise("ECONOMICAL")
)
data_clf.show(5)


+----------+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+------------------+---------------+------------+---------+--------+----+-----+-----------------+-----------+--------------------+----------------------+-------------------------+---------------------+---------------------------+----------------------+-------------------+----------------+---------------+----------------+--------------------+--------------+
|host_since|host_response_rate|host_acceptance_rate|host_is_superhost|host_total_listings_count|host_identity_verified|neighbourhood|     property_type|      room_type|accommodates|bathrooms|bedrooms|beds|price|number_of_reviews|last_review|review_scores_rating|review_scores_accuracy|review_scores_cleanliness|review_scores_checkin|review_scores_communication|review_scores_location|review_scores_value|instant_bookable|amenities_count|price_per_person|avg_price_per_person|price_category|
+----------+------

In [34]:
# Use StringIndexer to convert the categorical columns to hold numerical data
price_category_group = StringIndexer(inputCol='price_category',outputCol='price_category_group',handleInvalid='keep')
data_clf = price_category_group.fit(data_clf).transform(data_clf)
data_clf.show(5)

+----------+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+------------------+---------------+------------+---------+--------+----+-----+-----------------+-----------+--------------------+----------------------+-------------------------+---------------------+---------------------------+----------------------+-------------------+----------------+---------------+----------------+--------------------+--------------+--------------------+
|host_since|host_response_rate|host_acceptance_rate|host_is_superhost|host_total_listings_count|host_identity_verified|neighbourhood|     property_type|      room_type|accommodates|bathrooms|bedrooms|beds|price|number_of_reviews|last_review|review_scores_rating|review_scores_accuracy|review_scores_cleanliness|review_scores_checkin|review_scores_communication|review_scores_location|review_scores_value|instant_bookable|amenities_count|price_per_person|avg_price_per_person|price_categor

#### Decision Tree

In [35]:
cols_filtered = ["host_response_rate", "host_acceptance_rate", "host_is_superhost", "host_total_listings_count", 
                 "host_identity_verified", "neighbourhood", "property_type", "room_type", "accommodates",
                 "bathrooms", "bedrooms", "beds", "price", "number_of_reviews", "review_scores_rating", 
                 "instant_bookable", "amenities_count", "price_category_group"]
train_data, test_data=data_clf.randomSplit([0.7,0.3])
train_data = train_data.select(cols_filtered)
test_data = test_data.select(cols_filtered)
train_data.show(5)

+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+----------------+---------------+--------------------+
|host_response_rate|host_acceptance_rate|host_is_superhost|host_total_listings_count|host_identity_verified|neighbourhood|       property_type|      room_type|accommodates|    bathrooms|bedrooms|beds|price|number_of_reviews|review_scores_rating|instant_bookable|amenities_count|price_category_group|
+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+----------------+---------------+--------------------+
|             100.0|                24.0|                0|                      5.0|               

In [36]:
dt_model = DecisionTreeClassifier(labelCol='price_category_group', maxBins=5000)

In [37]:
pipe = Pipeline(stages=[
    neighbourhood_group,
    property_type_group,
    room_type_group,
    bathrooms_group,
    assembler,
    dt_model
    ]
)

In [38]:
fit_model=pipe.fit(train_data)

In [39]:
results = fit_model.transform(test_data)

In [40]:
results.select(['price_category_group','prediction']).show(5)

+--------------------+----------+
|price_category_group|prediction|
+--------------------+----------+
|                 0.0|       0.0|
|                 1.0|       1.0|
|                 1.0|       0.0|
|                 0.0|       0.0|
|                 0.0|       0.0|
+--------------------+----------+
only showing top 5 rows



In [41]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="price_category_group", predictionCol="prediction", metricName="accuracy")

accuracy = ACC_evaluator.evaluate(results)

print(f"The accuracy of the decision tree classifier is {accuracy}")

The accuracy of the decision tree classifier is 0.7811759859952631


In [43]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

AUC_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='price_category_group',metricName='areaUnderROC')
AUC = AUC_evaluator.evaluate(results)

print(f"The area under the curve is {AUC:.2f}")

The area under the curve is 0.66


In [44]:
PR_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='price_category_group',metricName='areaUnderPR')
PR = PR_evaluator.evaluate(results)

print("The area under the PR curve is {}".format(PR))

The area under the PR curve is 0.5137904801545486


In [46]:
from sklearn.metrics import confusion_matrix

y_true = results.select("price_category_group")
y_true = y_true.toPandas()
 
y_pred = results.select("prediction")
y_pred = y_pred.toPandas()
 
cnf_matrix = confusion_matrix(y_true, y_pred)
print("Below is the confusion matrix: \n {}".format(cnf_matrix))

Below is the confusion matrix: 
 [[6589  601]
 [1524  997]]


In [48]:
tn = cnf_matrix[0][0]
fp = cnf_matrix[0][1]
fn = cnf_matrix[1][0]
tp = cnf_matrix[1][1]

accuracy = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1_score = 2*(precision*recall)/(precision+recall)

In [49]:
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1_score:.2f}")

Accuracy: 0.78
Precision: 0.62
Recall: 0.40
F1 Score: 0.48


#### MLP Classifier

In [None]:
cols_filtered = ["host_response_rate", "host_acceptance_rate", "host_is_superhost", "host_total_listings_count", 
                 "host_identity_verified", "neighbourhood", "property_type", "room_type", "accommodates",
                 "bathrooms", "bedrooms", "beds", "price", "number_of_reviews", "review_scores_rating", 
                 "instant_bookable", "amenities_count", "price_category_group"]
train_data, test_data=data_clf.randomSplit([0.7,0.3])
train_data = train_data.select(cols_filtered)
test_data = test_data.select(cols_filtered)
train_data.show(5)

+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+----------------+---------------+--------------------+
|host_response_rate|host_acceptance_rate|host_is_superhost|host_total_listings_count|host_identity_verified|neighbourhood|       property_type|      room_type|accommodates|    bathrooms|bedrooms|beds|price|number_of_reviews|review_scores_rating|instant_bookable|amenities_count|price_category_group|
+------------------+--------------------+-----------------+-------------------------+----------------------+-------------+--------------------+---------------+------------+-------------+--------+----+-----+-----------------+--------------------+----------------+---------------+--------------------+
|             100.0|                25.0|                0|                      3.0|               

In [None]:
train_data.select('price_category_group').distinct().show()


+--------------------+
|price_category_group|
+--------------------+
|                 0.0|
|                 1.0|
|                 2.0|
+--------------------+



In [None]:
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

In [None]:
mlp_model = MultilayerPerceptronClassifier().\
        setLabelCol("price_category_group").\
        setFeaturesCol("scaled_features").\
        setSeed(20).\
        setLayers([14, 24, 18, 4])

In [None]:
pipe = Pipeline(stages=[
    neighbourhood_group,
    property_type_group,
    room_type_group,
    bathrooms_group,
    assembler,
    scaler,
    mlp_model
    ]
)

In [None]:
fit_model=pipe.fit(train_data)

                                                                                

In [None]:
results = fit_model.transform(test_data)

In [None]:
results.select(['price_category_group','prediction']).show(5)

+--------------------+----------+
|price_category_group|prediction|
+--------------------+----------+
|                 0.0|       0.0|
|                 1.0|       0.0|
|                 1.0|       1.0|
|                 1.0|       0.0|
|                 0.0|       0.0|
+--------------------+----------+
only showing top 5 rows



In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="price_category_group", predictionCol="prediction", metricName="accuracy")

accuracy = ACC_evaluator.evaluate(results)

print(f"The accuracy of the MLP classifier is {accuracy}")

The accuracy of the MLP classifier is 0.7792048929663609


In [50]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

AUC_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='price_category_group',metricName='areaUnderROC')
AUC = AUC_evaluator.evaluate(results)

print(f"The area under the curve is {AUC:.2f}")

The area under the curve is 0.66


In [51]:
PR_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='price_category_group',metricName='areaUnderPR')
PR = PR_evaluator.evaluate(results)

print("The area under the PR curve is {}".format(PR))

The area under the PR curve is 0.5137904801545486


In [52]:
from sklearn.metrics import confusion_matrix

y_true = results.select("price_category_group")
y_true = y_true.toPandas()
 
y_pred = results.select("prediction")
y_pred = y_pred.toPandas()
 
cnf_matrix = confusion_matrix(y_true, y_pred)
print("Below is the confusion matrix: \n {}".format(cnf_matrix))

Below is the confusion matrix: 
 [[6589  601]
 [1524  997]]


In [53]:
tn = cnf_matrix[0][0]
fp = cnf_matrix[0][1]
fn = cnf_matrix[1][0]
tp = cnf_matrix[1][1]

accuracy = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1_score = 2*(precision*recall)/(precision+recall)

In [54]:
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1_score:.2f}")


Accuracy: 0.78
Precision: 0.62
Recall: 0.40
F1 Score: 0.48
