# Predictive Analysis of Airbnb data


### Description:

In this part of the project, we have tried to predict the prices of airbnb property based on its features. Attributes which appeared to be most relevant to predicting the pricing have been selected for training ma-chine learning models. Pyspark dataframe is used for computations. Packages from pyspark.ml have been used for predictions. The data is trained and tested on different machine learning models to understand which model fits best for this particular dataset. Linear Regression, Decision Tree Regressor and Gradient Boosting Tree Regressor have been used to predict the prices.

### Packages to install for this code

pip install geopy or conda install -c conda-forge geopy

Used geopy.distance.distance() to compute distance from airbnb property location to interest points in the city using latitude and longitude values.


In [1]:

# data file
airbnb_listings_data = "./listings.csv"

# import statements
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import GBTRegressor
import matplotlib.pyplot as plt
from pyspark.sql.types import *
from pyspark.sql import functions as F
import pandas as pd
import geopy.distance


####             
### Read airbnb listings into dataframe

In [2]:

def load_airbnb_data(input_data):
      
    
    # read input data
    
    # Note: spark had issues while loading listings data, probably because of separator(",") while
    # reading csv,column values gets mixed up, had same issue even after defining schema, 
    # so we read it into pandas and then converted it into spark for rest of the computations.
    # explicitly defined data type for monthly_price and weekly_price to avoid warning of mixed data types
    listings_df = pd.read_csv(input_data,dtype={'monthly_price':object,'weekly_price':object})
    
    
    # select relevant columns for predicting from the entire dataset
    
    listings_df = listings_df[['latitude','longitude','accommodates',
                     'bathrooms','bedrooms','beds','price','cleaning_fee',
                     'guests_included','number_of_reviews',
                     'review_scores_rating']] 
    
    # consider a fraction of the dataset for faster computation
    sample_listings=listings_df.sample(frac=0.2,random_state=1)
    sample_listings=sample_listings.dropna()
    
    return sample_listings



#### Note:
We have considered only numeric columns for prediction. Categorical attributes for this dataset has too many categories and less frequency of occurence for most of the categories in the attribute, as shown in the visualization part of this project. Using binarized categorical attributes has negatively influenced the predictions. So, categorical attributes was removed.


In [3]:

def pandas_to_spark_convert(listings_df):
    
    spark = SparkSession.builder\
            .appName("AirBnB listings") \
            .getOrCreate() 
    spark_df_schema=StructType([StructField("latitude", FloatType(), True)\
                       ,StructField("longitude", FloatType(), True)\
                       ,StructField("accommodates", IntegerType(), True)\
                       ,StructField("bathrooms", FloatType(), True)\
                       ,StructField("bedrooms", FloatType(), True)\
                       ,StructField("beds", FloatType(), True)\
                       ,StructField("price", StringType(), True)\
                       ,StructField("cleaning_fee", StringType(), True)\
                       ,StructField("guests_included", IntegerType(), True)\
                       ,StructField("number_of_reviews", IntegerType(), True)\
                       ,StructField("review_scores_rating", FloatType(), True)\
                       ])
    airbnb_df=spark.createDataFrame(listings_df,schema = spark_df_schema)

    return airbnb_df


In [4]:

def get_data_in_spark_dataframe(input_data):
    
    listings_df=load_airbnb_data(input_data)
    
    airbnb_df = pandas_to_spark_convert(listings_df)
    
    return airbnb_df
    

####        
### Data preprocessing

In [5]:

def data_tranformation(initial_data_df):
    
    ##### function to clean and transform column values so that it can be used for prediction ####
    
    # drop NaN and Null values, if any
    
    listing_df=initial_data_df.na.drop()
    
    print("Number of records in dataset:",listing_df.count())
    
    # use latitude and longitude values to determine the distances of each property
    # from important(tourist) spots in LA
    
    airport = (33.9416,-118.4085)
    downtown = (34.0407,-118.2468)
    anaheim = (33.8366,-117.9143)
    santa_monica_pier =(34.0103,-118.4962)
    hollywood_center = (34.0928,-118.3287)
    
    # convert the list dataframe into pandas for easier computation of distances
    
    list_df_pd=listing_df.toPandas()
    
    for i in range(len(list_df_pd.latitude)):
        list_df_pd.at[i,'dist_to_airport']=geopy.distance.distance((\
                                            list_df_pd.latitude[i],list_df_pd.longitude[i]),airport).miles
        list_df_pd.at[i,'dist_to_downtown']=geopy.distance.distance((\
                                            list_df_pd.latitude[i],list_df_pd.longitude[i]),downtown).miles
        list_df_pd.at[i,'dist_to_anaheim']=geopy.distance.distance((\
                                            list_df_pd.latitude[i],list_df_pd.longitude[i]),anaheim).miles
        list_df_pd.at[i,'dist_to_santa_monica_pier']=geopy.distance.distance((\
                                                list_df_pd.latitude[i],list_df_pd.longitude[i]),santa_monica_pier).miles
        list_df_pd.at[i,'dist_to_hollywood']=geopy.distance.distance((\
                                        list_df_pd.latitude[i],list_df_pd.longitude[i]),hollywood_center).miles
    
    
    # to get a better picture of the dataset in use
    print("\nVIEW DATASET IN PANDAS:\n")
    display(list_df_pd)
    
    # drop latitude,longitude columns as we don't need them anymore
    
    list_df_pd = list_df_pd.drop(columns=['latitude','longitude'],axis=1)
    
    
    # convert back into spark dataframe for further data cleaning 
    
    spark = SparkSession.builder\
            .appName("AirBnB listings") \
            .getOrCreate() 
    
    airbnb_data = spark.createDataFrame(list_df_pd)
    
    airbnb_data = airbnb_data.withColumn("price",F.regexp_replace(F.col("price"), "[\$,]", "").cast("float"))
    
    airbnb_data = airbnb_data.withColumn("cleaning_fee",F.regexp_replace(F.col("cleaning_fee"), "[\$,]", "").cast("float"))
    
    #airbnb_data
    
    return airbnb_data
    
  

In [6]:

def normalize_data(norm_df):
    
    #### function to normalize column values #####
    
    # filter independent variables for normalization
    columns_to_normalize=[ col for col in norm_df.columns if col != "price"]
    
    # normalize column = (each_value_in_column - mean_of_column)/ standard_deviation_of_column
    for col_name in columns_to_normalize:
        
        col_mean=norm_df.agg({col_name: "mean"}).collect()[0][0]
        col_std=norm_df.agg({col_name: "std"}).collect()[0][0]
        
        norm_df=norm_df.withColumn(col_name, ((F.col(col_name)-col_mean)/col_std))
        
    return norm_df


In [7]:



def vectorize_variables(attributes_df):
    
    ##### function to vectorize independent variables #####
    
    
    # create feature set by vectorizing independent variables
    input_columns= [ col for col in attributes_df.columns if col != 'price' ]
    
    vectorAssembler = VectorAssembler(inputCols = input_columns, outputCol = 'airbnb_attributes')

    features_set = vectorAssembler.transform(attributes_df)
    
    features_set = features_set.select(['airbnb_attributes', 'price'])
    
    return features_set


#####                        
### Linear Regression

In [8]:

def linear_regression(airbnb_dataset):
    
    #### linear regression function ####
    
    # split cleaned data into training and testing set
    train_df,test_df = airbnb_dataset.randomSplit([0.7, 0.3],seed=1)
    
    # normalize independent variables for linear regression
    norm_train=normalize_data(train_df)
    
    # vectorize to get 'libsvm' format
    vector_train_df=vectorize_variables(norm_train)
    
    # create linear regression model
    lr = LinearRegression(featuresCol = 'airbnb_attributes', labelCol='price', 
                      maxIter=10, regParam=0.05, elasticNetParam=0.01)
    
    lr_model = lr.fit(vector_train_df)
    print("\n\n----------LINEAR REGRESSION MODEL-----------")
    print("\nHYPERPARAMETERS :\n")
    
    print("\nCo-efficients: \n" + str(lr_model.coefficients))
    
    print("\nIntercept: " + str(lr_model.intercept))
    
    
    
    
    # ___ predict using the created linear model ___
    
    # normalize test data for prediction
    norm_test=normalize_data(test_df)
    
    # vectorize to get 'libsvm' format of data
    vector_test_df=vectorize_variables(norm_test)
    
    # predict price
    lr_predictions = lr_model.transform(vector_test_df)
    
      
    print("\n\nPRICE PREDICTION USING LINEAR REGRESSION MODEL\n")
    lr_predictions.select("prediction","price","airbnb_attributes").show()
    
    
    
    # print evaluation measures on training data
    
    trainingSummary = lr_model.summary
    print("\nRMSE(Root Mean squared Error) on training data: %f" % trainingSummary.rootMeanSquaredError)
    
    print("\nr2(R^2 or R-square) on training data: %f" % trainingSummary.r2)
    
    
    
    # print evaluation measures on testing data
    test_result = lr_model.evaluate(vector_test_df)
    
    lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                     labelCol="price",metricName="r2")
    
    print("\nRMSE on testing data = %g" % test_result.rootMeanSquaredError)
     
    print("\nr2 on test data = %g" % lr_evaluator.evaluate(lr_predictions))
  
    
    
    print("\nRESIDUALS:\n")
    trainingSummary.residuals.show()
    
    
 

In [9]:

def decision_tree_regression(airbnb_dataset):
    
    # vectorize the data to get LIBSVM format 

    data = vectorize_variables(airbnb_dataset)
    
    # Automatically identify categorical features, and index them.
    # We specify maxCategories so features with > 4 distinct values are treated as continuous.
    featureIndexer = VectorIndexer(inputCol="airbnb_attributes", outputCol="indexedFeatures", maxCategories=4).fit(data)

    # Split the data into training and test sets (30% held out for testing)
    (trainData, testData) = data.randomSplit([0.7, 0.3],seed=1)

    # Train a DecisionTree model.
    dt = DecisionTreeRegressor(featuresCol="indexedFeatures",labelCol = 'price')

    # Chain indexer and tree in a Pipeline
    pipeline = Pipeline(stages=[featureIndexer, dt])

    # Train model
    model = pipeline.fit(trainData)
    
    
    # Make predictions on train data
    predictions_train = model.transform(trainData)
    
    # Make predictions on test data
    predictions_test = model.transform(testData)
    print("\n\n--------DECISION TREE REGRESSION MODEL---------\n")
    
    print("\n\nPRICE PREDICTION:\n")
    # Select rows to display.
    predictions_test.select("prediction", "price", "airbnb_attributes").show()

    
    # Select (prediction, true label) and compute test error
    
    evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
    
    # evaluate performance on train data
    rmse_train = evaluator.evaluate(predictions_train)
    print("RMSE on train data = %g" % rmse_train)
    print("R^2 on train data = %f" % evaluator.evaluate(predictions_train,{evaluator.metricName: "r2"}))
    
    evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
    
    # evaluate performance on train data
    rmse_test = evaluator.evaluate(predictions_test)
    print("RMSE on test data = %g" % rmse_test)
    print("R^2 on test data = %f" % evaluator.evaluate(predictions_test,{evaluator.metricName: "r2"}))
    
    
    
    treeModel = model.stages[1] # summary only
    print("\n")
    print(treeModel)
    

In [10]:

def gradient_boosted_tree_regression(airbnb_dataset):
    
    # vectorize the data to get LIBSVM format 
    data = vectorize_variables(airbnb_dataset)
    
    # Automatically identify categorical features, and index them.
    # Set maxCategories so features with > 4 distinct values are treated as continuous.
    featureIndexer = VectorIndexer(inputCol="airbnb_attributes", outputCol="indexedFeatures", maxCategories=4).fit(data)

    
    # Split the data into training and test sets (30% held out for testing)
    (trainData, testData) = data.randomSplit([0.7, 0.3],seed=1)
    print("\n\n--------GRADIENT BOOSTED TREE REGRESSION MODEL-------\n")
    # Train a GBT model.
    gbt = GBTRegressor(featuresCol="indexedFeatures", maxIter=10,labelCol = 'price')

    # Chain indexer and GBT in a Pipeline
    pipeline = Pipeline(stages=[featureIndexer, gbt])

    # Train model
    model = pipeline.fit(trainData)

    # Make predictions on train and test data
    predictions_train = model.transform(trainData)
    predictions_test = model.transform(testData)
    
    
    print("\n\nPRICE PREDICTION:\n")
    # Select rows to display.
    predictions_test.select("prediction", "price", "airbnb_attributes").show()
    
    
    # Select (prediction, true label) and compute test error
    evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")

    # evaluate performance on training set
    rmse_train = evaluator.evaluate(predictions_train)
    print("\nRMSE on train data = %g" % rmse_train)
    print("\nR^2 on train data = %f" % evaluator.evaluate(predictions_train,{evaluator.metricName: "r2"}))
    
    
    # evaluate performance on testing set
    rmse_test = evaluator.evaluate(predictions_test)
    print("\nRMSE on test data = %g" % rmse_test)
    print("\nR^2 on test data = %f" % evaluator.evaluate(predictions_test,{evaluator.metricName: "r2"}))
    
    
    gbtModel = model.stages[1]
    print("\n")
    print(gbtModel)  # summary only
    

In [11]:

def main(input_data):
    
           
    airbnb_df = get_data_in_spark_dataframe(input_data)
    
    cleaned_airbnb_df = data_tranformation(airbnb_df)
    
    linear_regression(cleaned_airbnb_df)
    
    decision_tree_regression(cleaned_airbnb_df)
    
    gradient_boosted_tree_regression(cleaned_airbnb_df)
    
      
    

In [12]:
if __name__ == "__main__":
    
    main(airbnb_listings_data)

Number of records in dataset: 6366

VIEW DATASET IN PANDAS:



Unnamed: 0,latitude,longitude,accommodates,bathrooms,bedrooms,beds,price,cleaning_fee,guests_included,number_of_reviews,review_scores_rating,dist_to_airport,dist_to_downtown,dist_to_anaheim,dist_to_santa_monica_pier,dist_to_hollywood
0,34.087292,-118.366699,5,2.0,2.0,2.0,$123.00,$100.00,4,3,80.0,10.324306,7.590456,31.202434,9.130260,2.211877
1,34.205978,-118.382797,6,2.0,2.0,3.0,$179.00,$70.00,4,3,100.0,18.281763,13.803872,37.028319,14.972480,8.394288
2,34.048241,-118.454460,2,1.0,1.0,1.0,$69.00,$35.00,1,174,98.0,7.809327,11.925894,34.286582,3.546222,7.839930
3,34.043640,-118.252350,2,1.0,0.0,2.0,$91.00,$75.00,1,3,100.0,11.394180,0.377446,24.098613,14.181267,5.537134
4,34.136978,-118.222481,2,1.0,1.0,1.0,$95.00,$50.00,1,2,100.0,17.183321,6.780928,27.234101,17.963968,6.808213
5,34.147060,-118.201210,3,1.0,1.0,2.0,$120.00,$95.00,1,71,98.0,18.493033,7.783068,27.003628,19.367057,8.209634
6,33.782490,-118.142540,6,2.0,2.0,4.0,$55.00,$150.00,1,7,97.0,18.817755,18.778039,13.650650,25.684599,23.912299
7,34.161861,-118.534622,6,1.0,2.0,6.0,$127.00,$60.00,1,3,100.0,16.817565,18.495527,42.079039,10.676211,12.727031
8,34.015072,-118.496513,3,1.0,1.0,1.0,$130.00,$40.00,2,214,90.0,7.154274,14.438624,35.641029,0.329388,11.017526
9,34.102730,-118.327141,4,1.0,1.0,4.0,$175.00,$125.00,1,3,100.0,12.047377,6.285983,29.975017,11.603688,0.690227




----------LINEAR REGRESSION MODEL-----------

HYPERPARAMETERS :


Co-efficients: 
[4.7623038577048264,79.86590613047609,13.840746221645448,-14.888429711170952,105.37544621078474,-2.802899599386168,3.383733998376521,6.606955535503115,-2.2871350931459804,6.342935246350399,7.0016388177379305,-5.257326277438803,0.7047523320520663]

Intercept: 163.74523435747918


PRICE PREDICTION USING LINEAR REGRESSION MODEL

+------------------+------+--------------------+
|        prediction| price|   airbnb_attributes|
+------------------+------+--------------------+
| 23.64882674131536|  30.0|[-1.0968298137201...|
| 35.85321431182038|  42.0|[-1.0968298137201...|
| 84.98157672366162|  49.0|[-1.0968298137201...|
| 49.00101333541457|  50.0|[-1.0968298137201...|
|115.85550318956848|  50.0|[-1.0968298137201...|
|12.844483803466773|  60.0|[-1.0968298137201...|
|130.64049658261888|  64.0|[-1.0968298137201...|
| 65.47326145104194|  70.0|[-1.0968298137201...|
| 145.7980111813024|  75.0|[-1.0968298137201...|


### Observation1:
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

For linear regression on airbnb dataset, RMSE on test data is greater than RMSE on train data implyig that our has overfitted the data.

The R² value depends upon train and test data which is divided randomly, so the result might vary.
R² value close to 1 implies it is a good fit. It indicates that  (R² value * 100)% variability in “price” can be explained using the model, which is only 38.4%.

Clearly, linear regression is not a good choice for predicting pricing for our dataset.


      
### Observation2:

In Decision Tree Regressor model, error from training set is greater than the error from testing set, which means there is no overfitting in this model.

The closer the RMSE values on training data and test data, the better the model. 

Closer RMSE values imply the model works equally well on both training set and testing set. 

Here, in Decision Tree Regressor model for airbnb data, the RMSE value on train data and test data are closer, implying it is a good model for our dataset


### Observation 3:

From the predictions table of Gradient Boosted Tree Regressor, we can see that there are predictions that are very close to the original price value, which is really good. 

But there are also some predictions with high variance from the original price value which has influenced the RMSE value.

RMSE on train data is less than RMSE on test data, which means there is overfitting of data.

Gradient Boosted Tree Regressor is usually considered as most suitable model for price predictions. Although, predictions highly depend on the attribute set that we have choosen and also on the attribute which the model chooses as node and its branches. 

So, for the dataset that we have taken, for the attributes we have chosen and the normalization that we have performed, GBT is not a good choice.

### CONCLUSION:

Decision Tree Regressor seems to be the better choice among the three models for our dataset. It has no overfitting of data and the difference between RMSE value on training set and RMSE value on testing set is less when compared to the other two models.
