In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline
pd.set_option('max_columns', None)

## In this file we will try to find answers to following questions:
1. Is it possible to accurately predict the most attractive price of your Airbnb?
2. What services will actually help you raise the price?
3. What are the most favored neighborhoods?


In [None]:
df_original = pd.read_csv('../input/airbnb-boston/boston_listings.csv')
df = pd.read_csv('../input/boston-preprocessed/boston_listings_updated.csv')
df = df.drop(columns=['Unnamed: 0'])
df.head(2)

Let's have a look at our target variable 'price'.

In [None]:
print(df['price'].describe()) 
df['price'].plot(kind ='box')

We have some outliers in the data and some extreme values as well. <br>
Let's use _Tukey rule_ to detect the outliers.

In [None]:
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)

IQR = Q3 -Q1

Max = Q3 + 1.5*IQR
Min = Q1 - 1.5*IQR
print('Min value {} , Max value {}'.format(Min,Max))

So, we have outliers at higher price values but non for lower(negative min value). <br>
Let's see the count for outliers for values higher than 422.

In [None]:
print('Total count for price higher than 422 = ' + str(df[df['price'] > 422]['price'].count()))
print('A look at the outlier prices :' + str(np.sort(df[df['price'] > 422]['price'].unique())))

We have a lot of values at range below 422, lets take upper limit to 500. 

In [None]:
df = df.query('price < 500')

## 1. Is it possible to accurately predict the most attractive price of your Airbnb?

## Prediction

We will use **Pycaret** library for our prediction

In [None]:
!pip install sklearn

In [None]:
!pip install pycaret
import pycaret
from pycaret.regression import *

Initializing the environment in pycaret. Creating the transformation pipeline to prepare the data for modeling and deployment.

In [None]:
bnb_setup = setup(df, target='price',silent = True)

Comparing all models in the model library and score them using K-fold Cross Validation

In [None]:
compare_models(fold=5)

We have selected highest scoring model for perdiction. 

In [None]:
catboost_regressor = create_model('catboost')

Let's tune the model to get the best hyperparameters and select the best model based on Mean Square Error.

In [None]:
tuned_catboost_regressor = tune_model('catboost',optimize ='mse')

In [None]:
predictions = predict_model(tuned_catboost_regressor)

final_model = finalize_model(tuned_catboost_regressor)
final_model.get_params()

Our RMSE for final model is 41.25 dollars, meaning that our regressor is wrong by that much on average.
<br><br>
But our motive here is to understand important variables that contribute towards price of the AirBnB. 

Save the model for further use.

In [None]:
save_model(final_model, 'airbnb_catboost_1')

Let's plot interpretation of the model based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of the machine learning model. 

In [None]:
interpret_model(final_model)

## Results:

RMSE for final model is 41.25 dollars, meaning that our regressor is wrong by that much on average. But the SHAP analysis tells us some interesting facts:

* Clearly, an Entire home or Apartment has the biggest impact on the price, so it is for the number of bedrooms and bathrooms. These parameters alone can shift the price of your listing.
Latitude and Longitude meaning the location is important as well, we will discuss this later in the post.
* Availability throughout the year and acceptance rate of the host is an important factor. Higher values for both will benefit you.
* As expected, the higher the number of minimum nights less popular will be Airbnb.
* It is interesting to note that rentals with higher rents tend to charge little or none for the cleaning fee.

***

Important of feature for pred

In [None]:
features = final_model.get_feature_importance(prettified = True).set_index('Feature Id').to_dict()['Importances']

#sort by decreasing importance 
feature_importance_dict = dict([(v[0],v[1]) for v in sorted(features.items(), key=lambda kv: (-kv[1], kv[0]))])

## 2. What services will actually help you bump the price? Let's check the significance of annementies in price of the AirBnB.

In [None]:
amenity_list = ['24-Hour Check-in', 'Air Conditioning', 'Breakfast',
       'Buzzer/Wireless Intercom', 'Cable TV', 'Carbon Monoxide Detector',
       'Cat(s)', 'Dog(s)', 'Doorman', 'Dryer', 'Elevator in Building',
       'Essentials', 'Family/Kid Friendly', 'Fire Extinguisher',
       'First Aid Kit', 'Free Parking on Premises',
       'Free Parking on Street', 'Gym', 'Hair Dryer', 'Hangers',
       'Heating', 'Hot Tub', 'Indoor Fireplace', 'Internet', 'Iron',
       'Kitchen', 'Laptop Friendly Workspace', 'Lock on Bedroom Door',
       'Other pet(s)', 'Paid Parking Off Premises', 'Pets Allowed',
       'Pets live on this property', 'Pool', 'Safety Card', 'Shampoo',
       'Smoke Detector', 'Smoking Allowed', 'Suitable for Events', 'TV',
       'Washer', 'Washer / Dryer', 'Wheelchair Accessible',
       'Wireless Internet', 'translation missing: en.hosting_amenity_49',
       'translation missing: en.hosting_amenity_50']

amenity_importance = []
for amenity in amenity_list:
    for col in feature_importance_dict.keys():
        if amenity in col:
            if 'False' in col:
                amenity_importance.append((col[:-6], -feature_importance_dict[col]))
            else:
                amenity_importance.append((col[:-5], feature_importance_dict[col]))
    
    
amenity_importance.sort(key=lambda tup: tup[1], reverse = True) 

pd.DataFrame(amenity_importance, columns = ['amenity','Importance']).drop_duplicates().set_index('amenity').plot(kind='bar', figsize = (30,15))
plt.xticks(fontsize=20)
plt.xlabel('amenity', fontsize=30)
plt.ylabel('Importance', fontsize=30)


## Result:

* We can see that mentioning basic amenities such as hangers, lock on bedroom door, hairdryers have a negative impact on the listing. This could be because there are very basic necessities and should not be mentioned as a special feature.
* The easiest things to increase price are to include a TV, Cable, and an Indoor Fireplace. Also, including a free on-premise parking space add a bonus.
* Investing in safety devices like Fire Extinguisher, Safety card, Intercome, and smoke detectors will not only help you safeguard the place but also earn a few extra bucks.

***

## 3. What are the most favored neighborhoods?

In [None]:
df_original['price'] = df_original['price'].map(lambda p: int(p[1:-3].replace(",", "")))
df_original = df_original.query('price < 500')

In [None]:
neighborhoods = df_original['neighbourhood_cleansed']

for n in neighborhoods:
    n_price.append((n, df_original[df_original['neighbourhood_cleansed'] == n]['price'].mean()))

In [None]:
pd.DataFrame(n_price, columns = ['Neighborhood','Average Price']).drop_duplicates().set_index('Neighborhood').sort_values(by = 'Average Price').plot(kind='bar', figsize = (30,15))
plt.xticks(fontsize=20)
plt.xlabel('Neighborhood', fontsize=30)
plt.ylabel('Average Price', fontsize=30)

## Result:

As expected, the areas near the city center cost on average $50 or more than the ones far from the city. It’s the premium you pay to be ‘right-there’ location wise which honestly sounds good to me.

***