# Airbnb Room Price Prediction 

- Airbnb is an online platform that allows people to rent short term accommodation. This ranges from regular people with a spare bedroom to property management firms who lease multiple rentals. On the one side, Airbnb enables owners to list their space and earn rental money. On the other side, it provides travelers easy access to renting private homes.

- Airbnb receives commissions from two sources upon every booking, namely from the hosts and guests. For every booking, Airbnb charges the guest 6-12% of the booking fee. Moreover, Airbnb charges the host 3% for every successful transaction.

- As a senior data scientist at Airbnb, you have to come up with a pricing model that can effectively predict the Rent for an accommodation and can help hosts, travelers, and also the business in devising profitable strategies.

## Data Dictionary
- 1. `id`: Property ID
- 2. `room_type` Type of Room in the property
- 3. `accommodates` How many adults can this property accomodate
- 4. `bathrooms` Number of bathrooms in the property
- 5. `cancellation_policy` Cancellation policy of the property
- 6. `cleaning_fee` This denotes whether the property's cleaning fee is included in the rent or not
- 7. `instant_bookable` It indicates whether an instant booking facility is available or not
- 8. `review_scores_rating` The review rating score of the property
- 9. `bedrooms Number` of bedrooms in the property
- 10. `beds` Total number of beds in the property
- 11. `log_price` Log of the rental price of the property for a fixed period

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.set_option('display.float_format', lambda x: '%.3f' % x) # to display numbers in digits

from scipy import stats

import warnings
warnings.filterwarnings("ignore")

### Load Data

In [None]:
airbnb = pd.read_csv('../input/airbnb/AirBNB.csv')
airbnb.head()

In [None]:
airbnb.tail()

In [None]:
print('This dataset has', airbnb.shape[0], 'rows/observations, and ', airbnb.shape[1], 'columns')

In [None]:
airbnb.info()

Dataset has 7 numerical columns, and 4 categorical columns which there defined as *object*

Dropping `id` column from dataset

In [None]:
airbnb = airbnb.drop(['id'], axis=1)

In [None]:
for col in airbnb.select_dtypes(include=['object']):
    airbnb[col] = airbnb[col].astype('category')

In [None]:
cat_col = airbnb.select_dtypes(include=['category'])

for col in cat_col:
    print('Unique Values of {} are \n'.format(col),airbnb[col].unique())
    print('*'*90)

In [None]:
airbnb.describe(include='all').T

**Quick Insights**
- The most frequent `room_type` is *Entire home/apt*
- `accommodates` has average number of 3 guests, and 75% of the guests are group of 4
- One is the common number of `bathrooms`
- Most places are considering *strict* `canellation_policy`
- Most places are charging `cleaning_fees`
- `instant_bookable` is not preferred by most place-owners
- The average `review_scores_rating` is above 94%
- One is also the common number of `beds'
- `log_price` in average is 4.78 and goes up tp 7.60

In [None]:
airbnb.hist(figsize=(20,15));

### Creating Training/Test Sets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_set, test_set = train_test_split(airbnb, test_size=0.2, random_state=42)
print(len(train_set), 'rows in training set')
print(len(test_set), 'rows in test set')

### EDA

In [None]:
data = train_set.copy()
data.head()

In [None]:
# While doing uni-variate analysis of numerical variables we want to study their central tendency 
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical 
# variable.
# This function takes the numerical column as the input and returns the boxplots 
# and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(10,8), bins = None):
    """ Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (9,8))
    bins: number of bins (default None / auto)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
                                           sharex = True, # x-axis will be shared among all subplots
                                           gridspec_kw = {"height_ratios": (.25, .75)}, 
                                           figsize = figsize 
                                           ) # creating the 2 subplots
    sns.boxplot(feature, ax=ax_box2, showmeans=True, color='violet') # boxplot will be created and a star will indicate the mean value of the column
    sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,color = 'orange') if bins else sns.distplot(feature, kde=False, ax=ax_hist2,color='tab:cyan') # For histogram
    ax_hist2.axvline(np.mean(feature), color='purple', linestyle='--') # Add mean to the histogram
    ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram

In [None]:
# Function to create barplots that indicate percentage for each category.

def perc_on_bar(z):
    '''
    plot
    feature: categorical feature
    the function won't work if a column is passed in hue parameter
    '''

    total = len(data[z]) # length of the column
    plt.figure(figsize=(15,5))
    ax = sns.countplot(data[z],palette='Paired')
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
        y = p.get_y() + p.get_height()           # hieght of the plot
        
        ax.annotate(percentage, (x, y), size = 12) # annotate the percantage 
    plt.show() # show the plot

#### `accommodates`

In [None]:
histogram_boxplot(data['accommodates'])

#### `bathrooms`

In [None]:
histogram_boxplot(data['bathrooms'])

#### `bedrooms`

In [None]:
histogram_boxplot(data['bedrooms'])

#### `beds`

In [None]:
histogram_boxplot(data['beds'])

#### `review_scores_rating`

In [None]:
histogram_boxplot(data['review_scores_rating'])

#### `log_price`

In [None]:
histogram_boxplot(data['log_price'])

#### `room_type`

In [None]:
perc_on_bar('room_type')

#### `cancellation_policy`

In [None]:
perc_on_bar('cancellation_policy')

#### `cleaning_fee`

In [None]:
perc_on_bar('cleaning_fee')

#### `instant_bookable`

In [None]:
perc_on_bar('instant_bookable')

### Correlation Check

In [None]:
plt.figure(figsize=(10,5))

sns.heatmap(data.corr(),
            annot=True,
            linewidths=0.5,vmin=-1,vmax=1,
            center=0,
            cbar=True,
            )

plt.show()

In [None]:
data.corr()['log_price'].sort_values(ascending=False)

In [None]:
plt.figure(figsize=(10,5))
sns.pairplot(data, diag_kind='kde');

#### Attributes Combinations

In [None]:
data.columns

- Checking number of `bedrooms` per number of `accommodates`
- Number of `beds` per numner of `accommodates`
- Number of `bathrooms` per `accommodates`

In [None]:
data['bedrooms_per_accommodates'] = data['bedrooms'] / data['accommodates']
data['beds_per_accommodates'] = data['beds'] / data['accommodates']
data['bathrooms_per_accommodates'] = data['bathrooms'] / data['accommodates']

In [None]:
data[['bedrooms_per_accommodates', 'beds_per_accommodates', 'bathrooms_per_accommodates']].describe().T

In [None]:
plt.figure(figsize=(10,5))

sns.heatmap(data.corr(),
            annot=True,
            linewidths=0.5,vmin=-1,vmax=1,
            center=0,
            cbar=True,
            )

plt.show()

In [None]:
data.corr()['log_price'].sort_values(ascending=False)

`bathrooms_per_accommodates` gives strong correlatio with `log_price` at -0.373, which is even stronger than number of `bathrooms` in the original dataset

### Data Cleaning 

#### Numeric Data

In [None]:
data = train_set.drop('log_price', axis=1)
data_labels = train_set['log_price'].copy()

In [None]:
data_num = data.drop(['room_type', 'cancellation_policy', 'cleaning_fee', 'instant_bookable'], axis=1)

In [None]:
data_num.info()

In [None]:
data_num.isnull().sum()

In [None]:
# Using the SimpleImputer function to find the values using the Median Strategy

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')

In [None]:
imputer.fit(data_num)

In [None]:
imputer.statistics_

In [None]:
data_num.median().values

In [None]:
X = imputer.transform(data_num)

In [None]:
# Converting the transformed array into dataset

data_tr = pd.DataFrame(X, columns=data_num.columns, index=data_num.index)

In [None]:
data_tr.head()

In [None]:
# Double checking with info()

data_tr.info()

#### Categorical Data

In [None]:
data_cat = data[['room_type', 'cancellation_policy', 'cleaning_fee', 'instant_bookable']]
data_cat.head(10)

In [None]:
data_cat.isnull().sum()

In [None]:
cat_imputer = SimpleImputer(strategy='most_frequent')

In [None]:
cat_imputer.fit(data_cat)

In [None]:
cat_imputer.statistics_

In [None]:
data_cat.mode().values

In [None]:
X_cat = cat_imputer.transform(data_cat)

In [None]:
# Converting the transformed array into dataset

data_cat_fil = pd.DataFrame(X_cat, columns=data_cat.columns, index=data_cat.index)
data_cat_fil.info()

In [None]:
# Encoding the Categorical values with OneHotEncoder function from Scikit-Learn

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()

In [None]:
data_cat_1hot = cat_encoder.fit_transform(data_cat_fil)
data_cat_1hot

In [None]:
# Showing the encoded values in an array

data_cat_1hot.toarray()

In [None]:
cat_encoder.categories_

### Feature Scaling and Transformations

In [None]:
data.info()

In [None]:
data_num.info()

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
# Getting the required indices

col_names = "accommodates", "bathrooms", "bedrooms", "beds"
accommodates_ix, bathrooms_ix, bedrooms_ix, beds_ix = [
    data.columns.get_loc(c) for c in col_names] # get the column indices

In [None]:
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bathrooms_per_accommodates=True): # no *args or **kargs
        self.add_bathrooms_per_accommodates = add_bathrooms_per_accommodates
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        bedrooms_per_accommodates = X[:, bedrooms_ix] / X[:, accommodates_ix]
        beds_per_accommodates = X[:, beds_ix] / X[:, accommodates_ix]
        if self.add_bathrooms_per_accommodates:
            bathrooms_per_accommodates = X[:, bathrooms_ix] / X[:, accommodates_ix]
            return np.c_[X, bedrooms_per_accommodates, beds_per_accommodates,
                         bathrooms_per_accommodates]
        else:
            return np.c_[X, bedrooms_per_accommodates, beds_per_accommodates]

attr_adder = CombinedAttributesAdder(add_bathrooms_per_accommodates=False)
airbnb_extra_attribs = attr_adder.transform(data.values)

In [None]:
airbnb_extra_attribs

In [None]:
airbnb_extra_attribs = pd.DataFrame(
    airbnb_extra_attribs,
    columns=list(data.columns)+["bedrooms_per_accommodates", "beds_per_accommodates"],
    index=data.index)
airbnb_extra_attribs.head()

In [None]:
# Cross-Check with the original dataset to see the added Combied Attributes
data.head()

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
# Scaling & transformation will be applied on the numerical dataset (data_num)

col_names = "accommodates", "bathrooms", "bedrooms", "beds"
accommodates_ix, bathrooms_ix, bedrooms_ix, beds_ix = [
    data_num.columns.get_loc(c) for c in col_names] # get the column indices


num_pipeline = Pipeline([
                        ('imputer', SimpleImputer(strategy="median")),
                        ('attribs_adder', CombinedAttributesAdder()),
                        ('std_scaler', StandardScaler())
                        ])
data_num_tr = num_pipeline.fit_transform(data_num)

#### Note:
I have encountred an error while executing `num_pipeline` due to the change of indicies. Therefore, I have hard-coded the `col_names` again with `data_num` set to collect the right index for each attribute

In [None]:
data_num_tr

In [None]:
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())
])
data_cat_tr = cat_pipeline.fit_transform(data_cat)

In [None]:
data_cat_tr.toarray()

In [None]:
# using `ColumnTransformer` for the final pipeline using the previously Transformers/Pipelines

from sklearn.compose import ColumnTransformer

In [None]:
num_attribs = list(data_num)
cat_attribs = list(data_cat)

In [None]:
full_pipeline = ColumnTransformer([
                ('num', num_pipeline, num_attribs),
                ('cat', cat_pipeline, cat_attribs),
])
data_prepared = full_pipeline.fit_transform(data)

In [None]:
data_prepared[0]

### Training Models

#### 1. Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(data_prepared, data_labels)

In [None]:
# Checking the values predicted on some data

some_data = data.iloc[:5]
some_labels = data_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print('Predictions:', lin_reg.predict(some_data_prepared))
print('Labels:', list(some_labels))

In [None]:
#finding the RMSE for Linear Regression
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(data_prepared)
lin_mse = mean_squared_error(data_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print("RMSE for Linear Regression:", lin_rmse)

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
lin_scores = cross_val_score(lin_reg, data_prepared, data_labels, scoring='neg_mean_squared_error', cv=10)
lin_rmse = np.sqrt(-lin_scores)

In [None]:
print('Cross-Validation RMSE LR Scores: \n', lin_rmse, '\n')
print('Cross-Validation RMSE LR Scores Mean: \n', lin_rmse.mean(), '\n')
print('Cross-Validation RMSE LR Scores Std. Dev.: \n', lin_rmse.std())

#### 2. Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(data_prepared, data_labels)

In [None]:
# Checking the values predicted on some data

some_data = data.iloc[:5]
some_labels = data_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print('Predictions:', tree_reg.predict(some_data_prepared))
print('Labels:', list(some_labels))

In [None]:
#finding the RMSE for Decision Tree

from sklearn.metrics import mean_squared_error
housing_predictions = tree_reg.predict(data_prepared)
tree_mse = mean_squared_error(data_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
print("RMSE for Decision Tree Regressor:", tree_rmse)

In [None]:
tree_scores = cross_val_score(tree_reg, data_prepared, data_labels, scoring='neg_mean_squared_error', cv=10)
tree_rmse = np.sqrt(-tree_scores)

In [None]:
print('Cross-Validation RMSE DT Scores: \n', tree_rmse, '\n')
print('Cross-Validation RMSE DT Scores Mean: \n', tree_rmse.mean(), '\n')
print('Cross-Validation RMSE DT Scores Std. Dev.: \n', tree_rmse.std())

#### 3. Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(data_prepared, data_labels)

In [None]:
# Checking the values predicted on some data

some_data = data.iloc[:5]
some_labels = data_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print('Predictions:', rf_reg.predict(some_data_prepared))
print('Labels:', list(some_labels))

In [None]:
#finding the RMSE for Random Forest

from sklearn.metrics import mean_squared_error
housing_predictions = rf_reg.predict(data_prepared)
rf_mse = mean_squared_error(data_labels, housing_predictions)
rf_rmse = np.sqrt(rf_mse)
print("RMSE for Random Forest Regressor:", rf_rmse)

In [None]:
rf_scores = cross_val_score(rf_reg, data_prepared, data_labels, scoring='neg_mean_squared_error', cv=10)
rf_rmse = np.sqrt(-rf_scores)

In [None]:
print('Cross-Validation RMSE RF Scores: \n', rf_rmse, '\n')
print('Cross-Validation RMSE RF Scores Mean: \n', rf_rmse.mean(), '\n')
print('Cross-Validation RMSE RF Scores Std. Dev.:', rf_rmse.std())

### Evaluation with Test Dataset

In [None]:
final_model = lin_reg

In [None]:
X_test = test_set.drop('log_price', axis=1)
y_test = test_set['log_price'].copy()

In [None]:
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

In [None]:
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

In [None]:
from scipy.stats import randint

confidence = 0.95
squared_error = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_error) - 1,
                        loc=squared_error.mean(),
                        scale=stats.sem(squared_error)))

### Full Pipeline with Preparation & Prediction

In [None]:
start = 2384
end = 2395

some_data = data.iloc[start:end]
some_labels = data_labels.iloc[start:end]

In [None]:
full_pipeline_with_predictor = Pipeline([
    ('preparation', full_pipeline),
    ('predictor', LinearRegression())
])

full_pipeline_with_predictor.fit(data, data_labels)
full_pipeline_with_predictor.predict(some_data)

In [None]:
pred_df = pd.DataFrame(full_pipeline_with_predictor.predict(some_data),
                             columns=['Price_Prediction'], 
                             index=some_labels.index)
prediction_df = pd.concat([some_labels, pred_df], axis=1)
prediction_df

### Thanks !!

Enjoy the code, and feel free to contact :)