We have a dataset of $226$K rows, describing different features of each appartement. It has numerical and categorical features. The target variable is the price. In first part we visualize and gain insights about dataset. In second part we do feature engineering. We use XGRegressor to build and train model to predict price. 

### Table of Contents

1.  [Quick Look at Data Structure](#1)

2.  [Analysis of numerical features](#2)

3.  [Analysis of categorical features](#3)

4. [Feature Engineering](#4)

5. [Build, Train and Evaluate Model](#5)

## 1. Quick Look at Data Structure

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data=pd.read_csv("../input/us-airbnb-open-data/AB_US_2020.csv",low_memory=False)

In [None]:
data.head()

In [None]:
data.info()

There are 226030 instances in the dataset. We observe that there are missing values for atributes nighbourhood_group, last_review and reviews_per_month.

In [None]:
A=['neighbourhood_group','last_review','reviews_per_month']

missing = data[A].isna().sum()/data.shape[0]*100

missing = missing.to_frame().rename(columns={0:'Percentage of missing values'})
missing

We will remove this features and try to build a classifier on dataset without using this features. 

In [None]:
data=data.drop(['id','name','host_id','host_name','neighbourhood_group',\
                'last_review', "reviews_per_month"], axis=1)

In [None]:
data.describe()

We can observe that several attributes have outliers. For instance target variable price, while $75\%$ of prices are inferior to $201$, maximum price is $25$K. Notice that squared-error loss places much more emphasis on observations with large absolute residuals $|y_i −f(x_i)|$ during the fitting process. It is thus far less robust, and its performance severely degrades for long-tailed error distributions and especially for grossly mis- measured $y$-values (“outliers”). To avoid this problem we will we will remove top $5\%$ gross values. Let visualize the boxplot of price and see $0.95$-quantile.

In [None]:
plt.figure(figsize=(20,2))
plt.title("Horizontal boxplot of price", size=18)
sns.boxplot(x="price", data=data, showfliers = False, showmeans=True, palette="Set2")
plt.show()

In [None]:
data['price'].quantile(.95)

In [None]:
# Removing outliers
lower_bound = .0
upper_bound = .95
data = data[data['price'].between(data['price'].quantile(lower_bound), \
            data['price'].quantile(upper_bound), inclusive=True)].reset_index(drop=True)

## 2. Analysis of numerical features

To better visualise distribution of numerical features we will remove outliers.

In [None]:
# Removing Outliers

iqr = data.copy()
iqr = iqr[iqr['calculated_host_listings_count'] < 10]
iqr = iqr[iqr['number_of_reviews'] < 200]
iqr = iqr[iqr['minimum_nights'] < 10]

In [None]:
numeric_ix =data.select_dtypes(include=['int64', 'float64']).columns

fig, axes = plt.subplots(nrows=2, ncols=4)
aux = 0
fig.set_figheight(17)
fig.set_figwidth(25)
for row in axes:
    for col in row:
        iqr[numeric_ix[aux]].plot(kind='kde',ax=col)
        col.set_title(numeric_ix[aux] +' Distribution',fontsize=16,fontweight='bold')
        aux+=1
        if aux==len(numeric_ix):
            break

## 3. Analysis of categorical features

In [None]:
plt.figure(figsize=(8,2))
sns.countplot(y="room_type", data=data)
plt.title("Counts for room typs", size=15)
#plt.xlabel('count')
plt.show()

In [None]:
plt.figure(figsize=(20,6))
sns.violinplot(x="price", y="room_type", showfliers = False, data=data)
plt.title("Distributions of prices depending from room type", size=18)
plt.show()

In [None]:
plt.figure(figsize=(20,8))
sns.countplot(y="neighbourhood", data=data, order=data.neighbourhood.value_counts().iloc[:20].index)
plt.title("Counts for Top populated neighbourhouds", size=18)
plt.show()

In [None]:
A=list(data.neighbourhood.value_counts().iloc[:20].index) # Top 20 neighbourhoods

plt.figure(figsize=(20,8))
sns.boxplot(x="price", y="neighbourhood", data=data.loc[data['neighbourhood'].isin(A)], \
            showfliers = False, palette="Set2")
plt.title("Boxplots of price for 20 most popular neighbourhoods", size=18)
plt.ylabel('')
plt.show()

We see that within most popular neighbourhouds on airbnb Lahaina and Khei-Makena are the most expensive ones.

In [None]:
#plt.figure(figsize=(20,2))
#plt.title("Horizontal boxplot of price", size=18)
#sns.boxplot(x="minimum_nights", data=data, showfliers = False, showmeans=True, palette="Set2")
#plt.show()

In [None]:
plt.figure(figsize=(20,8))
sns.countplot(y="city", data=data, order=data.city.value_counts().index)
plt.title("Counts for cities ", size=18)
plt.show()

In [None]:
B=list(data.city.value_counts().iloc[:20].index) # Top 20 cities

plt.figure(figsize=(20,8))
sns.boxplot(x="price", y="city", data=data.loc[data['city'].isin(B)], \
            showfliers = False, palette="Set2")
plt.title('Boxplots of price for 10 most popular cities', size=18)
plt.ylabel('Cities')
plt.show()

The most expensive cities are Hawaii and Rhode Island.

## 4. Feature Engineering

In [None]:
#Transforming categories of categorical features into numbers

numeric_ix=data.select_dtypes(include=['int64', 'float64']).columns.drop('price')

to_categorical_list = ['neighbourhood','room_type','city']
for i in to_categorical_list:
    data[i]=data[i].astype('category')
    
labelencoder = LabelEncoder()
for i in to_categorical_list:
    data[i] = labelencoder.fit_transform(data[i])
data.head()

In [None]:
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

In [None]:
x_train=train_set.drop(['price'], axis=1)
y_train=train_set['price']

x_test=test_set.drop(['price'], axis=1)
y_test=test_set['price']

In [None]:
# Choose a sample of train set to search best parameters for XGBRegressor, to spend less time

train_sample=train_set.sample(frac=0.4, replace=True, random_state=42)

In [None]:
x_train_sample=train_sample.drop(['price'], axis=1)
y_train_sample=train_sample['price']

In [None]:
#One Hot Encoding

t = [('cat', OneHotEncoder(), ['room_type','city'])]
col_transform = ColumnTransformer(transformers=t,remainder='passthrough')

x_train = pd.DataFrame(col_transform.fit_transform(x_train).toarray())
x_test = pd.DataFrame(col_transform.fit_transform(x_test).toarray())

x_train_sample = pd.DataFrame(col_transform.fit_transform(x_train_sample).toarray())

In [None]:
# Normalizing numerical data

mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std

x_test -= mean
x_test /= std

x_train_sample -= mean
x_train_sample /= std

## 5. Build, Train and Evaluate Model

In [None]:
import xgboost as xgb

In [None]:
booster = xgb.XGBRegressor()

In [None]:
from sklearn.model_selection import GridSearchCV

# create Grid
param_grid = {'n_estimators': [100, 200, 300],
              'learning_rate': [0.01, 0.05, 0.1], 
              'max_depth': [5, 7, 10],
              'colsample_bytree': [0.6, 0.7, 1],
              'gamma': [0.0, 0.1, 0.2]}

# instantiate the tuned random forest
booster_grid_search = GridSearchCV(booster, param_grid, cv=3, n_jobs=-1)

# train the tuned random forest
booster_grid_search.fit(x_train_sample, y_train_sample)

# print best estimator parameters found during the grid search
print(booster_grid_search.best_params_)

In [None]:
# instantiate xgboost with best parameters
booster = xgb.XGBRegressor(colsample_bytree=1, gamma=0.0, learning_rate=0.1, 
                           max_depth=10, n_estimators=500, random_state=4)

# train
booster.fit(x_train, y_train)

# predict
y_pred_train = booster.predict(x_train)
y_pred_test = booster.predict(x_test)

In [None]:
RMSE = np.sqrt(mean_squared_error(y_test, y_pred_test))
print(f"RMSE: {round(RMSE, 4)}")

In [None]:
MEA = mean_absolute_error(y_test, y_pred_test)
print(f"MAE: {round(MEA, 4)}")

In [None]:
plt.figure(figsize=(20,8))
sns.residplot(x=y_test[:1000],y=y_pred_test[:1000])
plt.title('The residuals of a linear regression .', size=18)
plt.show()

In [None]:
d=y_pred_test-y_test

plt.figure(figsize=(20,8))
plt.hist(d, bins=100)
plt.title('The histogram of residuals of a linear regression.', size=18)
plt.show()