# Predicting New York City rental prizes from Airbnb

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
print(tf.__version__)

We're pulling the data straight off the kaggle website. We could also download the archive zip file, but we're going to use the csv file directly. Note: by using the -o flag we're specifying the output filename. Mine is airbnb.csv. You can of course use any name you'd like, just note to pass it as to pandas' read_csv function. 

In [None]:
!wget -o airbnb.csv https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv

In [None]:
!ls

In [None]:
path='airbnb.csv'
df=pd.read_csv(path)
df.head()

The data has _ columns and _rows. Pandas' dataframe object has a lot of built in options for managing and analysing data, as you'll see

In [None]:
df.shape

Seaborn has a great plotting API. Let's use it to plot the correlation matrix. Change the figsize parameters for the size of the canvas. 

In [None]:
corr = df.corr(method='kendall')
plt.figure(figsize=(8,8))
sns.heatmap(corr, annot=True)


Next, we'll remove the duplicates and deal with illegal values in the column 'reviews_per_month'.

In [None]:
df.duplicated().sum()
df.drop_duplicates(inplace=True)
df['reviews_per_month']=df['reviews_per_month'].fillna(0)

Investigating the types of columns we're dealing with here. 

Dropping unnecessary columns will make it easier to train our model.
When I say unnecessary, I mean data that has low correlation with the model's prediction. 
Let's take a look at the types after dropping columns.

In [None]:
df.dtypes

In [None]:
columns=['id','host_id','name','host_name','last_review','calculated_host_listings_count']
df=df.drop(columns,axis=1)
df.isnull().sum()

In [None]:
df.dtypes

By using the countplot we'll recieve the visual information about which neighbourhood group is the most popular for renting via AirBnb. 

In [None]:
sns.countplot(df['neighbourhood'], palette="plasma")
fig = plt.gcf()
fig.set_size_inches(10,6)
plt.title('Neighbourhood')

In [None]:
plot_dims=(12,8)
plt.figure(figsize=plot_dims)
sns.scatterplot(df.longitude,df.latitude,hue=df.neighbourhood_group)
plt.ioff()

Next, we'll one-hot encode certain columns for training. Computers deal with numbers, right?

In [None]:
df['neighbourhood_group']=pd.factorize(df.neighbourhood_group)[0]
df['neighbourhood']=pd.factorize(df.neighbourhood)[0]
df['room_type']=pd.factorize(df.room_type)[0]

df.head()

In [None]:
# Normalizing the availability column so that its values don't return out of the box loss results while training.
# The column's mean is 0 and standard deviation is 1 

availabillity=df['availability_365']
availabillity=(availabillity-availabillity.mean())/availabillity.std()


In [None]:
df.shape

Here I created separate datasets for columns and labels. Note that the X dataset does not remove the price label. Remember we are trying to predict the price based on the training data. 

Data shapes don't match, so I've decided to truncate it.

In [None]:
Y=df['price']
X=df.copy()
X=X.drop(X['price'])

X.shape,Y.shape

In [None]:
Y=Y.truncate(after=48220,axis=0)
X.shape,Y.shape

Next up are the feature crosses.
The point is to merge the two columns, so that its values are representative of the data. Our goal here is to feature cross longitude and landitute, which is one of the oldest tricks in the book. If we put merely the two columns as values to the model, it will assume those values are progressively related to the output. 

Instead, we'll be using a feature cross, meaning we will split the longitude*langitude map into a grid. 
Quite a delicate little problem. Lucky for us, Tensorflow makes it easy. 


I'm making a grid of equally spreaded grids by iterating from the minimum to the maximum value with an iteration of (max-min)/100.

I'm using a 100x100 grid. 


In [None]:
max_long=df['longitude'].max()
min_long=df['longitude'].min()

diff=max_long-min_long
diff/=100

long_boundaries=[]
for i in np.arange(min_long, max_long, diff):
    long_boundaries.append(min_long+i*diff)

    
max_lat=df['latitude'].max()
min_lat=df['latitude'].min()

d=max_lat-min_lat
d/=100

lat_boundaries=[]
for i in np.arange(min_lat, max_lat, d):
    lat_boundaries.append(min_long+i*d)
    

Essentially, what we're doing here, is defining a bucketized column with boundaries defined earlier and creating a DenseFeatures layer, which will be passed to the Sequential API later. 


If you're not familiar with the Tensorflow syntax, do check the docs.
https://www.tensorflow.org/api_docs/python/tf/feature_column/

In [None]:
long_marked=tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('longitude'), boundaries=long_boundaries
)

lat_marked=tf.feature_column.bucketized_column(
   tf.feature_column.numeric_column('latitude'),boundaries=lat_boundaries
)


crossed_feature=tf.feature_column.crossed_column([long_marked,lat_marked],hash_bucket_size=100)
feature_layer=tf.keras.layers.DenseFeatures(tf.feature_column.indicator_column(crossed_feature))


In the following sections, we'll finally prepare the data for training using the sklearn's train_test_split and the StandardScaler function.


In [None]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,shuffle=True,random_state=0)


In [None]:
x_train.shape,x_test.shape

In [None]:
y_train.shape,y_test.shape

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x_train=scaler.fit_transform(x_train)
x_test=scaler.transform(x_test)

Again, sklearn makes it super easy to define this hard architectures.
First, linear regression, which shouldn't work well because this isn't a regression task. 

Then, the notorious support vector machine. 

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
linreg=LinearRegression()
linreg.fit(x_train,y_train)
y_pred=linreg.predict(x_test)
r2_score=r2_score(y_test,y_pred)
r2_score

In [None]:
from sklearn.svm import SVC

svc=SVC(kernel='linear')
svc.fit(x_train,y_train)
r2_score=r2_score(y_test,y_pred)
r2_score

Finally, the creation of the keras Sequential model. 

We' re compiling the model using the Adam optimizer, MSE loss and two metrics. Keep track of these while the model trains. 

In [None]:
from tensorflow.keras.layers import Dense, Dropout,DenseFeatures
tf.keras.backend.set_floatx('float32')
model = tf.keras.Sequential([
    DenseFeatures(feature_layer)
    Dense(128,activation='relu'),
    Dense(64,activation='relu'),
    Dropout(0.3),
    Dense(32,activation='relu'),
    Dense(1)
])

opt=tf.keras.optimizers.Adam(learning_rate=0.05)
rmse=tf.keras.metrics.RootMeanSquaredError()
model.compile(optimizer=opt,loss='mean_squared_error',metrics=['mae',rmse])


In [None]:
model.summary()

Additionally, we are using two callbacks

EarlyStopping, which is self explanatory, but check the docs 
   - https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping
    
Reduce learning rate on plateau.
  - https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ReduceLROnPlateau

In [None]:
lr_reducer=tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss',patience=2,factor=0.2)
early_stopper=tf.keras.callbacks.EarlyStopping(patience=5)
callbacks=[lr_reducer,early_stopper]

history=model.fit(x_train,y_train,validation_data=(x_test,y_test), callbacks=callbacks,epochs=50,batch_size=64,verbose=2)

In [None]:
history_df=pd.DataFrame(history.history)
history_df.head()

In [None]:
history_df.plot(x='loss',y='mae')
plt.xlabel('Loss')
plt.ylabel('Mae')
plt.title("Model performance")
plt.show()