# Automated Valuation Model (AVM)
## Project - Building an AVM model using New York City Airbnb Data
In this notebook, I will be exploring the [New York City Airbnb Open Dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) from Kaggle and building a model that can predict their corresponding prices. 

This projects requires the following libraries :
- [Pandas](https://pandas.pydata.org)
- [Numpy](https://numpy.org)
- [Keras](https://keras.io/)
- [Sklearn](https://scikit-learn.org/)

**Please ensure you have installed the following libraries mentioned above before continuing.**

### Importing the Necessary libraries

In [1]:
import numpy as np
import pandas as pd
from keras.callbacks import ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from keras.layers import LeakyReLU
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from math import sqrt

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Loading Data
Pandas is a great open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python. So we will be using it in the project to visualize and modifying the dataset according to out needs.

In [2]:
# Loading Dataset
Data = pd.read_csv(r'C:\Users\sharm\OneDrive\Desktop\ML project\new-york-city-airbnb-open-data\AB_NYC_2019.csv')

# Checking the number of training examples in Dataset
print('Size of Dataset:', len(Data))

#looking over the dataset
Data[:5]

Size of Dataset: 48895


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


## Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.

We will be checking and resolving the following before feeding the data into model:
- Checking for null values if any (fill them with zeroes or average value of that feature)
- Dropping the Feature that are not necessary in model learning (like id, it is unique for every datapoint)
- Checking the Datatypes of every Feature and if it's not *int* or *float* then **Encode** them because model cant learns on numerics.
- Distribute the Dataset into Features and Labels.
- Normalizing the Dataset. It change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. For more information on Normalization see [this](https://www.quora.com/Why-do-we-normalize-the-data).
- Distributing the Dataset into Training and Testing Subsets.

In [3]:
# Checking if there any null values are present or not
Data.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [4]:
# Filling the empty values with mean of that feature
Data.fillna({'reviews_per_month':Data['reviews_per_month'].mean()}, inplace=True)
# Examing changes
Data.reviews_per_month.isnull().sum()

0

In [5]:
# Dropping the unnecessary features
Data.drop(['id','host_name','last_review','name', 'host_id'], axis=1, inplace=True)

Encoding Features using Label Encoding. For details see [this example](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn-preprocessing-labelencoder).

In [6]:
# Encoding 'neighbourhood_group' feature
E = preprocessing.LabelEncoder()
E.fit(Data['neighbourhood_group'])
encoded = E.transform(Data['neighbourhood_group'])
Data.drop(['neighbourhood_group'], axis=1, inplace=True)
Data['neighbourhood_group'] = encoded

In [7]:
# Encoding 'room_type' feature
E.fit(Data['room_type'])
encoded = E.transform(Data['room_type'])
Data.drop(['room_type'], axis=1, inplace=True)
Data['room_type'] = encoded

In [8]:
# Encoding 'neighbourhood' feature
E.fit(Data['neighbourhood'])
encoded = E.transform(Data['neighbourhood'])
Data.drop(['neighbourhood'], axis=1, inplace=True)
Data['neighbourhood'] = encoded

In [9]:
# Checking the mean to know if there is need to normalize the Dataset 
Data.mean(axis = 0)

latitude                           40.728949
longitude                         -73.952170
price                             152.720687
minimum_nights                      7.029962
number_of_reviews                  23.274466
reviews_per_month                   1.373221
calculated_host_listings_count      7.143982
availability_365                  112.781327
neighbourhood_group                 1.675345
room_type                           0.504060
neighbourhood                     107.122732
dtype: float64

### Normalizing Dataset
We will be normalizing the Features using [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn-preprocessing-standardscaler) of `sklearn`. The standard score of a sample x is calculated as: `z = (x - u) / s` where `x` is the feature, `u` is the mean and `s` is the standard deviation. For example of StandardScaler see [this](http://benalexkeen.com/feature-scaling-with-scikit-learn/).

In [10]:
# Dividing the dataset into features and labels
prices = Data['price']
features = Data.drop('price', axis = 1)

# Normalizing the Features
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(features)
scaled_df = pd.DataFrame(scaled_df, columns=['latitude', 'longitude',  'minimum_nights', 'number_of_reviews',
                                            'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 
                                             'neighbourhood_group', 'room_type', 'neighbourhood'])

### Splitting the Dataset into Training and Testing Subsets
The data we use is usually split into training data and test data.The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

In [11]:
# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(scaled_df , prices , test_size=0.05 , random_state=0)

In [12]:
# Visualizing Dataset Before feeding into Model
scaled_df[:5]

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood_group,room_type,neighbourhood
0,-1.493849,-0.437652,-0.293996,-0.320414,-0.776641,-0.034716,1.91625,-0.917828,0.909359,0.012762
1,0.452436,-0.684639,-0.293996,0.487665,-0.663138,-0.156104,1.840275,0.441222,-0.924247,0.289156
2,1.468399,0.222497,-0.196484,-0.522433,0.0,-0.186451,1.91625,0.441222,0.909359,-0.190897
3,-0.803398,-0.16445,-0.293996,5.538156,2.18111,-0.186451,0.617065,-0.917828,-0.924247,-0.961892
4,1.27566,0.177216,0.144807,-0.320414,-0.850084,-0.186451,-0.856865,0.441222,-0.924247,-0.67095


## Defining our own Network Architecture
Here comes the main part of this project. All the cool and magical stuff happens here. We will use Deep neural networks. A Deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers. The DNN finds the correct mathematical manipulation to turn the input into the output, whether it be a linear relationship or a non-linear relationship. The network moves through the layers calculating the probability of each output. For more details see [this](https://www.techopedia.com/definition/32902/deep-neural-network).


We will be needing some activation functions. Now you might be thinking what are activation Function, they are used to introduce non-linearity to our model. For more details see [this](https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f). So we will also be using [Leaky reLu](https://medium.com/@danqing/a-practical-guide-to-relu-b83ca804f1f7) as our activaiton function. We will be building a 6 layer Deep Neural Network.

In [14]:
model = Sequential()
# The Input Layer :
model.add(Dense(32, kernel_initializer='normal',input_dim = X_train.shape[1], activation='relu'))

# The Hidden Layers :
model.add(Dense(64, kernel_initializer='normal'))
model.add(LeakyReLU(alpha=0.05))
model.add(Dense(128, kernel_initializer='normal'))
model.add(LeakyReLU(alpha=0.05))
model.add(Dense(128, kernel_initializer='normal'))
model.add(LeakyReLU(alpha=0.05))
model.add(Dense(64, kernel_initializer='normal'))
model.add(LeakyReLU(alpha=0.05))
# The Output Layer :
model.add(Dense(1, kernel_initializer='normal',activation='linear'))

# Compile the network :
model.compile(loss='mean_absolute_error', optimizer='adam')
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 32)                352       
_________________________________________________________________
dense_2 (Dense)              (None, 64)                2112      
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               8320      
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 128)               16512     
________________________________________________

## Defining [Loss Function](https://keras.io/losses/) and [Optimizer](https://keras.io/optimizers/) and Training the model
We will be using `mean_absolute_error` as our **loss** function and `adam` as our **optimizer**.

In [15]:
# Creating checkpoints Everytime the new best validation Loss is encountered
checkpoint_name = 'Weights.hdf5' 
checkpoint = ModelCheckpoint(checkpoint_name, monitor='val_loss', verbose = 1, save_best_only = True, mode ='auto')
callbacks_list = [checkpoint]
model.fit(X_train, y_train, epochs=250, batch_size=5, validation_split = 0.2, callbacks=callbacks_list)

Instructions for updating:
Use tf.cast instead.
Train on 37160 samples, validate on 9290 samples
Epoch 1/250

Epoch 00001: val_loss improved from inf to 59.95170, saving model to Weights.hdf5
Epoch 2/250

Epoch 00002: val_loss did not improve from 59.95170
Epoch 3/250

Epoch 00003: val_loss improved from 59.95170 to 57.76789, saving model to Weights.hdf5
Epoch 4/250

Epoch 00004: val_loss improved from 57.76789 to 57.67282, saving model to Weights.hdf5
Epoch 5/250

Epoch 00005: val_loss did not improve from 57.67282
Epoch 6/250

Epoch 00006: val_loss improved from 57.67282 to 57.52375, saving model to Weights.hdf5
Epoch 7/250

Epoch 00007: val_loss did not improve from 57.52375
Epoch 8/250

Epoch 00008: val_loss did not improve from 57.52375
Epoch 9/250

Epoch 00009: val_loss improved from 57.52375 to 57.48777, saving model to Weights.hdf5
Epoch 10/250

Epoch 00010: val_loss improved from 57.48777 to 57.33685, saving model to Weights.hdf5
Epoch 11/250

Epoch 00011: val_loss improved fr


Epoch 00047: val_loss did not improve from 56.45679
Epoch 48/250

Epoch 00048: val_loss did not improve from 56.45679
Epoch 49/250

Epoch 00049: val_loss did not improve from 56.45679
Epoch 50/250

Epoch 00050: val_loss did not improve from 56.45679
Epoch 51/250

Epoch 00051: val_loss did not improve from 56.45679
Epoch 52/250

Epoch 00052: val_loss did not improve from 56.45679
Epoch 53/250

Epoch 00053: val_loss improved from 56.45679 to 56.17592, saving model to Weights.hdf5
Epoch 54/250

Epoch 00054: val_loss did not improve from 56.17592
Epoch 55/250

Epoch 00055: val_loss did not improve from 56.17592
Epoch 56/250

Epoch 00056: val_loss did not improve from 56.17592
Epoch 57/250

Epoch 00057: val_loss did not improve from 56.17592
Epoch 58/250

Epoch 00058: val_loss did not improve from 56.17592
Epoch 59/250

Epoch 00059: val_loss did not improve from 56.17592
Epoch 60/250

Epoch 00060: val_loss did not improve from 56.17592
Epoch 61/250

Epoch 00061: val_loss did not improve fr


Epoch 00096: val_loss did not improve from 55.58209
Epoch 97/250

Epoch 00097: val_loss did not improve from 55.58209
Epoch 98/250

Epoch 00098: val_loss did not improve from 55.58209
Epoch 99/250

Epoch 00099: val_loss did not improve from 55.58209
Epoch 100/250

Epoch 00100: val_loss did not improve from 55.58209
Epoch 101/250

Epoch 00101: val_loss did not improve from 55.58209
Epoch 102/250

Epoch 00102: val_loss did not improve from 55.58209
Epoch 103/250

Epoch 00103: val_loss did not improve from 55.58209
Epoch 104/250

Epoch 00104: val_loss did not improve from 55.58209
Epoch 105/250

Epoch 00105: val_loss did not improve from 55.58209
Epoch 106/250

Epoch 00106: val_loss improved from 55.58209 to 55.53545, saving model to Weights.hdf5
Epoch 107/250

Epoch 00107: val_loss did not improve from 55.53545
Epoch 108/250

Epoch 00108: val_loss did not improve from 55.53545
Epoch 109/250

Epoch 00109: val_loss did not improve from 55.53545
Epoch 110/250

Epoch 00110: val_loss did not


Epoch 00146: val_loss did not improve from 55.53545
Epoch 147/250

Epoch 00147: val_loss did not improve from 55.53545
Epoch 148/250

Epoch 00148: val_loss did not improve from 55.53545
Epoch 149/250

Epoch 00149: val_loss did not improve from 55.53545
Epoch 150/250

Epoch 00150: val_loss did not improve from 55.53545
Epoch 151/250

Epoch 00151: val_loss did not improve from 55.53545
Epoch 152/250

Epoch 00152: val_loss did not improve from 55.53545
Epoch 153/250

Epoch 00153: val_loss did not improve from 55.53545
Epoch 154/250

Epoch 00154: val_loss did not improve from 55.53545
Epoch 155/250

Epoch 00155: val_loss did not improve from 55.53545
Epoch 156/250

Epoch 00156: val_loss did not improve from 55.53545
Epoch 157/250

Epoch 00157: val_loss did not improve from 55.53545
Epoch 158/250

Epoch 00158: val_loss did not improve from 55.53545
Epoch 159/250

Epoch 00159: val_loss did not improve from 55.53545
Epoch 160/250

Epoch 00160: val_loss did not improve from 55.53545
Epoch 161


Epoch 00196: val_loss did not improve from 55.53545
Epoch 197/250

Epoch 00197: val_loss did not improve from 55.53545
Epoch 198/250

Epoch 00198: val_loss did not improve from 55.53545
Epoch 199/250

Epoch 00199: val_loss did not improve from 55.53545
Epoch 200/250

Epoch 00200: val_loss did not improve from 55.53545
Epoch 201/250

Epoch 00201: val_loss did not improve from 55.53545
Epoch 202/250

Epoch 00202: val_loss did not improve from 55.53545
Epoch 203/250

Epoch 00203: val_loss did not improve from 55.53545
Epoch 204/250

Epoch 00204: val_loss did not improve from 55.53545
Epoch 205/250

Epoch 00205: val_loss did not improve from 55.53545
Epoch 206/250

Epoch 00206: val_loss did not improve from 55.53545
Epoch 207/250

Epoch 00207: val_loss did not improve from 55.53545
Epoch 208/250

Epoch 00208: val_loss did not improve from 55.53545
Epoch 209/250

Epoch 00209: val_loss did not improve from 55.53545
Epoch 210/250

Epoch 00210: val_loss did not improve from 55.53545
Epoch 211


Epoch 00246: val_loss did not improve from 55.53545
Epoch 247/250

Epoch 00247: val_loss did not improve from 55.53545
Epoch 248/250

Epoch 00248: val_loss did not improve from 55.53545
Epoch 249/250

Epoch 00249: val_loss did not improve from 55.53545
Epoch 250/250

Epoch 00250: val_loss did not improve from 55.53545


<keras.callbacks.callbacks.History at 0x24d9ca793c8>

In [16]:
wights_file = 'Weights.hdf5'
# Loading the weight file with best accuracy
model.load_weights(wights_file)
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_absolute_error'])

## Testing
 Predicting and Displaying some samples from original database and the one's we predicted from our model. Also showing the Root Mean Squared Error. This will be our evaluation metric. If you want to read about **RMSE** see [this](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn-metrics-mean-squared-error).

In [19]:
# Predicting on the test Dataset
predictions = model.predict(X_test)

# Display prediction from model
b = predictions[1212]
print('Model Prediction of one Datapoint from Test Data:', b[0])

# Display Original Label
a = y_test.values.tolist()
print('original label from test data:', a[1212])

# Display Root Mean Squared Error
rmse = sqrt(mean_squared_error(y_test, predictions))
print('Root Mean Squared error:', rmse)

Model Prediction of one Datapoint from Test Data: 165.73103
original label from test data: 179
Root Mean Squared error: 135.44911065395334
