## 1 Introduction
Using our Sydney Airbnb listings data, we would like to test out a couple of machine learning (ML) algorithms to see how well we can predict the `price` (a continuous numerical variable) of an accommodation based on its other features. The two algorithms we will be using are **k-nearest neighbours (KNN)** and **linear regression**. We will also go over and implement concepts surrounding **feature engineering** and **k-fold cross validation** into our workflow.

Let's import our Airbnb data and the Python packages that we will be using: 

In [8]:
from sklearn.model_selection import KFold
from sklearn import linear_model, neighbors, metrics
import pandas as pd
import numpy as np

data = pd.read_csv('Airbnb_Listings.csv')
data.head()

Unnamed: 0,listing_url,latitude,longitude,room_type,accommodates,bathrooms,bedrooms,price
0,https://www.airbnb.com/rooms/5098784,-33.5941,151.3218,Entire home/apt,10,3.0,4,1749.0
1,https://www.airbnb.com/rooms/30905588,-33.64781,151.32421,Entire home/apt,9,4.0,4,1749.0
2,https://www.airbnb.com/rooms/30905729,-33.631,151.33632,Entire home/apt,7,3.0,4,1749.0
3,https://www.airbnb.com/rooms/34907444,-33.60566,151.32985,Entire home/apt,10,4.0,5,1749.0
4,https://www.airbnb.com/rooms/14191947,-33.96839,151.25268,Entire home/apt,10,2.0,5,1725.0


## 2 Feature Engineering and Cross Validation
### 2.1 What is Feature Engineering?
Feature engineering is the process of transforming the features (variables/attributes) of data by splitting, merging, scaling etc. them to enhance a model's prediction accuracy. We will be transforming our data in multiple different ways before feeding it to our ML algorithms and comparing the accuracy scores of each feature engineering trial.

### 2.2 What is k-fold Cross Validation?
When going through the ML process, we need to decide how to split up our data into training and testing batches. Generally, we allocate around 80-90% of our data to training and 10-20% to testing. K-fold cross validation allows us to, in a way, use 100% of our data for *both* training and testing. It does this by splitting the data randomly into k folds and running an ML algorithm k times, where a different fold is used for testing each time and the other k-1 folds for training. We will specifically be using 5-fold cross validation (so 80/20 splits on each run). A final evaluation can be done by averaging the results of each run of the ML algorithm.

## 3 Overview of the ML Process
Since we will be using two different ML algorithms, multiple different feature engineered versions of our data, and be performing 5-fold cross validation each time, we need to streamline our ML process as much as possible.

To achieve this, we will define a function `produce_model` that takes in two arguments; the feature engineering function to be used (we will get to this soon) and the ML algorithm to be used. The function will then carry out the ML process with 5-fold cross validation after engineering the data as specified, and using the algorithm specified. It will print out the root-mean-square error (RMSE, lower is better) and R^2 scores (higher is better) for each of the 5 runs, and also the mean RMSE and R^2 scores of those runs.

In [32]:
def produce_model(feature_eng_funct, ML_algorithm):
    # split data into 5 folds
    kfolds = KFold(n_splits=5, shuffle=True, random_state=17).split(data['price'])
    total_rmse = 0; total_r2 = 0
    for train_indices, test_indices in kfolds:
        # data preparation
        X_train = data.loc[train_indices, 'latitude':'bedrooms'] # everything except listing_URL
        y_train = data.loc[train_indices, 'price'] # attribute to be predicted
        X_test = data.loc[test_indices, 'latitude':'bedrooms']
        y_test = data.loc[test_indices, 'price']

        # train model after engineering data
        model = ML_algorithm.fit(feature_eng_funct(X_train), y_train)

        # predictions using engineered test data
        y_pred = model.predict(feature_eng_funct(X_test))

        # evaluate model
        rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
        r2 = metrics.r2_score(y_test, y_pred)
        print('RMSE:', round(rmse, 3), '\tR^2:', round(r2, 3))
        
        total_rmse += rmse; total_r2 += r2
    
    # aggregated results
    print('\nMean RMSE across 5 folds:', round(total_rmse / 5, 3))
    print('Mean R^2 across 5 folds:', round(total_r2 / 5, 3))

All we need to do now is figure out the ways that we want to feature engineer our data and encode these as functions to feed into our `produce_model` function. In other words, whenever we want to use the `produce_model` function, we'll need to define how we want to engineer our data via a function. This will become clear in the next section, where we define these engineering functions and use them in producing models.

## 4 Testing out our ML Models
What we learnt when trying out different feature engineering functions is that there are many, many possible ways we can engineer the data before feeding it to an ML algorithm. Due to this, we will only go over some distinct engineering methods we used that we think may be useful to understand.

### 4.1 The Baseline Linear Regression
Here, we define our first feature engineering function `trial_1` as follows:

In [6]:
def trial_1(raw_df):
    engineered_df = raw_df[:]
    engineered_df = engineered_df.drop(columns=['room_type'])
    return engineered_df

All it does is retain all columns of the data that have a numerical value and discards the only nominal categorical variable `room_type`. Here's how the function fairs with a linear regression algorithm:

In [33]:
produce_model(trial_1, linear_model.LinearRegression())

RMSE: 134.589 	R^2: 0.5
RMSE: 137.978 	R^2: 0.511
RMSE: 128.528 	R^2: 0.537
RMSE: 132.177 	R^2: 0.499
RMSE: 130.824 	R^2: 0.517

Mean RMSE across 5 folds: 132.819
Mean R^2 across 5 folds: 0.513


### 4.2 Using All Columns
The `room_type` column has 4 different categories. A way that we can utilise the `room_type` column is by splitting it into 4 separate columns representing each category, where the column that matches with the `room_type` column for a specified row will contain a 1 and the other three columns will contain a 0.

In [22]:
def trial_2(raw_df):
    engineered_df = raw_df[:]
    engineered_df = engineered_df[['room_type', 'accommodates', 'bathrooms', 'bedrooms', 'latitude', 'longitude']]
    
    for option in data['room_type'].drop_duplicates():
        engineered_df[option] = (engineered_df['room_type'] == option)
        engineered_df = engineered_df.astype({option: 'int'})
        
    engineered_df = engineered_df.drop(columns='room_type')
    return engineered_df

Let's see how this will transform our data before conducting the ML process:

In [27]:
trial_2(data).tail()

Unnamed: 0,accommodates,bathrooms,bedrooms,latitude,longitude,Entire home/apt,Hotel room,Private room,Shared room
37505,3,1.0,1,-33.89685,151.26079,0,0,0,1
37506,2,1.0,1,-33.9553,151.13893,0,0,0,1
37507,3,1.0,1,-33.96138,151.1386,0,0,1,0
37508,4,1.0,2,-33.88227,151.1967,1,0,0,0
37509,4,1.0,2,-33.88744,151.27668,1,0,0,0


In [25]:
produce_model(trial_2, linear_model.LinearRegression())

RMSE: 133.016 	R^2: 0.511
RMSE: 136.607 	R^2: 0.52
RMSE: 127.387 	R^2: 0.545
RMSE: 130.815 	R^2: 0.51
RMSE: 129.412 	R^2: 0.527

Mean RMSE across 5 folds: 131.447
Mean R^2 across 5 folds: 0.523


The results are ever-so slightly better than our baseline model.

### 4.3 The Best (and most complicated) Attempt
This last feature engineering function was created after trying out several previous functions, hence its complexity:

In [28]:
def trial_3(raw_df):
    engineered_df = raw_df[:]
    engineered_df = engineered_df[['room_type', 'accommodates', 'bathrooms', 'bedrooms', 'longitude', 'latitude']]
    
    for option in data['room_type'].drop_duplicates():
        engineered_df[option] = (engineered_df['room_type'] == option)
        engineered_df = engineered_df.astype({option: 'int'})

    for option in data['bedrooms'].drop_duplicates():
        engineered_df['bed' + str(option)] = (engineered_df['bedrooms'] == option)
        engineered_df = engineered_df.astype({'bed' + str(option): 'int'})
        
    for option in data['bathrooms'].drop_duplicates():
        engineered_df['bath' + str(option)] = (engineered_df['bathrooms'] == option)
        engineered_df = engineered_df.astype({'bath' + str(option): 'int'})
        
    engineered_df['accommodates'] = np.log(engineered_df['accommodates'])
    
    city_lat = -33.868322; city_lon = 151.209122
    engineered_df['dist_from_city'] = ((engineered_df['latitude'] - city_lat) ** 2 + (engineered_df['longitude'] - city_lon) ** 2) ** 0.5

    engineered_df = engineered_df.drop(columns=['room_type', 'bedrooms', 'bathrooms'])
    return engineered_df

To summarise what it does:
- it does what we did for `room_type` earlier, but for the `bedrooms` and `bathrooms` variables too
- it takes the logarithm of the `accommodates` variable
- it generates a new column based off the `latitude` and `longitude` variables that contains the distances from Sydney city (using Google Maps' lat/lon coords of the city)

Below are the results. Although they are better than our other two attempts, they aren't much better:

In [29]:
produce_model(trial_3, linear_model.LinearRegression())

RMSE: 129.932 	R^2: 0.534
RMSE: 131.99 	R^2: 0.552
RMSE: 122.335 	R^2: 0.581
RMSE: 126.234 	R^2: 0.543
RMSE: 124.151 	R^2: 0.565

Mean RMSE across 5 folds: 126.928
Mean R^2 across 5 folds: 0.555


### 4.4 The Baseline KNN Model
For the KNN algorithm, we need to specify a value for k. We ran the algorithm using our `trial_1` feature engineering function over all values of k between 2 and 100, using the code below. 

In [None]:
for k in range(2,101):
    print('Number of neighbours:', k, '\n')
    produce_model(trial_1, neighbors.KNeighborsRegressor(n_neighbors=k))
    print('\n-------------------------------------\n')

The results kept getting better up to k = 23, and then started to degrade. For k = 23, the results are shown below, and they happen to be better than our results from our linear regression attempt in section 4.3!

In [34]:
produce_model(trial_1, neighbors.KNeighborsRegressor(n_neighbors=23))

RMSE: 126.588 	R^2: 0.557
RMSE: 126.305 	R^2: 0.59
RMSE: 120.178 	R^2: 0.595
RMSE: 121.922 	R^2: 0.574
RMSE: 119.598 	R^2: 0.596

Mean RMSE across 5 folds: 122.918
Mean R^2 across 5 folds: 0.583


### 4.5 KNN with trial_3
Due to our baseline KNN model beating our best linear regression model, we just had to try using feature engineering function `trial_3` with the KNN algorithm to see how that faired.

In [35]:
produce_model(trial_3, neighbors.KNeighborsRegressor(n_neighbors=23))

RMSE: 123.606 	R^2: 0.578
RMSE: 124.162 	R^2: 0.604
RMSE: 117.205 	R^2: 0.615
RMSE: 119.556 	R^2: 0.59
RMSE: 116.658 	R^2: 0.616

Mean RMSE across 5 folds: 120.237
Mean R^2 across 5 folds: 0.601


This attempt happened to be the best out of all of our others, resulting in an RMSE of ~120 and the only R^2 score above 0.6! However, the prediction is still quite poor despite it being the best out of our models, as accommodation prices only range between \\\$4 and \$1749.

## 5 Takeaways
We learnt many new things about machine learning by trying it out ourselves for the first time.

We discovered how complicated the ML process can be. Not only do you have to be able to choose the best model, but plenty of feature engineering and rigorous validation is required before and after the process. There are so many knobs that need to be turned to fine-tune a single model.

However in contrast, we found out how streamlined the ML process can be. We were able to combine ML algorithms and cross validation testing into one easy-to-use function, and were also able to modularise the ML process into feature engineering functions and the `produce_model` function. This allowed us to quickly and effortlessly perform ML runs by fitting the functions together like a jigsaw puzzle.

The beautiful thing about ML experiments compared to say, an industrial chemical experiment, is that an ML experiment can be repeated with a couple of clicks, has practically no cost, and modifications can be made with ease. A chemical experiment may be costly and difficult to run again, hard to modify, and can even be dangerous.