# AirBNB Price Prediction with Logistic Regression

We will predict the price (`price_gte_150` column) of an AirBNB dataset used last week.

## 1. Setup

In [1]:
# Common imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

np.random.seed(1)

# 2. Load the data

We will use the AirBNB data that we cleaned in last class (the original, not the one that you altered for last weeks exercise).

In [2]:
# Uncomment the following snippet of code to debug problems with finding the .csv file path
# This snippet of code will exit the program and print the current working directory.
#import os
#print(os.getcwd())

In [3]:
X_train = pd.read_csv("./data/airbnb_train_X_price_gte_150.csv")
X_test = pd.read_csv("./data/airbnb_test_X_price_gte_150.csv")
y_train = pd.read_csv("./data/airbnb_train_y_price_gte_150.csv")
y_test = pd.read_csv("./data/airbnb_test_y_price_gte_150.csv")

## 3. Model the data

First, we will create a dataframe to hold all the results of our models.

In [4]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

### 3.1 Fit and test a Logistic Regression model

In [5]:
log_reg_model = LogisticRegression(max_iter=900, penalty= 'none')
_ = log_reg_model.fit(X_train, np.ravel(y_train))

In [6]:
model_preds = log_reg_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762


### 3.2 Change to liblinear solver

In [7]:
log_reg_liblin_model = LogisticRegression(solver='liblinear').fit(X_train, np.ravel(y_train))

In [8]:
model_preds = log_reg_liblin_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"liblinear logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454


### 3.3 L2 Regularization

In [9]:
log_reg_L2_model = LogisticRegression(penalty='l2', max_iter=1000)
_ = log_reg_L2_model.fit(X_train, np.ravel(y_train))

In [10]:
model_preds = log_reg_L2_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L2 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709


### 3.4 L1 Regularization

In [11]:
log_reg_L1_model = LogisticRegression(solver='liblinear', penalty='l1')
_ = log_reg_L1_model.fit(X_train, np.ravel(y_train))

In [12]:
model_preds = log_reg_L1_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L1 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,L1 logistic,0.858482,0.845455,0.875706,0.860315


3.5 Elastic Net Regularization

In [13]:
log_reg_elastic_model = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=1000)
_ = log_reg_elastic_model.fit(X_train, np.ravel(y_train))

In [14]:
model_preds = log_reg_elastic_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Elestic logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

## 5.0 Summary

Sorted by accuracy, the best models are:

In [15]:
performance.sort_values(by=['Accuracy'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,default logistic,0.866917,0.852995,0.885122,0.868762


Sorted by Precision, the best models are:

In [16]:
performance.sort_values(by=['Precision'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,default logistic,0.866917,0.852995,0.885122,0.868762


Sorted by Recall, the best models are:

In [17]:
performance.sort_values(by=['Recall'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,default logistic,0.866917,0.852995,0.885122,0.868762


Sorted by F1, the best models are:

In [18]:
performance.sort_values(by=['F1'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,default logistic,0.866917,0.852995,0.885122,0.868762


### So which model is the 'best' and the one you wish to choose?

This is very much depending on the profit or loss associated with FP, FN, TP and TN. We will discuss this in the next class.

Based on the above results, default logistic model performs better amoung all the models

Default logistic model had the best results for all the metrics

### 3.1 Fit a SVM classification model using linear kernal

In [19]:
from sklearn.svm import SVC
#performance1 = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})
svm_lin_model = SVC(kernel="linear")
_ = svm_lin_model.fit(X_train, np.ravel(y_train))

In [20]:
model_preds = svm_lin_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance= pd.concat([performance, pd.DataFrame({'model':"linear svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### 3.2 Fit a SVM classification model using rbf kernal

In [21]:
svm_rbf_model = SVC(kernel="rbf", C=10, gamma='scale')
_ = svm_rbf_model.fit(X_train, np.ravel(y_train))

In [22]:
model_preds = svm_rbf_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance= pd.concat([performance, pd.DataFrame({'model':"rbf svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### 3.3 Fit a SVM classification model using polynomial kernal

In [23]:
svm_poly_model = SVC(kernel="poly", degree=3, coef0=1, C=10)
_ = svm_poly_model.fit(X_train, np.ravel(y_train))

In [24]:
model_preds = svm_poly_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance= pd.concat([performance, pd.DataFrame({'model':"poly svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [35]:
## 4.0 Summary

performance.sort_values(by=['Accuracy'],ascending= False)

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,poly svm,0.867854,0.855839,0.883239,0.869323
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,rbf svm,0.864105,0.85348,0.877589,0.865367
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,linear svm,0.853796,0.828371,0.890772,0.858439


In [36]:

performance.sort_values(by=['Precision'],ascending= False)

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,poly svm,0.867854,0.855839,0.883239,0.869323
0,rbf svm,0.864105,0.85348,0.877589,0.865367
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,linear svm,0.853796,0.828371,0.890772,0.858439


In [34]:

performance.sort_values(by=['Recall'],ascending= False)

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,linear svm,0.853796,0.828371,0.890772,0.858439
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,poly svm,0.867854,0.855839,0.883239,0.869323
0,rbf svm,0.864105,0.85348,0.877589,0.865367
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454


In [33]:

performance.sort_values(by=['F1'],ascending= False)

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,poly svm,0.867854,0.855839,0.883239,0.869323
0,default logistic,0.866917,0.852995,0.885122,0.868762
0,rbf svm,0.864105,0.85348,0.877589,0.865367
0,L2 logistic,0.861293,0.850091,0.875706,0.862709
0,liblinear logistic,0.861293,0.851376,0.873823,0.862454
0,Elestic logistic,0.859419,0.846995,0.875706,0.861111
0,L1 logistic,0.858482,0.845455,0.875706,0.860315
0,linear svm,0.853796,0.828371,0.890772,0.858439


# Analysis

SVM is a classification algorithm that has achieved significant performance. It categorizes the data by locating the optimum hyperplane and optimizing the distance between points. In order to demonstrate how it works with support vector machines, a kernel function will be introduced. We have used, linear, rbf and poly kernal.

In SVM classification issues when the data is not linearly separable, the polynomial kernel is frequently utilized.
The polynomial kernel has a number of parameters that can be tuned to improve its performance, including the degree of the polynomial and the coefficient of the polynomial.

Our target price_get_150 is in binary.
Based on the results of the SVM models, the svm with poly model has overall the best performance.
The poly svm model gives best results in majority of the metrics.

Since we were dealing with the data consisting of multiple predictors; I would like to consider F1 metric as it provides a good results for both balanced and unbalanced datasets, as F1 score considers both precision and recall as it calculates the harmonic mean, as it minimizes false positives and false negatives.

**Therefore, the SVM with poly kernel is the best model for our data.**