In this repo, I built a `Boosted Binary Logistic Regression` and `Schoastic Gredient Descent` Models that **predicts the house avaliability of AIRBNB Amsterdam listings.** It is a binary classification task we're deailing in here. While doing so, I've used the below columns in my dataset: 
- Price
- Adjusted price
- Listing id
- Minimum_nights
- maximum nights

Then, I compared these two models' performance by their accuracy score, time, and memory consumption.

The source of this dataset is: http://insideairbnb.com/get-the-data.html


In [None]:
import pandas as pd
import numpy as np

In [2]:
calendar = pd.read_csv("C:\\Users\\talfi\\python\\dersler\\capstones\\11. binary_lr_and_sgd_for_airbnb_dataset\\calendar.csv")

In [7]:
calendar.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,2818,2021-02-11,f,$59.00,$59.00,3.0,1125.0
1,415619,2021-02-10,t,$145.00,$145.00,2.0,30.0
2,415619,2021-02-11,t,$145.00,$145.00,2.0,30.0
3,415619,2021-02-12,t,$145.00,$145.00,2.0,30.0
4,415619,2021-02-13,t,$145.00,$145.00,2.0,30.0


In [8]:
calendar.shape

(6676319, 7)

In [3]:
calendar.dtypes

listing_id          int64
date               object
available          object
price              object
adjusted_price     object
minimum_nights    float64
maximum_nights    float64
dtype: object

In [4]:
calendar.isnull().any()

listing_id        False
date              False
available         False
price              True
adjusted_price     True
minimum_nights     True
maximum_nights     True
dtype: bool

- Let's replace the String values in the Numeric columns

In [3]:
calendar['price'] = calendar['price'].str.replace('$', '')

  calendar['price'] = calendar['price'].str.replace('$', '')


In [4]:
calendar['price'] = calendar['price'].str.replace('.', '')

  calendar['price'] = calendar['price'].str.replace('.', '')


In [5]:
calendar['price'] = calendar['price'].str.replace(',', '')

In [6]:
calendar['adjusted_price'] = calendar['adjusted_price'].str.replace('$', '')

  calendar['adjusted_price'] = calendar['adjusted_price'].str.replace('$', '')


In [7]:
calendar['adjusted_price'] = calendar['adjusted_price'].str.replace('.', '')

  calendar['adjusted_price'] = calendar['adjusted_price'].str.replace('.', '')


In [8]:
calendar['adjusted_price'] = calendar['adjusted_price'].str.replace(',', '')

In [9]:
calendar[['price','adjusted_price']] = calendar[['price','adjusted_price']].astype(float,errors ="ignore")

In [12]:
calendar.dtypes

listing_id          int64
date               object
available          object
price             float64
adjusted_price    float64
minimum_nights    float64
maximum_nights    float64
dtype: object

In [10]:
calendar_core = calendar[["price", "adjusted_price", "minimum_nights", "maximum_nights"]]

- As it is a binary classification task, and algorithms won't work with f and t, let's encode
    - f = false = Not Available as `0`
    - t = True = Available as `1`

In [11]:
calendar = calendar.replace({"available":{"f":0, "t":1}})

In [21]:
calendar.available

0          0
1          1
2          1
3          1
4          1
          ..
6676314    0
6676315    0
6676316    0
6676317    0
6676318    0
Name: available, Length: 6676319, dtype: int64

In [13]:
calendar.isnull().any()

listing_id        False
date              False
available         False
price              True
adjusted_price     True
minimum_nights     True
maximum_nights     True
dtype: bool

- Let's replace the null values with mean of each column

In [12]:
from sklearn.impute import SimpleImputer
imr = SimpleImputer(missing_values=np.nan, strategy='mean')

imr = imr.fit(calendar_core)

imputed_data = imr.transform(calendar_core)
calendar_core = pd.DataFrame(imputed_data)
calendar_core = calendar_core.rename(columns={0:"price", 1:"adjusted_price", 2:"minimum_nights",3:"maximum_nights"})

In [13]:
calendar_core2 = calendar[["listing_id","available"]]

In [14]:
calendar_core = calendar_core.join(calendar_core2)
print(calendar_core.isnull().values.any())
print(calendar_core.head(3))

False
     price  adjusted_price  minimum_nights  maximum_nights  listing_id  \
0   5900.0          5900.0             3.0          1125.0        2818   
1  14500.0         14500.0             2.0            30.0      415619   
2  14500.0         14500.0             2.0            30.0      415619   

   available  
0          0  
1          1  
2          1  


In [48]:
calendar_core.dtypes

price             float64
adjusted_price    float64
minimum_nights    float64
maximum_nights    float64
listing_id          int64
available           int64
dtype: object

#### SPLITING DATA INTO TRAINING & TEST SETS

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(calendar_core.drop(columns = "available"), 
                                                    calendar_core["available"],
                                                    test_size = 0.3,
                                                    random_state = 42) 

## LOGISTIC REGRESSION

#### WHICH VALUE SHOULD WE SET THE C PARAMETER?

- Let's boost LR's performance by tuning hyperparameters with GridSearchCV
- 17.05.24: I've tried to used GridSearchCV for SGDClassifier but as the data no longer usable, I couldn't do it.

In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty = "l2")
searcher = GridSearchCV(lr, {"C":[0.001, 0.01, 0.1, 1, 10]})
searcher.fit(X_train, y_train)
print("Best CV Params", searcher.best_params_)

Best CV Params {'C': 0.001}


- Boosting is now completed for LR. Let's measure the algorithms by their 
    - Accuracy
    - Amount of Memory they've used
    - Time they consumed

#### Logistic Regression ALGORITM

In [None]:

memfit = memory_usage((lr.fit, (X_train, y_train)), max_usage=True)
memscore = memory_usage((lr.score,(X_train, y_train)), max_usage=True)
memtotal = memfit + memscore

In [20]:
import time
from memory_profiler import memory_usage
from sklearn.linear_model import LogisticRegression
t_start = time.time()
lr = LogisticRegression(penalty = "l2",C=0.001)
# penalty = l2(Lasso Regularization), shrinks the less important feature’s coefficient to zero thus, removing some feature altogether.
So, l2 works well for feature selection in case we have a huge number of features. 
lr.fit(X_train, y_train)
memfit = memory_usage((lr.fit, (X_train, y_train)), max_usage=True)
print("Logistic Regression Accuracy Score",lr.score(X_test, y_test))
memscore = memory_usage((lr.score,(X_train, y_train)), max_usage=True)
memtotal = memfit + memscore
print("Logistic Regression Memory Used:", memtotal)
t_end = time.time()
print('Logistic Regression Time: {} s'.format(t_end - t_start))

Logistic Regression Accuracy Score 0.8220297009929622
Logistic Regression Memory Used: 6482.578125
Logistic Regression Time: 66.7182605266571 s


#### Schoastic Gredient Descent
* Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. It can be used with Logistic Regression and SVM by setting loss = "hinge" and loss = "log" 
* In this repo, I am going to set the loss metric as "log" , Logistic Regression. 

In [21]:
from sklearn.linear_model import SGDClassifier
t_start = time.time()
svd_log = SGDClassifier(loss = "log")
svd_log.fit(X_train, y_train)
memfit = memory_usage((svd_log.fit, (X_train, y_train)), max_usage=True)
print("SGD Accuracy Score: ",svd_log.score(X_test, y_test))
memtotal = memfit + memscore
print("SGD Memory Used:", memtotal)
t_end = time.time()
print('SGD time: {} s'.format(t_end - t_start))

SGD Accuracy Score:  0.8204449956463041
SGD Memory Used: 6656.73046875
SGD time: 568.1156244277954 s


While the accuracy scores are pretty much the same, Binary Logistic Regression is much more faster than SGD and consumes less space. Boosting help with this<br>
Thus, the winner is Boosted Binary Logistic Regression Model.