In this notebook I want to justify the goal of my research, which is 
* building a classifier for good-bad reviews
* understanding what affects the decision of the classifier

Two easily monetizable features are listing's price and listing's availability in the following days. Thus, below I will show that there is a different in the average price and availability between 'good' and 'bad' listings, and that it may be reasonable to help the 'bad' listring thus increasing their value.

In [122]:
import pandas as pd
import plotly.express as px
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import locale
locale.setlocale(locale.LC_ALL, '')

'LC_CTYPE=en_US.UTF-8;LC_NUMERIC=ru_RU.UTF-8;LC_TIME=ru_RU.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=ru_RU.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=ru_RU.UTF-8;LC_NAME=ru_RU.UTF-8;LC_ADDRESS=ru_RU.UTF-8;LC_TELEPHONE=ru_RU.UTF-8;LC_MEASUREMENT=ru_RU.UTF-8;LC_IDENTIFICATION=ru_RU.UTF-8'

First, I'll load the dataset and convert some of the columns to their correct type.

In [123]:
df = pd.read_csv('data/listings.csv')
df.last_scraped = pd.to_datetime(df.last_scraped, infer_datetime_format=True)
df.host_since = pd.to_datetime(df.host_since, infer_datetime_format=True)
df.price = df.price.apply(lambda x: locale.atof(x[1:].replace(',', '')))
df.host_response_rate = df.host_response_rate.apply(lambda x: locale.atof(x[:-1].replace(',', '')) if not pd.isna(x) else x)
df.host_acceptance_rate = df.host_acceptance_rate.apply(lambda x: locale.atof(x[:-1].replace(',', '')) if not pd.isna(x) else x)
df

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,5396,https://www.airbnb.com/rooms/5396,20221210143007,2022-12-11,city scrape,Explore the heart of old Paris,"Cozy, well-appointed and graciously designed s...","You are within walking distance to the Louvre,...",https://a0.muscache.com/pictures/52413/f9bf76f...,7903,...,4.82,4.95,4.55,7510402838018,f,1,1,0,0,1.98
1,7397,https://www.airbnb.com/rooms/7397,20221210143007,2022-12-11,city scrape,MARAIS - 2ROOMS APT - 2/4 PEOPLE,"VERY CONVENIENT, WITH THE BEST LOCATION !<br /...",,https://a0.muscache.com/pictures/67928287/330b...,2626,...,4.88,4.93,4.72,7510400829623,f,2,2,0,0,2.26
2,7964,https://www.airbnb.com/rooms/7964,20221210143007,2022-12-11,previous scrape,Large & sunny flat with balcony !,Very large & nice apartment all for you! <br /...,,https://a0.muscache.com/pictures/4471349/6fb3d...,22155,...,5.00,5.00,5.00,7510903576564,f,1,1,0,0,0.04
3,9359,https://www.airbnb.com/rooms/9359,20221210143007,2022-12-11,city scrape,"Cozy, Central Paris: WALK or VELIB EVERYWHERE !",Location! Location! Location! Just bring your ...,,https://a0.muscache.com/pictures/c2965945-061f...,28422,...,,,,"Available with a mobility lease only (""bail mo...",f,1,1,0,0,
4,81870,https://www.airbnb.com/rooms/81870,20221210143007,2022-12-11,previous scrape,Saint Germain Musee d'orsay,<b>The space</b><br />This beautiful apartment...,,https://a0.muscache.com/pictures/558458/3c1263...,152242,...,5.00,5.00,4.00,,f,78,78,0,0,0.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55099,777293887700658247,https://www.airbnb.com/rooms/777293887700658247,20221210143007,2022-12-11,city scrape,Aborigine,The flat is decorated in a pure and design way...,Very quiet area.<br />Four times a week there ...,https://a0.muscache.com/pictures/miso/Hosting-...,490045822,...,,,,7511107644867,f,1,1,0,0,
55100,778194850167838333,https://www.airbnb.com/rooms/778194850167838333,20221210143007,2022-12-11,city scrape,Gratte-ciel - bien située - proche Tour Eiffel,"In the heart of the 15th arrondissement, this ...",The 15th arrondissement is a pleasant and very...,https://a0.muscache.com/pictures/prohost-api/H...,490713158,...,,,,7511507662375,f,1,1,0,0,
55101,778209219542952721,https://www.airbnb.com/rooms/778209219542952721,20221210143007,2022-12-11,city scrape,Charming apartment 2 P - Malesherbe,This pretty studio specially designed for love...,This studio is located in the 8th arrondisseme...,https://a0.muscache.com/pictures/prohost-api/H...,374553401,...,,,,7510807648613,f,91,91,0,0,
55102,777311545564408798,https://www.airbnb.com/rooms/777311545564408798,20221210143007,2022-12-10,city scrape,Appartement Guisarde,Profitez d'un logement élégant et central. Le ...,,https://a0.muscache.com/pictures/miso/Hosting-...,325898345,...,,,,7510807670783,f,4,4,0,0,


Then, I'll take only rows which have some non-null review score.

In [124]:
df = df[~df.filter(regex='review_score').isna().any(axis=1)]

Now, let's binarize reviews. I will have two types of reviews - 'good' and 'bad'. 'good' ones are those which have mean score in all review columns greater than some user-specified threshold. Otherwise, review is bad.

In [125]:
threshold = 4.0
df['target'] = (df.filter(regex='review_score').mean(axis=1) > threshold).astype(int)
np.mean(df['target'])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



0.9648162449461681

Now I need to limit my analysis only to a subset of data which will further be used in the training.

In [126]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=0)

Now I want to remove the outliers for a meaningful assessment of price differences. First, I show that the outliers are present indeed - the maximum value in the 'price' column is far away from the mean value.

In [128]:
train_df.price.describe()

count    35220.000000
mean       159.910393
std        586.693301
min          0.000000
25%         72.000000
50%        110.000000
75%        176.000000
max      99140.000000
Name: price, dtype: float64

So let's remove the outliers

In [129]:
train_df = train_df[train_df.price < train_df["price"].quantile(0.95)]

In [130]:
px.box(y=train_df.price, x=train_df.target)

Now I want to calculate the average price and availability for each group of listings

In [142]:
agg = train_df.groupby('target').mean()[['price', 'availability_30', 'availability_60', 'availability_90', 'availability_365']]
agg


The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.



Unnamed: 0_level_0,price,availability_30,availability_60,availability_90,availability_365
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,120.805379,9.448423,21.683717,35.26769,148.757886
1,125.294246,5.498947,13.577619,22.997491,97.532336


While prices do not differ to much, there is a difference in listing's availability. What this means is: people do not want to leave in 'bad' listings, therefore it stay unoccupied for longer periods of time. Thus, if we can transfer a listing from 'bad' to 'good', then we will increase the number of occupied days, thus producing more income. How much income exactly? See below.

Here is what will happen if we increase the number of occupied days in the listing to the level of 'good' listing.

In [161]:
periods = ['month', '2 months', '3months', 'year']
tmp = agg.to_numpy()
values = agg.iloc[1, 0] * np.abs(tmp[0, 1:] - tmp[1, 1:])
for period, value in zip(periods, values):
    print(f"In {period} we will have +{round(value)} money")


In month we will have +495 money
In 2 months we will have +1016 money
In 3months we will have +1537 money
In year we will have +6418 money
