## Abstract

Airbnb has a growing share of the accomodation industry all over the globe. Airbnb offers different types of accomodation ranging from shared room till villas. In many cities the short-term rental of flats is a rental price rising force and it reduces the number of available flats for inhabitants on the city. In some cities it is particularly critical. With data from the Italian webpage insiteaibnb (http://insideairbnb.com/), I analysed the AirBnb market in Rome, Italy. 

In [None]:
# Import and installing packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [None]:
# Import and installing packages

!pip install missingno

In [None]:
# Defining color schema

c = ['#8e9aaf', '#cbc0d3', '#efd3d7', '#feeafa', '#dee2ff']

In [None]:
# Load dataset into a pandas dataframe

df = pd.read_csv('data/listings_Rome.csv')

In [None]:
# check the first lines of the dataframe

df.head()

In [None]:
# check the variables of the dataframe

df.columns

In [None]:
# check the regions/neighbourhoods of Rome

df['neighbourhood_cleansed'].unique()

### Cleaning Data

In [None]:
# search for patterns in the missing data

import missingno as msno
msno.matrix(df)

Some variables are mainly empty, it worths to take a closer look, which ones.

In [None]:
df.info()

The following variables are empty: neighbourhood_cleansed, bathrooms, calendar_updated  

In [None]:
# drop the empty columns

df.drop(['neighbourhood_cleansed', 'bathrooms', 'calendar_updated'], axis=1)

### Listings and hsts in Rome

In [None]:
# Number of listings in Rome, separate offers

len(df.index)

In [None]:
# Number of hosts

df['host_id'].nunique()

From the discrepancy between the number of accomodation offered and the number of hosts, it is visiable that some hosts are offereing more than one accomodation.

In [None]:
# group the listings by the hosts to see how many listings they own

df_host = df.groupby('host_id').id.count().sort_values(ascending = False)

In [None]:
# check the first 25 hosts

df_host.head(25)

In [None]:
# The largest host in Rome offers 239 accomodations

df.loc[df['host_id'] == 23532561].id.count()

In [None]:
# check the largest host in Rome

df[['id', 'host_id', 'name', 'host_name']].loc[df['host_id'] == 23532561]

In [None]:
# visualize the proportion of hosts with one or more offers

y = np.array([9007, 3928])
mylabels = ["More properties", "One property"]

mycolors =c
myexplode = [0.2, 0]

plt.pie(y, labels = mylabels, colors= mycolors, autopct = '%1.1f%%', explode = myexplode, startangle = 90)
plt.title("Number of offers per host")
plt.show() 

### Location of the accomodations in Rome

In [None]:
# Name of the regions within Rome

df['neighbourhood_cleansed'].unique()

In [None]:
# Number of accommodation per neighbourhood

df.groupby('neighbourhood_cleansed').id.count()

### License

According to the actual regulations in Rome (https://airbtics.com/airbnb-regulation-in-rome/) **CIR code** is needed for hosting for a short time.

In [None]:
df[['id', 'host_id', 'license']].head(10)

In [None]:
# Among all listings in Rome only 3649 showed a registration number/code

df['license'].nunique()

In [None]:
# 20469 offers do not indicate a licence number...

df['license'].isnull().sum()

In [None]:
df['license'].unique()

In [None]:
# Checking for CIR codes in the dataframe

df['license'] = df['license'].fillna('')
df_license = df[df['license'].str.contains('CIR')]


In [None]:
df_license[['id', 'host_id', 'license']]

Out of 24924 listings in Rome, 108 has a formally correctly indicated CIR number.

### The biggest host in Rome (host_id = 23532561)

In [None]:
# create a dataframe containing only the listings from the largest host

df_tophost = df.query('host_id == 23532561')

In [None]:
# hecking the first some lines of the new dataframe

df_tophost.head()

In [None]:
# The biggest host in Rome (iFlat) has 3 separate licences for 238 accommodations.

df_tophost['license'].nunique()

In [None]:
# the following licences: 

df_tophost['license'].unique()

In [None]:
df_tophost['calculated_host_listings_count_private_rooms'].unique()

In [None]:
df_tophost['calculated_host_listings_count_entire_homes'].unique()

Out of 239 listings, iFlat offers 238 entire flats and 1 private rooms.

### Information about the hosts

In [None]:
# check on the hosts in more details

df.groupby(['host_id','host_name','host_since','host_response_time','host_is_superhost','host_listings_count', 'host_total_listings_count']).agg(listings=('id', 'count'))

In [None]:
df.groupby(['host_id','host_name','host_since','host_is_superhost','calculated_host_listings_count']).agg(listings=('id', 'count')).sort_values(by = 'listings', ascending = False)

#### Recode string values with numeric in host_response_time

In [None]:
df['host_response_time'].nunique()

In [None]:
df['host_response_time'].unique()

In [None]:
df.loc[df['host_response_time'] == 'within an hour', 'host_response_time'] = 1


In [None]:
df.loc[df['host_response_time'] == 'within a few hours', 'host_response_time'] = 2

In [None]:
df.loc[df['host_response_time'] == 'within a day', 'host_response_time'] = 3

In [None]:
df.loc[df['host_response_time'] == 'a few days or more', 'host_response_time'] = 4

In [None]:
df['host_response_time'].unique()

In [None]:
df.groupby(['host_id','host_name','host_since','host_is_superhost','calculated_host_listings_count', 'host_response_time']).agg(listings=('id', 'count'),response_time=('host_response_time', 'mean')).sort_values(by = 'listings', ascending = False)

### Categorizing the hosts based on the number of offers

#### Adding variable host_type
- **1: 1 listing**
- **2: 2 listings**
- **3: 3 listings**
- **4: less/or equal than 10 listings**
- **5: more than 10 listings**


In [None]:
df['calculated_host_listings_count'].unique()

In [None]:
df.loc[df['calculated_host_listings_count'] == 1, 'host_type'] = 1


In [None]:
df.loc[df['calculated_host_listings_count'] == 2, 'host_type'] = 2

In [None]:
df.loc[df['calculated_host_listings_count'] == 3, 'host_type'] = 3

In [None]:
df.loc[(df['calculated_host_listings_count'] > 3) & (df['calculated_host_listings_count'] < 11), 'host_type'] = 4

In [None]:
df.loc[(df['calculated_host_listings_count'] > 10) & (df['calculated_host_listings_count'] < 250), 'host_type'] = 5

In [None]:
# the the recoded values

df['host_type'].unique()

#### Hosts response time

In [None]:
# recode in response time in percentage

new_df = df.groupby('host_type')['host_response_time'].value_counts(normalize=True)
new_df = new_df.mul(100).rename('Percent').reset_index()

In [None]:
# visualize the response time of the large hosts 

# set the background color to #242424 
sns.set(rc={'axes.facecolor':'#242424', 'figure.facecolor':'#242424'})

# plot the data with a categorical plot

g = sns.catplot(data=new_df, kind='bar', x='host_type', y= 'Percent', hue='host_response_time', legend = False, palette = c)

titel = plt.title('Number of accomodation in the different cities')
legend = plt.legend(['within an hour', 'within some hours', 'within a day', 'within some days'], loc=0, frameon=False)   
for text in legend.get_texts():
    text.set_color("white")

# change axes labels and ticks to white    
xlabel = plt.xlabel('Host type')
ylabel = plt.ylabel('Percentage')

xlabel.set_color("white")
ylabel.set_color("white")

xtick = plt.xticks(rotation=45, color='white')
g.set_xticklabels(['1 listing','2 listings','3 listings', '4-10 listings', 'above 10'])
ytick = plt.yticks(color="white")

titel.set_color("white")
        
# Turns off grid on the left Axis.
g.ax.grid(False)
sns.despine()

86 % of the corporate hosts replied within an hour. However, all the other hosts has a similar response time, therefore, no pattern among hosts types can be identified.

#### Host acceptance rate

In [None]:
df_bighosts = df.query('host_type == 5')

In [None]:
df_bighosts['host_acceptance_rate'] = df_bighosts['host_acceptance_rate'].str.rstrip("%").astype(float)/100

In [None]:
# acceptance rate for large hosts

df_bighosts['host_acceptance_rate'].mean()

In [None]:
df_singlehost = df.query('host_type == 1')

In [None]:
df_singlehost['host_acceptance_rate'] = df_singlehost['host_acceptance_rate'].str.rstrip("%").astype(float)/100

In [None]:
# acceptance rate for hosts with one listing

df_singlehost['host_acceptance_rate'].mean()

In [None]:
df_doublehosts = df.query('host_type == 2')

In [None]:
df_doublehosts['host_acceptance_rate'] = df_doublehosts['host_acceptance_rate'].str.rstrip("%").astype(float)/100

In [None]:
# acceptance rate for hosts with 2 listings

df_doublehosts['host_acceptance_rate'].mean()

In [None]:
df_triplehosts = df.query('host_type == 3')

In [None]:
df_triplehosts['host_acceptance_rate'] = df_triplehosts['host_acceptance_rate'].str.rstrip("%").astype(float)/100

In [None]:
## acceptance rate for hosts with 3 listings

df_triplehosts['host_acceptance_rate'].mean()

In [None]:
df_middlehosts = df.query('host_type == 4')

In [None]:
df_middlehosts['host_acceptance_rate'] = df_middlehosts['host_acceptance_rate'].str.rstrip("%").astype(float)/100

In [None]:
# acceptance rate hosts with 4-10 ratings

df_middlehosts['host_acceptance_rate'].mean()

Possible commertial hosts (more than 3 flats) having a higher acceptance rate.

In [None]:
fig, ax = plt.subplots(figsize=(6, 3), subplot_kw=dict(aspect="equal"))

label = ['single host', 'double host', 'triple host', '4-10 appartments', 'big hosts']

data = [0.8579, 0.8932, 0.8823, 0.9004, 0.9382]

wedges, texts = ax.pie(data, wedgeprops=dict(width=0.5), startangle=-40)

bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)
kw = dict(arrowprops=dict(arrowstyle="-"),
          bbox=bbox_props, zorder=0, va="center")

for i, p in enumerate(wedges):
    ang = (p.theta2 - p.theta1)/2. + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))
    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
    connectionstyle = f"angle,angleA=0,angleB={ang}"
    kw["arrowprops"].update({"connectionstyle": connectionstyle})
    ax.annotate(label[i], xy=(x, y), xytext=(1.35*np.sign(x), 1.4*y),
                horizontalalignment=horizontalalignment, **kw)


plt.show()