<img align="center" src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/96/PanoramaRio.jpg/1920px-PanoramaRio.jpg" />

# Is it a good deal to buy an apartment in Rio and rent it using Airbnb platform?

This jupyter notebook uses data from <a href="http://insideairbnb.com/get-the-data.html">Inside Airbnb</a> collected from 2019 to 2020.

I created a hypothetical situation where I want to buy an apartment in Rio de Janeiro (my home city) to make it available for rent on Airbnb and I want to answer the following questions:

* What are the best neighborhoods to invest in a rental property on Airbnb?
* What are the most and least desired features by tenants?
* Which comments are most associated with the best ratings and the ones associated with the worst, so I can take care of this to provide a good experience to guests?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_profiling
import time
import csv
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

#Display more data on screen
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 100)

import folium
from folium.plugins import HeatMap


%matplotlib inline

# 1. Basic dataset exploration

Here I import data and make basic analysis of features, nulls, counts, etc. to get sense of data

In [None]:
##Procedure to reduce memory consuption of dataframes. Borrowed from https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df


In [None]:
df_listings = pd.read_csv('../input/rio-de-janeiro-brazil-airbnb-data/listings.csv')
df_calendar = pd.read_csv('../input/rio-de-janeiro-brazil-airbnb-data/calendar.csv')
df_reviews = pd.read_csv('../input/rio-de-janeiro-brazil-airbnb-data/reviews.csv')
#df_neighbourhoods = pd.read_csv('../input/rio-de-janeiro-brazil-airbnb-data/neighbourhoods.csv')

df_listings = reduce_mem_usage(df_listings)
df_calendar = reduce_mem_usage(df_calendar)
df_reviews = reduce_mem_usage(df_reviews)

* Listings contains the properties and summary information about rental
* Calendar contains information about availability of each property along time
* Reviews contains detailed information of reviews for each property

In [None]:
print(len(df_listings), len(df_calendar), len(df_reviews))

In [None]:
df_listings.head(2)

In [None]:
## Check null values percentages
df_listings.isna().mean().sort_values(ascending=True)

In [None]:
## remove useless columns, mainly URL ones, but keeping information whereas 
## property has that information or not.
df_listings = df_listings.drop(['scrape_id', 'listing_url', 'last_scraped', 
                                'host_acceptance_rate', 'license', 
                                'neighbourhood_group_cleansed', 
                                'jurisdiction_names'], axis=1)
url_columns = [c for c in df_listings.columns if '_url' in c]
for c in url_columns:
    df_listings['has_' + c] = df_listings[c].isnull()

df_listings = df_listings.drop(url_columns, axis=1)

In [None]:
def convert_price_to_float(col):
    return col.str.replace('$', '').str.replace(',', '', regex = 'true').astype(float)

df_listings.price = convert_price_to_float(df_listings.price)
df_listings.monthly_price = convert_price_to_float(df_listings.monthly_price)
df_listings.weekly_price = convert_price_to_float(df_listings.weekly_price)
df_listings.cleaning_fee   = convert_price_to_float(df_listings.cleaning_fee)
df_listings.security_deposit   = convert_price_to_float(df_listings.security_deposit)
df_listings.extra_people   = convert_price_to_float(df_listings.extra_people)

### Plot histograms of some categorical features

In [None]:
df_listings.property_type.value_counts()

In [None]:
df_listings.room_type.hist()

In [None]:
## There's a property with 200 bathrooms and another with 69 beds. Excluding 
## data for visualization

df_listings[(df_listings.bathrooms<10)&(df_listings.beds<20)][['bathrooms','bedrooms'
                ,'beds', 'number_of_reviews', 'reviews_per_month', 'availability_30',
                'availability_60','availability_90','availability_365']].hist(
    figsize=(10, 10), bins=20);

In [None]:
df_listings.minimum_nights.describe()

In [None]:
import seaborn as sns
selected_cols = ['price', 'security_deposit', 'cleaning_fee', 'weekly_price', 'monthly_price']

fig, axes = plt.subplots(1, len(selected_cols))
for i, col in enumerate(selected_cols):
    ax = sns.boxplot(y=df_listings[col], ax=axes.flatten()[i], showfliers = False)
    #ax.set_ylim(df_listings[col].min(), df_listings[col].max())
    ax.set_ylabel(col + ' / Unit')
fig.tight_layout()
plt.show()


In [None]:
df_listings.groupby('neighbourhood_cleansed')['price'].describe().sort_values('50%', ascending=False).head(100)

In [None]:
## Show top expensive properties
df_listings.sort_values('price', ascending=False).head(10)

### Map with properties

In [None]:
#Rio de Janeiro coordinates https://www.latlong.net/place/rio-de-janeiro-brazil-27580.html
rio_map = folium.Map([-22.9032, -43.1929], zoom_start=8) 
HeatMap(df_listings[['latitude','longitude']].dropna(), 
        radius=10, gradient={0.2:'green',0.4:'purple',0.6:'orange',1.0:'red'}).add_to(rio_map)


## Commented because browser wasn't showing markers
#df_listings.sample(frac=0.1).apply(lambda row:folium.Marker(location=[row["latitude"], row["longitude"]], 
#                                         popup=row['name']).add_to(rio_map), axis=1)

display(rio_map)

# 2. Analysis of features's relation to price and booking

At this point I'm interested in analyze how features influence the property price for rental. 
I'll filter neighborhoods with less than 20 properties, because most of my studies are focused on choosing a neighborhoods to buy an apartment and having few informantion gives more noise than information.

In [None]:
df_listings_clean = df_listings.groupby("neighbourhood_cleansed").filter(lambda x: len(x) >= 20)


### Correlation Matrix with selected features

In [None]:
selected_numerical_features = ['host_response_time', 'host_response_rate', 
                               'host_verifications', 'bathrooms',
                               'bedrooms', 'beds', 'number_of_reviews', 
                               'review_scores_rating', 'reviews_per_month' ]
selected_categorical_features = ['host_is_superhost', 'host_identity_verified', 
                                 'neighbourhood_cleansed', 
                                 'property_type', 'room_type', 'bed_type', 
                                 'instant_bookable', 'cancellation_policy', 
                                 'has_picture_url',  'has_thumbnail_url', 
                                 'has_host_thumbnail_url']


In [None]:
# Code from https://seaborn.pydata.org/examples/many_pairwise_correlations.html
corr = df_listings_clean[selected_numerical_features + ['price']].corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(8, 6))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0, vmin=-1, vmax=1, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

In [None]:
#This library allows analysis of categorical features correlation
!pip install dython

In [None]:
from dython.nominal import associations
associations(df_listings_clean[selected_categorical_features + ['price']], figsize=(12,12));

### Price Boxplot by neighborhood

In [None]:
fig = plt.gcf()
fig.set_size_inches( 16, 30)


sns.boxplot(y = 'neighbourhood_cleansed', x='price', data=df_listings_clean, orient='h', 
            showfliers = False, width=1);

In [None]:
fig = plt.gcf()
fig.set_size_inches( 16, 30)

sns.boxplot(y = 'neighbourhood_cleansed', x='reviews_per_month', 
            data=df_listings_clean, orient='h', 
            showfliers = False, width=1);

In [None]:
df_listings_clean.reviews_per_month.describe()

### Analyze the effect of price category on other aspects

Let's first binarize the price into a new column

In [None]:
df_listings_clean['price_level'] = pd.cut(df_listings_clean.price, 
                                          bins=[0,100,300,500,1000,999999],
                                          labels=['Low','Mid','High','Very High','Overpriced'])

In [None]:
fig = plt.gcf()
fig.set_size_inches( 8, 6)

sns.boxplot(y = 'price_level', x='reviews_per_month', data=df_listings_clean, orient='h',
            showfliers = False, width=1);

In [None]:
fig = plt.gcf()
fig.set_size_inches( 8, 6)

sns.boxplot(y = 'price_level', x='review_scores_rating', data=df_listings_clean, orient='h', 
            showfliers = False, width=1);

### Analyze Calendar data

In [None]:
df_calendar.price = convert_price_to_float(df_calendar.price)
df_calendar.adjusted_price = convert_price_to_float(df_calendar.adjusted_price)
#df_calendar.date = df_calendar.date.apply(lambda x : datetime.strptime(x , '%Y-%m-%d'))
df_calendar.date = pd.to_datetime(df_calendar.date, format='%Y-%m-%d')

In [None]:
df_calendar.head()

### Plot samples of prices over time

In [None]:
import matplotlib.dates as mdates
myFmt = mdates.DateFormatter('%m/%y')


fig, axes = plt.subplots(5,5, sharex=True, sharey=False, figsize=(20,15))

for i, ax in enumerate(axes.flatten()):
    sample_id = np.random.choice(df_listings.id.unique(),1)[0]
    df_plot = df_calendar[df_calendar.listing_id == sample_id]
    ax.plot(df_plot['date'], df_plot['adjusted_price'])
    ax.xaxis.set_major_formatter(myFmt)
plt.show()

### Average Price over time for all properties

In [None]:
df_plot = df_calendar[['date', 'adjusted_price']].groupby(by='date', as_index=False).mean()
plt.figure(figsize=(8, 6))
plt.title('Average price over time for all properties in Rio de Janeiro')
plt.plot(df_plot['date'], df_plot['adjusted_price'])


# 3. Summary of findings on properties's data

* 'name', 'summary', 'space', 'description', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules' are all descritive features of the property. Name and Description are only ones with low missing rates (~ 2%) followed by summary (~ 6.1%). Others have more than 40% of missing values. Data have both Portuguese and English information.

* The most expensive properties are for olimpics purposes and are clearly overpriced even been large houses or apartments. They are candidates to be excluded from analysis.

* Joá, Itanhangá, Vila Militar e São Conrado are neighboardhoods which presents wider ranges in prices.

* There's a concentration of no available or available all the time in properties. Others properties between these situation are probably used by owners and rented only when they travel. 

* Several neightboardhoods have very few listings. They are also candidates to be excluded from further analysis

* There's no single feature that has strong correlation alone with prices.

* Majority of properties are apartments, my focus in this study. Also, I can say that at least 50% of data has no more than 3 bedrooms, 3 bathrooms and 4 beds.

* Number of reviews by month suggests that hosts rents theirs properties around once every 2 months usually. Since average minimum number of days is 4.7, is reasonable to presume that they rent their property at least one week every 2 months at least. Properties with lower prices are booked more often.

* Looking at some sample of data, hosts don't have a single price strategy. Some fixed price over time, others vary all the time and others just rise prices on important dates such as Carnival or New Year's eve. But averanging all properties, it's clear that over the weekends the prices rises.

* The price of the property seems to have little effect on reviews. 

# 4. Obtaining (external) data of property real state offers

In [None]:
## Load a previously stored data instead of scrap again
external_data_loaded=False
try:
    df_rio_neighbourhood_apartment_avg_prices = pd.read_csv(
        '../input/d/vabatista/airbnb-rio-de-janeiro/rio_neighbourhood_appartment_avg_prices.csv')
    external_data_loaded=True
except:
    print('File didnt exists yet')

In [None]:
# This library scraps one of the most important web sites of real state in Brazil, Zap Imoveis (https://www.zapimoveis.com.br/)
!pip install zapimoveis_scraper

In [None]:
import zapimoveis_scraper as zap
import unidecode

if not external_data_loaded:
    neighborhood_apartment_avg_prices = {}
    neighborhood_list = df_listings.neighbourhood_cleansed.unique()

    for neighborhood in neighborhood_list:
        print("Querying Zap Imoveis for", neighborhood)

        neighborhood_apartment_avg_prices[neighborhood] = 0
        ## calculate the average only with the first page of search engine (20 records at most)
        counter = 0
        ## this is the way zap imóveis passes neighbourhood names as parameter
        unaccented_string = unidecode.unidecode(neighborhood).replace(' ','-').lower()
        try:
            for offer in zap.search(localization="rj+rio-de-janeiro++" + unaccented_string, acao='venda', 
                                    tipo='apartamentos', num_pages=1):
                try:
                    if len(offer.total_area_m2)>0 and int(offer.total_area_m2) > 0 and \
                            len(offer.price)>0:
                        neighborhood_apartment_avg_prices[neighborhood] = \
                                neighborhood_apartment_avg_prices[neighborhood] + \
                                float(offer.price) / int(offer.total_area_m2)
                        counter = counter + 1
                except:
                    #print(offer.price, offer.total_area_m2)
                    continue
            neighborhood_apartment_avg_prices[neighborhood] = neighborhood_apartment_avg_prices[neighborhood] / counter
            ## rest a while to not be blocked by Zap Imoveis
            time.sleep(5)         
        except Exception as err:
            ## some neighbourhood names has abreviations on zap imoveis. For instance: Santa Teresa is Sta Teresa. 
            print("Couldn't find neighborhood with this name.", neighborhood, err)
            

    with open('./rio_neighbourhood_appartment_avg_prices.csv', 'w') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(["neighborhood", "avg_price_per_sqfeet"])
        for key, value in neighborhood_apartment_avg_prices.items():
            writer.writerow([key, value])
    df_rio_neighbourhood_apartment_avg_prices = pd.read_csv('./rio_neighbourhood_appartment_avg_prices.csv')

In [None]:
df_rio_neighbourhood_apartment_avg_prices.sort_values('avg_price_per_sqfeet', ascending=False).head(5)

# 5. Create a Model to estimate which is the best price/performance to buy a property

I'll drop few data from my analysis:
* neighborhood I couldn't got prices from Zap Imoveis Site
* listings that aren't apartment
* Neighborhood with few properties 

In [None]:
df_rio_neighbourhood_apartment_avg_prices = df_rio_neighbourhood_apartment_avg_prices[
    df_rio_neighbourhood_apartment_avg_prices.avg_price_per_sqfeet>0]

In [None]:
df_listings_model = df_listings

df_listings_model['price_level'] = pd.cut(df_listings_model.price, 
                                          bins=[0,100,300,500,1000,999999], 
                                          labels=['Low','Mid','High','Very High','Overpriced'])
df_listings_model = df_listings_model[((df_listings_model.property_type == 'Apartment')&
                                       (df_listings_model.neighbourhood_cleansed.isin(
                                           df_rio_neighbourhood_apartment_avg_prices.neighborhood.unique()))&
                                       (df_listings_model.price_level!='Overpriced'))]
df_listings_model = df_listings_model.groupby("neighbourhood_cleansed").filter(lambda x: len(x) >= 20)

### Assumptions

Here are my assumptions:

* reviews per month are a proxy for number of bookings and I'll multiply it by one week (7 days) times price as the performance of each property.
* availability are more a concern of the hosts about their properties than a market issue, so I won't deal with it.
* for simplification, I'll assume the mean price / squared feet I collected from Zap Imoveis is a reasonable measure for apartments in each location, but I know it's imprecise.


In [None]:
df_listings_model.price.fillna(0, inplace=True)
df_listings_model.reviews_per_month.fillna(0, inplace=True)
df_listings_model['property_performance'] = df_listings_model.price * 7 * df_listings_model.reviews_per_month

### Measure of size

Now I need somehow to measure the size of each property. The feature squared_feet from data is useless since it's not filled, but I can use these three below
<pre>
feature            % of nulls
bathrooms           0.001523
bedrooms            0.001285
beds                0.001285
</pre>

This site (https://www.homify.com.br/livros_de_ideias/5994421/tamanho-ideal-de-quarto-como-definir-na-hora-de-construir) suggests that a bedroom has in average 10 squared feets. A bathroom has 3-4 (https://www.uol.com.br/universa/noticias/redacao/2012/11/26/banheiros-de-29-m-a-1245-m-veja-como-deixa-los-funcionais-na-moda-e-ampliar-a-sensacao-de-espaco.htm). I would assume that every apartment has also a kitchen (4 sq2) and living room (15sq2).

In [None]:
df_listings_model.bathrooms.fillna(1, inplace=True)
df_listings_model.bedrooms.fillna(1, inplace=True)
df_listings_model['estimated_size'] = df_listings_model.bathrooms * 3.5 + df_listings_model.bathrooms * 10 + 4 + 15

Now I'll merge the data with Zap imoveis and calculate an estimate cost of each property and also a total number of days of bookings necessary to pay for that.

In [None]:
df_listings_model = df_listings_model[['id','price', 'reviews_per_month', 'estimated_size', 'property_performance', 'price_level', 'neighbourhood_cleansed']]
df_listings_model = df_listings_model.merge(df_rio_neighbourhood_apartment_avg_prices, how='left', left_on='neighbourhood_cleansed', right_on='neighborhood')

df_listings_model['cost'] = df_listings_model.estimated_size * df_listings_model.avg_price_per_sqfeet
df_listings_model.loc[df_listings_model.price==0, 'price'] = 1 #avoid division by 0
df_listings_model['number_of_bookings_to_pay'] = df_listings_model.cost / df_listings_model.price

In [None]:
df_listings_model.head(10)

### Performance / Price Index

Now I'll create a concept of Performance over price index with the previous concepts

In [None]:
df_listings_model['price_performance'] = df_listings_model['property_performance'] / df_listings_model['price']

In [None]:
fig = plt.gcf()
fig.set_size_inches( 16, 30)

sns.boxplot(y = 'neighbourhood_cleansed', x='price_performance', 
            data=df_listings_model, orient='h', showfliers = False, width=1);

In [None]:
fig = plt.gcf()
fig.set_size_inches( 16, 30)

sns.boxplot(y = 'neighbourhood_cleansed', x='number_of_bookings_to_pay', 
            data=df_listings_model, orient='h', showfliers = False, width=1);

In [None]:
fig = plt.gcf()
fig.set_size_inches( 16, 30)

sns.boxplot(y = 'neighbourhood_cleansed', x='property_performance', 
            data=df_listings_model, orient='h', showfliers = False, width=1);

In [None]:
df_listings_model[df_listings_model.neighborhood=='Ipanema'].head(20)

### My conclusion so far:

Looking into graphs above and the imperfect data I collected, Ipanema seems to be the best option to invest. It has a good performance on booking and best performance over price index. The problem is that property in this neighborhood is too expensive and to get a loan from bank you already had to have money to buy part of it. 
So, better try another afortable neighborhood :-)

## TODO: Analyze relationship between features and review ratings and comments. 