# Become a data driven Airbnb host 1

Airbnb has changed the way people travel to other places. With a reasonable price, you can find a clean and cute private room in someone's home/property  and live like a local. It‘s really great that the Airbnb data in many major cities are now available online and we can be our own data scientists to plan our next trips or create strategies for our listings (if you are a host!)

This notebook is a comprehensive exploration on the Boston Airbnb data (but without the touch of any modeling yet). Although the analysis can be used in various ways, I'd like to recommend you to image that **you are a new host on the market who's planning the debut for your lovely apartment ;)**.

Some of the business questions to be solved in this notebook:

* How do hosts describe their listings? What aspects will they include? How do they describe them?
* Descriptive analysis on the hosts and their lisitngs
* What do customers say in their reviews? What do they care?
* Display negative comments on an aspect (e.g. bed, bathrooms)

![](https://images.unsplash.com/photo-1501979376754-2ff867a4f659?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=3300&q=80)

# 1. Import packages and load data

Apart from the classic python data analysis packages. I will also use a utility script I wrote for my work --- `reviewminer.py`. It's on my Github (https://github.com/tianyiwangnova/2019_project_ReviewMiner) with the source code and a few example notebooks. We will use this utility code to identify what the hosts say about their listings and what customers say about their stays.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from datetime import date
pd.set_option("display.max_columns", 100)

In [None]:
from reviewminer import *

In [None]:
def quick_aspect_opinion_view(df, id_column, content_column):
    '''
    A wrapper around the utility script AspectOpinionExtractor --- to quickly explore the comments;
    The input table should have a id_column and a content_column (each row is a comment)
    It will print out 9 bar charts for the top 9 aspects in the comments; Each bar chart shows the popular words people use to describe that aspect.
    '''
    aoe = AspectOpinionExtractor(df, id_column, content_column)
    aoe.aspect_opinon_for_all_comments()
    aoe.popular_aspects_view()
    
def data_first_look(df):
    '''
    print out the shape of a dataframe, the percentage of features without missing values and the columns with high missing rates
    '''
    print(f"Data has in total {df.shape[0]} rows and {df.shape[1]} features.")
    missing_rates = pd.DataFrame(df.isna().mean())\
                    .rename({0:'perc_missing'}, axis=1)\
                    .sort_values('perc_missing', ascending=False)
    print("{}% of the features don't have missing values"\
          .format(round(sum(missing_rates['perc_missing']==0)*100/len(missing_rates), 2)))
    missing_rates[missing_rates['perc_missing']>0].plot.bar(title="Missing rates")

In [None]:
calendar = pd.read_csv("../input/boston/calendar.csv")
listings = pd.read_csv("../input/boston/listings.csv")
reviews = pd.read_csv("../input/boston/reviews.csv")

# 2. Data exploration

# 2.1 Listings data

Let's first take a look at the listings data to see the offerings in Boston. Listing data has all the information of each listing including the description of the room, host, housing policy, price, review scores, etc. Let's deep dive into the data to see if there are interesting features we can use for modeling later.

In [None]:
listings.head(3)

In [None]:
plt.rcParams['figure.figsize'] = (20,3)
data_first_look(listings)

Most of the useful columns have low missing rates.

## 2.1.1 Description Data
There are quite a few feautures are decription (natural language data). An example of these columns:

In [None]:
listings[['description','space','neighborhood_overview','house_rules']].head()

To explore these columns, we'd love to see what aspects and corresponding opinions are in those columns. For example, for the sentence:

In [None]:
listings.loc[0,'space']

It talks about `open and cozy feel`, `flat screen TV`, `kitchen`, `yard`, `sitting room` etc. However, it's hard to extract all the aspects in the sentence. I'll use a tool I built for work to do the aspects/opinion extraction. The tool extract popular opinions for the most popular 9 aspects in the sentence. Let's take a look! (We only displayed the columns with meaningful insights!)

> ## Name and summary

In the name and summary fields of the of listings, hosts love to mention `private room`, `downtown & south Boston`, `cozy and spacious apartments`, `sunny` and `locations`.

In [None]:
quick_aspect_opinion_view(listings, 'id', 'name')

In [None]:
quick_aspect_opinion_view(listings, 'id', 'summary')

> ## Space

In the space field of the of listings, hosts love to mention `kitchen`, `size of the bed`, `floor the aparment is on` and `bathroom`

In [None]:
quick_aspect_opinion_view(listings, 'id', 'space')

> ## Neighborhood

In the neighborhood field of the of listings, hosts love to mention `quite`, `residential`, `safe`, `historic`, `fashionable and antique shops` and `short walk`.

In [None]:
quick_aspect_opinion_view(listings, 'id', 'neighborhood_overview')

> ## Transit

It seems that green and red lines are the popular lines in Boston. Other popular aspects includes `airport` and `parking`.

In [None]:
quick_aspect_opinion_view(listings, 'id', 'transit')

> ## House rules

Popular aspects in house rules include `guests`, `smoking`, `pets` and `parties`.

In [None]:
quick_aspect_opinion_view(listings, 'id', 'house_rules')

## 2.1.2 Catogorical/Numerical Data

In [None]:
today = date.today()

In [None]:
datetime.datetime.today().strftime('%Y-%m-%d')

> ## Information about the hosts

We created the visualizations of the information of the hosts. We can see that we have more and more hosts in recent years. The response rates and acceptance rates are in general very high. Only a small portion of the hosts are superhosts. Most hosts have less than 10 listings while there are some hosts that have several hundreds of listings. About 30% of the hosts haven't had identity verified yet.

In [None]:
host_age = listings['host_since'].apply(lambda x: (datetime.datetime.today() - pd.to_datetime(x)).days/365)
reponse_rate = listings['host_response_rate'].dropna().apply(lambda x: float(x.split("%")[0])/100)
acceptance_rate = listings['host_acceptance_rate'].dropna().apply(lambda x: float(x.split("%")[0])/100)

In [None]:
plt.rcParams['figure.figsize'] = (20,20)

plt.subplot(3,3,1)
plt.hist(host_age)
plt.title('Years of hosting experience')

plt.subplot(3,3,2)
plt.hist(reponse_rate)
plt.title('Response rate')

plt.subplot(3,3,3)
plt.hist(acceptance_rate)
plt.title('Acceptance rate')

plt.subplot(3,3,4)
listings.groupby('host_is_superhost').count()['id'].plot.bar()
plt.xticks(rotation = 0)
plt.title('Superhost')

plt.subplot(3,3,5)
plt.hist(listings['host_total_listings_count'], bins=80)
plt.title('Host total listings')

plt.subplot(3,3,6)
listings.groupby('host_has_profile_pic').count()['id'].plot.bar()
plt.xticks(rotation = 0)
plt.title('Has public profile')

plt.subplot(3,3,7)
listings.groupby('host_identity_verified').count()['id'].plot.bar()
plt.xticks(rotation = 0)
plt.title('Identity verified')

plt.subplot(3,3,(8,9))
listings.groupby('host_response_time').count()['id'].plot.bar()
plt.xticks(rotation = 0)
plt.title('Response time')

plt.subplots_adjust(hspace = 0.3)

> ## Information about the rooms

We can see the popular neighborhoods below. Most of the properties are apartments. The listings mostly offer 1 bedrooms targeting 1-2 guests.

In [None]:
plt.rcParams['figure.figsize'] = (20,20)

plt.subplot(3,3,(1,2))
listings.groupby('neighbourhood_cleansed').count()['id'].sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation = 45)
plt.title('Popular Neighborhoods')

plt.subplot(3,3,3)
listings.groupby('property_type').count()['id'].sort_values(ascending=False)[:5].plot.bar()
plt.xticks(rotation = 45)
plt.title('Property type')

plt.subplot(3,3,4)
listings.groupby('room_type').count()['id'].sort_values(ascending=False)[:5].plot.bar()
plt.xticks(rotation = 45)
plt.title('Room type')

plt.subplot(3,3,5)
plt.hist(listings['accommodates'])
plt.title('Accommodates')

plt.subplot(3,3,6)
plt.hist(listings['bathrooms'])
plt.title('Bathrooms')

plt.subplot(3,3,7)
plt.hist(listings['bedrooms'])
plt.title('Bedrooms')

plt.subplot(3,3,8)
plt.hist(listings['beds'])
plt.title('Beds')

plt.subplot(3,3,9)
listings.groupby('bed_type').count()['id'].sort_values(ascending=False)[:5].plot.bar()
plt.xticks(rotation = 45)
plt.title('Bed type')

plt.subplots_adjust(wspace = 0.3)
plt.subplots_adjust(hspace = 1)

In terms of amenities, we listed all the amenities mentioned:

In [None]:
amenities = list(set(listings['amenities'].apply(lambda x: x[1:-1].replace("\"","").split(",")).sum()))
amenities = [i for i in amenities if i!=""]
amenities_pd = listings[['amenities']]
for a in amenities:
    amenities_pd[a] = listings['amenities'].str.contains(a)

In [None]:
plt.rcParams['figure.figsize'] = (20,20)
amenities_pd.mean().sort_values().plot.barh()
plt.title("% of listings with certain amenity")

This is actually an interesting field. We can see that popular amenities are internet, heating, kitchen, essentials (probably stuffs like shampoo), dryer etc. There some amenites that only a few listings offer: gum, doorman, hot tub, pool... Listings with more rare amenities must have higher prices.

> ## Analysis on the review scores

About 75% of the listings have reivew scores. The distribution of the scores for various aspects are similar. Most of the listings get high scores. There are very few listings that have scores lower than 60.

In [None]:
plt.rcParams['figure.figsize'] = (20,20)

plt.subplot(3,3,1)
plt.hist(listings['review_scores_rating'])
plt.title('review_scores_rating')

plt.subplot(3,3,2)
plt.hist(listings['review_scores_accuracy'])
plt.title('review_scores_accuracy')

plt.subplot(3,3,3)
plt.hist(listings['review_scores_cleanliness'])
plt.title('review_scores_cleanliness')

plt.subplot(3,3,4)
plt.hist(listings['review_scores_communication'])
plt.title('review_scores_communication')

plt.subplot(3,3,5)
plt.hist(listings['review_scores_location'])
plt.title('review_scores_location')

plt.subplot(3,3,6)
plt.hist(listings['review_scores_value'])
plt.title('review_scores_value')

## 2.2 Review data

We then take a look at the review data to understand what customers care about their stays. We use the tool `ReviewMiner` to run the analysis on a sample of the reviews data in 2016.

In [None]:
plt.rcParams['figure.figsize'] = (20,3)
data_first_look(reviews)

In [None]:
review2016 = reviews[reviews['date']>='2016-01-01']

In [None]:
pd.set_option('max_colwidth',200)
review2016.head()

In [None]:
review_explore = ReviewMiner(review2016.sample(10000).reset_index(), 'id', 'comments', 'date')

It seems that most of comments are very positive comments. Out Aspect_Opinion Extractor only caught very general aspects. It's good to know that most of customers had great experience with their stay. However, we'd love to know what they were not satisfied with. To give an example, we did a deeper investigation on the aspect `bathroom`.

In [None]:
bathroom_reviews = review_explore.investigate("bathroom", topic_modeling=True)

Here are a few negative reviews of bathrooms! ⬇️

In [None]:
bathroom_reviews[:10]