# Exploring Prices

What factors affect the price of an AirBnB rental? Let's find out!

## Preliminaries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
pd.set_option("max_columns", None)

In [None]:
listings = pd.read_csv("../input/listings.csv")

In [None]:
listings.head()

First we're going to need to fix up the price variable, which is given to us as a string containing dollar signs, dots, and commas, very not ideal.

In [None]:
listings['price'].head()

In [None]:
prices = listings['price'].map(lambda p: int(p[1:-3].replace(",", "")))

With that done let's look at the distribution of prices in general.

In [None]:
prices.describe()

In [None]:
sns.swarmplot(y=prices.sample(200))

Here's that 4000$/night AirBnB, the most expensive one in Boston:

In [None]:
listings.iloc[np.argmax(prices)]

If you [check the listing URL](https://www.airbnb.com/rooms/12972378) this is a fake listing, as far as I can tell.

(Tangent: I stayed at the nearby hotel once, the hotel was decent but the food from the UNO pizza shop on the ground floor was *terrible*).

[You can rent out mansions on AirBnB](http://www.mirror.co.uk/sport/football/news/inside-neymars-7000-night-airbnb-8130098), just so you know.

In [None]:
listings['price'] = prices

## Exploring Variables

Now let's examine the effects of the various variables we have access to on price.

In [None]:
import matplotlib.pyplot as plt

First an obvious one, neighborhood.

In [None]:
sort_order = listings.query('price <= 600')\
                    .groupby('neighbourhood_cleansed')['price']\
                    .median()\
                    .sort_values(ascending=False)\
                    .index
sns.boxplot(y='price', x='neighbourhood_cleansed', data=listings.query('price <= 600'), 
            order=sort_order)
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.show()

Superhosts are AirBnB hosts with a particularly lengthy history on the platform. I'm not sure what the benefit to the host is, but the benefit to a renter of staying with a superhost is that they know that this person has amassed a large amount of positive reputation.

I had thought that superhosts would be at a premium and charge a greater rate because of their special status, but I was surprised to find below that is not true. There doesn't appear to be any statistical difference at all between superhost prices and normal host prices.

Now, keep in mind that we're naively looking at one variable in isolation here. It's possible that there is an effect and it just doesn't show up because superhosts tend to rent out entities (e.g. single bedrooms) that are cheaper than those rented out by the general population (e.g. entire houses or apartments, especially heavily legally debated vacation rentals). We're not going to dive that far in, but this is an interesting (lack of an) effect.

In [None]:
sns.boxplot(y='price', x='host_is_superhost', data=listings.query('price <= 600'))

Naturally the type of residence has a strong effect on price. Note that the vast majority of observations fall into just four categories: House, Apartment, Condominium, and Townhouse, in that order.

The Boat effect is interesting, people are paying a premium to be on the water.

In [None]:
sort_order = listings.query('price <= 600')\
                    .groupby('property_type')['price']\
                    .median()\
                    .sort_values(ascending=False)\
                    .index
sns.boxplot(y='price', x='property_type', data=listings.query('price <= 600'), order=sort_order)
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.show()

More understandable effects, how many walls you get and whether your bed is real or not.

In [None]:
sort_order = listings.query('price <= 600')\
                    .groupby('room_type')['price']\
                    .median()\
                    .sort_values(ascending=False)\
                    .index
sns.boxplot(y='price', x='room_type', data=listings.query('price <= 600'), order=sort_order)

In [None]:
sort_order = listings.query('price <= 600')\
                    .groupby('bed_type')['price']\
                    .median()\
                    .sort_values(ascending=False)\
                    .index
sns.boxplot(y='price', x='bed_type', data=listings.query('price <= 600'), order=sort_order)

Let's take a look at the market in terms of bedrooms and bathroom provided. Below is a heatmap showing the number of BnBs of various Bathroom/Bedroom configurations. Not surprisingly housing availability is clustered around the 1 bed 1 bath, with dorms and such having less and large rentals having more. The vast majority of entities fall into a 0x0-to-2x2 box.

In [None]:
sns.heatmap(listings.query('price <= 600')\
                .groupby(['bathrooms', 'bedrooms'])\
                .count()['price']\
                .reset_index()\
                .pivot('bathrooms', 'bedrooms', 'price')\
                .sort_index(ascending=False),
            cmap="Greens", fmt='.0f', annot=True, linewidths=0.5)

And here are the mean prices for each of these configurations. Note that this number should be treated with skepticism in the case of cells with a small number of observations above; anything outside of 0x0-by-2x2 is especially phony-feeling.

In [None]:
sns.heatmap(listings.query('price <= 600')\
                .groupby(['bathrooms', 'bedrooms'])\
                .mean()['price']\
                .reset_index()\
                .pivot('bathrooms', 'bedrooms', 'price')\
                .sort_index(ascending=False),
            cmap="Greens", fmt='.0f', annot=True, linewidths=0.5)

Surprisingly enough zero beds costs you a premium! BnBs with no real bed seem to be special in some regard, justifying their cost elsewhere.

In [None]:
sns.boxplot(y='price', x='beds', data=listings.query('price <= 600'))

In [None]:
listings['amenities'] = listings['amenities'].map(
    lambda amns: "|".join([amn.replace("}", "").replace("{", "").replace('"', "")\
                           for amn in amns.split(",")])
)

Here are the amenities that each BnB provides, sorted by the number of BnBs providing them:

In [None]:
pd.Series(np.concatenate(listings['amenities'].map(lambda amns: amns.split("|"))))\
    .value_counts()\
    .plot(kind='bar')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=8)
plt.show()

Let's see which amenities are most popular, price-wise.

In [None]:
amenities = np.unique(np.concatenate(listings['amenities'].map(lambda amns: amns.split("|"))))
amenity_prices = [(amn, listings[listings['amenities'].map(lambda amns: amn in amns)]['price'].mean()) for amn in amenities if amn != ""]
amenity_srs = pd.Series(data=[a[1] for a in amenity_prices], index=[a[0] for a in amenity_prices])

In [None]:
amenity_srs.sort_values(ascending=False).plot(kind='bar')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=8)
plt.show()

Washer/Dryer is clearly an outlier as it appears in only one case (for some reason). Here's a slightly clearer picture without it:

In [None]:
amenity_srs.sort_values(ascending=False)[1:].plot(kind='bar')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=8)
plt.show()

That's it for our tour! Next we should try and model what we found.