# Exploring Hotels on Goibibo

Goibibo is an Indian-based online hotelier company. Like many other hotel-booking sites -- in the United States, `hotels.com` comes to mind -- it allows users an easy option for comparing hotel prices and going through with booking in one interface.

In this notebook we will explore this dataset, a detailed sample of the hotels listed on Goibibo. We will probe the basic dataset attributes and hopefully uncover some interesting effects from the data! This exploratory data analytics notebook is recommended for beginners and those interested in probing this dataset further. Feel free to fork this notebook and/or copy the code here and explore further on your own!

![](https://i.imgur.com/3js0YPV.png)

In [None]:
import pandas as pd
pd.set_option('max_columns', None)
hotels = pd.read_csv("../input/goibibo_com-travel_sample.csv")
hotels.head(3)

## Data munging

First, cleaning up the data a bit...

In [None]:
import numpy as np

# Helper functions for encoding.
def split_piped_list(srs, col):
    try:
        ret = [r.split("::")[-1] for r in srs[col].split("|")]
        ret = [float(r) if len(r) > 0 else np.nan for r in ret]
        return ret
    except AttributeError:
        return np.nan

# Encode the aggregated 5-star review categories into columns.
ratings = pd.DataFrame(data=hotels.apply(
                                lambda h: split_piped_list(h, 'site_stay_review_rating'), 
                                                           axis='columns').tolist(),
                       columns=['service_quality_rating', 'amenities_rating', 
                                'food_and_drinks_rating', 'value_for_money_rating', 
                                'location_rating', 'cleanliness_rating'])
hotels = hotels.join(ratings)

# Encode the reviews column into separate columns.
review_counts = pd.DataFrame(
    data=(
        hotels
            .apply(lambda h: split_piped_list(h, 'review_count_by_category'), axis='columns')
            .map(lambda v: [0, 0, 0] if isinstance(v, float) else v).tolist()
    ),
    columns=['positive_reviews_total', 'critical_reviews_total', 'reviews_with_images_total']
)
hotels = hotels.join(review_counts)

hotels = hotels.drop(['country', 'sitename', 'review_count_by_category', 
                      'site_stay_review_rating'], axis='columns')
hotels = hotels.drop(hotels['room_count'].argmax())  # bad entry

## Hotel star ratings and reviews

Let's take a look at some top-level variables, then drill down a little bit into particulars.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")

f, axarr = plt.subplots(2, 2, figsize=(14, 8))
plt.suptitle('Goibibo Hotel Breakdown', fontsize=18)

sns.kdeplot(hotels['site_review_rating'], ax=axarr[0][0])
sns.kdeplot(hotels['site_review_count'], ax=axarr[0][1])
sns.countplot(hotels['hotel_star_rating'], ax=axarr[1][0])
sns.kdeplot(hotels['room_count'], ax=axarr[1][1])

sns.despine()

Site reviews are heavily skewed towards 4 stars. This is actually pretty much standard across every platform on the web that does ratings ([relevant XKCD](https://xkcd.com/1098/)), so no surprise here.

Interestingly enough, half of the hotels in the system are 0-rated, e.g. they have no rating assigned at all. While I am not specifically familiar with the Goibibo product, this seems like a bad thing to me -- a hotel's overall rating is very important for making decisions about where to stay.

Review counts and room counts are very similar in that they both peak at a couple dozen, with a very long tail of extremely large hotels with much larger numbers of rooms and reviews.

It is also extremely possible that some of the review totals are inflated; paid-for reviews are a notorious problem on sites like this, and this phenomenon will probably be especially bad in India. Check out the website for the following hotel, for example, which is the most heavily reviewed one of them all; nothing on its listing seems to particularly justify this fact...

In [None]:
hotels.iloc[hotels['room_count'].argmax()]['pageurl']

Does the distribution of ratings change with the number of stars (1-star, 2-star, etc.) the hotel has?

In [None]:
sns.jointplot(hotels.hotel_star_rating, hotels.site_review_rating)

There is a very strong lift that's visible here. 4-star and especially 5-star hotels have significantly better ratings than lower-starred options. As a rule of thumb, staying at a 3-minus-star hotel risks having a terrible experience; staying at a 4-star hotels means you will have *at a minimum* an unsatisfactory experience (2/5 rating); and staying at a 5-star hotel garuntees you will have at least an *ok* experience (3/5 rating).

Next, let's look at the sentiment and quality of the reviews.

In [None]:
f, axarr = plt.subplots(1, 2, figsize=(14, 4))

sns.violinplot(hotels.positive_reviews_total / (hotels.positive_reviews_total + hotels.critical_reviews_total), ax=axarr[0], color='lightgreen')
sns.violinplot(hotels.reviews_with_images_total / (hotels.positive_reviews_total + hotels.critical_reviews_total), ax=axarr[1], color='lightgreen')

axarr[0].set_title("Ratio of Positive Reviews to Negative Ones")
axarr[1].set_title("Ratio of Reviews with Images")
plt.suptitle('Goibibo Hotel Review Ratios', fontsize=18, y=1.08)

sns.despine()

Unsurprisingly, positive reviews the norm. 75% of hotels have a 75%-or-better positive review ratio.

Reviews with images are the really high-quality reviews, because they involve more work and have more viewership with readers (and are also more effect to falsify BTW). From this plot we see that the percentage of reviews which have images tends to be around 10 to 20 percent or so.

Finally, let's see whether or not review ratings differ significantly between the five different review categories: cleanlinesss, location, value for money, food and drink quality, amenities, and service quality (as well as overall quality and the number of stars the hotel has).

In [None]:
_ = (hotels[[col for col in hotels.columns if "_rating" in col]]
          .dropna()
          .sample(100))
_.columns = pd.Series([col for col in hotels.columns if "_rating" in col]).str.replace("_rating", "")

sns.pairplot(_, size=1.4)

We see that there is a strong linear relationship amongst all of these variables. The overall hotel rating captures the rating the hotel recieves in each subcategory basically quite well.

However, some combinations of categories *do* have more variance between them than others. For example, cleanliness and location or cleanliness and value for money are relatively weakly correlated, while value for money and amenities or value for money and service quality are relatively strongly correlated.

Can you spot any more interesting interactions in this capture? Can you explain why they occur? Anything surprising?

## Hotel locations

The number of hotels in a particular city is strongly, *but not totally*, dependent on the size of the city in question. For example, the city of Gao, with ~1.5 million residents and a strong tourism industry, has about the same sample size of hotels in this dataset than Bangalore, a city of ~18 million (metro area population). I believe that some of this is due to variance in definitions, regional differences in what platforms hotels in an area use (India is *a really big country*), and possibly data sampling issues on PromptCloud's end; but we should nevertheless have a *reasonable* estimate of distribution of hotels in India.

The plot below maps them out. Click on a circle to get the name of the represented city; the bigger the circle, the more hotels the city has. Zoom in to see more!

In [None]:
hotels_lat_long = hotels.groupby('city').first().loc[:, ['longitude', 'latitude']].assign(
    n_hotels = hotels.groupby('city').area.count(),
    n_reviews = (hotels.assign(site_review_count=hotels.site_review_count.fillna(0))\
                 .groupby('city').site_review_count.sum())
)

In [None]:
import folium

m = folium.Map(
    location=[21.15, 79.09],
    zoom_start=4
)

max_n_hotels = hotels_lat_long.n_hotels.max()

hotels_lat_long.apply(lambda ll: folium.Circle(radius=200000 * (ll.n_hotels / max_n_hotels),
                                               location=[ll.latitude, ll.longitude],
                                               fill=True,
                                               color='black',
                                               popup=ll.name).add_to(m), axis='columns')
m

## Effect of amenities on price

Lastly, let's look at whether or not a hotel offering or not offering a specific amenity (explicitly; remember that this field may not be completely accurate!) has an effect on the star rating of the offering.

In [None]:
import itertools

top_amenities = pd.Series(
    list(itertools.chain(*hotels['room_facilities']\
                             .fillna("")\
                             .map(lambda f: [am.strip() for am in f.split("|")])\
                             .values\
                             .tolist()))).value_counts().head(12).index.values
temp = hotels.assign(amenities=hotels['room_facilities'].fillna("").map(
        lambda f: [am.strip() for am in f.split("|")]))

for amenity in top_amenities:
    temp[amenity] = temp.amenities.map(lambda l: amenity in l)

The top amenities are, unsurprisingly, pretty basic. We will stick with looking at these here.

(If you want to take a look at the effects of rarer amenities, try forking this notebook and writing that check yourself!)

In [None]:
top_amenities

The plot below (TODO: move this) demonstrates that, indeed, amenities can have strong effects on hotel star-ratings. Most surprisingly, having room service is a downward trend, while housekeeping is upwards! This is not due to the amenities themselves -- nobody would rather not have room service then have it -- but rather tells us something about the kinds of hotels these fields get attached to. Perhaps housekeeping is an upgraded version of room service specific to higher-quality hotels!

Some results are, on the other hand, more surprising.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")

f, axarr = plt.subplots(3, 4, figsize=(14, 8))
f.subplots_adjust(hspace=1)

sns.factorplot(x='Room Service', y='hotel_star_rating', data=temp.dropna(), ax=axarr[0][0])
axarr[0][0].set_title("Room Service?")

sns.factorplot(x='Basic Bathroom Amenities', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[0][1])
axarr[0][1].set_title("Basic Bathroom Amenities?")

sns.factorplot(x='Hot / Cold Running Water', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[0][2])
axarr[0][2].set_title("Hot / Cold Running Water?")

sns.factorplot(x='Housekeeping', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[0][3])
axarr[0][3].set_title("Housekeeping?")

sns.factorplot(x='Ceiling Fan', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[1][0])
axarr[1][0].set_title("Ceiling Fan?")

sns.factorplot(x='Air Conditioning', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[1][1])
axarr[1][1].set_title("Air Conditioning?")

sns.factorplot(x='Cable / Satellite / Pay TV available', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[1][2])
axarr[1][2].set_title("Cable / Satellite / Pay TV?")

sns.factorplot(x='Attached Bathroom', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[1][3])
axarr[1][3].set_title("Attached Bathroom?")

sns.factorplot(x='Telephone', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[2][0])
axarr[2][0].set_title("Telephone?")

sns.factorplot(x='Mirror', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[2][1])
axarr[2][1].set_title("Mirror?")

sns.factorplot(x='TV', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[2][2])
axarr[2][2].set_title("TV?")

sns.factorplot(x='Desk in Room', y='site_review_rating', 
               data=temp.dropna(), ax=axarr[2][3])
axarr[2][3].set_title("Desk in Room?")

## Further ideas

That's all for here folks! Here are some ideas for further exploration:

* Try exploring some other amenity categories. What do you see?
* Try applying some natural language processing algorithms to the hotel descriptions, a field that we have not even touched in this brief tour. What are the some common words and phrases? How do they relate to the amenities the hotel offers?
* What can you discover by drilling down further into hotels in different regions?

If you found this notebook helpful, try applying the techniques here to [some of the other hotel datasets on Kaggle](https://www.kaggle.com/PromptCloudHQ/datasets)! What will you find?