In [12]:
import pandas as pd

import requests
from bs4 import BeautifulSoup

# 1. Data collection

## Overview
In this notebook I collected information through web scraping. The data was scraped from two different websites for hotels, located in the Bulgarian Black Sea resort town Sozopol. The data from first website (https://hoteli-sozopol.pochivka.bg) is related to hotel details - stars, customers rating, price for double room, prices for triple room in the summer season and amenities that the hotel provides. The data from second website (https://booking.com) includes customer ratings and reviews - positive and negative. 

The data collected is intended only for educational and research purposes. It is not used for advertising, commercial purposes or any other misuse. The data is for demonstrating data analysis methods.

## Web Scraping

### Collect structured data
The data was fetched on 12.08.2024 from https://hoteli-sozopol.pochivka.bg

The function extract_links_from_thumbs extract, process and retrieve all needed URLs from a specific webpage:

In [13]:
def extract_links_from_thumbs(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    thumb_divs = soup.find_all("div", class_="thumb")
    
    links = []
    for thumb in thumb_divs:
        link_tag = thumb.find("a", href=True)
        if link_tag:
            if "oferti" not in link_tag["href"].lower():
                if not link_tag["href"].startswith("https://"):
                    link_tag["href"] = f"https:{link_tag['href']}"
                    links.append(link_tag["href"])    
    return links

In [14]:
all_urls = [
    "https://hoteli-sozopol.pochivka.bg/1",
    "https://hoteli-sozopol.pochivka.bg/2",
    "https://hoteli-sozopol.pochivka.bg/3",
    "https://hoteli-sozopol.pochivka.bg/4",
    "https://hoteli-sozopol.pochivka.bg/5",
    "https://pochivka.bg/semeyni-hoteli-sozopol-t65",
    "https://pochivka.bg/semeyni-hoteli-sozopol-t65/2",
    "https://pochivka.bg/semeyni-hoteli-sozopol-t65/3",
    "https://pochivka.bg/semeyni-hoteli-sozopol-t65/4",
    "https://pochivka.bg/semeyni-hoteli-sozopol-t65/5",
    "https://pochivka.bg/semeyni-hoteli-sozopol-t65/6",
    "https://pochivka.bg/semeyni-hoteli-sozopol-t65/7",
    "https://pochivka.bg/semeyni-hoteli-sozopol-t65/8",
    "https://pochivka.bg/semeyni-hoteli-sozopol-t65/9",
    ]

all_links = []
for url in all_urls:
    all_links.extend(extract_links_from_thumbs(url))

In [15]:
len(all_links)

259

In [16]:
unique_links = set(all_links)
len(unique_links)

215

The function extract_hotel_details scrape and collect detailed information from a specific hotel webpage. It returns a dictionary containing extracted details:

In [62]:
def extract_hotel_details(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    hotel_name = soup.find("h1", itemprop="name").text.strip()
    location = soup.find("div", class_="sub-title").text.strip()
    
    rating = soup.find("span", class_="rating-score")
    rating = rating.text.strip() if rating else None

    hotels_reviews_links = {}
    review_link_tag = soup.find("a", itemprop="aggregateRating", class_="pull-right property-rating")
    review_link = review_link_tag["href"] if review_link_tag else None
    hotels_reviews_links[hotel_name] = review_link
    
    hotel_stars = soup.find("div", class_="rating")
    if hotel_stars: 
        star_spans = hotel_stars.find_all("span")
        stars_count = 5 - len([span for span in star_spans if "gray" in span["class"]])
    else:
        stars_count = None
    
    prices_table = soup.find("table", class_="prices") 
    
    room_type1, occupancy1, summer_price1 = None, None, None
    room_type2, occupancy2, summer_price2 = None, None, None
    room_type3, occupancy3, summer_price3 = None, None, None

    if prices_table:
        price_rows = prices_table.find_all("tr")
        room_type1 = price_rows[1].find_all("td")[0].text.strip()
        occupancy1 = price_rows[1].find_all("td")[1].text.strip()
        summer_price1 = price_rows[1].find_all("td")[3].text.strip()
       
        if len(price_rows) - 1 > 1:
            room_type2 = price_rows[2].find_all("td")[0].text.strip()
            occupancy2 = price_rows[2].find_all("td")[1].text.strip()
            summer_price2 = price_rows[2].find_all("td")[3].text.strip()

        if len(price_rows) - 1 > 2:
            room_type3 = price_rows[3].find_all("td")[0].text.strip()
            occupancy3 = price_rows[3].find_all("td")[1].text.strip()
            summer_price3 = price_rows[3].find_all("td")[3].text.strip()
        
    amenities_list = []
    amenities_div = soup.find("div", class_="extras")
    if amenities_div:
        amenities_ul = amenities_div.find("ul")
        if amenities_ul:
            amenities = amenities_ul.find_all("li")
            for amenity in amenities:
                amenities_list.append(amenity.text.strip())
        
    extracted_hotel_details = {
        "hotel_name": hotel_name,
        "hotel_stars_count": stars_count,
        "location": location,
        "rating": rating,
        "room_type1": room_type1,
        "occupancy1": occupancy1,
        "summer_price1": summer_price1,
        "room_type2": room_type2,
        "occupancy2": occupancy2,
        "summer_price2": summer_price2,
        "room_type3": room_type3,
        "occupancy3": occupancy3,
        "summer_price3": summer_price3,
        "amenities_list": amenities_list,
    }

    return extracted_hotel_details, hotels_reviews_links

In [64]:
hotel_data = []
reviews_data = []

# iterate through links and extract details and reviews link for each hotel in current link (page)
for link in unique_links:
    hotel_details, hotels_reviews_links = extract_hotel_details(link)

    hotel_data.append(hotel_details)
    reviews_data.append(hotels_reviews_links)
    
    print(f"{hotel_details}")
    print(f" LINK: {hotels_reviews_links}\n")

{'hotel_name': 'Семеен хотел Къщата с кивито', 'hotel_stars_count': None, 'location': 'Созопол', 'rating': None, 'room_type1': 'стандартна стая', 'occupancy1': '2', 'summer_price1': '70 лв', 'room_type2': 'стандартна стая', 'occupancy2': '4', 'summer_price2': '90 лв', 'room_type3': 'лукс стая', 'occupancy3': '2', 'summer_price3': '75 лв', 'amenities_list': ['барбекю', 'камина', 'телевизор', 'кабелна тв', 'климатик', 'хладилник', 'мини-бар', 'интернет', 'баня/тоалетна', 'тераса', 'изглед', 'ютия', 'гладене', 'електрическа кана']}
 LINK: {'Семеен хотел Къщата с кивито': None}

{'hotel_name': 'Хотел Антеа Сердика', 'hotel_stars_count': None, 'location': 'Созопол', 'rating': '9.6', 'room_type1': 'апартамент', 'occupancy1': '2', 'summer_price1': '170 лв', 'room_type2': 'апартамент', 'occupancy2': '2', 'summer_price2': '200 лв', 'room_type3': 'лукс апартамент', 'occupancy3': '2', 'summer_price3': '200 лв', 'amenities_list': ['ресторант', 'телевизор', 'кабелна тв', 'сателитна тв', 'климатик',

In [65]:
hotels_data = pd.DataFrame(hotel_data)

In [70]:
hotels_data.to_csv("data/hotels_data.csv", index = False)

In [71]:
hotels_data_df = pd.read_csv("data/hotels_data.csv")

### Collect unstructured data
Collect reviews for specific hotels are from booking.com

The function extracts latest 25 reviews (if there are 25 or more) and their ratings for each hotel:

In [359]:
def extract_review_details(current_hotel_info):
    for name, url in current_hotel_info.items():
        hotel_name, review_link = name, url

        if not review_link:
            continue 

        response = requests.get(review_link)
        soup = BeautifulSoup(response.content, "html.parser")

        review_containers = soup.find_all("div", class_ = "review_item_review_content")

        reviews = []
        for current_review in review_containers:
            if current_review.find("p", class_ = "review_none"):
                continue

            rating_value = None
            rating_element = soup.find('span', itemprop="reviewRating")
            if rating_element:
                rating_meta = rating_element.find('meta', itemprop="ratingValue")
                if rating_meta:
                    rating_value = rating_meta['content']

            pos_review_content = None
            neg_review_content = None

            for review_type in ["review_neg", "review_pos"]:
                review_section = current_review.find('p', class_=review_type)
                if review_section:
                    content_span = review_section.find("span", itemprop="reviewBody")
                    content = content_span.text.strip() if content_span else None
                    if review_type == "review_neg":
                        neg_review_content = content
                    else:
                        pos_review_content = content

            reviews.append({
                'hotel_name': hotel_name,
                'review_rating_score': rating_value,
                'pos_review_content': pos_review_content,
                'neg_review_content': neg_review_content
            })

        return reviews

In [351]:
all_reviews = []

for current_hotel_info in reviews_data:
    hotel_reviews = extract_review_details(current_hotel_info)
    if hotel_reviews:
        all_reviews.extend(hotel_reviews)
        # print(hotel_reviews)

[{'hotel_name': 'Комплекс Вили над залива', 'review_rating_score': '10', 'pos_review_content': 'Супер локация, много е удобна ако искате да посетите Exe beach bar.', 'neg_review_content': 'Широко и удобно, има паркомясто и удобна локация на тихо място.'}, {'hotel_name': 'Комплекс Вили над залива', 'review_rating_score': '10', 'pos_review_content': 'Уникално и любезно отношение от собствениците. Посрещнаха ни и ни изпратиха с усмивка. Нищо общо с отзивите оставени в booking. Истината е ,че рядко доволните клиенти пишат коментари, а напротив. Държи се на тишина и спокойствие, а не на купони и еуфории. Вилите са изключително приятно обзаведени, комфортни, уютни и чистички. Има всичко необходимо като оборудване в кухненския бокс и приятни местенца за хапване на открито. Има място за паркиране. Попитах за домашния си любимец и той не беше допуснат, но в последствие разбрахме, че си има и това адекватна и разбираема причина, а имено здравословна причина в семейството на собсвениците. Със сиг

In [353]:
hotels_reviews_df = pd.DataFrame(all_reviews)

In [366]:
hotels_reviews_df.to_csv("data/reviews_hotels_data.csv")

In [10]:
hotels_reviews_data = pd.read_csv("data/reviews_hotels_data.csv")

### Results

Structured data:

In [74]:
hotels_data_df.sample(5)

Unnamed: 0,hotel_name,hotel_stars_count,location,rating,room_type1,occupancy1,summer_price1,room_type2,occupancy2,summer_price2,room_type3,occupancy3,summer_price3,amenities_list
62,Семеен хотел Калипсо,3.0,Созопол,,стандартна стая,2.0,70 лв,,,,,,,"['телевизор', 'кабелна тв', 'климатик', 'хлади..."
29,Хотел Боруна,3.0,Созопол,,стандартна стая,2.0,100 лв,стандартна стая,3.0,150 лв,студио,4.0,140 лв,"['паркинг', 'ресторант', 'телевизор', 'кабелна..."
193,Хотел Блу Ориндж,,Созопол,8.2,стандартна стая,2.0,210 лв,,,,,,,"['открит басейн', 'паркинг', 'ресторант', 'ани..."
20,Къща Арес,3.0,Созопол,8.1,стандартна стая,2.0,80 лв,стандартна стая,3.0,120 лв,студио,4.0,160 лв,"['телевизор', 'кабелна тв', 'сателитна тв', 'х..."
97,Комплекс Вили над залива,3.0,Созопол,9.5,вила,6.0,380 лв,,,,,,,"['открит басейн', 'паркинг', 'детски кът', 'де..."


Unstructured data:

In [75]:
hotels_reviews_data.sample(5)

Unnamed: 0.1,Unnamed: 0,hotel_name,review_rating_score,pos_review_content,neg_review_content
170,170,Хотел Сиена Хаус,9,"Стаите са големи, има паркинг, имахме възможно...","Няма асансьор и това следва да се упомене, защ..."
1617,1617,Хотел Лагуна Бийч,8,1.Най-напред изключително любезно обслужване о...,"Паркингът е максимално оползотворен, но не е с..."
991,991,Хотел Св. Зосим,10,Хората в хотела бяха много усмихнати и любезни !,"Единствената ми забележка е, че трябва да има ..."
769,769,Ваканционно селище Санта Марина,9,"Просторен апартамент, с 2 телевизора, приятна ...",Споделих вече някои впечатления на ръководство...
2137,2137,Хотел Голдън Плейс,7,"Чистота, комфорт, бутиков стил на стаите и бли...",
