<h1><center>Scarping Hotel Details of a City on TripAdvisor</center></h1>

![](https://i.imgur.com/lwsc7HJ.png)

<p><b>Data</b> has become a major part of our day-to-day lives. we have tons of unstructured data available freely over the web. one can use automatic methods such as<span><b> Web-Scraping</b></span> to collect this unstructured data  and convert it to structured data.</p> 
<h4>What is web-scraping?</h4>

- Web scraping is an automatic method to obtain large amounts of data from websites.Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

Here we are scraping the <a href="https://www.tripadvisor.com/">TripAdvisor</a> website to parse the Hotel prices offered by different websites for the same hotel in a given city. 

<b>TripAdvisor</b> is a travel guide website that offers its users from planning to booking to taking a trip. we are using the below tools to complete this project.
- <b>Python</b> is one of the most popular languages for web scraping as it has a variety of libraries that are specifically created for Web Scraping.
- <b>Beautiful soup</b> is another Python library that is highly suitable for Web Scraping, It creates a parse tree that can be used to extract data from HTML on a website.
- <b>Selenium Webdriver</b> is a tool for testing the front end of an application, it is used to perform browser manipulation in web scraping 
- <b>Pandas</b> is a tool used to read and manipulate the data.  

![](https://i.imgur.com/hdZCxLh.png)

<b> Project Outline : </b>

<ul>  
 <li>Install and Import the required packages.</li>  
 <li>Defining the global variables.</li>  
 <li>Create the selenium webdriver object.</li>  
 <li>By providing required inputs to the driver crawl to the hotel's page</li>
 <li>Create a BeautifulSoup object from the loaded page source and Parse the Hotel's details from BeautifulSoup object</li>
 <li>Write the Parsed data to a CSV file using Pandas</li> 
 <li>Defining a main function to run all the above steps</li> 
 <li>Open the CSV file and View the data using pandas</li> 
</ul> 

<h4>Install and Import the required packages.</h4>

PIP is the standard package mangement system in Python, below are the packages we need to install for this project. 

<ul>  
 <li>pip install beautifulsoup4 selenium pandas</li>   
</ul> 


The below are the libraries that are imported

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By 
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from bs4 import BeautifulSoup
import pandas as pd

<h4>Defining the global variables.</h4>


In [None]:
#PATHS
SCRAPING_URL = "https://www.tripadvisor.com/"
CHROME_DRIVER_PATH = "D:\MLworkSpace\selenium\chromedriver.exe"

#INPUTS
CITY = "Hyderabad"
CHECK_IN = "Tue May 10 2022"
CHECK_OUT = "Wed May 11 2022"
NO_OF_PAGES = 5

#Global Variables
HOTELS_LIST = []
HOTELS_DF = None

<h4>Create the selenium webdriver object.</h4> 

- we have to create the webdriver instance of the required browser type by providing path to the chromedriver and required additional options. 



In [None]:
def get_driver_object():
    """
    Creates and returns the selenium webdriver object 

    Returns:
        Chromedriver object: This driver object can be used to simulate the webbrowser
    """
    
    # Creating the service object to pass the executable chromedriver path to webdriver
    service_object = Service(executable_path=CHROME_DRIVER_PATH)
    
    # Creating the ChromeOptions object to pass the additional arguments to webdriver
    options = webdriver.ChromeOptions()
    
    # Adding the arguments to ChromeOptions object
    options.headless = True                    #To run the chrome without GUI
    options.add_argument("start-maximized")      #To start the window maximised 
    options.add_argument("--disable-extensions") #To disable all the browser extensions
    options.add_argument("--log-level=3")        #To to capture the logs from level 3 or above
    options.add_experimental_option(
        "prefs", {"profile.managed_default_content_settings.images": 2}
    )                                           #To disable the images that are loaded when the website is opened
    
    
    # Creating the Webdriver object of type Chrome by passing service and options arguments
    driver_object = webdriver.Chrome(service=service_object,options=options)
    
    
    return driver_object

- we have to open website by passing URL to the webdriver instance with get() method.

In [None]:
def get_website_driver(driver=get_driver_object(),url=SCRAPING_URL):
    """it will get the chromedriver object and opens the given URL

    Args:
        driver (Chromedriver): _description_. Defaults to get_driver_object().
        url (str, optional): URL of the website. Defaults to SCRAPING_URL.

    Returns:
        Chromedriver: The driver where the given url is opened.
    """

    # Opening the URL with the created driver object
    print("The webdriver is created") 
    driver.get(url)
    print(f"The URL '{url}' is opened")
    return driver

<h4>By providing required inputs to the driver crawl to the hotel's page</h4>

- The name of the CITY is provided as input in the search field using send_keys("input_text") method.

![](https://i.imgur.com/b7S7Ct2.png)

- The Hotels tab is seleceted in the loaded in page after giving the city as input.

![](https://i.imgur.com/El3MYdj.png)


In [None]:
def open_hotels_tab(driver):
    """ Opens the Hotels link with city provided

    Args:
        driver (Chromedriver): The driver where the url is opened.
    """
    #Finding the Input Tag for to enter the CITY name
    city_input_tag = driver.find_element(by=By.XPATH,value="//input[@placeholder='Where to?']")
    
    #providing the charaters in the CITY one by one as the search is dynamically loaded
    for letter in CITY:
        city_input_tag.send_keys(letter)
    time.sleep(5)
    
    
    # selecting the top search result based on the input provided
    city_input_tag.send_keys(Keys.ARROW_DOWN)
    city_input_tag.send_keys(Keys.ENTER)
    time.sleep(5)
    
    
    # selecting the type as Hotels in the webpage that is loaded
    wait = WebDriverWait(driver,10)
    for _ in range(3):
        try:
            select_hotels_tag = wait.until(EC.presence_of_element_located((By.XPATH,'//span[contains(text(),"Hotels")]')))      
            driver.execute_script("arguments[0].click();", select_hotels_tag)
            break
        except:
            time.sleep(2)
            continue
    print("The Hotels window with the provided city is opened")

- The check-in and check-out date is selected in the loaded page.

![](https://i.imgur.com/ZXJrAbG.png)

In [None]:
def select_check_in(driver):
    """The check in date is selected in the list the dates available 

    Args:
        driver (Chromedriver): The driver instance where the Hotels page is loaded
    """
    # Check in Date element is selected
    check_in_dates = driver.find_elements(By.CLASS_NAME,"fgeHy")
    
    # Selecting the check in date in the available dates 
    for date in check_in_dates:
        date_val = date.get_attribute("aria-label")
        if date_val == CHECK_IN and date.is_enabled():
            driver.execute_script("arguments[0].click();", date)
            print("Check in date selected")

            
def select_check_out(driver):
    """ The check out date is selected in the list the dates available 

    Args:
        driver (Chromedriver): The driver instance where the check in date selected
    """
    #  After the check in date is selected the wep-page loads in the backgound the chances of getting 
    #  stale element exceptions are more to avoid this we can use implicit or explicit wait
    
    
    # Selecting the Check out dates available 
    wait = WebDriverWait(driver,10)
    check_out_dates = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME,"fgeHy")))
    
    
    # searching the available dates and selecting the check out date
    for date in check_out_dates:
        date_val = date.get_attribute("aria-label")
        if date_val == CHECK_OUT and date.is_enabled():
            driver.implicitly_wait(10)
            # searching the check out date element in the wep page to avoid Stale Element exception
            date_element = wait.until(EC.presence_of_element_located((By.XPATH,f"//div[@aria-label='{date_val}']")))
            driver.execute_script("arguments[0].click();", date_element)
            print("Check Out date selected")
            break

def select_check_in_out_dates(driver):
    """The check in and check out dates are selected in the webpage loaded

    Args:
        driver (webdriver): The driver instance where the hotels page is loaded with provided city
    """
    # Moving to the first month that is available
    time.sleep(5)
    left_button = driver.find_element(by=By.CSS_SELECTOR, value=".RGOTE button")
    if left_button.is_enabled():
        driver.execute_script("arguments[0].click();", left_button) 
    
    # Select check in Date
    select_check_in(driver)
    time.sleep(10)
    
    # Select Check out Date
    select_check_out(driver)
    time.sleep(10)
    



- The deatils are updated to get the list of the hotels
![](https://i.imgur.com/7nQc7CS.png)

In [None]:
def update_button(driver):
    """The check in , check out details are updated to populate the hotel results

    Args:
        driver (Chromedriver): The driver instance where check in and check out dates are selected
    """
    # The webpage is dyanmically loading in the background once check in date
    for _ in range(10):
        try:
            driver.find_element(by=By.XPATH,value='//button[@class="ui_button primary fullwidth"]').click()
            break
        except:
            time.sleep(2)
    print("The web page is loaded with the provided check in and check out dates")

- Below function calls the open_hotels_tab(driver),select_check_in_out_dates(driver),update_button(driver) to open the hotels tab.

In [None]:
def search_hotels(driver):
    """Opens the Hotels page from which the data can be parsed.

    Args:
        driver (Chromedriver): The driver where the url is opened.
    """
    #Opening the Hotels tab with the given city and waiting for it to load
    open_hotels_tab(driver)
    time.sleep(10)
    
    #Selecting the Check In and Check Out Dates
    select_check_in_out_dates(driver)
    
    #Updating the details
    update_button(driver)
    time.sleep(10)

<h4>Create a BeautifulSoup object from the loaded page source and Parse the Hotel's details from BeautifulSoup object</h4>

- The driver.page_source is passed to BeautifulSoup Class to create a BeautifulSoup object.
- The various hotel details are parsed from this soup object.
- parse_hotel_details(hotel) is function which takes hotel DIV element and parses the hotel information marked below.

![](https://i.imgur.com/uHnXehC.png)


In [None]:
def parse_best_offer(hotel):
    """Parse the best offer hotel details given on tripadvisor

    Args:
        hotel (Beautifulsoup object): The hotel div element which contains hotel details
    
    Returns:
        Dict: returns dictionary containing best offer hotel details. 
    """
    hotel_name = hotel.find("a",class_="property_title").text.strip()[3:]
    hotel_price = hotel.find("div",class_="price").text
    best_price_offered_element = hotel.find("img",class_="provider_logo")
    best_price_offered_by = best_price_offered_element["alt"] if best_price_offered_element is not None else None
    review_count = hotel.find("a",class_="review_count").text 
    return  {
        "Hotel_Name" : hotel_name,
        "Hotel_Price" : hotel_price,
        "Best_Deal_By" : best_price_offered_by,
        "Review_Count" : review_count,
    }

- The details of the other offers listed for a hotel is parsed in the below function. 
![](https://i.imgur.com/hEMqCTg.png)

In [None]:
def parse_other_offers(hotel,hotel_details):
    """Parse the hotel details of other deals given on tripadvisor and to the hotel_details dictionary

    Args:
        hotel (Beautifulsoup object): The hotel div element which contains hotel details
        hotel_details : Dictionary containing the best hotel details
    Returns:
        Dict: returns dictionary containing all offer's hotel details. 
    """
    other_deals = hotel.find("div",class_="text-links",).find_all("div",recursive=False)
    for i in range(3):
        try : 
            deal_name_tag = other_deals[i].find("span",class_="vendorInner")
            deal_name = deal_name_tag.text if deal_name_tag is not None else None
            hotel_details[f"next_deal_{i+1}"] = deal_name

            deal_price_tag = other_deals[i].find("div",class_="price")
            deal_price = deal_price_tag.text if deal_price_tag is not None else None
            hotel_details[f"next_deal_{i+1}_price"] = deal_price
        except:
            hotel_details[f"next_deal_{i+1}"] = None
            hotel_details[f"next_deal_{i+1}_price"] = None
    return hotel_details

In [None]:
def parse_hotel_details(hotel):
    """Parse the hotel details from the given hotel div element

    Args:
        hotel (Beautifulsoup object): The hotel div element which contains hotel details
    """
    #declaring the global variables
    global HOTELS_LIST

    #Parsing the best offer Hotel Details
    best_offer_deals = parse_best_offer(hotel)
    
    #Parsing the other offers Hotel Details 
    hotel_details = parse_other_offers(hotel,best_offer_deals)
    
    # Apending the data to the hotels list
    HOTELS_LIST.append(hotel_details)

- Function to create and parse the Hotel details.

In [None]:
def parse_hotels(driver):
    """ To parse th web page using the BeautifulSoup

    Args:
        driver (Chromedriver): The driver instance where the hotel details are loaded
    """
    # Getting the HTML page source
    html_source = driver.page_source

    # Creating the BeautifulSoup object with the html source
    soup = BeautifulSoup(html_source,"html.parser")
    
    # Finding all the Hotel Div's in the BeautifulSoup object 
    hotel_tags = soup.find_all("div",{"data-prwidget-name":"meta_hsx_responsive_listing"})
    
    # Parsing the hotel details 
    for hotel in hotel_tags:
        # condition to check if the hotel is sponsered, ignore this hotel if it is sponsered
        sponsered = False if hotel.find("span",class_="ui_merchandising_pill") is None else True
        if not sponsered:
            parse_hotel_details(hotel)
    print("The Hotels details in the current page are parsed")

- Next page is loaded after the details in the current page is parsed by clicking on the next page button.

![](https://i.imgur.com/wR5nODS.png)

In [None]:
def next_page(driver) -> bool:
    """To load the next webpage if it is available

    Args:
        driver (Chromedriver): The driver instance where the hotel details are loaded

    Returns:
        bool: returns True if the page is loaded 
    """
    # Finding the element to load the next page
    next_page_element = driver.find_element(By.XPATH,value='.//a[@class="nav next ui_button primary"]')
    
    # click on the next page element if it is avialable
    if next_page_element.is_enabled():
        driver.execute_script("arguments[0].click();", next_page_element)
        time.sleep(30)
        return True
    return False

<h4>Write the Parsed data to a CSV file using Pandas</h4>

- Create a Pandas DataFrame object with the list of Hotel details.

- Write the data to a CSV file using pandas.DataFrame.to_csv() method.

In [None]:
def write_to_csv():
    """To Write the hotels data in to a CSV file using pandas
    """
    #declaring the global variables
    global HOTELS_LIST,HOTELS_DF

    # Creating the pandas DataFrame object
    HOTELS_DF = pd.DataFrame(HOTELS_LIST,index=None)

    # Viewing the DataFrame
    print(f"The number of columns parsed is {HOTELS_DF.shape[1]}")
    print(f"The number of rows parsed is {HOTELS_DF.shape[0]}")

    # Conveting the DataFrame to CSV file
    HOTELS_DF.to_csv("hotels_list.csv",index=False)
    print("The CSV file is created at hotels_list.csv")

<h4>Defining a main function to run all the above steps.</h4>

In [None]:
def main():
    # Create the driver and load the website
    driver = get_website_driver()
    
    # open the website with details provided   
    search_hotels(driver)
    time.sleep(30)
    
    # Parse the hotel details for the given no of pages
    parse_hotels(driver)
    for page in range(NO_OF_PAGES):
        if next_page(driver):
            print(f"The next page is loaded : Page No - {page+2}")
            parse_hotels(driver)
    
    # write the parsed data in to a CSV file
    write_to_csv()
    
    # close the driver once the parsing is completed
    driver.close()
    print("The driver is closed")

In [None]:
main()

The webdriver is created
The URL 'https://www.tripadvisor.com/' is opened
The Hotels window with the provided city is opened
Check in date selected
Check Out date selected
The web page is loaded with the provided check in and check out dates
The Hotels details in the current page are parsed
The next page is loaded : Page No - 2
The Hotels details in the current page are parsed
The next page is loaded : Page No - 3
The Hotels details in the current page are parsed
The next page is loaded : Page No - 4
The Hotels details in the current page are parsed
The next page is loaded : Page No - 5
The Hotels details in the current page are parsed
The next page is loaded : Page No - 6
The Hotels details in the current page are parsed
The number of columns parsed is 10
The number of rows parsed is 180
The CSV file is created at hotels_list.csv
The driver is closed


<h4>Open the CSV file and View the data using pandas</h4>

In [None]:
hotels_csv_file = pd.read_csv("hotels_list.csv")
hotels_csv_file.head()

Unnamed: 0,Hotel_Name,Hotel_Price,Best_Deal_By,Review_Count,next_deal_1,next_deal_1_price,next_deal_2,next_deal_2_price,next_deal_3,next_deal_3_price
0,Mercure Hyderabad KCP,"₹5,500",Agoda.com,787 reviews,Mercure,"₹5,500",Trip.com,"₹4,866",ZenHotels.com,"₹5,554"
1,"Holiday Inn Express Hyderabad Banjara Hills, a...","₹4,099",Agoda.com,169 reviews,Booking.com,"₹4,099",HIExpress.com,"₹4,099",Hotels.com,"₹4,099"
2,The Golkonda Hotel,"₹3,580",Agoda.com,"2,248 reviews",Hotels.com,"₹4,191",eDreams,"₹5,400",Expedia,"₹4,191"
3,Fairfield by Marriott Hyderabad Gachibowli,"₹4,409",MakeMyTrip,338 reviews,Agoda.com,"₹6,400",Fairfield Inn,"₹6,400",Booking.com,"₹9,500"
4,The Park Hyderabad,"₹5,560",Agoda.com,"3,641 reviews",Booking.com,"₹5,560",goibibo.com,"₹5,060",eDreams,"₹5,560"


<h4>Summary</h4>

- To summarise we have opened the TripAdvisor website and crawled our way to the Hotel listings by providing the required information to the selenium webdriver which mimicked the human actions and opened the website for us.

- Now to parse the details in the loaded page we have used Beautiful Soup, which allowed us to get the required hotel details from the HTML page source. 

- we have used pandas to save the data into a CSV file by converting our data to DataFrame object. 

- we can use this same technique to collect the other details available on the website using the above functions with some modifications,The data collected can be used for further analysis.
    

<h4>References:</h4> 
<ol type="i">
<li><a href="https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/workshop-web-scraping-with-selenium-aws">Workshop - Web Scraping with Selenium & AWS</a> - Basics of Selenium and webscraping.</li>
<li><a href="https://www.w3schools.com/html/default.asp">HTML Topics</a> - Basics of HTML Tags.</li>
<li><a href="https://www.w3schools.com/css/default.asp"> CSS</a> - Basics of CSS Selectors.</li>
<li><a href="https://www.geeksforgeeks.org/navigation-with-beautifulsoup/">BeautifulSoup Topics.</a>- Basics of BeautifulSoup.</li>
<li><a href="https://www.youtube.com/watch?v=IYILCEV5j6s&list=PLUDwpEzHYYLvx6SuogA7Zhb_hZl3sln66">Selenium with Python Playlist</a> - By SDET- QA Automation Techie</li>
<li><a href="https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis">Web Scraping and REST APIs</a> - By Jovian</li>
</ol>

<h4>Future work</h4>

- same parsing technique can be applied to get other details in a city such as restaurants, fight deals, car rentals etc.
- Parsing the each individual hotel details by visting the websites that offers this deals.
- comparison analysis how the prices vary from one wesite to another.

In [None]:
import jovian

In [None]:
jovian.commit(files=["hotels_list.csv"])

<IPython.core.display.Javascript object>