# Web Scraping & Data Analysis
## Table of contents

1. [Introduction](#Introduction)
    - [Required libraries](#Required-libraries)

2. [Step 1: Web scraping](#Step-1:-Web-scraping)
    - [Scraping Booking.com](#Scraping-booking.com)
    - [Scraping TripAdvisor.com](#Scraping-TripAdvisor.com)

3. [Step 2: Saving the data](#Step-2:-Saving-the-data)

4. [Step 3: Data analysis](#Step-3:-Data-analysis)



## Introduction
[[ go back to the top ]](#Table-of-contents)
 
In this final project.... #TO DO

## Required libraries
[[ go back to the top ]](#Table-of-contents)

This notebook uses several Python packages that come standard with the Python distribution. The primary libraries that we'll be using are:

* **Selenium**: This package is used to automate web browser interaction from Python.
* **BeautifulSoup**: Beautiful Soup helps you pull particular content from a webpage, remove the HTML markup, and save the information.
* **pandas**: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
* **sqlite3**: Database interface that allows connections to various SQLite database engines.




In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
import sqlite3


## Step 1: Web scraping

[[ go back to the top ]](#Table-of-contents)

For web scraping was used Selenium and Beautiful Soup.
First,....
To navigate around the web was used ....

## Scraping Booking.com

[[ go back to the top ]](#Table-of-contents)

In [2]:
browser=webdriver.Chrome()

#load Booking.com page
browser.get('https://www.booking.com')
assert 'Booking.com' in browser.title 

In [3]:
#Click accept button
accept = browser.find_element(By.ID, "onetrust-accept-btn-handler")
accept.click()

In [4]:
#Enter destination
search_box = browser.find_element(By.NAME, 'ss')
search_box.send_keys('London' + Keys.RETURN)

In [None]:
#If needed remove Genius advertisements
genius_element = browser.find_element(By.CSS_SELECTOR, '[aria-label="Dismiss sign in information."]')
genius_element.click()

In [6]:
#Press calendar button to enter check in date (without check out date system automatically sets one night stay)
calendar_button = browser.find_element(By.CLASS_NAME, 'd47738b911')
calendar_button.click()


In [9]:
#Move forward/backwards to choose the month of the stay
forward_calendar_page = browser.find_element(By.CLASS_NAME,"be298b15fa")
forward_calendar_page.click()

In [125]:
backward_calendar_page = browser.find_element(By.CLASS_NAME,"ab15620a33")
backward_calendar_page.click()


In [10]:
#Enter check in date
check_in_date = browser.find_element(By.CSS_SELECTOR, '[data-date="2023-02-09"]')
check_in_date.click()


In [11]:
submit = browser.find_element(By.CSS_SELECTOR, '[type=submit]')
submit.click()

In [12]:
#To avoid new accomodations let's set required review scores
number_list=list(range(90, 50, -10))
number_list
for i in number_list:
    review_score = browser.find_element(By.CSS_SELECTOR, f"[data-filters-item='review_score:review_score={i}']")
    review_score.click()


In [13]:
soup = BeautifulSoup(browser.page_source, 'lxml')

In [14]:
hotels = []
for name in soup.findAll('div',{'data-testid':'title'}):
  hotels.append(name.text.strip())
hotels[:5]
len(hotels)

25

In [15]:
ratings = []
for rating in soup.findAll('div',{'class':'d10a6220b4'}):
      ratings.append(rating.text.strip())
len(ratings)



25

In [16]:
reviews = []
for review in soup.findAll('div',{'class':'db63693c62'}):
  reviews.append(review.text.replace('reviews','').replace(',','').strip())
reviews[:5]
len(reviews)

25

In [17]:
price = []
for p in soup.findAll('span',{'class':['fbd1d3018c', 'bd73d13072']}):
  price.append(p.text.replace('€','').replace(',','').strip()) 
price[:5]
len(price)

25

## Step 2: Saving the data

[[ go back to the top ]](#Table-of-contents)

After retrieving all the required data from the Booking.com and TripAdvisor web pages, the data was saved to dataframe using pandas.

In [18]:
df1 = {'Hotel':hotels,'Ratings':ratings,'No_of_Reviews':reviews,'Price, €':price}
df_booking = pd.DataFrame.from_dict(df1)

In [19]:
df_booking

Unnamed: 0,Hotel,Ratings,No_of_Reviews,"Price, €"
0,Langham Court Hotel,7.1,2283,180
1,Yotel London Shoreditch,7.8,1814,126
2,The Other House - South Kensington,9.1,637,336
3,The Dilly,7.4,2166,257
4,"The Westminster London, Curio Collection by Hi...",8.2,3002,168
5,Park Grand London Kensington,8.0,6025,259
6,White House Hotel,7.1,2041,75
7,"Holiday Inn London Camden Lock, an IHG Hotel",8.2,3448,197
8,Studios2Let,7.7,5481,126
9,The Chesterfield Mayfair,8.9,2181,279


In [None]:
df_booking.to_csv("bookingcom_london_list.csv")

## Scraping TripAdvisor.com

[[ go back to the top ]](#Table-of-contents)

In [None]:
browser.get('https://www.tripadvisor.in/Hotels-g186338-a_ufe.true-London_England-Hotels.html')


In [None]:
accept = browser.find_element(By.ID, "onetrust-accept-btn-handler")
accept.click()

In [None]:
list_of_numbers = str(list(range(1, 101)))
hotels = []
for name in soup.findAll('a',{'data-clicksource':'HotelName'}):
      hotels.append(name.text.strip(list_of_numbers + '.'))
hotels[:5]


In [None]:
ratings = []
for rating in soup.findAll('a',{'class':'ui_bubble_rating'}):
      if rating['alt'] == "5 of 5 bubbles":
            rating = 10
      elif rating['alt'] == "4.5 of 5 bubbles":
            rating = 9
      elif rating['alt'] == "4 of 5 bubbles":
            rating = 8
      elif rating['alt'] == "3.5 of 5 bubbles":
            rating = 7
      elif rating['alt'] == "3 of 5 bubbles":
            rating = 6
      else:
            rating = 5
      ratings.append(rating)
len(ratings)

In [None]:
reviews = []
for review in soup.findAll('a',{'class':'review_count'}):
  reviews.append(review.text.replace('reviews','').replace(',','').strip())
len(reviews)

In [None]:
price = []
for p in soup.findAll('div',{'data-sizegroup':'mini-meta-price'}):
  price.append(p.text.replace('€','').replace(',','').strip()) 
price[:5]
len(price)

In [None]:
df2 = {'Hotel':hotels,'Ratings':ratings,'No_of_Reviews':reviews,'Price, €':price}
df_tripadvisor = pd.DataFrame.from_dict(df2)

In [None]:
df_tripadvisor

In [None]:
df_tripadvisor.to_csv("tripadvisor_london_list.csv")

## Step 3: Data analysis

[[ go back to the top ]](#Table-of-contents)

After having data saved to dataframe in rows and columns, let's to some analysis.
For analysis ...

In [None]:
conn = sqlite3.connect('my_data.db')
c = conn.cursor()

In [None]:
tripadvisor = pd.read_csv('tripadvisor_london_list.csv')
tripadvisor.head()

In [None]:
booking = pd.read_csv('bookingcom_london_list.csv')
booking.head()

In [None]:
tripadvisor.to_sql('tripadvisor', conn, if_exists='append', index = False)

In [None]:
booking.to_sql('booking_com', conn, if_exists='append', index = False)