<a href="https://colab.research.google.com/github/sanxlop/tfm_etsit/blob/master/web_scraping_system_for_data_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping System for Data Modeling
 
This notebook shows how to build a database by extracting data from web pages using web scraping techniques and applying a deep learning model for classifying. Based on the nature of the project, TripAdvisor web is selected. The main idea is to use TripAdvisor restaurants information and food images from reviews to model a dataset for our study case.




In [0]:
# Libraries
import sys
import requests
import time
import pickle
import os
import numpy as np
import pandas as pd
from PIL import Image
from io import BytesIO
from bs4 import BeautifulSoup
from IPython.display import clear_output
from google.colab import drive

### Get Google Drive access
To start we are going to get the access to our Google Drive account with a token in order to be able to get files from there and save new ones.

In [0]:
# Get google drive acces
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


## Methodology and configuration of the data modeling process

### Selenium web driver configuration

To start, Selenium and Chromium should be installed because they are not included in Colaboratory. From Selenium, the WebDriver is needed to run Chromium. They are defined two functions to configure the webdriver.

In [0]:
# Install Selenium and Chromium
!pip install selenium
!apt-get update
!apt-get install chromium-chromedriver

from selenium import webdriver
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

Web driver configuration logic is defined to start and configure the web driver. It is composed by some configuration options that are:

- headless: it is a way to run Chrome browser in a headless environment which means running Chrome without chrome. It brings all modern web platform features provided by Chromium and the Blink rendering engine to the command line.
- no-sandbox: when running headless option in a container without a defined user, the chromeOptions environment property needs this argument or Chrome won't be able to startup.
- disable-dev-shm-usage: allows to launch flags by default and prevent from crashing.

This function also runs a script ('return document.readyState') that wait until the document and scripts are ready.

In [0]:
def loadWebDriverConfiguration():
  """This function load the webdriver configuration"""
  chrome_options = webdriver.ChromeOptions()
  chrome_options.add_argument('--headless')
  chrome_options.add_argument('--no-sandbox')
  chrome_options.add_argument('--disable-dev-shm-usage')
  driver = webdriver.Chrome('chromedriver', options=chrome_options)
  driver.execute_script('return document.readyState')
  return driver

Web scrolling logic allows to scroll down the web to load all scripts executed by scrolling. It executes scripts to get scroll height and scrolling down until end.

In [0]:
def scrollWeb(driver):
  """This function allow scrolling to load javascript entire web"""
  scroll_height = 'document.documentElement.scrollHeight'
  last_height = driver.execute_script('return '+scroll_height)
  while True:
    # Scroll down to bottom
    driver.execute_script('window.scrollTo(0,'+scroll_height+');')
    # Wait to load page
    time.sleep(0.2)
    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script('return '+scroll_height)
    if new_height == last_height:
      break
    last_height = new_height
  return driver

### Path results configuration

Path configuration to save results.

In [0]:
# Path configurations 
base_url = 'https://www.tripadvisor.es'
drive_path = 'drive/My Drive/'
folder_path_scraper = drive_path+'scraper_test'+'/'

# Create folder for results
if not os.path.exists(folder_path_scraper):
  os.makedirs(folder_path_scraper)

## Data extraction proccess

### Extraction of restaurant urls (for a city)
The very first scraping process step is to get the desired data is to collect all restaurant links belonging to a city. In the study case, has been used restaurants from Majadahonda and surroundings so URL fragment must be the corresponding one. 

The logic of the process consists in visiting the URL corresponding to the city, collecting restaurant URLs and detecting if there is a next page to iterate again. Special attention must be paid to 'a' HTML tags belonging to a class and to the 'href' parameter for obtaining the links. It is essential to add 'time.sleep' functions to avoid blocking. 

In [0]:
def getRestaurantsUrls(driver, url_city, restaurant_urls_all):
  """This function collect all restaurants urls in the city going through all pages"""
  driver.get(url_city)
  html = driver.page_source.encode('utf-8')
  soup = BeautifulSoup(html, "lxml")
  # Search restaurants in page
  restaurant_urls = soup.find_all('a', class_='property_title')
  try:
    # Take only href value
    restaurant_urls_href = [row['href'] for row in restaurant_urls[3:]] #first 3 rows are not usefull
    # Save urls in list
    restaurant_urls_all.extend(restaurant_urls_href)
  except:
    print('No href')
  print('X', end="")
  # Return next page url
  try:
    next_page = soup.find('a', class_='nav next rndBtn ui_button primary taLnk')['href']
    return base_url+next_page, restaurant_urls_all
  except:
    return None, restaurant_urls_all

#### Scraping proccess
Once iterated through all pages, is obtained a list of restaurant links belonging to a city and surroundings and would be used in the next step.

In [0]:
%%time
# City url wanted
url_city = base_url+'/Restaurants-g1063665-Majadahonda.html' #Majadahonda

# Collect all city restaurants
restaurant_urls_all = []
driver = loadWebDriverConfiguration()
next_page_url, restaurant_urls_all = getRestaurantsUrls(driver, url_city, restaurant_urls_all)
while next_page_url:
  time.sleep(2)
  next_page_url, restaurant_urls_all = getRestaurantsUrls(driver, next_page_url, restaurant_urls_all)
print('')

XXXXXXXXXXXXXXXXX
CPU times: user 2.94 s, sys: 61 ms, total: 3 s
Wall time: 1min 6s


#### Clean and save data

In [0]:
# Find duplicates
restaurant_urls_all_set = set(restaurant_urls_all)

print('Number of restaurant urls:', len(restaurant_urls_all), '-', len(restaurant_urls_all_set))

Number of restaurant urls: 444 - 444


In [0]:
# Save dict with restaurant urls
pickle.dump(restaurant_urls_all_set, open(folder_path_scraper+'restaurant_urls.dump', 'wb'))

In [0]:
# Read dict with restaurant urls
restaurant_urls_all_set = pickle.load(open(folder_path_scraper+'restaurant_urls.dump', 'rb'))

print('Number of restaurants:', len(restaurant_urls_all_set))

Number of restaurants: 444


### Extraction of restaurant information (from restaurant urls)

Extracting restaurant information from URLs is more complex than obtaining restaurant links. Special attention must be paid to every desired field, look for its HTML structure, and deal with missing tags/information. From the prerequisites analysis made, it is necessary to get the name, score, reviews, categories, price, address and image URLs, of each restaurant URL obtained before.

The web driver requests HTML document scrolled when it is ready to load more images. This procedure is looped through all restaurant pages using the restaurant URLs. To obtain information, attention is focused to `div', `h1', `span', and `a' HTML tags, and its corresponding classes and parameters. Must prevent from crashing dealing with missing and empty values, so it is more tricky.

In [0]:
def getDataFromRestaurant(driver, complete_url_restaurant_album):
  """This function get all restaurante information digging in all required filds. Take care with class ids"""
  driver.get(complete_url_restaurant_album)
  time.sleep(0.2)
  #driver = scrollWeb(driver)
  html = driver.page_source.encode('utf-8')
  soup = BeautifulSoup(html, "lxml")
  #Get urls imgs
  imgs_urls = []
  imgs = soup.find_all('div', class_='fillSquare')
  for img in imgs:
    try:
      imgs_urls.append(img.img['src'])
    except:
      imgs_urls.append(img.img['data-lazyurl'])
  #Get name 
  try:
    name = soup.find('h1', class_='ui_header h1').text.strip()
  except:
    name = np.nan
  #Get score
  try:
    score = float(soup.find('span', class_='restaurants-detail-overview-cards-RatingsOverviewCard__overallRating--nohTl').text.replace(',','.'))
  except:
    score = np.nan
  #Get number reviews
  try:
    n_opinions = float(soup.find('a', class_='restaurants-detail-overview-cards-RatingsOverviewCard__ratingCount--DFxkG').text.split(' ')[0])
  except:
    n_opinions = np.nan
  #Get food categories and prize
  try:
    details = soup.find('div', class_='restaurants-detail-overview-cards-DetailsSectionOverviewCard__detailsSummary--evhlS')
    prize_flag = False
    food_flag = False
    food_categories = []
    prize = ''
    for div in details:
      for info in div:
        if prize_flag:
          prize = info.text.replace('\xa0US$', '')
          prize_flag = False
        elif food_flag:
          food_categories = info.text.lower().split(', ')
          food_flag = False
        elif info.text.lower() == 'rango de precios':
          prize_flag = True
        elif info.text.lower() == 'tipos de cocina':
          food_flag = True
  except:
    food_categories = []
    prize = ''
  #Get address
  try:
    address = soup.find('span', class_='restaurants-detail-overview-cards-LocationOverviewCard__detailLinkText--co3ei').text
  except:
    address = ''
  
  print(len(imgs_urls), score, n_opinions, food_categories, prize, address, name)
  return imgs_urls, score, n_opinions, food_categories, prize, address, name

#### Scraping proccess
Once iterated through all restaurant pages, is obtained a dictionary that contains information of each establishment and includes images (restaurants without images are dispensable). 

In [0]:
%%time

# Images corresponding web
photos_part = '#photos;aggregationId=101&albumid=101&filter=7'

# Collect all information of each restaurant
dict_restaurants = {}
n_restaurants = len(restaurant_urls_all_set)
restaurant_urls_all_set_list = list(restaurant_urls_all_set)
driver = loadWebDriverConfiguration()
for i, url_restaurant in enumerate(restaurant_urls_all_set_list[:]):
  clear_output()
  print(i+1,'/',n_restaurants)
  complete_url_restaurant_album = base_url+url_restaurant+photos_part
  dict_restaurants[url_restaurant] = getDataFromRestaurant(driver, complete_url_restaurant_album)
  time.sleep(1)

#### Clean and save data

In [0]:
# Print number of restaurants
print('Number of restaurants:',len(dict_restaurants))
# Print number of images
counter = 0
for a in list(dict_restaurants): 
  if dict_restaurants.get(a)[0] != []:
    counter += len(dict_restaurants.get(a)[0])
print('Number of imgs:', counter)

Number of restaurants: 6
Number of imgs: 180


In [0]:
# Delete restaurants without imgs
for a in list(dict_restaurants): 
  if dict_restaurants.get(a)[0] == []:
    dict_restaurants.pop(a)

In [0]:
# Print number of restaurants
print('Number of restaurants:',len(dict_restaurants))
# Print number of images
counter = 0
for a in list(dict_restaurants): 
  if dict_restaurants.get(a)[0] != []:
    counter += len(dict_restaurants.get(a)[0])
print('Number of imgs:', counter)

Number of restaurants: 5
Number of imgs: 180


In [0]:
# Save dict restaurants info
pickle.dump(dict_restaurants, open(folder_path_scraper+'restaurants_info.dump', 'wb'))

In [0]:
# Load dict restaurants info
dict_restaurants = pickle.load(open(folder_path_scraper+'restaurants_info.dump', 'rb'))

### Classify images (from restaurant information)

#### Load Keras model for classifying
At this point, it is required for the next step to load the food classification model built in the previous chapter and its configuration. Xception model obtained is loaded with Keras framework and all its configuration. It is going to be used to classify images in the next step.

In [0]:
# Keras libraries
from keras.models import load_model
from keras.preprocessing import image
from keras.applications.xception import preprocess_input

# Path configuration
#InceptionResNetV2 | 299x299 | b32 (7000s)
#ResNet50 | 224x224 | b32 (2550s)
#MobileNetV2 | 224x224 | b64 (1870s)
#Xception | 299x299 | b16 (6700s)
NET = 'Xception' 
# Configuration
IMG_SHAPE = (299, 299, 3)
BATCH_SIZE = 16
EPOCHS = 25
# Directories
test_folder = 'blacklist' #write test if testing
drive_path = 'drive/My Drive/'
base_name = NET+'-'+str(IMG_SHAPE[0])+'x'+str(IMG_SHAPE[0])+'-b'+str(BATCH_SIZE)+'-e'+str(EPOCHS)
folder_path = drive_path+test_folder+base_name+'/'
model_file_name_best = 'model-best-'+base_name+'.h5'
model_file_name_last = 'model-last-'+base_name+'.h5'

# Load model
model = load_model(folder_path+model_file_name_best)

# Read url categories
list_classes = pickle.load(open(drive_path+'list_classes.dump', 'rb'))
print('Number of classes:',len(list_classes))

Number of classes: 101


#### Images classifying proccess and database formatting

All the image URLs obtained are opened and preprocessed for prediction using Xception model. The process consists on opening all images belonging to a restaurant and predict them saving the category and the accuracy. 


In [0]:
def getImgDataFromListOfUrls(imgs_urls):
  """This function download the image from url"""
  img_data = []
  for url in imgs_urls:
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    img = img.resize(size=(299,299), resample=1)
    img_data.append(img)
  return img_data

Once iterated through all restaurant images, is obtained a dictionary containing image information.

In [0]:
%%time

# Classify all images and generate dataset
database_rest = pd.DataFrame(columns=['food', 'img_url', 'acc', 'rest_url', 'score', 'reviews', 'categories', 'price', 'address', 'name'])
debug = False
for k,v in dict_restaurants.items():
  print(k)
  img_urls = v[0]
  #print(img_urls)
  img_data = getImgDataFromListOfUrls(img_urls)

  for index, img in enumerate(img_data):
    try:
      x = image.img_to_array(img)
      x = np.expand_dims(x, axis=0)
      x = preprocess_input(x)
      result = model.predict(x)
      food_class = list_classes[np.argmax(result[0])]
      if debug:
        display(img)
        print(food_class)
        print(img_urls[index], round(result[0][np.argmax(result[0])]*100,2))
      database_rest = database_rest.append({'food': food_class, 
                                            'img_url': img_urls[index], 
                                            'acc': round(result[0][np.argmax(result[0])]*100,2), 
                                            'rest_url': k, 
                                            'score':v[1], 
                                            'reviews': v[2], 
                                            'categories':v[3], 
                                            'price':v[4], 
                                            'address':v[5], 
                                            'name':v[6]}, 
                                           ignore_index=True)
    except:
      print('error')
  time.sleep(0.3)

/Restaurant_Review-g644339-d14798593-Reviews-La_Huella_Vegana_de_las_Rozas-Las_Rozas.html
/Restaurant_Review-g187514-d4149668-Reviews-Ito_Ita-Madrid.html
/Restaurant_Review-g644339-d1154598-Reviews-Kyoto_Restaurant-Las_Rozas.html
/Restaurant_Review-g644339-d2414040-Reviews-El_tomate-Las_Rozas.html
/Restaurant_Review-g580334-d6602067-Reviews-Cul_de_Sac-Pozuelo_de_Alarcon.html
CPU times: user 5.79 s, sys: 829 ms, total: 6.62 s
Wall time: 11.8 s


#### Clean and save data
Finally, it is required to merge the restaurant information table and restaurant images table into one for the study case and export it as JSON file orient to records. Merging tables into one gets some duplicate values, but, there is a small amount of data and, it is essential to make filters using the selected search engine.

In [0]:
def priceAvg(x):
  x_ = x.split()
  if( len(x_) == 3 ): 
    return (int(x_[0]) + int(x_[2])) / 2 
  else:
    if x_ == []:
      return 0
    else:
      return int(x_[0]) 
  
# Compute average price
database_rest['price'] = database_rest['price'].fillna('0 - 0').apply(lambda x: priceAvg(x))

In [0]:
# Check na values
database_rest.isna().sum()

food          0
img_url       0
acc           0
rest_url      0
score         0
reviews       0
categories    0
price         0
address       0
name          0
dtype: int64

In [0]:
database_rest

Unnamed: 0,food,img_url,acc,rest_url,score,reviews,categories,price,address,name
0,chocolate_cake,https://media-cdn.tripadvisor.com/media/photo-...,100.00,/Restaurant_Review-g644339-d14798593-Reviews-L...,4.5,31.0,[saludable],16.5,"Calle Verónica 6, 28232 Las Rozas España",La Huella Vegana de las Rozas
1,guacamole,https://media-cdn.tripadvisor.com/media/photo-...,96.42,/Restaurant_Review-g644339-d14798593-Reviews-L...,4.5,31.0,[saludable],16.5,"Calle Verónica 6, 28232 Las Rozas España",La Huella Vegana de las Rozas
2,poutine,https://media-cdn.tripadvisor.com/media/photo-...,99.89,/Restaurant_Review-g644339-d14798593-Reviews-L...,4.5,31.0,[saludable],16.5,"Calle Verónica 6, 28232 Las Rozas España",La Huella Vegana de las Rozas
3,hamburger,https://media-cdn.tripadvisor.com/media/photo-...,58.25,/Restaurant_Review-g644339-d14798593-Reviews-L...,4.5,31.0,[saludable],16.5,"Calle Verónica 6, 28232 Las Rozas España",La Huella Vegana de las Rozas
4,pulled_pork_sandwich,https://media-cdn.tripadvisor.com/media/photo-...,77.00,/Restaurant_Review-g644339-d14798593-Reviews-L...,4.5,31.0,[saludable],16.5,"Calle Verónica 6, 28232 Las Rozas España",La Huella Vegana de las Rozas
5,fried_rice,https://media-cdn.tripadvisor.com/media/photo-...,51.99,/Restaurant_Review-g644339-d14798593-Reviews-L...,4.5,31.0,[saludable],16.5,"Calle Verónica 6, 28232 Las Rozas España",La Huella Vegana de las Rozas
6,macarons,https://media-cdn.tripadvisor.com/media/photo-...,96.74,/Restaurant_Review-g644339-d14798593-Reviews-L...,4.5,31.0,[saludable],16.5,"Calle Verónica 6, 28232 Las Rozas España",La Huella Vegana de las Rozas
7,chocolate_mousse,https://media-cdn.tripadvisor.com/media/photo-...,91.10,/Restaurant_Review-g644339-d14798593-Reviews-L...,4.5,31.0,[saludable],16.5,"Calle Verónica 6, 28232 Las Rozas España",La Huella Vegana de las Rozas
8,guacamole,https://media-cdn.tripadvisor.com/media/photo-...,65.29,/Restaurant_Review-g644339-d14798593-Reviews-L...,4.5,31.0,[saludable],16.5,"Calle Verónica 6, 28232 Las Rozas España",La Huella Vegana de las Rozas
9,cup_cakes,https://media-cdn.tripadvisor.com/media/photo-...,78.15,/Restaurant_Review-g644339-d14798593-Reviews-L...,4.5,31.0,[saludable],16.5,"Calle Verónica 6, 28232 Las Rozas España",La Huella Vegana de las Rozas


In [0]:
# Save json database
database_rest.to_json(folder_path_scraper+'database_restaurants.json', orient='records')

In [0]:
# Load json database
database_rest = pd.read_csv(folder_path_scraper+'database_restaurants.json')