# Scraping with Selenium

Selenium is a tool initially created to automate tests on websites. It is therefore very useful when information is accessible by clicking on links. A button for example is an element from which it is very difficult to obtain the link. BeautifulSoup then becomes limited.
In this case, use Selenium.

### Load libraries

If you are missing any libraries in the next cell, you'll need to install them before continuing.

In [7]:
import bs4
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import json
import re
import lxml.html
import time
import random
from random import randint
import logging
import collections
from time import gmtime, strftime

import re
from tabulate import tabulate
import os

date = strftime("%Y-%m-%d")

### Install Selenium according to this manual

https://selenium-python.readthedocs.io/installation.html#downloading-python-bindings-for-selenium/bin

*NB: On Linux, put your `geckodriver` (the downloaded extension) in the equivalent path on your machine into `/home/<YOUR_NAME>/.local/bin/`*

We will simulate a search on the official Python website.

In [8]:
import selenium

# The selenium.webdriver module provides all the implementations of WebDriver
# Currently supported are Firefox, Chrome, IE and Remote. The `Keys` class provides keys on
# the keyboard such as RETURN, F1, ALT etc.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# Here, we create instance of Firefox WebDriver.
driver = webdriver.Firefox()

# The driver.get method will lead to a page given by the URL. WebDriver will wait until the page is fully
# loaded (i.e. the "onload" event has been triggered) before returning the control to your script.
# It should be noted that if your page uses a lot of AJAX calls when loading, WebDriver may not know
# when it was fully loaded.
driver.get("http://www.python.org")

# The following line is a statement confirming that the title contains the word "Python".
assert "Python" in driver.title

# WebDriver offers several methods to search for items using one of the methods
# `find_element_by_...` .
# For example, the input text element can be located by its name attribute by
# using the `find_element_by_name` method.
elem = driver.find_element_by_name("q")

# Then we send keys. This is similar to entering keys using your keyboard.
# Special keys can be sent using the `Keys` class imported in line 7 (from selenium.webdriver.common.keys import Keys).
# For security reasons, we will delete any pre-filled text in the input field
# (for example, "Search") so that it does not affect our search results:
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

# After submitting the page, you should get the result if there is one. To ensure that certain results
# are found, make an assertion:
assert "No results found." not in driver.page_source
driver.close()

SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line


In [None]:
import selenium

# The selenium.webdriver module provides all the implementations of WebDriver
# Currently supported are Firefox, Chrome, IE and Remote. The `Keys` class provides keys on
# the keyboard such as RETURN, F1, ALT etc.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

# Here, we create instance of Firefox WebDriver.
driver = webdriver.Firefox()

# The driver.get method will lead to a page given by the URL. WebDriver will wait until the page is fully
# loaded (i.e. the "onload" event has been triggered) before returning the control to your script.
# It should be noted that if your page uses a lot of AJAX calls when loading, WebDriver may not know
# when it was fully loaded.
driver.get("http://www.python.org")

# The following line is a statement confirming that the title contains the word "Python".
assert "Python" in driver.title

# WebDriver offers several methods to search for items using one of the methods
# `find_element_by_...` .
# For example, the input text element can be located by its name attribute by
# using the `find_element_by_name` method.
driver.find_element(By.NAME, 'q').send_keys("pycon")

# Then we send keys. This is similar to entering keys using your keyboard.
# Special keys can be sent using the `Keys` class imported in line 7 (from selenium.webdriver.common.keys import Keys).
# For security reasons, we will delete any pre-filled text in the input field
# (for example, "Search") so that it does not affect our search results:
#elem.clear()
#elem.send_keys("pycon")
#elem.send_keys(Keys.RETURN)

# After submitting the page, you should get the result if there is one. To ensure that certain results
# are found, make an assertion:
assert "No results found." not in driver.page_source
#driver.close()

#### Open the source code of the webpage and check that the search area (field) is called "q".

In [None]:
<input id="id-search-field" name="q" type="search" role="textbox" class="search-field" placeholder="Search" value="" tabindex="1">

SyntaxError: invalid syntax (192146405.py, line 1)

### Getting a phone number from *leboncoin*

In [15]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By

url = "https://www.leboncoin.fr/sports_hobbies/1536839557.htm/"

driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)

python_button = driver.find_elements_by_xpath('//div[@data-reactid="269"]')[0]
python_button.click()

# And then we use Beautiful soup
soup = BeautifulSoup(driver.page_source)

driver.close()

for elem in soup.find_all("a", attrs={"data-qa-id": "adview_number_phone_contact"}):
    print(elem.text)

  python_button = driver.find_elements_by_xpath('//div[@data-reactid="269"]')[0]


WebDriverException: Message: TypeError: browsingContext.currentWindowGlobal is null
Stacktrace:
getMarionetteCommandsActorProxy/get/<@chrome://remote/content/marionette/actors/MarionetteCommandsParent.jsm:332:29


### Starting from *leboncoin*, collect all the information available to define the product being sold. Use `selenium` for the telephone number.

In [17]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By


url = "https://www.immoweb.be/en/classified/apartment/for-rent/schaerbeek/1030/9901286?searchId=627a574b63a1a"

driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)

keep_browsing = driver.find_element(By.ID,'uc-btn-accept-banner')
keep_browsing.click()
contact_phone = driver.find_elements(by=By.XPATH, value='//*[@id="customer-card"]/div[2]/div[2]/button[2]')[0]
contact_phone.click()

## And then we use Beautiful soup
soup = BeautifulSoup(driver.page_source)

#driver.close()

for elem in soup.find_all("a", {'class' : 'button button--secondary'}):
    href = elem.get('href')
    print(href)



tel:+32 2 733 00 00


### API (Application Program Interface)

A set of tools and methods that allow different applications to interact with each other. In the case of a web service, we can retrieve data dynamically. By using an API correctly, we can thus obtain in real time, the modifications made on a "parent" site.

For example, we will retrieve online news, for example from the "L'équipe" website.

Follow the instructions at https://newsapi.org/s/lequipe-api to retrieve an "API key" connection key

Your API key is: `73bbb95f8ecb49b499113a46481b4af1`


It is frequent that a key does not work after a while (e.g. `5 min`n `30 min`, a day, ...)
So don't jump up if you get an error message back.

In [18]:
import requests

key = "73bbb95f8ecb49b499113a46481b4af1"
url = "https://newsapi.org/v2/top-headlines?sources=lequipe&apiKey=" + key
response = requests.get(url)

# Here the response format is a json file, it is used as a dictionary
print(response.json())

{'status': 'ok', 'totalResults': 10, 'articles': [{'source': {'id': 'lequipe', 'name': "L'equipe"}, 'author': "L'EQUIPE", 'title': "Christophe Galtier (entraîneur de Nice)\xa0: «\xa0Pas à l'heure du bilan\xa0»", 'description': "À la veille de la rencontre entre Nice et Saint-Étienne, en match en retard de la 36e journée de Ligue\xa01 mercredi (19h00), l'entraîneur du Gym Christophe Galtier a repoussé les interrogations concernant son avenir. Pour mieux se focaliser sur une rencontre cr…", 'url': 'https://www.lequipe.fr/Football/Actualites/Christophe-galtier-entraineur-de-nice-pas-a-l-heure-du-bilan/1332298', 'urlToImage': 'https://medias.lequipe.fr/img-photo-jpg/christophe-galtier-entraineur-de-nice-en-conference-de-presse-p-lahalle-l-equipe/1500000001640586/0:0,1998:1332-640-427-75/83a0e.jpg', 'publishedAt': '2022-05-10T11:44:00+00:00', 'content': "«\xa0Serez-vous toujours l'entraîneur de Nice la saison prochaine et avez-vous demandé des garanties à Ineos sur les contours du futur eff

In [19]:
dictionnary = response.json()
print(dictionnary.keys())

dict_keys(['status', 'totalResults', 'articles'])


In [20]:
for element in list(dictionnary.keys()):
    print("##############################################")
    print("Key: ", element, "// Values: ", dictionnary[element])

##############################################
Key:  status // Values:  ok
##############################################
Key:  totalResults // Values:  10
##############################################
Key:  articles // Values:  [{'source': {'id': 'lequipe', 'name': "L'equipe"}, 'author': "L'EQUIPE", 'title': "Christophe Galtier (entraîneur de Nice)\xa0: «\xa0Pas à l'heure du bilan\xa0»", 'description': "À la veille de la rencontre entre Nice et Saint-Étienne, en match en retard de la 36e journée de Ligue\xa01 mercredi (19h00), l'entraîneur du Gym Christophe Galtier a repoussé les interrogations concernant son avenir. Pour mieux se focaliser sur une rencontre cr…", 'url': 'https://www.lequipe.fr/Football/Actualites/Christophe-galtier-entraineur-de-nice-pas-a-l-heure-du-bilan/1332298', 'urlToImage': 'https://medias.lequipe.fr/img-photo-jpg/christophe-galtier-entraineur-de-nice-en-conference-de-presse-p-lahalle-l-equipe/1500000001640586/0:0,1998:1332-640-427-75/83a0e.jpg', 'publishedAt'

In [21]:
# And now we have lists in dictionaries(it's a JSON file actually but it's very similar)
# We will discover the information of the article key.

for element in enumerate(dictionnary["articles"]):
    print("###############################################")
    print(element)

###############################################
(0, {'source': {'id': 'lequipe', 'name': "L'equipe"}, 'author': "L'EQUIPE", 'title': "Christophe Galtier (entraîneur de Nice)\xa0: «\xa0Pas à l'heure du bilan\xa0»", 'description': "À la veille de la rencontre entre Nice et Saint-Étienne, en match en retard de la 36e journée de Ligue\xa01 mercredi (19h00), l'entraîneur du Gym Christophe Galtier a repoussé les interrogations concernant son avenir. Pour mieux se focaliser sur une rencontre cr…", 'url': 'https://www.lequipe.fr/Football/Actualites/Christophe-galtier-entraineur-de-nice-pas-a-l-heure-du-bilan/1332298', 'urlToImage': 'https://medias.lequipe.fr/img-photo-jpg/christophe-galtier-entraineur-de-nice-en-conference-de-presse-p-lahalle-l-equipe/1500000001640586/0:0,1998:1332-640-427-75/83a0e.jpg', 'publishedAt': '2022-05-10T11:44:00+00:00', 'content': "«\xa0Serez-vous toujours l'entraîneur de Nice la saison prochaine et avez-vous demandé des garanties à Ineos sur les contours du futur e

In [22]:
# So if we keep going, it gives us another dictionary!
for element in dictionnary["articles"][0].keys():
    print(" Key : ", element, "Values : ", dictionnary["articles"][0][element])

 Key :  source Values :  {'id': 'lequipe', 'name': "L'equipe"}
 Key :  author Values :  L'EQUIPE
 Key :  title Values :  Christophe Galtier (entraîneur de Nice) : « Pas à l'heure du bilan »
 Key :  description Values :  À la veille de la rencontre entre Nice et Saint-Étienne, en match en retard de la 36e journée de Ligue 1 mercredi (19h00), l'entraîneur du Gym Christophe Galtier a repoussé les interrogations concernant son avenir. Pour mieux se focaliser sur une rencontre cr…
 Key :  url Values :  https://www.lequipe.fr/Football/Actualites/Christophe-galtier-entraineur-de-nice-pas-a-l-heure-du-bilan/1332298
 Key :  urlToImage Values :  https://medias.lequipe.fr/img-photo-jpg/christophe-galtier-entraineur-de-nice-en-conference-de-presse-p-lahalle-l-equipe/1500000001640586/0:0,1998:1332-640-427-75/83a0e.jpg
 Key :  publishedAt Values :  2022-05-10T11:44:00+00:00
 Key :  content Values :  « Serez-vous toujours l'entraîneur de Nice la saison prochaine et avez-vous demandé des garanties à I

### Make a script that allows you to take details of the last ten news from the team or another site. Store them in a nice CSV or excel file.