# Scraping with Selenium

Selenium is a tool initially created to automate tests on websites. It is therefore very useful when information is accessible by clicking on links. A button for example is an element from which it is very difficult to obtain the link. BeautifulSoup then becomes limited.
In this case, use Selenium.

### Load libraries

If you are missing any libraries in the next cell, you'll need to install them before continuing.

In [None]:
import bs4
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import json
import re
import lxml.html
import time
import random
from random import randint
import logging
import collections
from time import gmtime, strftime

import re
from tabulate import tabulate
import os

date = strftime("%Y-%m-%d")

### Install Selenium according to this manual

https://selenium-python.readthedocs.io/installation.html#downloading-python-bindings-for-selenium/bin

*NB: On Linux, put your `geckodriver` (the downloaded extension) in the equivalent path on your machine into `/home/<YOUR_NAME>/.local/bin/`*

We will simulate a search on the official Python website.

In [None]:
import selenium

# The selenium.webdriver module provides all the implementations of WebDriver
# Currently supported are Firefox, Chrome, IE and Remote. The `Keys` class provides keys on
# the keyboard such as RETURN, F1, ALT etc.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# Here, we create instance of Firefox WebDriver.
driver = webdriver.Firefox()

# The driver.get method will lead to a page given by the URL. WebDriver will wait until the page is fully
# loaded (i.e. the "onload" event has been triggered) before returning the control to your script.
# It should be noted that if your page uses a lot of AJAX calls when loading, WebDriver may not know
# when it was fully loaded.
driver.get("http://www.python.org")

# The following line is a statement confirming that the title contains the word "Python".
assert "Python" in driver.title

# WebDriver offers several methods to search for items using one of the methods
# `find_element_by_...` .
# For example, the input text element can be located by its name attribute by
# using the `find_element_by_name` method.
elem = driver.find_element_by_name("q")

# Then we send keys. This is similar to entering keys using your keyboard.
# Special keys can be sent using the `Keys` class imported in line 7 (from selenium.webdriver.common.keys import Keys).
# For security reasons, we will delete any pre-filled text in the input field
# (for example, "Search") so that it does not affect our search results:
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

# After submitting the page, you should get the result if there is one. To ensure that certain results
# are found, make an assertion:
assert "No results found." not in driver.page_source
driver.close()

#### Open the source code of the webpage and check that the search area (field) is called "q".

### Getting a phone number from *leboncoin*

In [None]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By

url = "https://www.leboncoin.fr/sports_hobbies/1536839557.htm/"

driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)

python_button = driver.find_elements_by_xpath('//div[@data-reactid="269"]')[0]
python_button.click()

# And then we use Beautiful soup
soup = BeautifulSoup(driver.page_source)

driver.close()

for elem in soup.find_all("a", attrs={"data-qa-id": "adview_number_phone_contact"}):
    print(elem.text)

### Starting from *leboncoin*, collect all the information available to define the product being sold. Use `selenium` for the telephone number.

### API (Application Program Interface)

A set of tools and methods that allow different applications to interact with each other. In the case of a web service, we can retrieve data dynamically. By using an API correctly, we can thus obtain in real time, the modifications made on a "parent" site.

For example, we will retrieve online news, for example from the "L'équipe" website.

Follow the instructions at https://newsapi.org/s/lequipe-api to retrieve an "API key" connection key

Your API key is: `73bbb95f8ecb49b499113a46481b4af1`


It is frequent that a key does not work after a while (e.g. `5 min`n `30 min`, a day, ...)
So don't jump up if you get an error message back.

In [None]:
import requests

key = "73bbb95f8ecb49b499113a46481b4af1"
url = "https://newsapi.org/v2/top-headlines?sources=lequipe&apiKey=" + key
response = requests.get(url)

# Here the response format is a json file, it is used as a dictionary
print(response.json())

In [None]:
dictionnary = response.json()
print(dictionnary.keys())

In [None]:
for element in list(dictionnary.keys()):
    print("##############################################")
    print("Key: ", element, "// Values: ", dictionnary[element])

In [None]:
# And now we have lists in dictionaries(it's a JSON file actually but it's very similar)
# We will discover the information of the article key.

for element in enumerate(dictionnary["articles"]):
    print("###############################################")
    print(element)

In [None]:
# So if we keep going, it gives us another dictionary!
for element in dictionnary["articles"][0].keys():
    print(" Key : ", element, "Values : ", dictionnary["articles"][0][element])

### Make a script that allows you to take details of the last ten news from the team or another site. Store them in a nice CSV or excel file.