<h2>Scraping Javascript</h2>
Navigate to the URL at <a>http://www.webscrapingfordatascience.com/</a>
simplejavascript/. This simple web page shows three random quotes, but it uses
JavaScript to do so. Inspect the source code of the page.
This JavaScript fragment does the following:
<ul>
<li>To code is wrapped in a “$()” function. This is not part of standard
JavaScript, but instead a mechanism provided by jQuery, a popular
JavaScript library that is loaded using another “script” tag. The
code defined in the function will be executed once the browser is
finished with loading the page.</li>
<li> The code inside the function starts by setting a “jsenabled” cookie.
Indeed, JavaScript is able to set and retrieve cookies as well.</li>
<li>Next, a “getJSON” function is used to perform another HTTP request to
fetch the quotes, which are added by inserting a ul tag in thebody></li>
</ul>

In [7]:
import requests
from bs4 import BeautifulSoup
url = 'http://www.webscrapingfordatascience.com/simplejavascript/'
r = requests.get(url)
# print(r.text) 
html_soup = BeautifulSoup(r.text, 'html.parser')
# No tag will be found here
ul_tag = html_soup.find('ul')
print(ul_tag)
# Show the JavaScript code
script_tag = html_soup.find('script', attrs={'src': None})
print(script_tag)

comments=''' #!COMMENTS
the contents of the page are just returned as is, but neither
requests nor Beautiful Soup come with a JavaScript engine included, meaning that no
JavaScript will be executed, and no “<ul>” tag will be found on the page. We can take a
look at the “<script>” tag, but to Beautiful Soup, this will look like any other HTML tag
with a bunch of text inside. We have no way to parse and query the actual JavaScript code.'''

None
<script>
	$(function() {
	document.cookie = "jsenabled=1";
	$.getJSON("quotes.php", function(data) {
		var items = [];
		$.each(data, function(key, val) {
			items.push("<li id='" + key + "'>" + val + "</li>");
		});
		$("<ul/>", {
			html: items.join("")
			}).appendTo("body");
		});
	});
	</script>


In simple situations such as this one, this is not necessarily a problem. We know
that the browser is making requests to a page at “quotes.php”, and that we need to set a
cookie. We can still scrape the data directly:

In [13]:
url = 'http://www.webscrapingfordatascience.com/simplejavascript/quotes.php'
r = requests.get(url,cookies={'jsenabled':'1'},headers={'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36'})
print(r.text)
# if cookie is not set you won't be abled to get data 

["Whatever the mind of man can conceive and believe, it can achieve. \u2013Napoleon Hill","I am not a product of my circumstances. I am a product of my decisions. \u2013Stephen Covey","Your time is limited, so don\u2019t waste it living someone else\u2019s life. \u2013Steve Jobs"]


Head over to <a>http://www.
webscrapingfordatascience.com/complexjavascript/</a>. You’ll note that this page loads
additional quotes by scrolling to the bottom of the list. Inspecting the script tags
now shows an obfuscated mess. For your web browser,
interpreting and running this code might be simple, but to us humans, it is not. we can still try to inspect the network requests to figure out what is happening here, to
some extent:
<ul>
<li>Requests are made once again to a “quotes.php” page with a “p” URL
parameter, used for pagination</li>
<li>Two cookies are used here: “nonce” and “PHPSESSID.” The latter
we’ve encountered before, and is simply included in the “Set-Cookie”
response header for the main page. The “nonce” cookie, however, is
not, which indicates that it might be set through JavaScript</li>
</ul>


In [16]:
import requests
url = 'http://www.webscrapingfordatascience.com/complexjavascript/'
my_session = requests.Session()
# Get the main page first to obtain the PHPSESSID cookie
r = my_session.get(url)
# Manually set the nonce cookie
my_session.cookies.update({
'nonce': '6205' #from browser developer tools 
})
r = my_session.get(url + 'quotes.php', params={'p': '0'})
print(r.text)
print(r.url)
print(r.request.headers)
# Shows: No quotes for you!

No quotes for you!
http://www.webscrapingfordatascience.com/complexjavascript/quotes.php?p=0
{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'nonce=6205; PHPSESSID=3e85jr8rukci1om99au4kg8mqr'}


 We’re getting a fresh session
identifier by visiting the main page as if we were coming from a new browsing session
to provide the “PHPSESSID” cookie. However, we’re reusing the “nonce” cookie value
that our browser was using. The web page might see that this “nonce” value does not
match with the “PHPSESSID” information. As such, we have no choice but to also reuse
the “PHPSESSID” value

In [17]:
import requests
url = 'http://www.webscrapingfordatascience.com/complexjavascript/'
my_cookies = {
'nonce': '6205',
'PHPSESSID': 'm1ajmim0qtqj40foh7t6h61emr'
}
r = requests.get(url + 'quotes.php', params={'p': '0'}, cookies=my_cookies)

# This looks like HTML containing our quotes, but note that every quote seems to be
# encoded in some way. 
print(r.text)

<div class="quote decode">TGlmZSBpcyBhYm91dCBtYWtpbmcgYW4gaW1wYWN0LCBub3QgbWFraW5nIGFuIGluY29tZS4gLUtldmluIEtydXNlDQo=</div><div class="quote decode">CVdoYXRldmVyIHRoZSBtaW5kIG9mIG1hbiBjYW4gY29uY2VpdmUgYW5kIGJlbGlldmUsIGl0IGNhbiBhY2hpZXZlLiDigJNOYXBvbGVvbiBIaWxsDQo=</div><div class="quote decode">CVN0cml2ZSBub3QgdG8gYmUgYSBzdWNjZXNzLCBidXQgcmF0aGVyIHRvIGJlIG9mIHZhbHVlLiDigJNBbGJlcnQgRWluc3RlaW4NCg==</div><br><br><br><br><a class="jscroll-next" href="quotes.php?p=3">Load more quotes</a>


 Above of parsing JS approach comes with a number of issues, which — sadly — we’re
unable to solve using what we’ve seen so far. The solution to this problem is easy to
describe: we’re seeing the quotes appear in our browser window, which is executing
JavaScript, so can’t we get them out from there? Indeed, for sites making heavy use of
JavaScript, we’ll have no choice but to emulate a full browser stack, and to move away
from requests and Beautiful Soup

<h2>Scrapping with Selenium </h2> Selenium is a powerful web scraping tool that was originally developed for the purpose
of automated website testing. Selenium works by automating browsers to load a website,
retrieve its contents, and perform actions like a user would when using the browser. As
such, it’s also a powerful tool for web scraping. Selenium can be controlled from various
programming languages, such as Java, C#, PHP, and of course, Python.
Selenium itself does not come with its own web browser.
Instead, it requires a piece of integration software to interact with a third party, called
a <b>WebDriver</b>. For this tutorial we use ChromeWebDriver can be downlaoded <a> https://sites.google.com/a/
chromium.org/chromedriver/downloads</a> as webDriver. And To install Selenium <i>pip install -U Selenium </i> 

In [2]:
#!BOILER PLATE
from selenium import webdriver
url = 'http://www.webscrapingfordatascience.com/complexjavascript/'
driver = webdriver.Chrome()
driver.get(url)
input('Press ENTER to close the automated browser')
driver.quit()

If you prefer to keep the WebDriver executable somewhere else, it is also possible
to pass its location as you construct the Selenium webdriver object in Python like so
(however, we’ll assume that you keep the executable in the same directory for the
examples that follow to keep the code a bit shorter):<br>
<pre>
driver_exe = 'C:/Users/Seppe/Desktop/chromedriver.exe'
# If you copy-paste the path with back-slashes, make sure to escape them
# E.g.: driver_exe = 'C:\\Users\\Seppe\\Desktop\\chromedriver.exe'
driver = webdriver.Chrome(driver_exe)
</pre>

Let’s modify last program to showcase Selenium methods, for instance, to get out the
quotes’ contents in <a>'http://www.webscrapingfordatascience.com/complexjavascript/'</a> which we <b>failed to fetch using BeautifulSoap</b>

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By #import By class 

url = 'http://www.webscrapingfordatascience.com/complexjavascript/'
# chromedriver should be in the same path as your Python script
driver = webdriver.Chrome()
driver.get(url)
for quote in driver.find_elements(By.CLASS_NAME,'quote'):
    print(quote.text)

Running above code doesn’t seem to work, as no quotes are displayed at all. The
reason for this is because our browser will take some time — even if only half a second —
to execute the JavaScript, fetch the quotes, and display them. Meanwhile, our Python
script is already hard at work to try to find quote elements, which at that moment are
not yet there. We might simply slap in a sleep line in our code to wait a few seconds, but
Selenium comes with a more robust approach: <b>wait conditions</b>.

<h4>SELENIUM WAITS</h4>
Selenium provides two types of waits: implicit and explicit.<br>
<b>Implicit Waits</b>
An implicit wait makes
WebDriver poll the page for a certain amount of time every time when trying to locate an
element.Think of the implicit wait as a “catch all” where we wait every time when trying
to locate an element up to a specified amount of time.By default, the implicit waiting
time is set to zero, meaning that Selenium will not wait at all.Implicit waits are helpful when you’re just getting started with Selenium,

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By #import By class 

url = 'http://www.webscrapingfordatascience.com/complexjavascript/'
driver = webdriver.Chrome()
# Set an implicit wait
driver.implicitly_wait(10)
driver.get(url)
for quote in driver.find_elements(By.CLASS_NAME,'quote'):
    print(quote.text),

Life is about making an impact, not making an income. -Kevin Kruse
Whatever the mind of man can conceive and believe, it can achieve. –Napoleon Hill
Strive not to be a success, but rather to be of value. –Albert Einstein


<b>EXPLICIT Waits</b>
<br>
 we rely on the following imports:
<pre>
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC</pre>

This example works as follows. First, we create a WebDriverWait object using our
WebDriver and a given amount of seconds we’d like to wait for it. We then call the until
method on this object, to which we need to provide a condition object, the predefined
<b>presence_of_all_elements_located</b> in our case.  Here, our locator states that we want to find elements by a given CSS selector rule,
specifying all elements with a “quote” CSS class but not with a “decode” CSS class, as we
want to wait until the JavaScript code on the page is done decoding the quotes.This condition will be checked over and over again until 10 seconds have passed,
or until the condition returns something that is not False, that is, the list of matching
elements in the case of presence_of_all_elements_located. We can then directly loop
over this list and retrieve the quotes’ contents.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'http://www.webscrapingfordatascience.com/complexjavascript/'
driver = webdriver.Chrome()
driver.get(url)
quote_elements = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".quote:not(.decode)")))
for quote in quote_elements:
    print(quote.text)

Life is about making an impact, not making an income. -Kevin Kruse
Whatever the mind of man can conceive and believe, it can achieve. –Napoleon Hill
Strive not to be a success, but rather to be of value. –Albert Einstein


So far, our example only returns the
first three quotes. We still need to figure out a way to scroll down in our list of quotes
using Selenium in order to load all of them. To do so, Selenium comes with a selection
of <b>“actions”</b> that can be performed by the browser, such as clicking elements, clicking
and dragging, double-clicking, right-clicking, and so on, which we could use in order to
move down the scroll bar. we can use the execute_script method in order to send a JavaScript
command to the browser for scrolling.

In [17]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

class at_least_n_elements_found(object):
    def __init__(self, locator, n):
        self.locator = locator
        self.n = n
    def __call__(self, driver):
        elements = driver.find_elements(*self.locator)
        if len(elements) >= self.n:
            return elements
        else:
            return False
url = 'http://www.webscrapingfordatascience.com/complexjavascript/'
driver = webdriver.Chrome()
driver.get(url)
# Use an implicit wait for cases where we don't use an explicit one
driver.implicitly_wait(10)
div_element = driver.find_element(By.CLASS_NAME,'infinite-scroll')
quotes_locator = (By.CSS_SELECTOR, ".quote:not(.decode)") #type webElementObj

nr_quotes = 0
while True:
    # Scroll down to the bottom
    driver.execute_script('arguments[0].scrollTop = arguments[0].scrollHeight',div_element)
# Try to fetch at least nr_quotes+1 quotes
    try:
        all_quotes = WebDriverWait(driver, 3).until(at_least_n_elements_found(quotes_locator, nr_quotes + 1))
    except TimeoutException as ex:
        # No new quotes found within 3 seconds, assume this is all there is
        print("... done!")
        break
    # Otherwise, update the quote counter
    nr_quotes = len(all_quotes)
    print("... now seeing", nr_quotes, "quotes")
   
# all_quotes will contain all the quote elements
print(len(all_quotes), 'quotes found\n')
for quote in all_quotes:
    print(quote.text)

... now seeing 3 quotes
... now seeing 6 quotes
... now seeing 9 quotes
... now seeing 12 quotes
... now seeing 15 quotes
... now seeing 18 quotes
... now seeing 21 quotes
... now seeing 24 quotes
... now seeing 27 quotes
... now seeing 30 quotes
... now seeing 33 quotes
... done!
33 quotes found

Life is about making an impact, not making an income. -Kevin Kruse
Whatever the mind of man can conceive and believe, it can achieve. –Napoleon Hill
Strive not to be a success, but rather to be of value. –Albert Einstein
Two roads diverged in a wood, and I—I took the one less traveled by, And that has made all the difference. –Robert Frost
I attribute my success to this: I never gave or took any excuse. –Florence Nightingale
You miss 100% of the shots you don’t take. –Wayne Gretzky
I've missed more than 9000 shots in my career. I've lost almost 300 games. 26 times I've been trusted to take the game winning shot and missed. I've failed over and over and over again in my life. And that is why I

If you’d like to see how this would work without using JavaScript commands and
actions instead, you can take a look at the following fragment (note the two new imports).
In the next section, we’ll talk more about interacting with a web page through actions.

In [2]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
class at_least_n_elements_found(object):
    def __init__(self, locator, n):
        self.locator = locator
        self.n = n
    def __call__(self, driver):
        elements = driver.find_elements(*self.locator)
        if len(elements) >= self.n:
            return elements
        else:
            return False
url = 'http://www.webscrapingfordatascience.com/complexjavascript/'
driver = webdriver.Chrome()
driver.get(url)
# Use an implicit wait for cases where we don't use an explicit one
driver.implicitly_wait(10)
div_element = driver.find_element(By.CLASS_NAME,'infinite-scroll')
quotes_locator = (By.CSS_SELECTOR, ".quote:not(.decode)")
nr_quotes = 0

while True:
    # Scroll down to the bottom, now using action (chains)
    action_chain = ActionChains(driver)
    # Move to our quotes block
    action_chain.move_to_element(div_element)
    # Click it to give it focus
    action_chain.click()
    # Press the page down key about 10 ten times
    action_chain.send_keys([Keys.PAGE_DOWN for i in range(10)])
    # Do these actions
    action_chain.perform()
    # Try to fetch at least nr_quotes+1 quotes
    try:
        all_quotes = WebDriverWait(driver, 3).until(
        at_least_n_elements_found(quotes_locator, nr_quotes + 1) #calls constructor of class at_least_n_elements
        )
    except TimeoutException as ex:
        # No new quotes found within 3 seconds, assume this is all there is
        print("... done!")
        break
    # Otherwise, update the quote counter
    nr_quotes = len(all_quotes)
    print("... now seeing", nr_quotes, "quotes")
    # all_quotes will contain all the quote elements
print(len(all_quotes), 'quotes found\n')
for quote in all_quotes:
    print(quote.text)

... now seeing 6 quotes
... now seeing 9 quotes
... now seeing 12 quotes
... now seeing 15 quotes
... now seeing 18 quotes
... now seeing 21 quotes
... now seeing 24 quotes
... now seeing 27 quotes
... now seeing 30 quotes
... now seeing 33 quotes
... done!
33 quotes found

Life is about making an impact, not making an income. -Kevin Kruse
Whatever the mind of man can conceive and believe, it can achieve. –Napoleon Hill
Strive not to be a success, but rather to be of value. –Albert Einstein
Two roads diverged in a wood, and I—I took the one less traveled by, And that has made all the difference. –Robert Frost
I attribute my success to this: I never gave or took any excuse. –Florence Nightingale
You miss 100% of the shots you don’t take. –Wayne Gretzky
I've missed more than 9000 shots in my career. I've lost almost 300 games. 26 times I've been trusted to take the game winning shot and missed. I've failed over and over and over again in my life. And that is why I succeed. –Michael Jorda

lets explore the form in <a>
http://www.webscrapingfordatascience.com/postform2/</a> using Selenium

In [27]:
from selenium import webdriver
url = 'http://www.webscrapingfordatascience.com/postform2/'
driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get(url)

Let’s start by talking a bit more about navigation. We have already seen the get
method to navigate to a URL using Selenium. Similarly, you can also call a driver’s
forward and back methods (these take no arguments) to go forward and backward in the
browser’s history. Regarding cookies, it is helpful to know that — since Selenium uses
a real browser — we don’t need to worry about cookie management ourselves. If you
want to output the cookies currently available, you can call the get_cookies method on
a WebDriver object. The add_cookie method allows you to set a new cookie (it expects a
dictionary with “name” and “value” keys as its argument).
<br>
Every time you retrieve elements using the find_element_by_* and find_elements_
by_* methods (or the general-purpose find_element and find_elements methods),
Selenium will return WebElement objects. There are a number of interesting methods and
attributes you can access for such objects:

In [1]:
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By 

url = 'http://www.webscrapingfordatascience.com/postform2/'
driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get(url)

driver.find_element(By.NAME,'name').send_keys('Qamar') #enter values in name input 
driver.find_element(By.CSS_SELECTOR,'input[name="gender"][value="M"]').click() #click on the input such as checkbox / radiobuttons

driver.find_element(By.NAME,'pizza').click()
driver.find_element(By.NAME,'salad').click()
Select(driver.find_element(By.NAME,'haircolor')).select_by_value('brown') #select list 
driver.find_element(By.NAME,'comments').send_keys(['First line', Keys.ENTER, 'Second line']) #textbox #Keys.Enter will creat a new line 

input('Press ENTER to submit the form')
# driver.find_element(By.TAG_NAME,'form').submit()
driver.find_element(By.CSS_SELECTOR,'input[type="submit"]').click()



Instead of working with actions directly as seen above, Selenium also provides an
<b>ActionChains object </b>(found under “selenium.webdriver.common.action_chains”) to
construct more fine-grained chains of actions. This is useful for doing more complex
actions like hover over and drag and drop. The following example is functionally equivalent to the above, but it uses action
chains to fill in most of the form fields: 

In [11]:
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By 

from selenium.webdriver.common.action_chains import ActionChains
url = 'http://www.webscrapingfordatascience.com/postform2/'
driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get(url)
chain = ActionChains(driver)
chain.send_keys_to_element(driver.find_element(By.NAME,'name'), 'Seppe')
chain.click(driver.find_element(By.CSS_SELECTOR,'input[name="gender"][value="M"]'))
chain.click(driver.find_element(By.NAME,'pizza'))
chain.click(driver.find_element(By.NAME,'salad'))
chain.click(driver.find_element(By.NAME,'comments'))
chain.send_keys('This is a first line', Keys.ENTER, 'And this a second') #don't use [] in chains 
chain.perform()
Select(driver.find_element(By.NAME,'haircolor')).select_by_value('brown')
# input('Press ENTER to submit the form')
driver.find_element(By.TAG_NAME,'form').submit()
# Or: driver.find_element_by_css_selector('input[type="submit"]').click()