## Starting Off

With a partner answer the following question. 

Is it legal to scrape data from websites?

# Advanced Webscraping: How to make sure you don't get blocked.

## Aims:

- Write scripts that can handle errors and minimize the likelihood of your IP address getting blocked.

- Write a selenium script to automatically log in to website.

## Agenda

- Talk about the legality of scraping
- Look at ways to programmatically avoid getting banned
- Set up the selenium webdriver
- Learn how to use Selenium
- Write your own script

In [12]:
import requests
from bs4 import BeautifulSoup as BS


In [8]:

url="https://www.amazon.com/Best-Sellers/zgbs"

page = requests.get(url)
page

<Response [200]>

In [13]:
soup = BS(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr">
 <head>
  <link href="https://images-na.ssl-images-amazon.com/images/I/41YNcL3lZpL._RC|51giv2WPknL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31bAdTWQ3tL.css,01gIQOTI9IL.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21EuGTxgpoL.css,01Xl9KigtzL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.css,01SHjPML6tL.css,111-D2qRjiL.css,01QrWuRrZ-L.css,310Imb6LqFL.css,11Z1a0FxSIL.css,01cbS3UK11L.css,21Yu8dLExyL.css,01L8Y-JFEhL.css_.css?AUIClients/AmazonUI#us.not-trident" rel="stylesheet"/>
  <script>
   (function(g,h,Q,z){function G(a){x&&x.tag&&x.tag(q(":","aui",a))}function v(a,b){x&&x.count&&x.count("aui:"+a,0===b?0:b||(x.count("aui:"+a)||0)+1)}function m(a){try{return a.test(navigator.userAgent)}catch(b){return!1}}function y(a,b,c){a.addEventListener?a.addEventListener(b,c,!1):a.attachEvent&&a.attachEvent("on"+b,c)}function q(a,b,c,d){b=b&&c?b+a+c:b||c;return d?q(a,b,d):b}function H

In [41]:
urls = soup.select('ul#zg_browseRoot a' )
urls[0]

<a href="https://www.amazon.com/Best-Sellers/zgbs/amazon-devices">Amazon Devices &amp; Accessories</a>

In [42]:
urls[0]['href']

'https://www.amazon.com/Best-Sellers/zgbs/amazon-devices'

In [None]:
for url in urls:
    page = requests.get(url)
    # more code to process the results
    

## 1- Check 200 status code
It is always good to check the HTTP status code earlier and proceed accordingly.

This is good:

~~~
if response.status_code == 200:
   #Proceed further
~~~

This is better:

~~~~
if response.status_code != 200:
  return False
~~~

In [None]:
for url in urls:
    page = requests.get(url)
    # include code to do status check
    if page.status_code != 200:
        return page.status_code
    
    # more code to process the results

## 2- Never Trust HTML

Specially if you can’t control it. Web scraping depends on HTML DOM, a simple change in element or class name could break your entire script. The best way to deal with it to check whether it returns None or not.

~~~
page_count = soup.select('.pager-pages > li > a')
if page_count:
 #do your stuff
else:
 # ALERT!! Send notification to Admin
~~~

Here I am checking whether the CSS selector returned something legitimate, if yes then proceed further.

In [None]:
for url in urls:
    page = requests.get(url)
    # include code to do status check
    if page.status_code != 200:
        return page.status_code
    
    # more code to process the results
    #imagine we have gotten the contents of the page in the soup variable
    items = soup.select(' .specific_class')
    if items:
        #continue processing the data
    else:
        return "Data is coming back blank"

## 3 — Set headers

Python Requests does not force you to use request headers while sending requests but there are few smart websites that does not let you to get read anything important unless certain headers are not set in it. Once I faced the situation that the HTML I was seeing in browser was different than what I was getting via my script, kind of like magic huh. So, it is always good to make your requests as legitimate as you can. The least you should do is to set a User-Agent.

~~~
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

response = requests.get(url, headers=headers, timeout=5)

~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    page = requests.get(url, headers = headers)
    # include code to do status check
    if page.status_code != 200:
        return page.status_code
    
    # more code to process the results
    #imagine we have gotten the contents of the page in the soup variable
    items = soup.select(' .specific_class')
    if items:
        #continue processing the data
    else:
        return "Data is coming back blank"

## 4- Set timeout

One of the issue with Python Requests is that, if you don’t mention timeout, it will keep trying till it’s last breathe. This might be good for some certain conditions but not in majority cases. Therefore, it’s always good to set a timeout value for each request. Here I am setting timeout to 5 seconds.

~~~
response = requests.get(url, headers=headers, timeout=5)
~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
    if page.status_code != 200:
        return page.status_code
    
    # more code to process the results
    #imagine we have gotten the contents of the page in the soup variable
    items = soup.select(' .specific_class')
    if items:
        #continue processing the data
    else:
        return "Data is coming back blank"

## 5- Exception handling

It is always good to implement exception handling. It does not only help to avoid unexpected exit of script but can also help to log errors and info notification. When using Python requests I prefer to catch exceptions like this:

~~~
try:
    # your logic is here

except requests.ConnectionError as e:
    print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
    print(str(e))
except requests.Timeout as e:
    print("OOPS!! Timeout Error")
    print(str(e))
except requests.RequestException as e:
    print("OOPS!! General Error")
    print(str(e))
except KeyboardInterrupt:
    print("Someone closed the program") 
~~~

Check the very last one. This one tells the program that if someone wants to terminate program by using Ctrl+C then it wrap things up first and then exist. This situation is good if you are storing information in file and wants to dump all at the time of exit.

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    try:
        page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            return page.status_code
    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
    # more code to process the results
    #imagine we have gotten the contents of the page in the soup variable
    items = soup.select(' .specific_class')
    if items:
        #continue processing the data
    else:
        return "Data is coming back blank"

## 6 - Regulate your request pace

Many websites have a limit on how many times you can ping a website within a minute/hour/day. YOu want to be aware of that and change your script in order to account for that.

One example is using the `sleep()` function that is a part of the time package.  This can pause your script for a set amount of time.

~~~
import time
 
 
## Star loop ##
for url in urls:

    # try to make resquest here.
    
 
    #### Delay for 1 seconds ####
    time.sleep(1)
        
~~~

In [48]:
import time
 
 
## Start loop ##
for url in urls:
    print("Current date & time " + time.strftime("%c"))
 
    time.sleep(1)

Current date & time Wed Mar 20 16:45:01 2019
Current date & time Wed Mar 20 16:45:02 2019
Current date & time Wed Mar 20 16:45:03 2019
Current date & time Wed Mar 20 16:45:04 2019
Current date & time Wed Mar 20 16:45:05 2019
Current date & time Wed Mar 20 16:45:06 2019
Current date & time Wed Mar 20 16:45:07 2019
Current date & time Wed Mar 20 16:45:08 2019
Current date & time Wed Mar 20 16:45:09 2019
Current date & time Wed Mar 20 16:45:10 2019


KeyboardInterrupt: 

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    try:
        page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            return page.status_code
    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
    # more code to process the results
    #imagine we have gotten the contents of the page in the soup variable
    
    items = soup.select(' .specific_class')
    if items:
        #continue processing the data
        
    else:
        return "Data is coming back blank"
    
    time.sleep(1)

## 7 - Save as you go

You might run into an issue halfway through your scrape and your script breaks. So you want to make sure you are saving your data as you go.  

~~~ 
import csv
...
with open("~/Desktop/output.csv", "w") as f:
    writer = csv.writer(f)

    # collected_items = [
    #   ["Product #1", "10", "http://example.com/product-1"],
    #   ["Product #2", "25", "http://example.com/product-2"],
    #   ...
    # ]

    for item_property_list in collected_items:
        writer.writerow(item_property_list)
~~~
~~~
import csv
...
field_names = ["Product Name", "Price", "Detail URL"]
with open("~/Desktop/output.csv", "w") as f:
    writer = csv.DictWriter(f, field_names)

    # collected_items = [
    #   {
    #       "Product Name": "Product #1",
    #       "Price": "10",
    #       "Detail URL": "http://example.com/product-1"
    #   },
    #   ...
    # ]

    # Write a header row
    writer.writerow({x: x for x in field_names})

    for item_property_dict in collected_items:
        writer.writerow(item_property_dict)
~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    try:
        page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            return page.status_code
    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
    # more code to process the results
    #imagine we have gotten the contents of the page in the soup variable
    
    items = soup.select(' .specific_class')
    if items:
        #continue processing the data
        
    else:
        return "Data is coming back blank"
    #write the line of data to a csv files
    with open("~/Desktop/output.csv", "w") as f:
    writer = csv.writer(f)

    for item_property_list in collected_items:
        writer.writerow(item_property_list)
    time.sleep(1)

## More Resources 
- [More advanced issues](https://blog.hartleybrody.com/web-scraping-cheat-sheet/)
- [Request Advanced Usage](http://docs.python-requests.org/en/master/user/advanced/#)

Web scraping with Python often requires no more than the use of the Beautiful Soup module to reach the goal. Beautiful Soup is a popular Python library that makes web scraping by traversing the DOM (document object model) easier to implement.

## Selenium

The Selenium package is used to automate web browser interaction from Python. With Selenium, programming a Python script to automate a web browser is possible.

In [49]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

In [50]:
driver = webdriver.Chrome()
driver.get("https://www.instagram.com/accounts/login/")


In [None]:
username = ''
pw = ''

In [6]:
email = driver.find_elements_by_css_selector('form input')[0]
password = driver.find_elements_by_css_selector('form input')[1]
email.send_keys(username)
password.send_keys(pw)
login = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[3]/button')
login.click()
try: 
    not_now = WebDriverWait(driver, 15).until(
        lambda d: d.find_element_by_xpath('//button[text()="Not Now"]')
    )
    not_now.click()
except: 
    pass
driver.get("https://www.instagram.com/foodandprobability")

ProtocolError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

### Transitioning to Beautiful Soup
Beautiful Soup remains the best way to traverse the DOM and scrape the data. After utilizing Selenium to handle the interactive parts, it is time to ask Beautiful Soup to grab the data that you need