## Starting Off

With a partner, answer the following question:

Is it legal to scrape data from websites?

# Advanced Webscraping: How to make sure you don't get blocked.

## Aims:

- Write scripts that can handle errors and minimize the likelihood of your IP address getting blocked.


## Agenda

- Talk about the legality of scraping
- Practice scraping
- Look at ways to programmatically avoid getting banned
- Set up the selenium webdriver
- Learn how to use Selenium

## 1. Check 200 status code
It is always good to check the HTTP status code earlier and proceed accordingly.

This is good:

~~~
if response.status_code == 200:
   #Proceed further
~~~

This is better:

~~~~
if response.status_code != 200:
  return False
~~~

In [1]:
for url in urls:
    page = requests.get(url)
    if response.status_code == 200:
    # include code to do status check
    
    
    
    
    else:
        print( page.status_code)
    
    # more code to process the results

IndentationError: expected an indented block (<ipython-input-1-89c75ab4dbb9>, line 5)

## 2. Never Trust HTML

Especially if you can’t control it. Web scraping depends on HTML DOM, a simple change in element or class name could break your entire script. The best way to deal with it is to check if it returns `None`.

~~~
page_count = soup.select('.pager-pages > li > a')
if page_count:
 #do your stuff
else:
 # ALERT!! Send notification to Admin
~~~

Here I am checking whether the CSS selector returned something legitimate, if yes then proceed further.

In [2]:
broken_links= []
for url in urls:
    page = requests.get(url)
    # include code to do status check
    if page.status_code != 200:
        print( page.status_code)
        broken_links.append(url)
        continue
    
    # more code to process the results
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    if items:
        #continue processing the data
        pass
    else:
        print("Data is coming back blank")

NameError: name 'urls' is not defined

## 3 .  Set headers

`requests` does not force you to use request headers while sending requests, but there are few smart websites that do not let you to get read anything important unless certain headers are not set in it. Once I faced the situation that the HTML I was seeing in browser was different than what I was getting via my script, kind of like magic huh. So, it is always good to make your requests as legitimate as you can. The least you should do is to set a User-Agent.

~~~
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

response = requests.get(url, headers=headers, timeout=5)

~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    page = requests.get(url, headers = headers)
    # include code to do status check
    if page.status_code != 200:
        print(page.status_code)
    
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 4. Set timeout

One of the issues with `requests` is that, if you don’t mention **timeout**, it will continue waiting for a response indefinitely. If your request is never fulfilled, it will leave your script haning there waiting for a response.  

To set the request’s timeout, use the timeout parameter. timeout can be an integer or float representing the number of seconds to wait on a response before timing out:

~~~
response = requests.get(url, headers=headers, timeout=5)
~~~


You can also pass a tuple to timeout with the first element being a connect timeout (the time it allows for the client to establish a connection to the server), and the second being a read timeout (the time it will wait on a response once your client has established a connection):

~~~ 
requests.get('https://api.github.com', timeout=(2, 5))
~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    page = requests.get(url, headers = headers, timeout=(2,5))
    # include code to do status check
    if page.status_code != 200:
        print(page.status_code)
    
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 5. Exception handling

It is always good to implement exception handling. It does not only help to avoid unexpected exit of script but can also help to log errors and info notification. When using Python requests I prefer to catch exceptions like this:

~~~
try:
    # your logic is here

except requests.ConnectionError as e:
    print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
    print(str(e))
except requests.Timeout as e:
    print("OOPS!! Timeout Error")
    print(str(e))
except requests.RequestException as e:
    print("OOPS!! General Error")
    print(str(e))
except KeyboardInterrupt:
    print("Someone closed the program") 
~~~

Check the very last one. This one tells the program that if someone wants to terminate program by using Ctrl+C then it wrap things up first and then exist. This situation is good if you are storing information in file and wants to dump all at the time of exit.

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    try:
        page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            print(page.status_code)
    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
        
        
        
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

This code is starting to get long and hard to read. So let's start to modularize it.  

In [None]:
def get_page(url):
    try:
        page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            print(page.status_code)

    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
        
        
    return page
    

We can replace a chunk of our code with this function

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    #use our new function to process each url
    page = get_page(url)
        
        
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 6. Regulate your request pace

Many websites have a limit on how many times you can ping a website within a minute/hour/day. YOu want to be aware of that and change your script in order to account for that.

One example is using the `sleep()` function that is a part of the time package.  This can pause your script for a set amount of time.

~~~
import time
 
 
## Star loop ##
for url in urls:

    # try to make resquest here.
    
 
    #### Delay for 1 seconds ####
    time.sleep(1)
        
~~~

In [None]:
import time
 
 
## Start loop ##
for url in urls:
    print("Current date & time " + time.strftime("%c"))

    #use our new function to process each url
    page = get_page(url)
             
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")
    
    time.sleep(3)

## 7 - Save as you go

You might run into an issue halfway through your scrape and your script breaks. So you want to make sure you are saving your data as you go.  

~~~ 
import csv
...
with open("~/Desktop/output.csv", "w") as f:
    writer = csv.writer(f)

    # collected_items = [
    #   ["Product #1", "10", "http://example.com/product-1"],
    #   ["Product #2", "25", "http://example.com/product-2"],
    #   ...
    # ]

    for item_property_list in collected_items:
        writer.writerow(item_property_list)
~~~
~~~
import csv
...
field_names = ["Product Name", "Price", "Detail URL"]
with open("~/Desktop/output.csv", "w") as f:
    writer = csv.DictWriter(f, field_names)

    # collected_items = [
    #   {
    #       "Product Name": "Product #1",
    #       "Price": "10",
    #       "Detail URL": "http://example.com/product-1"
    #   },
    #   ...
    # ]

    # Write a header row
    writer.writerow({x: x for x in field_names})

    for item_property_dict in collected_items:
        writer.writerow(item_property_dict)
~~~

In [20]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    print("Current date & time " + time.strftime("%c"))

    #use our new function to process each url
    page = get_page(url)
             
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")
    
    #Saving your data as you go
    
    # Option 1: write the line of data to a csv files
    with open("~/Desktop/output.csv", "w") as f:
        writer = csv.writer(f)

    for item in items:
        writer.writerow(item)
        
    # Option 2: Inseting the data into a DB
    # This code uses a theoretical module, SQL,
    # The functions below are examples and will not run. 
    import sql_helpers as sql
    
    sql.create_connection()
    for  item in items:
        item = data
        query = "INSERT INTO table_name VALUES (%s,%s,%s,%s)"
        sql.insert_data(db, query, data )
        
    #Taking a one second pause to help slow down your requests 
    time.sleep(1)

SyntaxError: 'return' outside function (<ipython-input-20-96ab78507719>, line 28)

## More Resources 
- [More advanced issues](https://blog.hartleybrody.com/web-scraping-cheat-sheet/)
- [Request Advanced Usage](http://docs.python-requests.org/en/master/user/advanced/#)

Web scraping with Python often requires no more than the use of the Beautiful Soup module to reach the goal. Beautiful Soup is a popular Python library that makes web scraping by traversing the DOM (document object model) easier to implement.

## Applied: Scraping Amazon's Best Sellers list:


Amazon keeps track of the best sellers for 41 different categories of products. We want to grab that data from Amazon so that we can keep track of which products are on that list and stock our mom and pop store with them.  


Deliverable: a file that contains all of the products on Amazon's best seller list. 

```[{'name': 'A top selling product',
'url': http://the_url_to_the_product.com},
{'name': 'A top selling product',
'url': http://the_url_to_the_product.com}]```

In [4]:
import requests
from bs4 import BeautifulSoup as BS


First we start by grabbing the page where all of the best sellers list are located.

In [5]:

url="https://www.amazon.com/Best-Sellers/zgbs"

#let's use the function we already created
page = requests.get(url)
page

<Response [200]>

In [6]:
soup = BS(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr">
 <head>
  <link href="https://images-na.ssl-images-amazon.com/images/I/21doGy6C0kL._RC|01KD4yyr5LL.css_.css?AUIClients/ZeitgeistPageAssets-zeitgeistHome" rel="stylesheet"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/517rp2NH2UL._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01tgK36lpGL.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01Xl9KigtzL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.css,01SHjPML6tL.css,111-D2qRjiL.css,01QrWuRrZ-L.css,310Imb6LqFL.css,11Z1a0FxSIL.css,01Alnvtt1zL.css,21mOLw+nYYL.css,01L8Y-JFEhL.css_.css?AUIClients/AmazonUI#us.not-trident.218320-T1.206347-T1" rel="stylesheet"/>
  <script>
   (function(f,h,R,A){function G(a){x&&x.tag&&x.tag(q(":","aui",a))}function v(a,b){x&&x.count&&x.count("aui:"+a,0===b?0:b||(x.count("aui:"+a)||0)+1)}function p(a){try{return a.test(navigator.userAgent)}catch

Now that we have this page, we want to find the urls of all the other pages to scrape those.  

In [7]:
#using the select statement to find the elements containing each url
urls = soup.select('ul#zg_browseRoot a' )


Amazon Devices & Accessories 
 https://www.amazon.com/Best-Sellers/zgbs/amazon-devices


In [8]:
urls

[<a href="https://www.amazon.com/Best-Sellers/zgbs/amazon-devices">Amazon Devices &amp; Accessories</a>,
 <a href="https://www.amazon.com/Best-Sellers-Amazon-Launchpad/zgbs/boost">Amazon Launchpad</a>,
 <a href="https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances">Appliances</a>,
 <a href="https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Apps &amp; Games</a>,
 <a href="https://www.amazon.com/Best-Sellers-Arts-Crafts-Sewing/zgbs/arts-crafts">Arts, Crafts &amp; Sewing</a>,
 <a href="https://www.amazon.com/Best-Sellers-Audible-Audiobooks/zgbs/audible">Audible Books &amp; Originals</a>,
 <a href="https://www.amazon.com/Best-Sellers-Automotive/zgbs/automotive">Automotive</a>,
 <a href="https://www.amazon.com/Best-Sellers-Baby/zgbs/baby-products">Baby</a>,
 <a href="https://www.amazon.com/Best-Sellers-Beauty/zgbs/beauty">Beauty &amp; Personal Care</a>,
 <a href="https://www.amazon.com/best-sellers-books-Amazon/zgbs/books">Books</a>,
 <a href="https://www.amaz

In [12]:
print(urls[4].text, '\n',urls[4]['href'])

Arts, Crafts & Sewing 
 https://www.amazon.com/Best-Sellers-Arts-Crafts-Sewing/zgbs/arts-crafts


In [13]:
#list of all best seller urls
urls = [url['href'] for url in urls]

Select a url/products that you want to investigate and lets build our script to parse one page.  then we can apply it to all of the pages. 

In [16]:
urls[3]

'https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps'

In [18]:
url=urls[3]

apps = requests.get(url)
apps

<Response [200]>

In [19]:
app_soup = BS(apps.content, 'html.parser')
print(app_soup.prettify())

<!DOCTYPE doctype html>
<html class="a-no-js" data-19ax5a9jf="dingo">
 <head>
  <script>
   var aPageStart = (new Date()).getTime();
  </script>
  <meta charset="utf-8"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/21doGy6C0kL._RC|01WTbMujHuL.css_.css?AUIClients/ZeitgeistPageAssets-zeitgeistList" rel="stylesheet"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/517rp2NH2UL._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01tgK36lpGL.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01Xl9KigtzL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.css,01SHjPML6tL.css,111-D2qRjiL.css,01QrWuRrZ-L.css,310Imb6LqFL.css,11Z1a0FxSIL.css,01Alnvtt1zL.css,21mOLw+nYYL.css,01L8Y-JFEhL.css_.css?AUIClients/AmazonUI#us.not-trident.218320-T1.206347-T1" rel="stylesheet"/>
  <script>
   (function(f,h,R,A){function G(a){x&&x.tag&&x.tag(q(":","aui",a))}function v

Inspect the actual webpage to determine the data you want and the corresponding elements you want to parse out. Then use that element tag or class to pull those elements out of the page. 

In [24]:
# your code here

app_soup.find_all(class_='aok-inline-block')

[<div class="a-row a-spacing-none aok-inline-block"><span class="a-size-small aok-float-left zg-badge-body zg-badge-color"><span class="zg-badge-text">#1</span></span><span class="aok-float-left zg-badge-triangle zg-badge-color"></span></div>,
 <span class="aok-inline-block zg-item"><a class="a-link-normal" href="/Mojang-Minecraft/dp/B00992CF6W?_encoding=UTF8&amp;psc=1"><span class="zg-text-center-align"><div class="a-section a-spacing-small"><img alt="Minecraft" height="200" src="https://images-na.ssl-images-amazon.com/images/I/61GA3lSuDNL._AC_UL200_SR200,200_.png" width="200"/></div></span>
 <div aria-hidden="true" class="p13n-sc-truncate p13n-sc-line-clamp-1" data-rows="1">
             Minecraft
         </div>
 </a><div class="a-row a-size-small"><span class="a-size-small a-color-base">Mojang</span></div>
 <div class="a-icon-row a-spacing-none">
 <a class="a-link-normal" href="/product-reviews/B00992CF6W" title="4.4 out of 5 stars">
 <i class="a-icon a-icon-star a-star-4-5"><span 

Now that you can access all the data you need, let's put this into a loop so that we can proccess all of the products and create one list with all of the data.   

In [23]:
# your code here

Now that we have each individual part working, let's wrap this all up in a function that we can run for each product class?


In [None]:
def parse_bestseller_cat(___):
    #your code here
    
    return ___

In [None]:
Next step is now to add this function to the larger script we have from above.  