# Advanced Webscraping: How to make sure you don't get blocked.

## Aims:

- Write scripts that can handle errors and minimize the likelihood of your IP address getting blocked.


## Agenda

- Talk about the legality of scraping
- Practice scraping
- Look at ways to programmatically avoid getting banned
- Set up the selenium webdriver
- Learn how to use Selenium

## 1. Check 200 status code
It is always good to check the HTTP status code earlier and proceed accordingly.

This is good:

~~~
if response.status_code == 200:
   #Proceed further
~~~

This is better:

~~~~
if response.status_code != 200:
  return False
~~~

In [3]:
broken_urls = []
for url in urls:
    page = requests.get(url)
    if response.status_code == 200:
    # include code to do status check
        
    
    else:
        pass
        print( page.status_code)
        broken_urls.append(url)
    # more code to process the results

IndentationError: expected an indented block (<ipython-input-3-58d9b4fbfae3>, line 8)

## 2. Never Trust HTML

Especially if you can’t control it. Web scraping depends on HTML DOM, a simple change in element or class name could break your entire script. The best way to deal with it is to check if it returns `None`.

~~~
page_count = soup.select('.pager-pages > li > a')
if page_count:
 #do your stuff
else:
 # ALERT!! Send notification to Admin
~~~

Here I am checking whether the CSS selector returned something legitimate, if yes then proceed further.

In [None]:
broken_links= []
for url in urls:
    page = requests.get(url)
    # include code to do status check
    if page.status_code != 200:
        print( page.status_code)
        broken_links.append(url)
        continue
    
    # more code to process the results
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    if items:
        #continue processing the data
        pass
    else:
        print("Data is coming back blank")

## 3 .  Set headers

`requests` does not force you to use request headers while sending requests, but there are few smart websites that do not let you to get read anything important unless certain headers are not set in it. Once I faced the situation that the HTML I was seeing in browser was different than what I was getting via my script, kind of like magic huh. So, it is always good to make your requests as legitimate as you can. The least you should do is to set a User-Agent.

~~~
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

response = requests.get(url, headers=headers, timeout=5)

~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    page = requests.get(url, headers = headers)
    # include code to do status check
    if page.status_code != 200:
        print(page.status_code)
    
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 4. Set timeout

One of the issues with `requests` is that, if you don’t mention **timeout**, it will continue waiting for a response indefinitely. If your request is never fulfilled, it will leave your script haning there waiting for a response.  

To set the request’s timeout, use the timeout parameter. timeout can be an integer or float representing the number of seconds to wait on a response before timing out:

~~~
response = requests.get(url, headers=headers, timeout=5)
~~~


You can also pass a tuple to timeout with the first element being a connect timeout (the time it allows for the client to establish a connection to the server), and the second being a read timeout (the time it will wait on a response once your client has established a connection):

~~~ 
requests.get('https://api.github.com', timeout=(2, 5))
~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    page = requests.get(url, headers = headers, timeout=(2,5))
    # include code to do status check
    if page.status_code != 200:
        print(page.status_code)
    
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 5. Exception handling

It is always good to implement exception handling. It does not only help to avoid unexpected exit of script but can also help to log errors and info notification. When using Python requests I prefer to catch exceptions like this:

~~~
try:
    # your logic is here

except requests.ConnectionError as e:
    print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
    print(str(e))
except requests.Timeout as e:
    print("OOPS!! Timeout Error")
    print(str(e))
except requests.RequestException as e:
    print("OOPS!! General Error")
    print(str(e))
except KeyboardInterrupt:
    print("Someone closed the program") 
~~~

Check the very last one. This one tells the program that if someone wants to terminate program by using Ctrl+C then it wrap things up first and then exist. This situation is good if you are storing information in file and wants to dump all at the time of exit.

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    try:
        page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            print(page.status_code)
    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
        
        
        
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

This code is starting to get long and hard to read. So let's start to modularize it.  

In [None]:
def get_page(url, headers):
    try:
        page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            print(page.status_code)

    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
        
        
    return page
    

We can replace a chunk of our code with this function

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    #use our new function to process each url
    page = get_page(url, headers)
        
        
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 6. Regulate your request pace

Many websites have a limit on how many times you can ping a website within a minute/hour/day. YOu want to be aware of that and change your script in order to account for that.

One example is using the `sleep()` function that is a part of the time package.  This can pause your script for a set amount of time.

~~~
import time
 
 
## Star loop ##
for url in urls:

    # try to make resquest here.
    
 
    #### Delay for 1 seconds ####
    time.sleep(1)
        
~~~

In [None]:
import time
 
 
## Start loop ##
for url in urls:
    print("Current date & time " + time.strftime("%c"))

    #use our new function to process each url
    page = get_page(url)
             
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")
    
    time.sleep(.3)

## 7 - Save as you go

You might run into an issue halfway through your scrape and your script breaks. So you want to make sure you are saving your data as you go.  

~~~ 
import csv
...
with open("~/Desktop/output.csv", "w") as f:
    writer = csv.writer(f)

    # collected_items = [
    #   ["Product #1", "10", "http://example.com/product-1"],
    #   ["Product #2", "25", "http://example.com/product-2"],
    #   ...
    # ]

    for item_property_list in collected_items:
        writer.writerow(item_property_list)
~~~
~~~
import csv
...
field_names = ["Product Name", "Price", "Detail URL"]
with open("~/Desktop/output.csv", "w") as f:
    writer = csv.DictWriter(f, field_names)

    # collected_items = [
    #   {
    #       "Product Name": "Product #1",
    #       "Price": "10",
    #       "Detail URL": "http://example.com/product-1"
    #   },
    #   ...
    # ]

    # Write a header row
    writer.writerow({x: x for x in field_names})

    for item_property_dict in collected_items:
        writer.writerow(item_property_dict)
~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    print("Current date & time " + time.strftime("%c"))

    #use our new function to process each url
    page = get_page(url)
             
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")
    
    #Saving your data as you go
    
    # Option 1: write the line of data to a csv files
    with open("~/Desktop/output.csv", "w") as f:
        writer = csv.writer(f)

    for item in items:
        writer.writerow(item)
        


## More Resources 
- [More advanced issues](https://blog.hartleybrody.com/web-scraping-cheat-sheet/)
- [Request Advanced Usage](http://docs.python-requests.org/en/master/user/advanced/#)

Web scraping with Python often requires no more than the use of the Beautiful Soup module to reach the goal. Beautiful Soup is a popular Python library that makes web scraping by traversing the DOM (document object model) easier to implement.

https://realpython.com/beautiful-soup-web-scraper-python/#part-3-parse-html-code-with-beautiful-soup

## Applied: Scraping Amazon's Best Sellers list:


Amazon keeps track of the best sellers for 41 different categories of products. We want to grab that data from Amazon so that we can keep track of which products are on that list and stock our mom and pop store with them.  


Deliverable: a file that contains all of the products on Amazon's best seller list. 

```[{'name': 'A top selling product',
'url': http://the_url_to_the_product.com},
{'name': 'A top selling product',
'url': http://the_url_to_the_product.com}]```

In [6]:
import requests
from bs4 import BeautifulSoup as BS


First we start by grabbing the page where all of the best sellers list are located.

In [8]:

url="https://www.amazon.com/Best-Sellers/zgbs"

#let's use the function we already created
page = requests.get(url)


In [11]:
soup = BS(page.content, 'html.parser')


In [9]:
# print(soup.prettify())

Now that we have this page, we want to find the urls of all the other pages to scrape those.  

In [20]:
#using the select statement to find the elements containing each url
urls = soup.find('ul', id='zg_browseRoot')


In [21]:
type(urls)

bs4.element.Tag

In [22]:
len(urls)

5

In [23]:
urls.find_all('a')

[<a href="https://www.amazon.com/Best-Sellers/zgbs/amazon-devices">Amazon Devices &amp; Accessories</a>,
 <a href="https://www.amazon.com/Best-Sellers-Amazon-Launchpad/zgbs/boost">Amazon Launchpad</a>,
 <a href="https://www.amazon.com/Best-Sellers-Prime-Pantry/zgbs/pantry">Amazon Pantry</a>,
 <a href="https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances">Appliances</a>,
 <a href="https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Apps &amp; Games</a>,
 <a href="https://www.amazon.com/Best-Sellers-Arts-Crafts-Sewing/zgbs/arts-crafts">Arts, Crafts &amp; Sewing</a>,
 <a href="https://www.amazon.com/Best-Sellers-Audible-Audiobooks/zgbs/audible">Audible Books &amp; Originals</a>,
 <a href="https://www.amazon.com/Best-Sellers-Automotive/zgbs/automotive">Automotive</a>,
 <a href="https://www.amazon.com/Best-Sellers-Baby/zgbs/baby-products">Baby</a>,
 <a href="https://www.amazon.com/Best-Sellers-Beauty/zgbs/beauty">Beauty &amp; Personal Care</a>,
 <a href="https:/

In [26]:
#using the select statement to find the elements containing each url
urls = soup.select('ul#zg_browseRoot a' )     # same as find statement


In [32]:
urls[0]['href']   

'https://www.amazon.com/Best-Sellers/zgbs/amazon-devices'

In [24]:
for url in urls:
    print(url['href'])

https://www.amazon.com/Best-Sellers/zgbs/amazon-devices
https://www.amazon.com/Best-Sellers-Amazon-Launchpad/zgbs/boost
https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances
https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps
https://www.amazon.com/Best-Sellers-Arts-Crafts-Sewing/zgbs/arts-crafts
https://www.amazon.com/Best-Sellers-Audible-Audiobooks/zgbs/audible
https://www.amazon.com/Best-Sellers-Automotive/zgbs/automotive
https://www.amazon.com/Best-Sellers-Baby/zgbs/baby-products
https://www.amazon.com/Best-Sellers-Beauty/zgbs/beauty
https://www.amazon.com/best-sellers-books-Amazon/zgbs/books
https://www.amazon.com/best-sellers-music-albums/zgbs/music
https://www.amazon.com/best-sellers-camera-photo/zgbs/photo
https://www.amazon.com/Best-Sellers/zgbs/wireless
https://www.amazon.com/Best-Sellers/zgbs/fashion
https://www.amazon.com/Best-Sellers-Collectible-Coins/zgbs/coins
https://www.amazon.com/Best-Sellers-Computers-Accessories/zgbs/pc
https://www.amazon.

In [33]:
print(urls[4].text, '\n',urls[4]['href'])

Apps & Games 
 https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps


In [34]:
#list of all best seller urls
urls = [url['href'] for url in urls]

Select a url/products that you want to investigate and lets build our script to parse one page.  then we can apply it to all of the pages. 

In [35]:
urls[2]

'https://www.amazon.com/Best-Sellers-Prime-Pantry/zgbs/pantry'

## Grabbing the products from each page. 

Now that we have the URL for the pages, we want to parse those to get the actual information about the bestselling products.

So now we need to go over each best seller link and parse that page to grabe the prodcuts. 

In [38]:
# Grabbing one specific URL to parse
url=urls[2]

apps = requests.get(url)
apps

<Response [200]>

In [39]:
app_soup = BS(apps.content, 'html.parser')
print(app_soup.prettify())

<!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo">
 <head>
  <script>
   var aPageStart = (new Date()).getTime();
  </script>
  <meta charset="utf-8"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/21GOoQKemTL._RC|01WTbMujHuL.css_.css?AUIClients/ZeitgeistPageAssets-zeitgeistList" rel="stylesheet"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/11EIQ5IGqaL._RC|012LjolmrML.css,41cDRFS39BL.css,21WV2mrxM2L.css,01Vctty9pOL.css,017DsKjNQJL.css,01l9iDpr-DL.css,41EWOOlBJ9L.css,11UoGyLuXoL.css,01ElnPiDxWL.css,11QxHU4QYaL.css,01Sp8sB1HiL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,01evdoiemkL.css,01oDR3IULNL.css,31zpKVx8wkL.css,01ZTetsDh7L.css,01Jb-VvL4uL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01ruG+gDPFL.css,115z+SGcjHL.css,21GwE3cR-yL.css,11KLBtpWIAL.css,119I1lTJONL.css,11M4XwS6hxL.css,11WgRxUdJRL.css,01-fWz3sOQL.css,11ocrgKoE-L.css,11k89RclloL.css,11cm-8W5AzL.css,01QrWuRrZ-L.css,21pIv-yKhaL.css,01M3ZzSySfL.css,01gAR5pB+IL.css,119dKrtBoVL.css,0

We have grabbed one specifc best seller URL, but now we need to figure out how to grab the details about the products. 

- Inspect the actual webpage to determine the data you want and the corresponding elements you want to parse out. 

- Then use that element tag or class to pull those elements out of the page. 

In [74]:
blocks = app_soup.select('span.zg-item') # look for spans that have class zg
len(blocks)

50

In [75]:
# blocks = app_soup.find_all(True, {'class':['aok-inline-block', 'zg-item']})

In [76]:
len(blocks)

50

In [77]:
for part in blocks[0].children:
    print(part, '\n')

<a class="a-link-normal" href="/Idahoan-Potatoes-Buttery-Homestyle-4-Ounce/dp/B00SH4NJH2?_encoding=UTF8&amp;psc=1"><span class="zg-text-center-align"><div class="a-section a-spacing-small"><img alt="Idahoan Buttery Homestyle Mashed Potatoes, Made with Gluten-Free 100-Percent Real Idaho Potatoes, 4-ounce Pouch (4…" height="200" src="https://images-na.ssl-images-amazon.com/images/I/9135ES8qVlL._AC_UL200_SR200,200_.jpg" width="200"/></div></span>
<div aria-hidden="true" class="p13n-sc-truncate p13n-sc-line-clamp-2 p13n-sc-truncate-desktop-type2" data-rows="2">
            Idahoan Buttery Homestyle Mashed Potatoes, Made with Gluten-Free 100-Percent Real Idaho Potatoes, 4-ounce Pouch (4 Servings)
        </div>
</a> 


 

<div class="a-icon-row a-spacing-none">
<a class="a-link-normal" href="/product-reviews/B00SH4NJH2" title="4.7 out of 5 stars">
<i class="a-icon a-icon-star a-star-4-5 aok-align-top"><span class="a-icon-alt">4.7 out of 5 stars</span></i>
</a>
<a class="a-size-small a-link-

In [78]:
block = blocks[0]

In [84]:
list(block.children)[0]

<a class="a-link-normal" href="/Idahoan-Potatoes-Buttery-Homestyle-4-Ounce/dp/B00SH4NJH2?_encoding=UTF8&amp;psc=1"><span class="zg-text-center-align"><div class="a-section a-spacing-small"><img alt="Idahoan Buttery Homestyle Mashed Potatoes, Made with Gluten-Free 100-Percent Real Idaho Potatoes, 4-ounce Pouch (4…" height="200" src="https://images-na.ssl-images-amazon.com/images/I/9135ES8qVlL._AC_UL200_SR200,200_.jpg" width="200"/></div></span>
<div aria-hidden="true" class="p13n-sc-truncate p13n-sc-line-clamp-2 p13n-sc-truncate-desktop-type2" data-rows="2">
            Idahoan Buttery Homestyle Mashed Potatoes, Made with Gluten-Free 100-Percent Real Idaho Potatoes, 4-ounce Pouch (4 Servings)
        </div>
</a>

In [66]:
list(block.children)[0].get_text()    # 

'\n\n            Idahoan Buttery Homestyle Mashed Potatoes, Made with Gluten-Free 100-Percent Real Idaho Potatoes, 4-ounce Pouch (4 Servings)\n        \n'

## Parsing the data
Now that you can access all the blocks for each prodcut, we need to pull out specific information for the products. 

- Think about what data you need from the 'block' and create a sample data stucture that you will want to use.  

- Parse one block into that data structure.

- Put this into a loop so that we can proccess all of the products and create one list with all of the data.   

In [91]:
# your code here

product = {"name" : str(list(block.children)[0].get_text()),
          "price" : list(block.children)[4].get_text()}

Now that we have each individual part working, let's wrap this all up in a function that we can run for each product class.

Create a function that takes in a URL(product category)

This function should use the code above that parses the individual products on the page.

The funciton should then return all of that data in the correct format ( a list of dictionaries). 

In [88]:
def parse_bestseller_cat(prod):
    
    name = prod['name']
    
    return ___

Next step is now to add this function to the larger script we have from above that will loop over our list of urls (categories) and grab all of the data we need. 

In [89]:
product.keys()

dict_keys(['name', 'price'])

In [90]:
product['name']

'\n\n            Idahoan Buttery Homestyle Mashed Potatoes, Made with Gluten-Free 100-Percent Real Idaho Potatoes, 4-ounce Pouch (4 Servings)\n        \n'

In [92]:
product['price']

'$0.69 '