# Web Scrapping

Extracting webpages and parsing them for in readable format.

Usually it is HTML. We'll use
- **Requests** to get the webpage
- **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)** to parse it. It parses HTML and XML with the help of a parser(**html or lxml**)

<center> <h1> Beautiful Soup  </h1> </center>

From the website

**"You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help."**

Install 
- **pip install beautifulsoup4**
- **pip install lxml**
in your virtual environment.

# Protocol to follow when scrapping the web page
- Check for robot.txt and see what is allowed
- Avoid lots of simultaneous calls. Your IP may get block. Use sleep between making get call to avoid this.

- Use Requests get method to get the webpage html
- Parse it using Beautiful Soup and lxml. It creates a hierarchical structure of html elements.
- In chrome right click and click on inspect to open developer tools. Inspecting the html elements for their attributes and hierarchical order.
- Use Beautiful Soup object to get to the desired element.

# An example of parsing html

Visit this [w3schools](https://www.w3schools.com/html/html_basic.asp) to get an idea about HTML


In [1]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story_title"><b>The Dormouse's story three little sisters</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""


<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story_title"><b>The Dormouse's story three little sisters</b></p>    
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>


In [2]:
from bs4 import BeautifulSoup as bsoup

In [3]:
soup = bsoup(html_doc, 'lxml')
print(type(soup))
print(soup)

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story_title"><b>The Dormouse's story three little sisters</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>



# Navigating this data structure

In [4]:
soup.title

<title>The Dormouse's story</title>

In [5]:
soup.title.text

"The Dormouse's story"

In [6]:
soup.title.string

"The Dormouse's story"

In [7]:
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [8]:
# print the name and text in parent tag ??????
soup.title.parent.text

"The Dormouse's story"

# p tag represent a paragraph of text

In [9]:
soup.p

<p class="story_title"><b>The Dormouse's story three little sisters</b></p>

p tag has some attribute too, like class here. How to get the value of attribute

In [10]:
soup.p['class']

['story_title']

But there were more **p** tags. How to get them from soup data structure

In [34]:
soup.find_all('p')

[<p class="story_title"><b>The Dormouse's story three little sisters</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

The **a** tag defines a hyperlink, to link  to another webpage

<img src=""> </img>

In [12]:
# Find all the url(href) in a tags
for atag in soup.find_all('a'):
    print(atag['href'])

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [13]:
# Third link
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [14]:
# complete text
soup.get_text()

"The Dormouse's story\n\nThe Dormouse's story three little sisters\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n\n\n"

# Let's create some Beautiful Soup

We will scrap fry electronics for telescopes following the protocol and store the result in a csv file.



# Checking robots.txt

In [6]:
!curl  https://www.frys.com/robots.txt

User-agent: * 
Crawl-delay: 10 
Sitemap: http://www.frys.com/sitemap_index.xml 
Visit-time: 0030-0300 
Disallow: /ShopCartServlet 
Disallow: /wf 



In [16]:
# import requests and bs4 
from bs4 import BeautifulSoup as bsoup
import requests

In [17]:
response = requests.get('https://www.frys.com/search?query_search=&cat=-68822&nearbyStoreName=false&pType=pDisplay&fq=a%20Regular%20Items&start=0&cat=-68822&from=0&to=99&isKeyword=true')

In [18]:
response.status_code

200

In [19]:
response.text

'\r\n\r\n<!-- Desktop page for search. -->\r\n\r\n<HTML>\r\n\t<HEAD>\r\n\t\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\r\n\t\t<base href="https://www.frys.com/">\r\n\t\t\n<script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var o=n[t]={exports:{}};e[t][0].call(o.exports,function(n){var o=e[t][1][n];return r(o||n)},o,o.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r}({1:[function(e,n,t){function r(){}function o(e,n,t){return function(){return i(e,[c.now()].concat(u(arguments)),n?null:this,t),n?void 0:this}}var i=e("handle"),a=e(3),u=e(4),f=e("ee").get("tracer"),c=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var p=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],d="api-",l=d+"ixn-";a(p,function(e,n){s[n]=o(d+n,!0,"api")}),s.addPageAction=o(d+"addPageAction",!0)

In [20]:
soup= bsoup(response.text, 'lxml')

In [21]:
telescope_containers = soup.find_all('div', {"class":"col-xs-12 col-sm-12 pad_lr_tab5 product togrid"})

In [23]:
telescope_container = telescope_containers[0]

In [49]:
def scrap_telescope(telescope_container):
    product_dict= {}
    telescope_info = telescope_container.find('div', {"class":"col-xs-12 col-sm-4 col-md-5 pad_none_tab pad_lr_desk5 toGirdDesc"})

    product_desc_container = telescope_info.find('p')
    product_desc = product_desc_container.text.strip()
    print(product_desc)
    
    product_dict['product_desc']=product_desc
    product_info_container =  telescope_info.find('div', {"class":"col-xs-12 pad_none_tab pad_none_desk prodModel"})

    product_attr_container = product_info_container.find_all('p')

    for product_attr in product_attr_container[:-1]:
        product_val = product_attr.text.strip().split(':')
        product_dict[product_val[0]] = product_val[1]
    return product_dict    

In [50]:
for telescope_container in telescope_containers:
    print(scrap_telescope(telescope_container))


Polaroid Telescope with Interchangeable 75x/150x Eyepiece Lenses
{'product_desc': 'Polaroid Telescope with Interchangeable 75x/150x Eyepiece Lenses', 'Frys #': ' 9547962', 'Brand': ' Polaroid', 'UPC ': ' 681066491096', 'Model': ' IT-160X'}
Polaroid 168x/525x Refractor Telescope
{'product_desc': 'Polaroid 168x/525x Refractor Telescope', 'Frys #': ' 9547922', 'Brand': ' Polaroid', 'UPC ': ' 681066110782', 'Model': ' IT-525X-RFR'}
CELESTRON KIDS MICROSCPE KIDS MICROSCOPE W/CASE
{'product_desc': 'CELESTRON KIDS MICROSCPE KIDS MICROSCOPE W/CASE', 'Frys #': ' 8335697', 'Brand': ' Celestron', 'UPC ': ' 050234441230', 'Model': ' KIDS 28 PIECE MICROS'}
Celestron PowerSeeker 60EQ Refractor Telescope
{'product_desc': 'Celestron PowerSeeker 60EQ Refractor Telescope', 'Frys #': ' 5536400', 'Brand': ' Celestron', 'UPC ': ' 050234210430', 'Model': ' POWERSEEKER 60EQ'}
Celestron LandScout 50mm Spotting Scope
{'product_desc': 'Celestron LandScout 50mm Spotting Scope', 'Frys #': ' 7897269', 'Brand': ' C