# Basics of Web Scrapping

Extracting webpages and parsing them for in readable format.

Usually it is HTML. We'll use
- **Requests** to get the webpage
- **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)** to parse it. It parses HTML and XML with the help of a parser (**html or lxml**)

<center> <h1> Beautiful Soup  </h1> </center>

From the website

**"You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help."**

Install 
- **pip install beautifulsoup4**
- **pip install lxml**
in your virtual environment.

# Protocol to follow when scrapping the web page
- Check for robot.txt and see what is allowed
- Avoid lots of simultaneous calls. Your IP may get block. Use sleep between making get call to avoid this.

- Use Requests get method to get the webpage html
- Parse it using Beautiful Soup and lxml. It creates a hierarchical structure of html elements.
- In chrome right click and click on inspect to open developer tools. Inspecting the html elements for their attributes and hierarchical order.
- Use Beautiful Soup object to get to the desired element.

# An example of parsing html

Visit this [w3schools](https://www.w3schools.com/html/html_basic.asp) to get an idea about HTML


In [4]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story_title"><b>The Dormouse's story three little sisters</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""


<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story_title"><b>The Dormouse's story three little sisters</b></p>    
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>


In [2]:
from bs4 import BeautifulSoup as bsoup
import urllib.robotparser

In [5]:
soup = bsoup(html_doc, 'lxml')
print(type(soup))
print(soup)

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story_title"><b>The Dormouse's story three little sisters</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>



# Navigating this data structure

In [6]:
soup.title

<title>The Dormouse's story</title>

In [7]:
soup.title.text

"The Dormouse's story"

In [8]:
soup.title.string

"The Dormouse's story"

In [9]:
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [10]:
# print the name and text in parent tag ??????
soup.title.parent.text


"The Dormouse's story"

# p tag represent a paragraph of text

In [11]:
soup.p

<p class="story_title"><b>The Dormouse's story three little sisters</b></p>

p tag has some attribute too, like class here. How to get the value of attribute

In [13]:
soup.p['class']

['story_title']

But there were more **p** tags. How to get them from soup data structure

In [14]:
soup.find_all('p')

[<p class="story_title"><b>The Dormouse's story three little sisters</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

The **a** tag defines a hyperlink, to link  to another webpage

In [15]:
# Find all the url(href) in a tags
for atag in soup.find_all('a'):
    print(atag['href'])

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [16]:
# Third link
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [17]:
# complete text
soup.get_text()

"The Dormouse's story\n\nThe Dormouse's story three little sisters\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n\n\n"

# Let's create some Beautiful Soup

We will scrap fry electronics for telescopes following the protocol and store the result in a csv file.



# Checking robots.txt

## Caution: This does not work because the business closed down permanently. Feel free to watch the video, but it is no longer possible to reproduce this exact example.

In [15]:
!curl  https://www.frys.com/robots.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0
  0     0    0     0    0     0      0      0 --:--

In [18]:
!pip install urllib3



In [18]:
#urllib has evolved since the video was recorded. Some of the syntax is different.
#https://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-urllib3-and-requests-modul

import urllib
rp = urllib.robotparser.RobotFileParser()

In [28]:
import requests
response=requests.get("https://www.frys.com/search?q=telescope...")
# This will not work as the website was taken down. Please feel free to watch the video and practice with a different website.
response.status_code

200

In [29]:
response.text

'<html lang="en">\n\n<head>\n    <title>Robot or human?</title>\n    <meta name="viewport" content="width=device-width">\n    <style>\n    #sign-in-widget a,\n    #sign-in-widget a:active,\n    #sign-in-widget a:hover {\n        color: #000\n    }\n\n    #sign-in-widget h1 {\n        font-weight: 500;\n        font-size: 20px;\n        font-size: 1.25rem;\n        letter-spacing: -.6px;\n        margin: 1px auto\n    }\n\n    @media (min-width:30em) {\n        #sign-in-widget h1 {\n            margin-top: 24px;\n            font-size: 24px;\n            font-size: 1.5rem\n        }\n    }\n\n    #sign-in-widget {\n        font-family: BogleWeb, Helvetica Neue, Helvetica, Arial, sans-serif\n    }\n\n    #sign-in-widget * {\n        box-sizing: border-box\n    }\n\n    #sign-in-widget .text-right {\n        text-align: right\n    }\n\n    @font-face {\n        font-family: NewYorkIcons;\n        src: url(6255ed72d86ece856725a2d80878bce6.eot);\n        font-weight: 400;\n        font-styl

In [24]:
from bs4 import BeautifulSoup as bsoup

In [30]:
soup = bsoup(response.text, 'lxml')

In [41]:
#telescope_containers = soup.find_all('div',{"class":"mb0 ph1 pa0-xl bb b--near-white w-25"})
#telescope_containers=soup.find_all('div', {"class":"h-100 pb1-xl pr4-xl pv1 ph1"})
telescope_containers=soup.find_all('div', {"data-testid":"flex-container"})
print(telescope_containers)

[]


In [34]:
telescope_container = telescope_containers[1]

IndexError: list index out of range