# Web scraping
Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, we can say that instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.

![webScrapping.png](attachment:webScrapping.png)

# Working of a Web Scraper
Web scraper may be defined as a software or script used to download the contents of multiple web pages and extracting data from it.
![wS_working.png](attachment:wS_working.png)

# The Components of a Web Page
When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

HTML — contain the main content of the page.<br>
CSS — add styling to make the page look nicer.<br>
JS — Javascript files add interactivity to web pages.<br>
Images — image formats, such as JPG and PNG allow web pages to show pictures.<br><br>
After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML.

# HTML example:

# The requests library
The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. 

In [1]:
# http://dataquestio.github.io/web-scraping-pages/simple.html

In [54]:
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

In [3]:
page.status_code

200

In [55]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

# Parsing a page with BeautifulSoup
As we can see above, we now have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

In [56]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content,'html.parser')
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>>


As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the list function on it:

In [57]:
soup.children

<list_iterator at 0x24f21d4e1d0>

In [58]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

The above tells us that there are two tags at the top level of the page — the initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (\n) in the list as well. Let’s see what the type of each element in the list is:

In [11]:
for item in list(soup.children):
    print(type(item))

<class 'bs4.element.Doctype'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>


As we can see, all of the items are BeautifulSoup objects. The first is a Doctype object, which contains information about the type of the document. The second is a NavigableString, which represents text found in the HTML document. The final item is a Tag object, which contains other nested tags. The most important object type, and the one we’ll deal with most often, is the Tag object.

The Tag object allows us to navigate through an HTML document, and extract other tags and text.

In [14]:
print(list(soup.children))
html = list(soup.children)[2]
html

['html', '\n', <html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>]


<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [15]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [16]:
for ele in list(html.children):
    print(ele)



<head>
<title>A simple example page</title>
</head>


<body>
<p>Here is some simple content for this page.</p>
</body>




In [17]:
list(html.children)[0]

'\n'

In [18]:
list(html.children)[1]

<head>
<title>A simple example page</title>
</head>

In [19]:
list(html.children)[2]

'\n'

In [20]:
list(html.children)[3]

<body>
<p>Here is some simple content for this page.</p>
</body>

In [21]:
body = list(html.children)[3]
print(list(body.children))

['\n', <p>Here is some simple content for this page.</p>, '\n']


In [23]:
p =  list(body.children)[1]
p

<p>Here is some simple content for this page.</p>

In [24]:
p.get_text()

'Here is some simple content for this page.'

# Finding all instances of a tag at once

In [59]:
import requests
from bs4 import BeautifulSoup
url = 'https://bongirwarvk.webnode.com/cgngui/'

In [61]:
r = requests.get(url)
htmlContent =  r.content

In [62]:
soup = BeautifulSoup(htmlContent, 'html.parser')
soup
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>

<!--[if IE 8]><html class="lt-ie10 lt-ie9 no-js" prefix="og: https://ogp.me/ns#" lang="en-us"><![endif]--><!--[if IE 9]><html class="lt-ie10 no-js" prefix="og: https://ogp.me/ns#" lang="en-us"><![endif]--><!--[if gt IE 9]><!--><html class="no-js" lang="en-us" prefix="og: https://ogp.me/ns#"><!--<![endif]--><head><meta charset="utf-8"/><link href="https://d1di2lzuh97fh2.cloudfront.net/files/44/442/442wv9.ico?ph=73f0ed939b" rel="shortcut icon"/><link href="https://d1di2lzuh97fh2.cloudfront.net/files/44/442/442wv9.ico?ph=73f0ed939b" rel="apple-touch-icon"/><link href="https://d1di2lzuh97fh2.cloudfront.net/files/44/442/442wv9.ico?ph=73f0ed939b" rel="icon"/><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/><title>Computer Graphics &amp; GUI Design Technologies CST319 :: Bongirwarvk</title><meta content="width=device-width,initial-scale=1,viewport-fit=cover" name="viewport"/><meta content="no" name="msapplication-tap-highlight"/><li

In [48]:
title= soup.title
print(type(soup.title))
title

<class 'bs4.element.Tag'>


<title>Computer Graphics &amp; GUI Design Technologies CST319 :: Bongirwarvk</title>

In [49]:
page = soup.find_all('p')
page

[<p>Computer Graphics</p>, <p>CST319</p>]

In [50]:
page = soup.find_all('p')
#print(page)
for ele in page:
    print(ele.get_text())

Computer Graphics
CST319


In [53]:
page = soup.find_all('img')
print(page)
for ele in page:
    print(ele.get('src'))

[<img alt="" height="272" src="https://bongirwarvk.webnode.com/_files/200000005-f06e8f167c/200/cg1.jpg" width="185"/>, <img alt="" height="203" src="https://bongirwarvk.webnode.com/_files/200000009-4c32a4d2c4/200/images.jpg" width="248"/>]
https://bongirwarvk.webnode.com/_files/200000005-f06e8f167c/200/cg1.jpg
https://bongirwarvk.webnode.com/_files/200000009-4c32a4d2c4/200/images.jpg


In [44]:
anchor= soup.find_all('a')
anchor

[<a href="/home/">
 <div class="logo-embed">
 <div class="logo-embed-cell">
 <embed data-src="https://d1di2lzuh97fh2.cloudfront.net/files/3g/3g3/3g3es4.svg?ph=73f0ed939b" id="wnd_LogoBlock_379717_img" type="image/svg+xml"/><script>checkAndChangeSvgColor('wnd_LogoBlock_379717_img');</script></div>
 </div>
 <div class="logo-text">
 <span class="logo-text-cell"></span>
 </div>
 </a>,
 <a href="#" id="menu-submit"><span></span>Menu</a>,
 <a class="close-menu" href="#" rel="nofollow">
 <span>Close Menu</span>
 </a>,
 <a class="menu-item" href="/home/"><span class="menu-item-text">Homepage</span></a>,
 <a class="menu-item" href="/wat/"><span class="menu-item-text">Web Architecture &amp; Technologies CST 413-1</span></a>,
 <a class="menu-item" href="/cgngui/"><span class="menu-item-text">Computer Graphics &amp; GUI Design Technologies CST319</span></a>,
 <a class="menu-item" href="/csp319/"><span class="menu-item-text">CSP319</span></a>,
 <a class="menu-item" href="/pps/"><span class="menu-it

In [31]:
links=set()
for link in soup.find_all('a'):
    links.add(link.get('href')) #append(link.get('href'))
links

{'#',
 '/cgngui/',
 '/contact/',
 '/csp319/',
 '/home/',
 '/pps/',
 '/publications/',
 '/wat/',
 'https://us.webnode.com?utm_source=button&utm_medium=footer&utm_campaign=free1&utm_content=wnd2',
 'https://us.webnode.com?utm_source=text&utm_medium=footer&utm_campaign=free1&utm_content=wnd2'}

In [None]:
QUESTIONS
Implement web scrapping 
Web page 
https://bongirwarvk.webnode.com/wat/
    
    Find out all links
    find text
    find images
    find how many times Test/assignment/lecture word is appeared on to the page
    find which words are there with its frequency store it in a dictionary
    