## Web Scrapping

* Technique for extracting information from the internet automatically using our script that simulates human web surfing.
* Web scraping helps us extract large volumes of information

### Scraping Rules

* Check website's Terms and Conditions before you scrap it.
* Do not spam the website by making a lot of requests to a specific web page.
* Update your code time to time (as websites get updated).

### Libraries Used
* Beautiful Soup
* Selenium
* Scrapy

### Process
* Find the URL you wish to scrape.
* Send an HTTP request to that URL and get the HTML a response.
* Parse the HTML content.
* Inspect the web page and find data that we want to extract.
* Extract required data and store it in the required format.

In [1]:
html = '<!DOCTYPE html>\
<html>\
<head>\
<title>Testing Web Page</title>\
</head>\
<body>\
<h1>Web Scraping</h1>\
<p class = "abc">\
Hello ! Good morning. Have a good Day !\
</p>\
<p id = "def" >\
<a href="https://codingninjas.in/"> Coding Nnjas </a>\
</p>\
<p class = "abc">\
hello\
</p>\
</body>\
</html>'

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [3]:
data = BeautifulSoup(html, 'html.parser')
type(data)

bs4.BeautifulSoup

In [4]:
print(data.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Testing Web Page
  </title>
 </head>
 <body>
  <h1>
   Web Scraping
  </h1>
  <p class="abc">
   Hello ! Good morning. Have a good Day !
  </p>
  <p id="def">
   <a href="https://codingninjas.in/">
    Coding Nnjas
   </a>
  </p>
  <p class="abc">
   hello
  </p>
 </body>
</html>


In [5]:
# data.tag_name to get details for a tag
data.title

<title>Testing Web Page</title>

In [6]:
data.h1
data.p # if multiple, details of first are used

<p class="abc">Hello ! Good morning. Have a good Day !</p>

In [7]:
data.h2 # won't print anything if a tag is not present

In [8]:
print(data.title)

<title>Testing Web Page</title>


In [9]:
data.title.name # print name of tag

'title'

In [10]:
data.title.string # print contents of the tag

'Testing Web Page'

In [11]:
data.p.attrs # gets attributes of the tag

{'class': ['abc']}

In [12]:
data.p['class'] # get value of class attribute

['abc']

In [13]:
data.p.get('class') # alternatively..

['abc']

In [14]:
data.get_text() # Test available on webpage

'Testing Web PageWeb ScrapingHello ! Good morning. Have a good Day ! Coding Nnjas hello'

In [15]:
data.find('p') # find is used to find something from webpage

<p class="abc">Hello ! Good morning. Have a good Day !</p>

In [16]:
li = data.find_all('p') # all occurences of a tag
print(li)

[<p class="abc">Hello ! Good morning. Have a good Day !</p>, <p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p>, <p class="abc">hello</p>]


In [17]:
for i in li:
    print(i)

<p class="abc">Hello ! Good morning. Have a good Day !</p>
<p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p>
<p class="abc">hello</p>


### Navigate Tree

We can navigate parse tree in following ways/directions:
* Going Up
* Going Down
* Going Sideways
* Going back and forth

In [18]:
data.find_all(['p', 'a']) #search for each tag present in list

[<p class="abc">Hello ! Good morning. Have a good Day !</p>,
 <p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p>,
 <a href="https://codingninjas.in/"> Coding Nnjas </a>,
 <p class="abc">hello</p>]

In [19]:
data.find_all(True) # Every tag present in the document

[<html><head><title>Testing Web Page</title></head><body><h1>Web Scraping</h1><p class="abc">Hello ! Good morning. Have a good Day !</p><p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p><p class="abc">hello</p></body></html>,
 <head><title>Testing Web Page</title></head>,
 <title>Testing Web Page</title>,
 <body><h1>Web Scraping</h1><p class="abc">Hello ! Good morning. Have a good Day !</p><p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p><p class="abc">hello</p></body>,
 <h1>Web Scraping</h1>,
 <p class="abc">Hello ! Good morning. Have a good Day !</p>,
 <p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p>,
 <a href="https://codingninjas.in/"> Coding Nnjas </a>,
 <p class="abc">hello</p>]

In [20]:
data.find_all(id = 'def') # finds all with id = def

[<p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p>]

In [21]:
data.find_all(class_ = "abc") # finds all with class = abc

[<p class="abc">Hello ! Good morning. Have a good Day !</p>,
 <p class="abc">hello</p>]

In [22]:
# for css selector pass value in data.select('')

In [23]:
print(data.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Testing Web Page
  </title>
 </head>
 <body>
  <h1>
   Web Scraping
  </h1>
  <p class="abc">
   Hello ! Good morning. Have a good Day !
  </p>
  <p id="def">
   <a href="https://codingninjas.in/">
    Coding Nnjas
   </a>
  </p>
  <p class="abc">
   hello
  </p>
 </body>
</html>


In [24]:
# Navigation using tag names
data.head

<head><title>Testing Web Page</title></head>

In [25]:
data.head.title

<title>Testing Web Page</title>

In [26]:
data.title.string # fetch string within tag

'Testing Web Page'

In [27]:
# string returns result when there is only one child, returns None otherwise


for i in data.find_all('p'):
    print(i.string)

Hello ! Good morning. Have a good Day !
 Coding Nnjas 
hello


In [28]:
# In case of multiple childs, we can use .strings which returns a generator object

for i in data.find_all('p'):
    print(list(i.strings))

['Hello ! Good morning. Have a good Day !']
[' Coding Nnjas ']
['hello']


In [29]:
# we can use .stripped_strings to strip off whitespaces from strings


for i in data.find_all('p'):
    print(list(i.stripped_strings))

['Hello ! Good morning. Have a good Day !']
['Coding Nnjas']
['hello']


In [30]:
# Use .contests for to find all children to a tag

data.html.contents

[<head><title>Testing Web Page</title></head>,
 <body><h1>Web Scraping</h1><p class="abc">Hello ! Good morning. Have a good Day !</p><p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p><p class="abc">hello</p></body>]

In [31]:
li = data.html.contents
print(li)
len(li)

[<head><title>Testing Web Page</title></head>, <body><h1>Web Scraping</h1><p class="abc">Hello ! Good morning. Have a good Day !</p><p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p><p class="abc">hello</p></body>]


2

In [32]:
# .children can also be used, but it returns an iterator instead

type(data.html.children)

list_iterator

In [33]:
for i in data.html.children:
    print(i)

<head><title>Testing Web Page</title></head>
<body><h1>Web Scraping</h1><p class="abc">Hello ! Good morning. Have a good Day !</p><p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p><p class="abc">hello</p></body>


In [34]:
# .descendants return a generator to all the descendants to the tag

desc = list(data.html.descendants)

print(len(desc))

for i in desc:
    print(i)
    print()

13
<head><title>Testing Web Page</title></head>

<title>Testing Web Page</title>

Testing Web Page

<body><h1>Web Scraping</h1><p class="abc">Hello ! Good morning. Have a good Day !</p><p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p><p class="abc">hello</p></body>

<h1>Web Scraping</h1>

Web Scraping

<p class="abc">Hello ! Good morning. Have a good Day !</p>

Hello ! Good morning. Have a good Day !

<p id="def"><a href="https://codingninjas.in/"> Coding Nnjas </a></p>

<a href="https://codingninjas.in/"> Coding Nnjas </a>

 Coding Nnjas 

<p class="abc">hello</p>

hello



### For going upwards, some tags are:
.parent, .parents

### For going sideways:
.next_sibling and .previous_sibling
.next_siblings and .previous_siblings

### Few other directional functions are:

.next_element and .previous_elements
.next_element and .previous_elements

## Scrapping web page!

In [35]:
response = requests.get("http://info.cern.ch/hypertext/WWW/TheProject.html")
print(response)
print(response.headers)

<Response [200]>
{'Date': 'Tue, 03 Dec 2019 18:06:05 GMT', 'Server': 'Apache', 'Last-Modified': 'Thu, 03 Dec 1992 08:37:20 GMT', 'ETag': '"40521e06-8a9-291e721905000"', 'Accept-Ranges': 'bytes', 'Content-Length': '2217', 'Connection': 'close', 'Content-Type': 'text/html'}


In [36]:
html_data = response.text

In [37]:
data = BeautifulSoup(html_data, "html.parser")
print(data.prettify())

<header>
 <title>
  The World Wide Web project
 </title>
 <nextid n="55"/>
</header>
<body>
 <h1>
  World Wide Web
 </h1>
 The WorldWideWeb (W3) is a wide-area
 <a href="WhatIs.html" name="0">
  hypermedia
 </a>
 information retrieval
initiative aiming to give universal
access to a large universe of documents.
 <p>
  Everything there is online about
W3 is linked directly or indirectly
to this document, including an
  <a href="Summary.html" name="24">
   executive
summary
  </a>
  of the project,
  <a href="Administration/Mailing/Overview.html" name="29">
   Mailing lists
  </a>
  ,
  <a href="Policy.html" name="30">
   Policy
  </a>
  , November's
  <a href="News/9211.html" name="34">
   W3  news
  </a>
  ,
  <a href="FAQ/List.html" name="41">
   Frequently Asked Questions
  </a>
  .
  <dl>
   <dt>
    <a href="../DataSources/Top.html" name="44">
     What's out there?
    </a>
    <dd>
     Pointers to the
world's online information,
     <a href="../DataSources/bySubject/Overview.htm

In [38]:
# Printing heading 1

data.h1

<h1>World Wide Web</h1>

In [39]:
data.h1.string

'World Wide Web'

In [40]:
# Printing title

data.title

<title>The World Wide Web project</title>

In [41]:
data.title.string

'The World Wide Web project'

In [42]:
# Get complete text available on webpage

print(data.get_text())


The World Wide Web project



World Wide WebThe WorldWideWeb (W3) is a wide-area
hypermedia information retrieval
initiative aiming to give universal
access to a large universe of documents.
Everything there is online about
W3 is linked directly or indirectly
to this document, including an executive
summary of the project, Mailing lists
, Policy , November's  W3  news ,
Frequently Asked Questions .

What's out there?
 Pointers to the
world's online information, subjects
, W3 servers, etc.
Help
 on the browser you are using
Software Products
 A list of W3 project
components and their current state.
(e.g. Line Mode ,X11 Viola ,  NeXTStep
, Servers , Tools , Mail robot ,
Library )
Technical
 Details of protocols, formats,
program internals etc
Bibliography
 Paper documentation
on  W3 and references.
People
 A list of some people involved
in the project.
History
 A summary of the history
of the project.
How can I help ?
 If you would like
to support the web..
Getting code
 Getting the cod

In [43]:
# Value of first hyperlink on webpage

data.a['href']

'WhatIs.html'

In [44]:
# All links

data.find_all('a')

[<a href="WhatIs.html" name="0">
 hypermedia</a>, <a href="Summary.html" name="24">executive
 summary</a>, <a href="Administration/Mailing/Overview.html" name="29">Mailing lists</a>, <a href="Policy.html" name="30">Policy</a>, <a href="News/9211.html" name="34">W3  news</a>, <a href="FAQ/List.html" name="41">Frequently Asked Questions</a>, <a href="../DataSources/Top.html" name="44">What's out there?</a>, <a href="../DataSources/bySubject/Overview.html" name="45"> subjects</a>, <a href="../DataSources/WWW/Servers.html" name="z54">W3 servers</a>, <a href="Help.html" name="46">Help</a>, <a href="Status.html" name="13">Software Products</a>, <a href="LineMode/Browser.html" name="27">Line Mode</a>, <a href="Status.html#35" name="35">Viola</a>, <a href="NeXT/WorldWideWeb.html" name="26">NeXTStep</a>, <a href="Daemon/Overview.html" name="25">Servers</a>, <a href="Tools/Overview.html" name="51">Tools</a>, <a href="MailRobot/Overview.html" name="53"> Mail robot</a>, <a href="Status.html#57" na

In [45]:
li = data.find_all('a')

for i in li:
    print(i.string, i['href'])
    print()


hypermedia WhatIs.html

executive
summary Summary.html

Mailing lists Administration/Mailing/Overview.html

Policy Policy.html

W3  news News/9211.html

Frequently Asked Questions FAQ/List.html

What's out there? ../DataSources/Top.html

 subjects ../DataSources/bySubject/Overview.html

W3 servers ../DataSources/WWW/Servers.html

Help Help.html

Software Products Status.html

Line Mode LineMode/Browser.html

Viola Status.html#35

NeXTStep NeXT/WorldWideWeb.html

Servers Daemon/Overview.html

Tools Tools/Overview.html

 Mail robot MailRobot/Overview.html


Library Status.html#57

Technical Technical.html

Bibliography Bibliography.html

People People.html

History History.html

How can I help Helping.html

Getting code ../README.html


anonymous FTP LineMode/Defaults/Distribution.html



In [46]:
# links in a particular part say 'dl'

li_2 = data.dl.find_all('a')

for i in li_2:
    print(i.string, i['href'])
    print()

What's out there? ../DataSources/Top.html

 subjects ../DataSources/bySubject/Overview.html

W3 servers ../DataSources/WWW/Servers.html

Help Help.html

Software Products Status.html

Line Mode LineMode/Browser.html

Viola Status.html#35

NeXTStep NeXT/WorldWideWeb.html

Servers Daemon/Overview.html

Tools Tools/Overview.html

 Mail robot MailRobot/Overview.html


Library Status.html#57

Technical Technical.html

Bibliography Bibliography.html

People People.html

History History.html

How can I help Helping.html

Getting code ../README.html


anonymous FTP LineMode/Defaults/Distribution.html



In [47]:
response = requests.get('http://books.toscrape.com/')
response

<Response [200]>

In [48]:
response.headers

{'Server': 'nginx/1.12.1', 'Date': 'Tue, 03 Dec 2019 18:06:07 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Last-Modified': 'Wed, 29 Jun 2016 21:39:03 GMT', 'X-Upstream': 'toscrape-books-master_web', 'Content-Encoding': 'gzip'}

In [49]:
html_data = response.text

In [50]:
data = BeautifulSoup(html_data, 'html.parser')
print(data.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [51]:
data.title

<title>
    All products | Books to Scrape - Sandbox
</title>

In [52]:
data.a # first link

<a href="index.html">Books to Scrape</a>

In [53]:
data.a.string 

'Books to Scrape'

In [54]:
data.header.a # a of header

<a href="index.html">Books to Scrape</a>

In [55]:
data.header.a.string

'Books to Scrape'

In [56]:
b1 = data.find(class_ = 'product_pod') # first product with class product_pod

In [57]:
b1.h3

<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>

In [58]:
b1.h3.a

<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

In [59]:
b1.h3.a["title"]

'A Light in the Attic'

In [60]:
b1.h3.a['href']

'catalogue/a-light-in-the-attic_1000/index.html'

In [61]:
print('http://books.toscrape.com/' + b1.h3.a['href'])

http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html


In [62]:
# Link of each book available on first page of the website

books = data.find_all(class_ = 'product_pod')

print(len(books))

20


In [63]:
base_url = 'http://books.toscrape.com/'
books_url = []

for i in books:
    books_url.append(base_url + i.h3.a['href'])
    
print(books_url[1:3])

['http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', 'http://books.toscrape.com/catalogue/soumission_998/index.html']


### Extracting information of all 1000 book available on the website

In [64]:
all_urls = ['http://books.toscrape.com/catalogue/page-1.html']

current_page = 'http://books.toscrape.com/catalogue/page-1.html'
base_url = 'http://books.toscrape.com/catalogue/'

response = requests.get(current_page)
    
while response.status_code == 200:

    data = BeautifulSoup(response.text, 'html.parser')

    next_page = data.find(class_ = 'next')
    if(next_page is None):
        break
    next_page_url = base_url + next_page.a['href']
    all_urls.append(next_page_url)
    current_page = next_page_url
    response = requests.get(current_page)

In [65]:
allPages = ['http://books.toscrape.com/catalogue/page-1.html',
            'http://books.toscrape.com/catalogue/page-2.html',
            'http://books.toscrape.com/catalogue/page-3.html',
            'http://books.toscrape.com/catalogue/page-4.html',
            'http://books.toscrape.com/catalogue/page-5.html',
            'http://books.toscrape.com/catalogue/page-6.html',
            'http://books.toscrape.com/catalogue/page-7.html',
            'http://books.toscrape.com/catalogue/page-8.html',
            'http://books.toscrape.com/catalogue/page-9.html',
            'http://books.toscrape.com/catalogue/page-10.html']

for url in allPages:
    response = requests.get(url)
    data = BeautifulSoup(response.text, 'html.parser')
    books = data.find_all(class_ = 'product_pod')
    for book in books:
        print(book.h3.a['title'])

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas
In Her Wake
How Music Works
Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More
Chase Me (Paris Nights #2)
Black Dust
Birdsong: A Story in Pictures
A

Modern Romance
Miss Peregrineâs Home for Peculiar Children (Miss Peregrineâs Peculiar Children #1)
Louisa: The Extraordinary Life of Mrs. Adams
Little Red
Library of Souls (Miss Peregrineâs Peculiar Children #3)
Large Print Heart of the Pride
I Had a Nice Time And Other Lies...: How to find love & sh*t like that
Hollow City (Miss Peregrineâs Peculiar Children #2)
Grumbles
Full Moon over Noahâs Ark: An Odyssey to Mount Ararat and Beyond
Frostbite (Vampire Academy #2)
Follow You Home
First Steps for New Christians (Print Edition)
Finders Keepers (Bill Hodges Trilogy #2)
Fables, Vol. 1: Legends in Exile (Fables #1)
Eureka Trivia 6.0
Drive: The Surprising Truth About What Motivates Us
Done Rubbed Out (Reightman & Bailey #1)
Doing It Over (Most Likely To #1)
Deliciously Ella Every Day: Quick and Easy Recipes for Gluten-Free Snacks, Packed Lunches, and Simple Meals


In [66]:
# Extract title, length of url, price, number of copies for every book

response = requests.get("http://books.toscrape.com/")
data = BeautifulSoup(response.text, 'html.parser')

# Extract url of first book

b1 = data.find(class_ = 'product_pod')
base_url = 'http://books.toscrape.com/'
b1_url = base_url + b1.h3.a['href']
print(b1_url)

http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html


In [67]:
response = requests.get(b1_url)
data = BeautifulSoup(response.text, 'html.parser')
title = data.h1.string
price = data.find(class_ = 'price_color').string
qty = data.find(class_ = 'instock availability')
qty = qty.contents[-1].strip()

In [68]:
print(title,b1_url, price, qty)

A Light in the Attic http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html Â£51.77 In stock (22 available)


In [69]:
import re

In [70]:
qty = int(re.search('\d+', qty).group()) # group returns part of string(qty) which matches the pattern
price = float(re.search('[\d.]+', price).group())

In [71]:
print(title,b1_url, price, qty)

A Light in the Attic http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html 51.77 22


In [72]:
# Creating DF and converting to csv file

book_details = []

book_details.append([title, b1_url, price, qty])

In [73]:
import pandas as pd

In [74]:
df = pd.DataFrame(book_details, columns = ["Title", "Link", "Price", "Quantity_in_stock"])
df

Unnamed: 0,Title,Link,Price,Quantity_in_stock
0,A Light in the Attic,http://books.toscrape.com/catalogue/a-light-in...,51.77,22


In [75]:
df.to_csv('books.csv')

In [76]:
# Removing row indices

df.to_csv('books.csv', index = False)