### **Introduction to Web Scraping**

**What is Web Scraping?**


*   Web scraping is the process of gathering information from the Internet.
*   Web
scraping is very imperative technique which is used to generate structured data on the basis of available unstructured data on the web.
* Scaping generated structured data then stored in central database and analyze in spreadsheets.
* Traditional copy-and-paste, Text grapping and regular expression matching, HTTP programming, HTML parsing, DOM parsing, Webscraping software, Vertical aggregation platforms, Semantic annotation recognizing and Computer vision web-page analyzers are some of the common techniques used for data scraping.



 **Need for Web Scraping?** [Use Cases](https://www.webharvy.com/articles/web-scraper-use-cases.html)

> The uses of Web Scraping for business as well as personal requirements are endless. Each business or individual has their own specific need for gathering data.

<p align="center">
    <img src="https://www.webharvy.com/images/web%20scraping%20uses.png">
</p>

1.   Monitoring of Price
2.   Possible Market Trends
3.   Keeping A Watch on Your Competitors
4.   Maintaining Your Brand Identity
5.   Social Media Management
6.   SEO Enhancement
7.   Knowing Your Targeted Audience
8.   Improvising Better Solutions
9.   Targeted Ads
10.  Tracking Trends




# **Outline for Scraping**
1. Design considerations
2. Crawling
3. Scraping


<p align="center">
    <img src="http://raghudathesh.weebly.com/uploads/4/8/9/6/48968251/scraping-considatrations_orig.png">
</p>

# **For design considerations**
1. What should be the output?
> * Type of information
> * Quality requirements
2. What is the best suited input?
3. Which method to get from input to output?




# **The essential fundamentals of web scraping are:**


*   To understand the basics of HTML and CSS.
*   HTML is used to give structure for a web page and CSS beautify the webpage.
*   To explore the web page structure and usage of developer tools.
*   To make HTTP requests and get HTML responses.
*   To get specific structured information using beautifulsoup.






# **BeautifulSoup** [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)


*   Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages.
*   Say you’ve found some webpages that display data relevant to your work/research, such as date or address information, but that do not provide any way of downloading the data directly. Beautiful Soup helps you pull particular content from a webpage, remove the HTML markup, and save the information.
*   It is a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web.
*   This process is suitable for static content which is available by making an HTTP request to get the webpage content




# **Basic Termes in Web Scraping**

1.   **Crawler**: is a web bot that visits a stack of web pages and accumulates the links (URLs) of the nodes, deriving new URLs from each new web page [html] that it visits. Crawler might or might not get pages’ info in a data storage. It does not go deep unless programmed explicitly.
2.   **Scraper**: is a bot that visits web pages of a given set of URLs. It does not collect new URLs (as a crawler does). It rather visits pre-collected URLs and retrieves relevant data to store into a data storage.
3.   **Parser**: is an [offline] robot that processes or analyses given data to dervie a proper data structures. It retrieves information from [unstructured] data, whether from data storage or directly from the web (e.g. HTML).





# **Types of Parser**

1.   **html.parser** :  built-in, no extra dependencies needed.
2.   **html5lib** : the most lenient (not strictly matches your pattern), better use it if HTML is broken.
3.   **lxml** : the fastest.
html2text check


**How to Make a Soup out of HTML File**

(Note: Here Soup mean way we prase the HTML Tree)

In [47]:
from bs4 import BeautifulSoup

def read_file():
    file = open('intro_to_soup_html.html')
    data = file.read()
    file.close()
    return data

# Make soup
#Syntax = BeautifulSoup(html_data,parser)
# Our parser is lxml or html.parser which we have installed

html_file = read_file()
print(html_file)
print("------------------------------------\n")


soup = BeautifulSoup(html_file,'lxml')
print(soup)
print("------------------------------------\n")
type(soup)

# soup prettify
print(soup.prettify())
print("------------------------------------\n")


<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>Intro_to_soup</title>
    </head>
    <body>
        <div>
            <p>In first div</p>
        </div>
        <div>
            <p>In second div</p>
        </div>
    </body>
</html>
------------------------------------

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Intro_to_soup</title>
</head>
<body>
<div>
<p>In first div</p>
</div>
<div>
<p>In second div</p>
</div>
</body>
</html>
------------------------------------

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Intro_to_soup
  </title>
 </head>
 <body>
  <div>
   <p>
    In first div
   </p>
  </div>
  <div>
   <p>
    In second div
   </p>
  </div>
 </body>
</html>

------------------------------------



# **How to Make a Soup out of any Website HTML**

# **fake_useragent:**[Library link](https://pypi.org/project/fake_user_agent/)
Randomly generates a useragent for fetching a web page without a browser.



In [26]:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()
header = {'user-agent':ua.chrome}
google_page = requests.get('https://www.google.com',headers=header)
#print(google_page.content)

soup = BeautifulSoup(google_page.content,'lxml') # html.parser

print(soup.prettify())


#identify some tags


<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en-IN">
 <head>
  <meta charset="utf-8"/>
  <meta content="origin" name="referrer"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <title>
   Google
  </title>
  <script nonce="tCQIDU4gccWzULl4OK6NXQ">
   (function(){var _g={kEI:'Ev7eZKKPLM3a4-EPrtCliAM',kEXPI:'31',kBL:'kYBU',kOPI:89978449};(function(){var a;(null==(a=window.google)?0:a.stvsc)?google.kEI=_g.kEI:window.google=_g;}).call(this);})();(function(){google.sn='webhp';google.kHL='en-IN';})();(function(){
var h=this||self;function l(){return void 0!==window.google&&void 0!==window.google.kOPI&&0!==window.google.kOPI?window.google.kOPI:null};var m,n=[];function p(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||m}function q(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b}function r(a){/^http:/i.test(a)&&"https:"===win

Requests: Issues an HTTP GET request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.

# Analysis to HTML **Tags**

In [27]:
from bs4 import BeautifulSoup

def read_file():
    file = open('tags.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')
#print(soup)
print("------------------------------------\n")
# Accessing tags
meta = soup.meta
#print(meta) # gives us the first occurance of meta tag.
print("------------------------------------\n")
div = soup.div
#print(div) # gives us the first occurance of tag ->div.

# tag methods
'''
name
-- attributes
.get() method
dictionary
'''
#print("Value of Charset via get method is: ")
#print(meta.get("charset"))

#print("Value of Charset via  dictonary is: ")
#print(meta["charset"]) # can be treated as dictionary

# modify attributes at runtime
body = soup.body
#print(body) # prints entire body content
#print(body['style'])  # output will be blank as there is no style
body['style'] = 'some style'
#print(body['style']) # returns some style

'''
 Multi valued attributes
'''
print(body['class']) # here class has two attributes(list): first and second

------------------------------------

------------------------------------

['first', 'second']


# **Navigable Strings**

In [28]:
from bs4 import BeautifulSoup

def read_file():
    file = open('intro_to_soup_html.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# Navigable strings in the HTML file are: Intro_to_soup, In first div, In second div

# To access string inside a tag use .string method (Accessing Navigable strings )
title = soup.title

#print(title)         #Complete HTML Element is printed
#print(title.string)  #String in the HTML element is printed


# .replace_with("") function            -- navigable string
print("Before replacing:")
print(title)

title.string.replace_with("title has been changed")# replaces "Intro_to_soup" to "title has been changed"

print("After replacing:")
print(title)

Before replacing:
<title>Intro_to_soup</title>
After replacing:
<title>title has been changed</title>


#**Navigating Through tag Names**

In [29]:
from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# example  -- accessing tags directely from their tag names
title = soup.title
print(title) # prints 1st title tag
print("------------------------------------\n")
p = soup.p
print(p) # prints 1st p tag
print("------------------------------------\n")

<title>
            The Dormouse's story
        </title>
------------------------------------

<p class="title">
<b>
                The Dormouse's story
            </b>
</p>
------------------------------------



# **Navigating Through Child tag**

In [46]:
from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# tag.contents         -- returns a list of children
head = soup.head
#print(head.contents)
#print("------------------------------------\n")
for child in head.contents:
    #print(child if child is not None else'')
    #print("------------------------------------\n")
    pass

body = soup.body
#print(body.contents) # to illustrate new line character but from web it may not be their
#print("------------------------------------\n")
for child in body.contents:
    #print(child if child is not None else '', end='\n\n\n\n')  #here end='\n\n\n\n' is written only to differntiating between tags works fine if deleted
    pass

#------ nop------------
# .children         -- returns an iterator
for child in body.children:
    #print(child if child is not None else '', end='\n\n\n\n')
    pass


# Navigating with Beautifulsoup - Going Down - use three_sisters.html

There are 3 types of movement across html Parse tree

1.   Down the Tree - body tag to P tag
2.   Up the Tree - P tag to body tag
3.   Sideways Movement - P tag to P tag Movement






In [52]:
#This script describes how to move up in an html parse tree from a child tag

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')
title = soup.title

parent = title.parent
#print(parent)   #prints the complete parent tag HTML Element
#print(parent.name)   # method.name ---> gives Parent tag's name


# .parent
p = soup.p
#print(p)  #prints first occurance of p tag
#print(p.parent) #prints complte body tag, since it is the parent of p tag
#print(p.parent.name) # prints only the name of the parent

'''
note: all p tags are siblings in the html
Tree starts from soup --> has its child as HTML --> HTML has childerns as head and body --> head and boby has childrens depending on the structure of web pag
'''

# html
#html = soup.html
print(type(html.parent))         #   bs4 (top level parent of every parse tags) ---> html ---- prints the parent of html


# soup
#print(soup.parent) # returns none as it is at top of the hirerchey

<class 'bs4.BeautifulSoup'>
None


In [32]:
'''This script describes .parent method, how access all the
parents of a perticular tag'''

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

'''
 .parents              --- returns a list (generator)  of parents
 we shall use 'a' tag
 'a' tag has parent as 'p' tag which has parent as 'body' tag and so on

#moving up the tree: a --> p --> body --> html --> beautifulsoup

'''

link = soup.a
#print(link) # prints first a tag
#print(link.parents) # returns generator object parents at mem location
#print(link.parent) # returns P tag structure
#print(link.parent.name) # returns a tag's parent name only

for parent in link.parents:
    #print(parent.name) # p --> body --> html --> doc
    pass


# Navigating with Beautifulsoup - Going Sideway (moving through siblings) - use three_sisters.html

In [33]:
#This script demonstrates moving from current tag to next sibling tag
#Here we are moving side ways

#observer that first b and p tags are siblings

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

body = soup.body
p = soup.body.p
#print(p) #print first p tag with class title

# body - contents
#print(body.contents)


#.next_sibling
#our task now is to move from p tag "title" to next p tag "story".
#observe the output of print(body.contents) their is a new line character "\n"
#and then the p tag "story"


print(p.next_sibling) # prints nothing as it is new line character
#print(p.next_sibling.next_sibling) #prints p tag "story". Moving side ways







In [34]:
'''
This script demonstrates moving from current tag to previous sibling tag
Here we are moving side ways
Here we are moving from body tag to head tag
'''

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

body = soup.body

# contents - html
#print(soup.html.contents) #prints complete html

#we shall move from body tag to head tag
# .previous_sibling
#print(body.previous_sibling) # prints nothing as it is new line character
#print(body.previous_sibling.previous_sibling) #prints head tag, sibling of body tag, moving up or previous sibling


In [2]:
'''
This script demonstrates moving from current tag to next tag and previous sibling tag.
Here we are moving side ways.
Here we are moving from 'p' tag to next 'p' tag, also to previous 'b' tag siblings.
'''

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')
p = (soup.body.p)
#print(p)   #prints first 'p' tag. <p class="title">


# .next_siblings (after b tag it has two siblings i.e, p, p tags)
#Use inline if to escape the '\n': (value if contiditon else '')

for sibling in p.next_siblings:
  #print(sibling.name if sibling != '\n' else '') # note: here we are omitting new line character see tree
  pass


# .previous_siblings (before first 'p' tag there is only one 'b' tag) # note: here we are omitting new line character see tree

for sibling in p.previous_siblings:
  print(sibling if sibling  != '\n' else '')
  pass


<p class="title">
<b>
                The Dormouse's story
            </b>
</p>

<b></b>



Use Case [Beautiful Soup]([URL](https://realpython.com/beautiful-soup-web-scraper-python/)) Assignment

###Regular Expressions

In [36]:
import re   # regular expression module

# specialised language which can be used to search for text within a given document with precision and efficiency

# expression -- > compiled into bytecode    --> executed by a matching engine written in C

# Usage in ourcase:
# Matching Characters, Finding tags, Finding data through parse string.

'''
What is Regular Expression?
A simple expression matches itself in the given string

REGEX: abc

String: abcdef

Exception --> Metacharacters

They don't match themselves

Complete list of Metacharacters  -->         . ^ $ * + ? { } [ ] \ | ( )

'''

# First Metacharacters which we will look at are --- >        [   ]

'''
# REGEX: b

# String: [abcdef]  or [a-f] if we apply regex on this we get output as true, if any other character output will be FALSE

# REGEX: 9


# String: [12345] - [1-5]   : Output will be False.



used for specifying a character class   - character class is a set of characters you wish to match

for example if I've written the following regex :

            [xyz]

this will match any x,y or z character

We could also give a range using hyphen,


            [x-z]           --- equivalent to ---              [xyz]

'''

"\n# REGEX: b\n\n# String: [abcdef]  or [a-f] if we apply regex on this we get output as true, if any other character output will be FALSE\n\n# REGEX: 9\n\n\n# String: [12345] - [1-5]   : Output will be False.\n\n\n\nused for specifying a character class   - character class is a set of characters you wish to match\n\nfor example if I've written the following regex :\n\n            [xyz]\n\nthis will match any x,y or z character\n\nWe could also give a range using hyphen,\n\n\n            [x-z]           --- equivalent to ---              [xyz]\n\n"

# **Compile function and Character class**

In [37]:
import re

# re.compile(pattern)       -- returns a regex object

#regex = re.compile('[ccca]')
#regex = re.compile('[a-h]') #Range
regex = re.compile('[a-zA-Z]') #Range


# regex.match(string to match) -- returns None if no match else returns a match object
print(regex.match('BA'))


# character class

# complement the set [^pattern]
regex = re.compile('[^ccca]')
#regex = re.compile('[a-h]') #Range
#regex = re.compile('[a-zA-Z]') #Range
#regex = re.compile('[+]')

#print(regex.match('c'))



# all metacharacters lose their meaning inside a character class
#regex = re.compile('[+]')
#print(regex.match('+'))

<re.Match object; span=(0, 1), match='B'>


## Special Sequence

In [38]:
import re

# special sequences


# \d        -- matches any decimal digit --     [0-9]

regex = re.compile('\d')


# \D        -- matches any non-digit character  -- [^0-9]

regex = re.compile('\D')

# \s        -- matches any whitespace character

regex = re.compile('\s')

# \S        -- matches any non-whitespace character

regex = re.compile('\S')

# \w        -- matches any alphanumeric character -- [a-zA-Z0-9_]

regex = re.compile('\w')

# \W        -- matches any non-alphanumeric character -- [^ a-zA-Z0-9_]

regex = re.compile('\W')

Asterisk repeating things

In [39]:
import re

# * character - this specifies that the previous character can be matched zero or more times, instead of exactly once.

#Simple Regex
#regex = re.compile('aaaaa') # to match 5 occurance of 'a'
#for  * matching range is 0 to infinity
regex = re.compile('a*')  # to match 500/50000 occurance of 'a'
#print(regex.match('aaaaaaaccaa'))
#print(regex.match('')) # -- lower limit is 0 and the upper limit is infinity

regex = re.compile('[a-c]*')       # -- lower limit is 0 and the upper limit is infinity
#print(regex.match('caaaaaaaaaaabcaaaaa'))

### ++repeating+thing

In [40]:
import re

# +  character -- this specifies that the previous character can be matched one or more times

# difference from '*'-- 0 - infinity ,      for  '+'  matching range is 1 to infinity

regex = re.compile('a+')
#print(regex.match(''))
#print(regex.match('a'))
#print(regex.match('aaaaaaaaa'))


# using character classes
#regex = re.compile('[a-c]*')
regex = re.compile('[a-c]+')
#print(regex.match('abcabcabc'))
#print(regex.match(''))

## ? and {m,n} repeating thing

In [41]:
import re



# ? question mark -- says the previous character can either come once or not at all

regex = re.compile('a?b')           # min - 0       max - 1
#print(regex.match('b'))
#print(regex.match('a'))
#print(regex.match('ab'))
#print(regex.match('aab'))


# {m,n}    m and n are integer values   -- This qualifier means there must be at least m repetitions, and at most n

regex = re.compile('a{2,4}')            # aa aaa aaaa
#print(regex.match('a'))
#print(regex.match('aa'))
#print(regex.match('aaa'))
#print(regex.match('aaaa'))
#print(regex.match('aaaaa'))



# * {0,}

regex = re.compile('a{0,}')    # zero to infinite
#print(regex.match(''))
#print(regex.match('a'))
#print(regex.match('aa'))
#print(regex.match('aaa'))
#print(regex.match('aaaa'))
#print(regex.match('aaaaa'))
#print(regex.match('aaaaaaaa'))

# + {1,}

# Assignment: Write different scenario


# ? {0,1}
# Assignment: Write different scenario


### Metacharacters Conti...

In [42]:
import re
# ^ character   -- says that the string should start with
regex = re.compile('^abc')


# | character -- is the or operator

regex = re.compile('a|b')

# $ character -- matches the end of line

regex = re.compile('abc$')



### Introduction  to searching with Beautiful Soup

**Use Acse: Say your tree has thousands of tags. How are you going navigate through thousands of tags to get to the tags desire.**
* Solution:
>* So here we come to searching. Beautiful Soup provides us with very strong methods and very efficient methods which return us the tags we want.
>* It searches the whole parse tree for the tags we want and it gives us back to those tags.
>* The most popular method for searching are "find" and "find_all".


In [43]:
from bs4 import BeautifulSoup
import re

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# most popular methods

# find()
# find_all()        -- to keep it simple for now, it takes the tag name as parameter


# Kinds of filters which we can use to retrieve tags - filters sent as parameter to find/find_all methods

# string
#print(soup.find_all('b')) #give a list of tags
#print(soup.find_all('a'))

# regular expression

# tag names start with b

regex = re.compile('^b')

for tag in soup.find_all(regex):
    #print(tag.name)
    pass


# tag names contains t

regex = re.compile('t')

for tag in soup.find_all(regex):
    #print(tag.name)
    pass


# list

# all a and b tags

for tag in soup.find_all(['a','b']):
    #print(tag.name)
    pass


# function

# just giving an example here - we'll discuss this more when we implement find_all

def has_class(tag):
    return tag.has_attr('class')

for tag in soup.find_all(has_class):
  #print(tag.name)
  pass

## find_all function

In [44]:
from bs4 import BeautifulSoup


def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# Signature: find_all(name, attrs, recursive, string, limit, **kwargs) #kwargs =keywork arguments

# name parameter can take regex object or string or True or function

a_tags = soup.find_all('a')
#print(a_tags)

# attrs parameter

# dicitonary


attr = {'class' : 'sister'}
first_a = soup.find_all('a' , attrs=attr)
#print(first_a)

attr = {'class':'story'}
first_a = soup.find_all(attrs=attr)
#print(first_a)

attr = {'class' : 'sister', 'id' : 'link1'}
first_a = soup.find_all('a' , attrs=attr)
#print(first_a)


# limit parameter used to limit number of tags to return

a_tags = soup.find_all('a',limit=2)
#print(a_tags)

In [45]:
from bs4 import BeautifulSoup
import re


def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

# string parameter: accepts srting or regex parameter
regex = re.compile('Elsie')
#regex = re.compile('story')

#tag = soup.find_all(string=regex)
#print(tag)


# **kwargs arguments
tags = soup.find_all(class_='sister')
#tags = soup.find_all(class_='story')
for tag in tags:
  #print(tag)
  pass

# to write the class attribute of a tag - use       class_          because simple class is a keyword in Python


# recursive parameter


title = soup.find_all('title',recursive=False) # output is nil as it find only html tag not its childern; try with True
#print(title)
