### Webscraping :- 
* https://realpython.com/python-web-scraping-practical-introduction/
* It is the process of collecting and parsing data from the Web, and the Python community has come up with some pretty powerful web scarpping tools.
* One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs.


### Using String methods 

In [9]:
# Example 1:-
from urllib.request import urlopen
url = "http://olympus.realpython.org/profiles/aphrodite"
page=urlopen(url) #  urlopen() returns an HTTPResponse object
page

<http.client.HTTPResponse at 0x7f6ad8f7ab90>

In [7]:
html_bytes=page.read() # read() - to read the HTTPReponse object into bytes, provided by HTTPResponse 
html_decode=html_bytes.decode('utf-8') # decode()- to convert bytes to UTF-8 
print(html_decode)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



In [8]:
type(html_decode)

str

#### Extract Text From HTML With String Methods

In [10]:
# Example 2 :- 
url="http://olympus.realpython.org/profiles/poseidon"
page=urlopen(url)
page

<http.client.HTTPResponse at 0x7f6ad9083d90>

In [11]:
html_bytes=page.read()
html=html_bytes.decode('utf-8')
print(html)

<html>
<head>
<title >Profile: Poseidon</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/poseidon.jpg" />
<h2>Name: Poseidon</h2>
<br><br>
Favorite animal: Dolphin
<br><br>
Favorite color: Blue
<br><br>
Hometown: Sea
</center>
</body>
</html>



In [14]:
start_index=html.find("<title>")+len("<title>")
end_index=html.find("</title>")
title = html[start_index:end_index]
print(title)



<head>
<title >Profile: Poseidon


In [16]:
# Extracting title without html tags using Regular Expressions. 
import re
pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags

print(title)

Profile: Poseidon


### Use an HTML Parser for Web Scraping in Python

In [22]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [38]:
from bs4 import BeautifulSoup
url="http://olympus.realpython.org/profiles/dionysus"
page=urlopen(url)
html=page.read().decode('utf-8')
soup=BeautifulSoup(html,"html.parser") # Creates a BeautifulSoup object and assigns it to soup variable
# BeautifulSoup() - takes two arguments - first arg 'html' string to be parsed, second arg 'html.parser' 
# tells the object which parser to use behind the scenes. "html.parser" represents Python’s built-in HTML parser.

In [39]:
print(soup.get_text().replace("\n"," ")) # get_text() - removes all html tags and returns only text 

  Profile: Dionysus      Name: Dionysus  Hometown: Mount Olympus  Favorite animal: Leopard   Favorite Color: Wine    


In [47]:
soup.find_all('img')

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

In [44]:
image1,image2= soup.find_all('img') # Output are instances of Tag object provided by BeautifulSoup

In [45]:
# Each Tag object has a .name property that returns a string containing the HTML tag type:
image1.name

'img'

In [48]:
image1["src"]

'/static/dionysus.jpg'

In [50]:
# Certain tags in HTML documents can be accessed by properties of the Tag object.
soup.title # gives the title tag 

<title>Profile: Dionysus</title>

In [51]:
soup.title.string # gives only the string in title tag

'Profile: Dionysus'

In [76]:
# Example 3 :-
base_url="http://olympus.realpython.org"
page=urlopen(base_url+'/profiles')
html= page.read().decode('utf-8')
soup=BeautifulSoup(html,"html.parser")
print("Title = ",soup.title.string)
print("\n",soup.prettify())

Title =  All Profiles

 <html>
 <head>
  <title>
   All Profiles
  </title>
 </head>
 <body bgcolor="yellow">
  <center>
   <br/>
   <br/>
   <h1>
    All Profiles:
   </h1>
   <br/>
   <br/>
   <h2>
    <a href="/profiles/aphrodite">
     Aphrodite
    </a>
    <br/>
    <br/>
    <a href="/profiles/poseidon">
     Poseidon
    </a>
    <br/>
    <br/>
    <a href="/profiles/dionysus">
     Dionysus
    </a>
   </h2>
  </center>
 </body>
</html>



In [71]:
soup.find_all('a')

[<a href="/profiles/aphrodite">Aphrodite</a>,
 <a href="/profiles/poseidon">Poseidon</a>,
 <a href="/profiles/dionysus">Dionysus</a>]

In [63]:
for link in soup.find_all("a"):
    link_url = base_url + link["href"]
    print(link_url)

http://olympus.realpython.org/profiles/aphrodite
http://olympus.realpython.org/profiles/poseidon
http://olympus.realpython.org/profiles/dionysus


In [77]:
pip install MechanicalSoup

Collecting MechanicalSoup
  Downloading MechanicalSoup-1.2.0-py3-none-any.whl (19 kB)
Installing collected packages: MechanicalSoup
Successfully installed MechanicalSoup-1.2.0
Note: you may need to restart the kernel to use updated packages.
