<a href="https://colab.research.google.com/github/quicksilverTrx/practical-nlp/blob/master/Ch2/01_WebScraping_using_BeautifulSoup_self.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In this notebook we show how we can scrap data from webpages using the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a python library.
<br><br>

In [1]:
#making the necessary imports
from pprint import pprint
from bs4 import BeautifulSoup
from urllib.request import urlopen 

In [2]:
myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" #specify the url
html = urlopen(myurl).read() #query the website so that it returns a html page  
soupified = BeautifulSoup(html, 'html.parser') # parse the html in the 'html' variable, and store it in Beautiful Soup format

As the size of the HTML webpage (soupified) is large, we are just showing some of its output (only 2000 characters).

In [3]:
pprint(soupified.prettify())      #for printing the full HTML structure of the webpage

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 'height="36" viewbox="0 0 36 36" width="36">\n'
 '               <path d="M6 14l8 8L30 6v8L14 30l-8-8v-8z">\n'
 '               </path>\n'
 '              </svg>\n'
 '             </div>\n'
 '            </div>\n'
 '            <a aria-label="Timeline" class="js-post-issue grid--cell s-btn '
 's-btn__unset c-pointer py6 mx-auto" data-controller="s-tooltip" '
 'data-ks-title="timeline" data-s-tooltip-placement="right" data-shortcut="T" '
 'href="/posts/51829852/timeline" title="Show activity on this post.">\n'
 '             <svg aria-hidden="true" class="mln2 mr0 svg-icon iconHistory" '
 'height="18" viewbox="0 0 19 18" width="19">\n'
 '              <path d="M3 9a8 8 0 113.73 6.77L8.2 14.3A6 6 0 105 '
 '9l3.01-.01-4 4-4-4h3L3 9zm7-4h1.01L11 9.36l3.22 2.1-.6.93L10 10V5z">\n'
 '              </path>\n'
 '             </svg>\n'
 '            </a>\n'
 '           </div>\n'
 '          </div>\n'
 '          <div class="answe

In [4]:
pprint(soupified.prettify()[:2000])#to get an idea of the html structure of the webpage 

('<!DOCTYPE html>\n'
 '<html class="html__responsive" itemscope="" '
 'itemtype="https://schema.org/QAPage">\n'
 ' <head>\n'
 '  <title>\n'
 '   datetime - How to get the current time in Python - Stack Overflow\n'
 '  </title>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" '
 'rel="shortcut icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="apple-touch-icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="image_src"/>\n'
 '  <link href="/opensearch.xml" rel="search" title="Stack Overflow" '
 'type="application/opensearchdescription+xml"/>\n'
 '  <link '
 'href="https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" '
 'rel="canonical">\n'
 '   <meta content="width=device-width, height=device-height, '
 'initial-scale=1.0, minimum-scale=1.0" name="viewport"/>\n'
 '  

In [5]:
soupified.title #to get the title of the web page 

<title>datetime - How to get the current time in Python - Stack Overflow</title>

In [6]:
question = soupified.find("div", {"class": "question"}) #find the nevessary tag and class which it belongs to
questiontext = question.find("div", {"class": "s-prose js-post-body"})
print("Question: \n", questiontext.get_text().strip())

answer = soupified.find("div", {"class": "answer"}) #find the nevessary tag and class which it belongs to
answertext = answer.find("div", {"class": "s-prose js-post-body"})
print("Best answer: \n", answertext.get_text().strip())

Question: 
 What is the module/method used to get the current time?
Best answer: 
 Use:
>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)

>>> print(datetime.datetime.now())
2009-01-06 15:08:24.789150

And just the time:
>>> datetime.datetime.now().time()
datetime.time(15, 8, 24, 78915)

>>> print(datetime.datetime.now().time())
15:08:24.789150

See the documentation for more information.
To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the leading datetime. from all of the above.


BeautifulSoup is one of the many libraries which allow us to scrape web pages. Depending on your needs you can choose between the many available choices like beautifulsoup, scrapy, selenium, etc