## Scraping content from the web

Needed packages: `beautifulsoup4` (the scraper engine), `lxml` (the html parser) and `requests` (to request webpages)

To check if they are installed: `conda list <package-name>` or `pip show <package-name>` 

We'll be retrieving web pages, parse their HTML, and get the information we want.

HTML is a language that wraps the content in tags, eg: `<p>this is a paragraph</p>`

So we have to know the HTML code of the page that we want and find out which tags matter to us

Open the `example.html` file for a simple HTML document example

In [2]:
# We'll start by parsing the example.html file as a simple exercise:
from bs4 import BeautifulSoup as bsoup

with open('example.html') as html_file:
    soup = bsoup(html_file, 'lxml')
    
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Test - A Sample Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <h1 id="site_title">
   Test Website
  </h1>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_1.html">
     Article 1 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 1
   </p>
  </div>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_2.html">
     Article 2 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 2
   </p>
  </div>
  <hr/>
  <div class="footer">
   <p>
    Footer Information
   </p>
  </div>
  <script src="js/vendor/modernizr-3.5.0.min.js">
  </script>
  <script src="js/plugins.js">
  </script>
  <script src="js/main.js">
  </script>
 </body>
</html>



In [3]:
# Getting the text inside the <title> tag:
page_title = soup.title.text

print(page_title)

Test - A Sample Website


In [4]:
# Getting the footer <div>
page_footer = soup.find('div', class_='footer')

print(page_footer)

<div class="footer">
<p>Footer Information</p>
</div>


In [5]:
# Getting all the article's headlines and summaries in the page:
# First we find out how to parse one article:
article = soup.find('div', class_='article')
headline = article.h2.a.text
summary = article.p.text
print(f'{article.prettify()} \n\n {headline} \n\n {summary}')

<div class="article">
 <h2>
  <a href="article_1.html">
   Article 1 Headline
  </a>
 </h2>
 <p>
  This is a summary of article 1
 </p>
</div>
 

 Article 1 Headline 

 This is a summary of article 1


In [6]:
# With this info, we can now get all articles with a simple for loop and using the find_all() method:
for article in soup.find_all('div', class_='article'):
    headline = article.h2.a.text
    summary = article.p.text
    print(f'{headline} \n{summary}')
    print('-' * 30)

Article 1 Headline 
This is a summary of article 1
------------------------------
Article 2 Headline 
This is a summary of article 2
------------------------------


## Exercise
Now we'll do the same, but in a live website.

The goal is to access this URL: https://en.wikipedia.org/wiki/Python_(programming_language) and:

1. Retrieve all the headlines (the first is 'History')
2. Get a list of all the languages in which the page is available (bottom of the left sidebar)

In [1]:
# Let's request the page and save the response:
from bs4 import BeautifulSoup as bsoup
import requests

source = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)').text
soup = bsoup(source, 'lxml')
    
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Python (programming language) - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python_(programming_language)","wgTitle":"Python (programming language)","wgCurRevisionId":917802878,"wgRevisionId":917802878,"wgArticleId":23862,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Use dmy dates from August 2015","Articles containing potentially dated statements from March 2018","All articles containing potentially dated statements","Articles containing potentially dated statements from August 2016","Articles containing potentially dated statements from December 2018","Articles with Curlie links","Wikipedia articles with BNF identifiers","Wikipedia articles wi

In [None]:
# Let's solve task #1, retrieve all the headlines:

In [2]:
for headline in soup.find_all('span', class_='mw-headline'):
    print(headline.text)

History
Features and philosophy
Syntax and semantics
Indentation
Statements and control flow
Expressions
Methods
Typing
Mathematics
Libraries
Development environments
Implementations
Reference implementation
Other implementations
Unsupported implementations
Cross-compilers to other languages
Performance
Development
Naming
API documentation generators
Uses
Languages influenced by Python
See also
References
Sources
Further reading
External links


In [None]:
# Let's solve task #2, get a list of languages:

In [3]:
lang_lst = []
for language in soup.find_all('a', class_='interlanguage-link-target'):
    lang_lst.append(language.text)
print(lang_lst)

['Afrikaans', 'Alemannisch', 'العربية', 'Aragonés', 'অসমীয়া', 'Asturianu', 'Azərbaycanca', 'تۆرکجه', 'বাংলা', 'Bân-lâm-gú', 'Беларуская', 'Български', 'Bosanski', 'Català', 'Cebuano', 'Čeština', 'Cymraeg', 'Dansk', 'Deutsch', 'Eesti', 'Ελληνικά', 'Español', 'Esperanto', 'Euskara', 'فارسی', 'Français', 'Galego', 'ગુજરાતી', '한국어', 'Հայերեն', 'हिन्दी', 'Hrvatski', 'Bahasa Indonesia', 'Interlingua', 'Íslenska', 'Italiano', 'עברית', 'ქართული', 'Қазақша', 'Кыргызча', 'Latina', 'Latviešu', 'Lietuvių', 'La .lojban.', 'Lumbaart', 'Magyar', 'Македонски', 'മലയാളം', 'मराठी', 'Bahasa Melayu', 'Монгол', 'မြန်မာဘာသာ', 'Nederlands', 'नेपाली', '日本語', 'Norsk', 'Norsk nynorsk', 'ଓଡ଼ିଆ', 'Oʻzbekcha/ўзбекча', 'پنجابی', 'ភាសាខ្មែរ', 'Plattdüütsch', 'Polski', 'Português', 'Română', 'Русский', 'Scots', 'Shqip', 'සිංහල', 'Simple English', 'Slovenčina', 'Slovenščina', 'کوردی', 'Српски / srpski', 'Srpskohrvatski / српскохрватски', 'ၽႃႇသႃႇတႆး ', 'Suomi', 'Svenska', 'Tagalog', 'தமிழ்', 'Татарча/tatarça', 'తెలుగు'

## Now the real exercise!

We'll be scraping a different URL: https://pplware.sapo.pt/microsoft/e-verdade-o-notepad-do-windows-recebeu-uma-atualizacao/

Now the goal is:

* Get all the comments, **but not the replies**, made to the post
    *  Get the author's name
    *  Get the comment text
* Save the output to a CSV file with 3 columns: *Comment #*, *Author* & *Message*

***

## Solution

In [4]:
from bs4 import BeautifulSoup
import requests
import csv
from unidecode import unidecode # you might have to install this package: conda install unidecode

source = requests.get('https://pplware.sapo.pt/microsoft/e-verdade-o-notepad-do-windows-recebeu-uma-atualizacao/').text

soup = BeautifulSoup(source, 'lxml')
csv_file = open('comments.csv', 'w', newline='')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Comment #', 'Author', 'Message'])

for key, comment in enumerate(soup.find_all('li', class_='depth-1')):
    body = comment.div
    number = key
    author = body.find('span', class_='fn').text
    msg = body.find('p').text
    csv_writer.writerow([number, unidecode(author), unidecode(msg)])

csv_file.close()