## DEMO: Scraping Webpages and HTML Parsing with the BeautifulSoup Library  
Accessing the HTML contet of a webpage in the internet and extracting useful information/data from it is called **web scraping** or **web harvesting**.

Steps involved in web scraping:

1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library [**requests**](https://requests.readthedocs.io/en/latest/).  

2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is [**html5lib**](https://github.com/html5lib/html5lib-python).  

3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, [**Beautiful Soup**](https://www.crummy.com/software/BeautifulSoup/). It is a Python library for pulling data out of HTML and XML files.

---
Quellen:
+ [Implementing Webscraping in Python with BeautifulSoup](https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/)

---
History:  
+ V1 29.09.2022 by daniel benninger


### Installing the required Third-Party Libraries

In [None]:
!pip install requests
!pip install html5lib
!pip install bs4

## Accessing the HTML Content from a Webpage  

+ First of all import the *requests* library.  
+ Then, *specify the URL* of the webpage you want to scrape.  
+ *Send a HTTP request* to the specified URL and save the response from server in a response object called *response*.
+ Now, as print *response.content* to get the raw HTML content of the webpage. It is of ‘string’ type.

In [18]:
import requests
URL = 'https://www.analyticsvidhya.com/blog-archive/'
response = requests.get(URL)
print(response.content)




In [None]:
from bs4 import BeautifulSoup
results = BeautifulSoup(response.content, 'html.parser')
print(results.contents)

+ Einträge nach *HTML-Klassennamen* (**class_**= ...) finden

In [None]:
blogbeitraege = results.find_all('div', class_='list-card-content')

for blogbeitrag in blogbeitraege:
    print(blogbeitrag, end='\n'*2)

+ Titeleinträge  nach *HTML-Überschrift* (heading) (**h4**) finden

In [21]:
for blogbeitrag in blogbeitraege:
    blog_titel = blogbeitrag.find('h4')
    print(blog_titel)

<h4>How to Become a Research Analyst? Description, Skills, and Salary</h4>
<h4>What is Data Redundancy? Benefits, Drawbacks and Tips</h4>
<h4>Python in Excel: Opening the Door to Advanced Data Analytics</h4>
<h4>Harness the Power of LLMs: Zero-shot and Few-shot Prompting</h4>
<h4>Exploring Advanced Generative AI | Conditional VAEs</h4>
<h4>Adversarial Autoencoders: Bridging the Gap Between Autoencoders and GANs</h4>
<h4>Building and Training Large Language Models for Code: A Deep Dive into StarCoder</h4>
<h4>Building A Model From Scratch to Generate Text From Prompts</h4>
<h4>Google Launches AI-Powered Search in India | Learn How to Use It</h4>
<h4>Generative AI’s Shift From GPT-3.5 to GPT-4 Journey</h4>


## Scan multiple Webpages and Extracting Title/Links  

+ First *redirect (standard) output* to local file  
+ Then, *specify the URL* and *number of pages" you want to scrape.  
Note: Select a Blog Category Tag (i.e. *data-science*)  
+ Scan each page for blog entries of the years > 2000 + Extract Titel and Link of the appropriate blog entries

In [22]:
# Python library re for "regular expressions" (https://docs.python.org/3/library/re.html)
import re

In [54]:
# redirect standard output
import sys
old_stdout = sys.stdout
sys.stdout = open('bloglist-datascience.html', 'w')

In [55]:
#scan all necessary blog pages
NO_of_PAGES = 40
BLOG_URL = 'https://www.analyticsvidhya.com/blog/tag/data-science/page/'

for iPage in range (1,NO_of_PAGES,1):
  URL =  BLOG_URL + str(iPage)

  website = requests.get(URL)
  results = BeautifulSoup(website.content, 'html.parser')


  blogbeitraege = results.find_all('div', class_='list-card-content')

  for blogbeitrag in blogbeitraege:
    blog_titel = blogbeitrag.find('h4')
    print(blog_titel)

    for link in blogbeitrag.find_all('a', href=True):
      #if ("20??" in link['href']), andere links ingnoieren
      if re.search("20\\d{2}", link['href']):
        print (link['href'])
        print("<a href=", link['href'], ">weblink</a>")


In [56]:
# restore standard output
sys.stdout = old_stdout