## DEMO: Scraping Webpages and HTML Parsing with the BeautifulSoup Library  
Accessing the HTML contet of a webpage in the internet and extracting useful information/data from it is called **web scraping** or **web harvesting**.

Steps involved in web scraping:

1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library [**requests**](https://requests.readthedocs.io/en/latest/).  

2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is [**html5lib**](https://github.com/html5lib/html5lib-python).  

3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, [**Beautiful Soup**](https://www.crummy.com/software/BeautifulSoup/). It is a Python library for pulling data out of HTML and XML files.

---
Quellen:
+ [Implementing Webscraping in Python with BeautifulSoup](https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/)

---
History:  
+ V1 29.09.2022 by daniel benninger


### Installing the required Third-Party Libraries

In [None]:
!pip install requests
!pip install html5lib
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l[?25hdone
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1256 sha256=81831b66b42803b9ab00c49d4c871e15763b7cbc8c19be09119feab4f5abdff0
  Stored in directory: /root/.cache/pip/wheels/25/42/45/b773edc52acb16cd2db4cf1a0b47117e2f69bb4eb300ed0e70
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


## Accessing the HTML Content from a Webpage  

+ First of all import the *requests* library.  
+ Then, *specify the URL* of the webpage you want to scrape.  
+ *Send a HTTP request* to the specified URL and save the response from server in a response object called *response*.
+ Now, as print *response.content* to get the raw HTML content of the webpage. It is of ‘string’ type.

In [None]:
import requests
URL = 'https://www.analyticsvidhya.com/blog-archive/'
response = requests.get(URL)
print(response.content)




In [None]:
from bs4 import BeautifulSoup
results = BeautifulSoup(response.content, 'html.parser')
print(results.contents)

['html', '\n', <html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="address=no" name="format-detection"/>
<link href="https://www.analyticsvidhya.com" rel="preconnect"/>
<link href="https://www.analyticsvidhya.com/blog" rel="preconnect"/>
<link href="https://cdnjs.cloudflare.com" rel="preconnect"/>
<link href="https://stackpath.bootstrapcdn.com" rel="preconnect"/>
<link href="https://code.jquery.com/" rel="preconnect"/>
<link href="https://cdn.jsdelivr.net" rel="preconnect"/>
<link href="https://www.analyticsvidhya.com/wp-content/uploads/2015/02/logo_square_v2.jpg" id="favicon" rel="icon" type="image/x-icon"/>
<link href="https://www.analyticsvidhya.com/wp-content/uploads/2015/02/logo_square_v2.jpg" rel="apple-touch-icon"/>
<link href="https://www.analyticsvidhya.com/wp-content/uploads/2015/02/logo_square_v2.jpg" rel="apple-touch-icon"

+ Einträge nach *HTML-Klassennamen* (**class_**= ...) finden

In [None]:
blogbeitraege = results.find_all('div', class_='list-card-content')

for blogbeitrag in blogbeitraege:
    print(blogbeitrag, end='\n'*2)

<div class="list-card-content">
<a href="https://www.analyticsvidhya.com/blog/2023/09/generative-ai-tools/"><h4>Top 140+ Generative AI Tools That Can Make Your Work Easy</h4></a>
<h6>
<a href="https://www.analyticsvidhya.com/blog/author/yana_khare/">Yana Khare</a>, September 26, 2023</h6>
<span><a href="https://www.analyticsvidhya.com/blog/category/artificial-intelligence/">Artificial Intelligence</a>, <a href="https://www.analyticsvidhya.com/blog/category/beginner/">Beginner</a>, <a href="https://www.analyticsvidhya.com/blog/category/generative-ai/">Generative AI</a>, <a href="https://www.analyticsvidhya.com/blog/category/image/">Image</a>, <a href="https://www.analyticsvidhya.com/blog/category/listicle/">Listicle</a>, <a href="https://www.analyticsvidhya.com/blog/category/midjourney/">Midjourney</a>, <a href="https://www.analyticsvidhya.com/blog/category/text/">Text</a>, <a href="https://www.analyticsvidhya.com/blog/category/videos/">Videos</a> </span>
</div>

<div class="list-card-c

+ Titeleinträge  nach *HTML-Überschrift* (heading) (**h4**) finden

In [None]:
for blogbeitrag in blogbeitraege:
    blog_titel = blogbeitrag.find('h4')
    print(blog_titel)

<h4>Top 140+ Generative AI Tools That Can Make Your Work Easy</h4>
<h4>How to Become a Data Scientist After BCom?</h4>
<h4>Advancing Forensic Science with Generative AI</h4>
<h4>How is AI Changing the Forex Market in 2023?</h4>
<h4>Enhancing Customer Surveys Feedback Analysis with Large Language Models</h4>
<h4>Top 10 Web Scraping Projects to Do in 2023</h4>
<h4>Empowering Contextual Document Retrieval: Leveraging GPT-2 and LlamaIndex</h4>
<h4>How to Become a Supply Chain Analyst in 2023?</h4>
<h4>5 Free Data Science Projects With Solutions</h4>
<h4>Data Science Curriculum for Self Study</h4>


## Scan multiple Webpages and Extracting Title/Links  

+ First *redirect (standard) output* to local file  
+ Then, *specify the URL* and *number of pages" you want to scrape.  
Note: Select a Blog Category Tag (i.e. *data-science*)  
+ Scan each page for blog entries of the years > 2000 + Extract Titel and Link of the appropriate blog entries

In [None]:
# Python library re for "regular expressions" (https://docs.python.org/3/library/re.html)
import re

In [None]:
# redirect standard output
import sys
old_stdout = sys.stdout
sys.stdout = open('bloglist-datascience.html', 'w')

In [None]:
#scan all necessary blog pages
NO_of_PAGES = 40
BLOG_URL = 'https://www.analyticsvidhya.com/blog/tag/data-science/page/'

for iPage in range (1,NO_of_PAGES,1):
  URL =  BLOG_URL + str(iPage)

  website = requests.get(URL)
  results = BeautifulSoup(website.content, 'html.parser')


  blogbeitraege = results.find_all('div', class_='list-card-content')

  for blogbeitrag in blogbeitraege:
    blog_titel = blogbeitrag.find('h4')
    print(blog_titel)

    for link in blogbeitrag.find_all('a', href=True):
      #if ("20??" in link['href']), andere links ingnoieren
      if re.search("20\\d{2}", link['href']):
        print (link['href'])
        print("<a href=", link['href'], ">weblink</a>")


In [None]:
# restore standard output
sys.stdout = old_stdout