## Web Scraping and HTML Parsing with BeautifulSoup and Python  
Accessing the HTML contet of a webpage in the internet and extracting useful information/data from it is called **web scraping** or **web harvesting**.

Steps involved in web scraping:

1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library **requests**.  

2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is **html5lib**.  

3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, **Beautiful Soup**. It is a Python library for pulling data out of HTML and XML files.

--- 
Quellen: [Implementing Webscraping in Python with BeautifulSoup](https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/)

--- 
+ V1 29.09.2022 by daniel benninger - initial version for KETE HS22


### Step 0: Installing the required third-party libraries

In [1]:
!pip install requests
!pip install html5lib
!pip install bs4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Case A: Accessing the HTML content from webpage  

+ First of all import the *requests* library.  
+ Then, *specify the URL* of the webpage you want to scrape.  
+ *Send a HTTP request* to the specified URL and save the response from server in a response object called *response*.
+ Now, print *response.content* to get the raw HTML content of the webpage. It is of ‘string’ type.

In [21]:
import requests
URL = 'https://www.analyticsvidhya.com/blog-archive/'
response = requests.get(URL)
print(response.content)




In [22]:
out_file = open("1results.txt", "w")
out_file.write(str(response.content))
out_file.close()

In [None]:
import requests 
from bs4 import BeautifulSoup

URL = 'https://www.analyticsvidhya.com/blog-archive/'
response = requests.get(URL)
results = BeautifulSoup(response.content, 'html.parser')

print(results)

In [24]:
out_file = open("2results.txt", "w")
out_file.write(str(results))
out_file.close()

## Case B: Parsing the HTML content from webpage  

+ Find elements (e.g. *blog articles*) by specific **HTML class names** (e.g. *list-card-content*)

In [None]:
blogbeitraege = results.find_all('div', class_='list-card-content')

for blogbeitrag in blogbeitraege: 
    print(blogbeitrag, end='\n'*2)

---  
+ Find text blocks (e.g. *blog article titel*) by specific **HTML header tags** (e.g. *h4*)

In [17]:
for blogbeitrag in blogbeitraege: 
    blog_titel = blogbeitrag.find('h4')
    print(blog_titel)

<h4>Book your Seats now for Upcoming DataHour Sessions!</h4>
<h4>Key Components and Challenges of Data Lakes</h4>
<h4>Demystifying NoSQL: Your Complete Interview Guide</h4>
<h4>CheXzero: Detect Pathologies From Unannotated X-ray Images</h4>
<h4>Apache Kafka Use Cases and Installation Guide</h4>
<h4>Informed Search Strategies for State Space Search Solving</h4>
<h4>AI: Grand Challenges and Emerging Economies</h4>
<h4>Is MLOps Another Redundant Terminology?</h4>
<h4>Using MongoDB with Pandas, NumPy, and PyArrow</h4>
<h4>Sentiment Analysis Using VADER</h4>


---  
+ Find text blocks (e.g. *blog article link/URL*)by **HTML href tag** (e.g. *href*)

In [19]:
import re

for blogbeitrag in blogbeitraege: 
    blog_titel = blogbeitrag.find('h4')
    print(blog_titel)
    
    for link in blogbeitrag.find_all('a', href=True):
      if re.search("20\\d{2}", link['href']): 
        print (link['href'])

<h4>Book your Seats now for Upcoming DataHour Sessions!</h4>
https://www.analyticsvidhya.com/blog/2022/10/book-your-seats-now-for-upcoming-datahour-sessions-2/
<h4>Key Components and Challenges of Data Lakes</h4>
https://www.analyticsvidhya.com/blog/2022/10/key-components-and-challenges-of-data-lakes/
<h4>Demystifying NoSQL: Your Complete Interview Guide</h4>
https://www.analyticsvidhya.com/blog/2022/10/demystifying-nosql-your-complete-interview-guide/
<h4>CheXzero: Detect Pathologies From Unannotated X-ray Images</h4>
https://www.analyticsvidhya.com/blog/2022/10/chexzero-detect-pathologies-from-unannotated-x-ray-images/
<h4>Apache Kafka Use Cases and Installation Guide</h4>
https://www.analyticsvidhya.com/blog/2022/10/apache-kafka-use-cases-and-installation-guide/
<h4>Informed Search Strategies for State Space Search Solving</h4>
https://www.analyticsvidhya.com/blog/2022/10/informed-search-strategies-for-state-space-search-solving/
<h4>AI: Grand Challenges and Emerging Economies</h4>


## Case C: Scraping multiple webpages and Parsing for Text and Links  

---  
+ Scan multiple webpages (e.g. *24*) of a whole blog archive (e.g. **nlp** *tagged blog articles*) and extracting the corresponding titles/links 

In [None]:
import re

for iPage in range (1,24,1):
  URL = 'https://www.analyticsvidhya.com/blog/tag/nlp/page/' + str(iPage)
  #print(URL)

  website = requests.get(URL)

  results = BeautifulSoup(website.content, 'html.parser')
  # print(results)

  blogbeitraege = results.find_all('div', class_='list-card-content')

  for blogbeitrag in blogbeitraege: 
    # print(blogbeitrag)
    blog_titel = blogbeitrag.find('h4')
    print(blog_titel)
      
    for link in blogbeitrag.find_all('a', href=True):
      # print ("Found the URL:", link['href'])
      if re.search("20\\d{2}", link['href']): 
        print (link['href'])

In [29]:
out_file = open("3results.txt", "w")

for iPage in range (1,24,1):
  URL = 'https://www.analyticsvidhya.com/blog/tag/nlp/page/' + str(iPage)
  out_file.write("\n"+str(URL)+"\n\n\n")

  website = requests.get(URL)

  results = BeautifulSoup(website.content, 'html.parser')
  
  blogbeitraege = results.find_all('div', class_='list-card-content')

  for blogbeitrag in blogbeitraege: 
    blog_titel = blogbeitrag.find('h4')
    out_file.write(str(blog_titel)+"\n")
      
    for link in blogbeitrag.find_all('a', href=True):
      if re.search("20\\d{2}", link['href']): 
        out_file.write(str(link['href'])+"\n")

out_file.close()