<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-4--job-resume-improvement/04_extracting_text_from_web_page.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Large text analysis using Clustering

A markup language is a system for annotating documents
that distinguishes the annotations from the document text. 

In the case of HTML, these annotations are instructions on how to visualize a web page.

Web page visualization is usually carried out using a web browser.

Of course, during large-scale data analysis, we don’t need to render every page.
Computers can process document texts without requiring any visualization. Thus,
when analyzing HTML documents, we can focus on the text while skipping over the
display instructions.

Consequently, a basic knowledge of HTML structure is imperative for online text analysis.

With this in mind, we begin this section by reviewing the HTML structure. Then
we learn how to parse that structure using Python libraries.

##Setup

In [None]:
!pip install bs4

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
from collections import defaultdict
from collections import Counter
import time
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup as bs

import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML

##HTML document structure

Let's explore many common HTML tags.

In [4]:
# Rendering an HTML string
def render(html_contents):
  display(HTML(html_contents))

html_contents = "<html>Hello</html>"
render(html_contents)

In [5]:
# Defining a title in HTML
title = "<title>Data Science is Fun</title>"
html_contents = f"<html>{title}Hello</html>"
render(html_contents)

In [8]:
# Adding a head and body to the HTML string
head =f"<head>{title}</head>"
body = "<body>Hello</body>"
html_contents = f"<html> {head} {body} </html>"
render(html_contents)

In [12]:
# Adding a header to the HTML string
header =f"<h1>Data Science is Fun</h1>"
body = f"<body>{header}Hello</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

Let’s add two consecutive paragraphs to our HTML.

In [13]:
# Adding paragraphs to the HTML string
paragraphs = ""
for i in range(2):
  paragraph_string = f"paragraphs {i} " * 40
  paragraphs += f"<p>{paragraph_string}</p>"

body = f"<body>{header}{paragraphs}</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

In [14]:
# Adding id attributes to the paragraphs
paragraphs = ""
for i in range(2):
  paragraph_string = f"paragraphs {i} " * 40
  attribute = f"id='paragraphs {i}'"
  paragraphs += f"<p {attribute}>{paragraph_string}</p>"

body = f"<body>{header}{paragraphs}</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

Let's create a hyperlink that reads Data Science Bookcamp and link.

In [16]:
link_text = "Data Science Bookcamp"
url = "https:/ /www.manning.com/books/data-science-bookcamp"
hyperlink = f"<a href='{url}'>{link_text}</a>"
new_paragraph = f"<p id='paragraph 2'>Here is a link to {hyperlink}</p>"
paragraphs += new_paragraph

body = f"<body>{header}{paragraphs}</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

Beyond just headers and paragraphs,
we can also visualize lists of texts in an HTML document.

In [17]:
# Adding an unstructured list to the HTML string
libraries = ['NumPy', 'SciPy', 'Pandas', 'Scikit-Learn']
items = ""
for library in libraries:
  items += f"<li>{library}</li>"

unstructured_list = f"<ul>{items}</ul>"
header2 = "<h2>Common Data Science Libraries</h2>"
body = f"<body>{header}{paragraphs}{header2}{unstructured_list}</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

In [18]:
# Adding divisions to the HTML string
div1 = f"<div id='paragraphs' class='text'>{paragraphs}</div>"
div2 = f"<div id='list' class='text'>{header2}{unstructured_list}</div>"
div3 = "<div id='empty' class='empty'></div>"

body = f"<body>{header}{div1}{div2}{div3}</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

In [19]:
# Printing the altered HTML string
print(html_contents)

<html> <title>Data Science is Fun</title> <body><h1>Data Science is Fun</h1><div id='paragraphs' class='text'><p id='paragraphs 0'>paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 </p><p id='paragraphs 1'>paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1

Next, we’d need to go one index over and extract the
string containing the title’s text. 

Finally, we’d have to clean the title string by splitting on
the remaining < bracket.

In [20]:
# Extracting the HTML title using basic Python
split_contents = html_contents.split(">")
for i, substring in enumerate(split_contents):
  if substring.endswith("<title"):
    next_string = split_contents[i + 1]
    title = next_string.split("<")[0]
    print(title)
    break

Data Science is Fun


Is there a cleaner way to extract elements from HTML documents? 

Yes! We don’t
need to manually parse the documents. 

Instead, we can use the external **Beautiful
Soup** library.

##Parsing HTML