<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-4--job-resume-improvement/04_extracting_text_from_web_page.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Large text analysis using Clustering

A markup language is a system for annotating documents
that distinguishes the annotations from the document text. 

In the case of HTML, these annotations are instructions on how to visualize a web page.

Web page visualization is usually carried out using a web browser.

Of course, during large-scale data analysis, we don’t need to render every page.
Computers can process document texts without requiring any visualization. Thus,
when analyzing HTML documents, we can focus on the text while skipping over the
display instructions.

Consequently, a basic knowledge of HTML structure is imperative for online text analysis.

With this in mind, we begin this section by reviewing the HTML structure. Then
we learn how to parse that structure using Python libraries.

##Setup

In [None]:
!pip install bs4

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [29]:
from collections import defaultdict
from collections import Counter
import time
import numpy as np
import pandas as pd

from urllib.request import urlopen
from bs4 import BeautifulSoup as bs

import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML

##HTML document structure

Let's explore many common HTML tags.

In [4]:
# Rendering an HTML string
def render(html_contents):
  display(HTML(html_contents))

html_contents = "<html>Hello</html>"
render(html_contents)

In [5]:
# Defining a title in HTML
title = "<title>Data Science is Fun</title>"
html_contents = f"<html>{title}Hello</html>"
render(html_contents)

In [6]:
# Adding a head and body to the HTML string
head =f"<head>{title}</head>"
body = "<body>Hello</body>"
html_contents = f"<html> {head} {body} </html>"
render(html_contents)

In [7]:
# Adding a header to the HTML string
header =f"<h1>Data Science is Fun</h1>"
body = f"<body>{header}Hello</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

Let’s add two consecutive paragraphs to our HTML.

In [8]:
# Adding paragraphs to the HTML string
paragraphs = ""
for i in range(2):
  paragraph_string = f"paragraphs {i} " * 40
  paragraphs += f"<p>{paragraph_string}</p>"

body = f"<body>{header}{paragraphs}</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

In [9]:
# Adding id attributes to the paragraphs
paragraphs = ""
for i in range(2):
  paragraph_string = f"paragraphs {i} " * 40
  attribute = f"id='paragraphs {i}'"
  paragraphs += f"<p {attribute}>{paragraph_string}</p>"

body = f"<body>{header}{paragraphs}</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

Let's create a hyperlink that reads Data Science Bookcamp and link.

In [10]:
link_text = "Data Science Bookcamp"
url = "https:/ /www.manning.com/books/data-science-bookcamp"
hyperlink = f"<a href='{url}'>{link_text}</a>"
new_paragraph = f"<p id='paragraph 2'>Here is a link to {hyperlink}</p>"
paragraphs += new_paragraph

body = f"<body>{header}{paragraphs}</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

Beyond just headers and paragraphs,
we can also visualize lists of texts in an HTML document.

In [11]:
# Adding an unstructured list to the HTML string
libraries = ['NumPy', 'SciPy', 'Pandas', 'Scikit-Learn']
items = ""
for library in libraries:
  items += f"<li>{library}</li>"

unstructured_list = f"<ul>{items}</ul>"
header2 = "<h2>Common Data Science Libraries</h2>"
body = f"<body>{header}{paragraphs}{header2}{unstructured_list}</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

In [12]:
# Adding divisions to the HTML string
div1 = f"<div id='paragraphs' class='text'>{paragraphs}</div>"
div2 = f"<div id='list' class='text'>{header2}{unstructured_list}</div>"
div3 = "<div id='empty' class='empty'></div>"

body = f"<body>{header}{div1}{div2}{div3}</body>"
html_contents = f"<html> {title} {body} </html>"
render(html_contents)

In [13]:
# Printing the altered HTML string
print(html_contents)

<html> <title>Data Science is Fun</title> <body><h1>Data Science is Fun</h1><div id='paragraphs' class='text'><p id='paragraphs 0'>paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 </p><p id='paragraphs 1'>paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1

Next, we’d need to go one index over and extract the
string containing the title’s text. 

Finally, we’d have to clean the title string by splitting on
the remaining < bracket.

In [14]:
# Extracting the HTML title using basic Python
split_contents = html_contents.split(">")
for i, substring in enumerate(split_contents):
  if substring.endswith("<title"):
    next_string = split_contents[i + 1]
    title = next_string.split("<")[0]
    print(title)
    break

Data Science is Fun


Is there a cleaner way to extract elements from HTML documents? 

Yes! We don’t
need to manually parse the documents. 

Instead, we can use the external **Beautiful
Soup** library.

##Parsing HTML

We now initialize the BeautifulSoup class by running `bs(html_contents`.

In [15]:
# Printing readable HTML
soup = bs(html_contents)
print(soup.prettify())

<html>
 <head>
  <title>
   Data Science is Fun
  </title>
 </head>
 <body>
  <h1>
   Data Science is Fun
  </h1>
  <div class="text" id="paragraphs">
   <p id="paragraphs 0">
    paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0
   </p>
   <p id="paragraphs 1">
    paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 par

In [16]:
# Extracting the title
title = soup.find("title")
print(title)
print(type(title))
print(title.text)

<title>Data Science is Fun</title>
<class 'bs4.element.Tag'>
Data Science is Fun


In [17]:
# Accessing the title’s text attribute
assert soup.title.text == title.text

In [18]:
# Accessing the body’s text attribute
body = soup.body
print(body.text)

Data Science is Funparagraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragr

In [19]:
# Accessing the text of the first paragraph
assert body.p.text == soup.p.text
print(soup.p.text)

paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 


In [20]:
# Accessing all paragraphs in the body
paragraphs = body.find_all("p")
for i, paragraph in enumerate(paragraphs):
  print(f"\nPARAGRAPH {i}:")
  print(paragraph.text)


PARAGRAPH 0:
paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 

PARAGRAPH 1:
paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraph

In [21]:
# Accessing all bullet points in the body
print([bullet.text for bullet in body.find_all("li")])

['NumPy', 'SciPy', 'Pandas', 'Scikit-Learn']


In [22]:
# Accessing a paragraph by ID
paragraph_2 = soup.find(id="paragraph 2")
print(paragraph_2.text)

Here is a link to Data Science Bookcamp


In [23]:
# Accessing an attribute in a tag
assert paragraph_2.get("id") == "paragraph 2"
print(paragraph_2.a.get("href"))

https:/ /www.manning.com/books/data-science-bookcamp


In [24]:
# Accessing divisions by their shared class attribute
for division in soup.find_all("div", class_="text"):
  id_ = division.get("id")
  print(f"\nDivision with id '{id_}':")
  print(division.text)


Division with id 'paragraphs':
paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 0 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragraphs 1 paragr

In [25]:
# Paragraph deletion
body.find(id="paragraphs 0").decompose()
soup.find(id="paragraphs 1").decompose()
print(body.find(id="paragraphs").text)

Here is a link to Data Science Bookcamp


In [26]:
# Initializing an empty paragraph Tag
new_paragraph = soup.new_tag("p")
print(new_paragraph)

<p></p>


In [27]:
# Updating the text of an empty paragraph
new_paragraph.string = "This paragraph is new"
print(new_paragraph)

<p>This paragraph is new</p>


In [28]:
# Paragraph insertion
soup.find(id="empty").append(new_paragraph)
render(soup.prettify())

##Parsing online data

Let’s briefly review the procedure for downloading HTML files.

In [30]:
# Downloading an HTML document
url = "https://www.manning.com/books/data-science-bookcamp"
html_contents = urlopen(url).read()
print(html_contents[:1000])

b'\n<!DOCTYPE html>\n<!--[if lt IE 7 ]> <html lang="en" class="no-js ie6 ie"> <![endif]-->\n<!--[if IE 7 ]>    <html lang="en" class="no-js ie7 ie"> <![endif]-->\n<!--[if IE 8 ]>    <html lang="en" class="no-js ie8 ie"> <![endif]-->\n<!--[if IE 9 ]>    <html lang="en" class="no-js ie9 ie"> <![endif]-->\n<!--[if (gt IE 9)|!(IE)]><!--> <html lang="en" class="no-js"><!--<![endif]-->\n\n<head>\n    <meta name="theme-color" content="#333333">\n    <title>Data Science Bookcamp</title>\n\n\n\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0">\n<meta name="application-name" content="Data Science Bookcamp"/>\n<meta name="apple-mobile-web-app-title" content="Data Science Bookcamp"/>\n\n<meta property="og:title" content="Data Science Bookcamp"/>\n<meta name="twitter:title" content="Data Science Bookcamp"/>\n\n<meta name="tw

In [31]:
# Accessing the title
soup = bs(html_contents)
print(soup.title.text)

Data Science Bookcamp


In [32]:
# Accessing a description of this book
for division in soup.find_all("div"):
  header = division.h2
  if header is None:
    continue
  if header.text.lower() == "about the book":
    print(division.text)


about the book

Data Science Bookcamp doesn’t stop with surface-level theory and toy examples. As you work through each project, you’ll learn how to troubleshoot common problems like missing data, messy data, and algorithms that don’t quite fit the model you’re building. You’ll appreciate the detailed setup instructions and the fully explained solutions that highlight common failure points. In the end, you’ll be confident in your skills because you can see the results.
    


We are now ready to use Beautiful Soup to parse job postings as part of our case
study solution.