Version: 02.14.2023

# Lab 3.1: Extracting Text from Webpages and Images

In this lab, you will use Beautiful Soup and Amazon Textract to extract text from the web and turn the results into a pandas dataframe.

In the second part of the lab, you will experiment with Amazon Textract to extract text from images.


## Lab steps

To complete this lab, you will follow these steps:

1. [Extracting information from a webpage](#1.-Extracting-information-from-a-webpage)
2. [Extracting text from images](#2.-Extracting-text-from-images)
    


In [1]:
#Upgrade dependencies
!pip install --upgrade pip
!pip install --upgrade sagemaker
!pip install --upgrade beautifulsoup4
!pip install --upgrade html5lib
!pip install --upgrade requests
!pip install --upgrade textract-trp

Collecting pip
  Downloading pip-24.2-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.2-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.3.2
    Uninstalling pip-23.3.2:
      Successfully uninstalled pip-23.3.2
Successfully installed pip-24.2
Collecting sagemaker
  Downloading sagemaker-2.232.2-py3-none-any.whl.metadata (16 kB)
Collecting sagemaker-mlflow (from sagemaker)
  Downloading sagemaker_mlflow-0.1.0-py3-none-any.whl.metadata (3.3 kB)
Collecting mlflow>=2.8 (from sagemaker-mlflow->sagemaker)
  Downloading mlflow-2.17.0-py3-none-any.whl.metadata (29 kB)
Collecting mlflow-skinny==2.17.0 (from mlflow>=2.8->sagemaker-mlflow->sagemaker)
  Downloading mlflow_skinny-2.17.0-py3-none-any.whl.metadata (30 kB)
Collecting alembic!=1.10.0,<2 (from mlflow>=2.8->sagemaker-mlflow->sagema

## 1. Extracting information from a webpage
([Go to top](#Lab-3.1:-Extracting-text-from-the-web))

In this section, you will use Beautiful Soup to extract the titles, authors, summaries, published data, and hyperlinks from blog posts. The extracted text could then be used in a downstream NLP task, such as topic extraction, sentiment analysis, text-to-speech, or translation.

Start by importing both the **Beautiful Soup** and **requests** packages.

In [2]:
from bs4 import BeautifulSoup
import requests

The blog post you will parse is the [AWS Machine Learning blog](https://aws.amazon.com/blogs/machine-learning/) at https://aws.amazon.com/blogs/machine-learning/.

Using your web browser, open the AWS Machine Learning page. 

Use the browser's *inspector mode* to discover the structure of the page. In Mozilla FireFox and Google Chrome, you can open the inspector by pressing CTRL+SHIFT+C. If you use a different browser, consult the browser documentation.

View the different elements of the webpage by moving your pointer over the page. Move the pointer over the following elements, and see whether you can find the tags that are used to identify the informtion:

* Title of the blog post
* Author
* Date published
* Text summary
* Hyperlink to the blog post

Don't worry if you can't find all the tags. The following code walkthrough will help you find tags.


First, use the **requests** library to load the webpage. Before you proceed, confirm that the HTTP status code is *200*.

In [3]:
page = requests.get('https://aws.amazon.com/blogs/machine-learning/')
page.status_code

200

Load the **content** from the page into a **soup** object.

In [4]:
soup = BeautifulSoup(page.content, 'html.parser')

View the entire page by using the `soup.prettify()` function.

**Note:** The content from the AWS Blogs page might be lengthy. To move to the next task, scroll down in this notebook.

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js aws-lng-en_US" data-aws-assets="https://a0.awsstatic.com" data-css-version="1.0.538" data-js-version="1.0.681" data-static-assets="https://a0.awsstatic.com" lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   AWS Machine Learning Blog
  </title>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="default-src 'self' data: https://a0.awsstatic.com https://prod.us-east-1.ui.gcr-chat.marketing.aws.dev; base-uri 'none'; connect-src 'self' *.akamaized.net *.googlevideo.com/videoplayback https://*.analytics.console.aws.a2z.com https://*.panorama.console.api.aws https://*.prod.chc-features.uxplatform.aws.dev https://112-tzm-766.mktoresp.com https://112-tzm-766.mktoutil.com https://a0.awsstatic.com https://a0.p.awsstatic.com https://a1.awsstatic.com https://amazonwebservices.d2.sc.omtrdc.net https://amazonwebservi

All the elements on the page can be accessed using dot (.) notation. Thus, to view the title, you could use `soup.title`. If you want only the `text`, use the text element as follows:

In [6]:
print(soup.title.text)

AWS Machine Learning Blog


When you used the inspector to search for tags on the AWS Blogs page, you might have found that blog-post content is organized/categorized/marked with `<article>` tags, which indicate a self-contained unit of content.

In [7]:
print(soup.article.prettify())

<article class="blog-post" typeof="TechArticle" vocab="https://schema.org/">
 <meta content="en-US" property="inLanguage"/>
 <meta content="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2024/07/10/ml-17178-hero.jpg" property="image"/>
 <div class="lb-row lb-snap">
  <div class="lb-col lb-mid-6 lb-tiny-24">
   <a href="https://aws.amazon.com/blogs/machine-learning/empowering-everyone-with-genai-to-rapidly-build-customize-and-deploy-apps-securely-highlights-from-the-aws-new-york-summit/" property="url" rel="bookmark">
    <img alt="" class="attachment-large size-large wp-post-image" height="253" src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2024/07/10/ml-17178-hero.jpg" width="486"/>
   </a>
  </div>
  <div class="lb-col lb-mid-18 lb-tiny-24">
   <h2 class="lb-bold blog-post-title">
    <a href="https://aws.amazon.com/blogs/machine-learning/empowering-everyone-with-genai-to-rapidly-build-customize-and-deploy-apps-secu

Review the output. Can you find the title?

The title can be found at `soup.article.h2.span`:

In [8]:
print(soup.article.h2.span.prettify())

<span property="name headline">
 Empowering everyone with GenAI to rapidly build, customize, and deploy apps securely: Highlights from the AWS New York Summit
</span>



To display only the text, use the `text` property:

In [9]:
print(soup.article.h2.span.text)

Empowering everyone with GenAI to rapidly build, customize, and deploy apps securely: Highlights from the AWS New York Summit


Find the publish date of the article:

In [10]:
print(soup.article.time.text)

10 JUL 2024


Next, extract the article summary:

In [11]:
print(soup.article.section.p.text)

See how AWS is democratizing generative AI with innovations like Amazon Q Apps to make AI apps from prompts, Amazon Bedrock upgrades to leverage more data sources, new techniques to curtail hallucinations, and AI skills training.


The author name is in the footer. A blog post can have multiple authors. However, for now, retrieve only the *first author*:

In [12]:
print(soup.article.footer.span.prettify())

<span>
 by
 <span property="author" typeof="Person">
  <span property="name">
   Swami Sivasubramanian
  </span>
 </span>
</span>



The hyperlink to the full article text is the last piece of information that you must find:

In [13]:
print(soup.article.div.a['href'])

https://aws.amazon.com/blogs/machine-learning/empowering-everyone-with-genai-to-rapidly-build-customize-and-deploy-apps-securely-highlights-from-the-aws-new-york-summit/


You have now identified all the relevant elements. You can find all the articles by using the `find_all()` function. You can then loop through the results and output information about the blog post, such as the title, author, and so on.

For example, to find all the authors and then loop through them, the author, use `find_all()`:

In [14]:
for article in soup.find_all('article'):
    print('==========================================')
    print(article.h2.span.text)
    authors = article.footer.find_all('span', {"property":"author"})
    print('by', end=' ')
    for author in authors:
        if author.span != None:
            print(author.span.text, end=', ')
    print(f'on {article.time.text}')
    print(article.section.p.text)
    print(article.div.a['href'])
    

Empowering everyone with GenAI to rapidly build, customize, and deploy apps securely: Highlights from the AWS New York Summit
by Swami Sivasubramanian, on 10 JUL 2024
See how AWS is democratizing generative AI with innovations like Amazon Q Apps to make AI apps from prompts, Amazon Bedrock upgrades to leverage more data sources, new techniques to curtail hallucinations, and AI skills training.
https://aws.amazon.com/blogs/machine-learning/empowering-everyone-with-genai-to-rapidly-build-customize-and-deploy-apps-securely-highlights-from-the-aws-new-york-summit/
Generative AI foundation model training on Amazon SageMaker
by Trevor Harvey, Guillaume Mangeot, Kanwaljit Khurmi, Miron Perel, on 22 OCT 2024
In this post, we explore how organizations can cost-effectively customize and adapt FMs using AWS managed services such as Amazon SageMaker training jobs and Amazon SageMaker HyperPod. We discuss how these powerful tools enable organizations to optimize compute resources and reduce the com

After you figure out the data format, you can add the results to an array:

In [None]:
blog_posts = []
for article in soup.find_all('article'):
    authors = article.footer.find_all('span', {"property":"author"})
    author_text = []
    for author in authors:
        if author.span != None:
            author_text.append(author.span.text)
    blog_posts.append([article.h2.span.text, ', '.join(author_text), article.time.text, article.section.p.text, article.div.a['href'] ])
    

Next, load the array into a pandas dataframe:

In [None]:
import pandas as pd
import time

In [None]:
df = pd.DataFrame(blog_posts, columns=['title','authors','published','summary','link'])

You must convert the **published** column to a `datetime` value.

In [None]:
df['published'] = pd.to_datetime(df['published'], format='%d %b %Y')

Adjust the column width for pandas, and display the first five rows of the dataframe:

In [None]:
pd.options.display.max_rows
pd.set_option('display.max_colwidth', None)
df.head()

Now that the data is in a pandas dataframe, you can use this data in downstream NLP tasks. You will come back to this data in Module 5.

## 2. Extracting text from images
([Go to top](#Lab-3.1:-Extracting-text-from-the-web))

In this section, you will extract the text from an image by using Amazon Textract.

For this exercise, you will use the following simple image. This file was loaded into Amazon Simple Storage Service (Amazon S3) when you started the lab.

![Image of a simple document](../s3/simple-document-image.jpg)

Start by importing the library for the AWS SDK for Python (Boto3).

In [15]:
import boto3

Setup the variables for the bucket and document name.

In [16]:
# Document
s3BucketName = "c133864a3391488l8075554t1w388111502021-labbucket-6fafswxet9kr"
documentName = "lab31/simple-document-image.jpg"

Extract text from the image by using Amazon Textract to call an application programming interface (API).

In [17]:
# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

print(response)

{'DocumentMetadata': {'Pages': 1}, 'Blocks': [{'BlockType': 'PAGE', 'Geometry': {'BoundingBox': {'Width': 1.0, 'Height': 1.0, 'Left': 0.0, 'Top': 0.0}, 'Polygon': [{'X': 0.0, 'Y': 0.0}, {'X': 1.0, 'Y': 0.0}, {'X': 1.0, 'Y': 1.0}, {'X': 0.0, 'Y': 1.0}]}, 'Id': 'da977d3f-bd03-4a58-98a2-4c5dc346e947', 'Relationships': [{'Type': 'CHILD', 'Ids': ['8a448825-b76f-4250-ba0c-9af9df1a09da', '1312e305-0768-4fcc-8330-ce271d8bad49', 'b4b1d17f-352e-4c9a-bbb6-d013cf70e70b', 'fb1176a5-1bc4-41cd-acf2-3c598ddf4a39']}]}, {'BlockType': 'LINE', 'Confidence': 99.52398681640625, 'Text': 'Amazon.com, Inc. is located in Seattle, WA', 'Geometry': {'BoundingBox': {'Width': 0.512660026550293, 'Height': 0.06824082136154175, 'Left': 0.06333211064338684, 'Top': 0.1989629715681076}, 'Polygon': [{'X': 0.06337157636880875, 'Y': 0.20793944597244263}, {'X': 0.5759921669960022, 'Y': 0.1989629715681076}, {'X': 0.5759671330451965, 'Y': 0.2590251564979553}, {'X': 0.06333211064338684, 'Y': 0.26720380783081055}]}, 'Id': '8a448

The response looks unformatted, but the **Blocks** list contains the key information that you need. 

Extract this information from the **Blocks** list:

In [18]:
# Print text
print("\nText\n========")
text = ""
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')
        text = text + " " + item["Text"]


Text
[94mAmazon.com, Inc. is located in Seattle, WA[0m
[94mIt was founded July 5th, 1994 by Jeff Bezos[0m
[94mAmazon.com allows customers to buy everything from books to blenders[0m
[94mSeattle is north of Portland and south of Vancouver, BC.[0m


You have now extracted the text from the image. You can use this text in a downstream NLP task.

You will now experiment with one additional image. This image contains *tables* of text.

![Image of Employment Application](../s3/employmentapp.png)

Set the new document name:

In [None]:
# Document
documentName = "lab31/employmentapp.png"

Call the Amazon Textract API again. However, this time, specify the **TABLES** feature type:

In [None]:
# Amazon Textract client

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["TABLES"])


Parse the table by using the Amazon Textract results parser (**textract-trp**).

**Note:** You installed the Amazon Textract results parser when you ran the `pip install --upgrade textract-trp` command at the start of this notebook.

In [None]:
from trp import Document
doc = Document(response)

for page in doc.pages:
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}".format(r, c, cell.text))

You have now extracted the text from a different image, and you could continue to process it further, if needed.

# Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.

*©2023 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. All trademarks are the property of their owners.*