# Assignment: Web Scraping with Beautiful Soup<a href="#Assignment:-Web-Scraping-with-Beautiful-Soup"
class="anchor-link"></a>

## Objective:<a href="#Objective:" class="anchor-link"></a>

The objective of this assignment is to help trainees gain hands-on
experience with Beautiful Soup, a popular Python library for web
scraping. By the end of this assignment, trainees should be able to
scrape data from websites, navigate HTML structures, and store the
extracted data in various formats.

### Task 1: Install and Set Up Beautiful Soup<a href="#Task-1:-Install-and-Set-Up-Beautiful-Soup"
class="anchor-link"></a>

-   Install Required Libraries: Install Beautiful Soup along with a
    parser like lxml and the requests library for making HTTP requests.

In \[1\]:

    # Command



### Task 3: Scrape Data Using Beautiful Soup<a href="#Task-3:-Scrape-Data-Using-Beautiful-Soup"
class="anchor-link">¶</a>

**Make an HTTP Request:** Use the requests library to send a GET request
to the website and retrieve the HTML content.

In \[ \]:

    # Comments

**Parse HTML Content:** Use Beautiful Soup to parse the HTML content
retrieved from the website.

In \[4\]:

    # Comments

**Extract Specific Data:**

-   Quotes Website: Extract quotes, authors, and tags from the website.

-   Hacker News: Extract headlines and links to the articles.

-   IMDB: Extract movie titles, ratings, and years of release.

In \[6\]:

    # Comments

    Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
    Author: Albert Einstein
    Tags: ['change', 'deep-thoughts', 'thinking', 'world']

    Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
    Author: J.K. Rowling
    Tags: ['abilities', 'choices']

    Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
    Author: Albert Einstein
    Tags: ['inspirational', 'life', 'live', 'miracle', 'miracles']

    Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
    Author: Jane Austen
    Tags: ['aliteracy', 'books', 'classic', 'humor']

    Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
    Author: Marilyn Monroe
    Tags: ['be-yourself', 'inspirational']

    Quote: “Try not to become a man of success. Rather become a man of value.”
    Author: Albert Einstein
    Tags: ['adulthood', 'success', 'value']

    Quote: “It is better to be hated for what you are than to be loved for what you are not.”
    Author: André Gide
    Tags: ['life', 'love']

    Quote: “I have not failed. I've just found 10,000 ways that won't work.”
    Author: Thomas A. Edison
    Tags: ['edison', 'failure', 'inspirational', 'paraphrased']

    Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
    Author: Eleanor Roosevelt
    Tags: ['misattributed-eleanor-roosevelt']

    Quote: “A day without sunshine is like, you know, night.”
    Author: Steve Martin
    Tags: ['humor', 'obvious', 'simile']

**Handle Missing Data:** Implement error handling to manage cases where
certain elements might be missing or where requests might fail.

In \[7\]:

    # Comments

## Task 4: Store the Scraped Data<a href="#Task-4:-Store-the-Scraped-Data" class="anchor-link">¶</a>

**Save Data to a JSON File:** Store the extracted data in a JSON file.

In \[10\]:

    # Comments

**Save Data to a CSV File:** Store the extracted data in a CSV file.

In \[13\]:

    # Comments

# Assignment: Web Scraping with Beautiful Soup

## Objective:<a href="#Objective:" class="anchor-link"></a>

The objective of this assignment is to help trainees gain hands-on
experience with Beautiful Soup, a popular Python library for web
scraping. By the end of this assignment, trainees should be able to
scrape data from websites, navigate HTML structures, and store the
extracted data in various formats.

### Task 1: Install and Set Up Beautiful Soup


-   Install Required Libraries: Install Beautiful Soup along with a
    parser like lxml and the requests library for making HTTP requests.


In [None]:
#Command
pip install beautifulsoup4

### Task 2: Choose a Website to Scrape<a href="#Task-2:-Choose-a-Website-to-Scrape" class="anchor-link"></a>

**Select a Website:** Choose a simple, publicly accessible website to
scrape. Some example websites include:

-   <http://quotes.toscrape.com> (A website designed for practicing web
    scraping)
-   <https://news.ycombinator.com/> (Hacker News)
-   <https://www.imdb.com/chart/top/> (IMDB Top 250)

**Understand the Website’s Structure:**

-   Use your browser’s developer tools (Inspect element) to explore the
    HTML structure of the website.
-   Identify the elements you need to scrape (e.g., titles, links,
    prices, etc.).

#### Comments<a href="#Comments" class="anchor-link"></a>

-   I want to scrap following sections:
    -   Title
    -   Quotes
    -   Author
    -   Tags

In [4]:
import requests
from bs4 import BeautifulSoup

url = 'https://quotes.toscrape.com/'
html = requests.get(url)
print(html.text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

In [5]:
html.status_code

200

In [6]:
soup = BeautifulSoup(html.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

In [7]:
soup.title

<title>Quotes to Scrape</title>

In [8]:
soup.title.name

'title'

In [9]:
soup.title.string

'Quotes to Scrape'

In [22]:
# let's find the div tags having quotes first
div_tags=soup.find_all('div',class_='quote')
len(div_tags)
print(div_tags)

[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K.

In [12]:
def get_quotes(div_tags):
    quotes=[] #get the quotes in the list
    for tag in div_tags:
        quote=tag.find('span',class_='text').text
        quotes.append(quote)
    return quotes 
get_quotes(div_tags)

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

In [41]:
#similarly we should find authors
def get_author(div_tags):
    authors =[] #get the author in the list
    for tag in div_tags:
        author=tag.find('small',class_='author').text
        authors.append(author)
    return authors 
get_author(div_tags)

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

In [30]:
#similarly we should find tags
def get_quotes_tags(div_tags):
    tags =[] #get the tags in the list
    for tag in div_tags:
        name_tag = tag.find('div',class_='tags').meta['content']
        tags.append(name_tag)
    return tags
get_quotes_tags(div_tags)

['change,deep-thoughts,thinking,world',
 'abilities,choices',
 'inspirational,life,live,miracle,miracles',
 'aliteracy,books,classic,humor',
 'be-yourself,inspirational',
 'adulthood,success,value',
 'life,love',
 'edison,failure,inspirational,paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor,obvious,simile']

In [33]:
quotes_data = []

for quote in soup.find_all('div', class_='quote'):
    text = quote.find('span',class_='text').text
    author = quote.find('small',class_='author').text
    tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
    
    quotes_data.append({
        'quote': text,
        'author': author,
        'tags': tags
    })
quotes_data

[{'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  'author': 'Albert Einstein',
  'tags': ['change', 'deep-thoughts', 'thinking', 'world']},
 {'quote': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  'author': 'J.K. Rowling',
  'tags': ['abilities', 'choices']},
 {'quote': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  'author': 'Albert Einstein',
  'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']},
 {'quote': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  'author': 'Jane Austen',
  'tags': ['aliteracy', 'books', 'classic', 'humor']},
 {'quote': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  'author': 'Marilyn Monroe',
  'tags': 

In [35]:
# Store the data in a JSON file
import json
with open('quotes.json', 'w') as json_file:
    json.dump(quotes_data, json_file, indent=4)