# Web Scraping with Beautiful Soup

Beautiful Soup...

### Task 1: Install and Set Up Beautiful Soup

Install Required Libraries: Install Beautiful Soup along with a parser like lxml and the requests library for making HTTP requests.

In [1]:
# Importing essential libraries 
import pandas as pd

# importing the python packages BeautifulSoupto web scrape 
from bs4 import BeautifulSoup


### Task 2: Choose a Website to Scrape

Select a Website: Choose a simple, publicly accessible website to scrape. Some example websites include:

* http://quotes.toscrape.com (A website designed for practicing web scraping)
* https://news.ycombinator.com/ (Hacker News)
* https://www.imdb.com/chart/top/ (IMDB Top 250)

#### Understand the Website’s Structure:

* Use your browser’s developer tools (Inspect element) to explore the HTML structure of the website.
* Identify the elements you need to scrape (e.g., titles, links, prices, etc.).

#### Comments

* I want to scrap following sections:
    * Title
    * Quotes
    * Author
    * Tags

### Task 3: Scrape Data Using Beautiful Soup

 Make an HTTP Request: Use the requests library to send a GET request to the website and retrieve the HTML content.

In [3]:
from urllib.request import urlopen   # importing request 

#import requests




In [122]:

url_1 = "https://www.imdb.com/chart/top/"
html_content1 = urlopen(url_1)
html_content1

"""
I tried this link but could open it. It says 403:Forbidden. 
Hence, i worked with quote url below"""

HTTPError: HTTP Error 403: Forbidden

In [5]:
url_2 = "http://quotes.toscrape.com/"
html2 = urlopen(url_2)   #open url_2(quotes) and assigned to html2
html2

<http.client.HTTPResponse at 0x2346c88ba30>

Parse HTML Content: Use Beautiful Soup to parse the HTML content retrieved from the website.

In [6]:
# use BeautifulSoup to scrape and parse using 'lxml' parser
soup = BeautifulSoup(html2, 'lxml')   


type(soup)   

bs4.BeautifulSoup

In [8]:
# print the html content as soup
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="t

In [7]:
# .prettify() puts the html content in a clear structure

print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

Extract Specific Data:

* Quotes Website: Extract quotes, authors, and tags from the website.

* Hacker News: Extract headlines and links to the articles.

* IMDB: Extract movie titles, ratings, and years of release.

In [105]:
# listing the 'div' from class = 'quote' in a list 
quotes = soup.find_all('div', class_ = "quote")
quotes

[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
 <span>by <small class="author" itempr

In [111]:
# create a empty list to append the quotes, authors and tags 

information = []

# Create a loop to find quotes, authors and tags 
for info in quotes:
    quote = info.find('span', class_ = "text").get_text(strip = True)
    author = info.find('small', class_ = 'author').text.strip()
    tags = [tag.get_text(strip = True) for tag in info.find_all('a', class_ = 'tag')]

        
 #Append the quote, author and tags into a empty list from a loop       
    information.append({
        "Quotes": quote,
        "Author": author,
        "Tag" : tags
    })

In [112]:
#Print the quotes, author name and tags  stores in a information list

information

[{'Quotes': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  'Author': 'Albert Einstein',
  'Tag': ['change', 'deep-thoughts', 'thinking', 'world']},
 {'Quotes': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  'Author': 'J.K. Rowling',
  'Tag': ['abilities', 'choices']},
 {'Quotes': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  'Author': 'Albert Einstein',
  'Tag': ['inspirational', 'life', 'live', 'miracle', 'miracles']},
 {'Quotes': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  'Author': 'Jane Austen',
  'Tag': ['aliteracy', 'books', 'classic', 'humor']},
 {'Quotes': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  'Author': 'Marilyn Monroe',
  'Tag': 

In [78]:
"""

information = []

for info in quotes:
    quote = info.find('span', class_ = "text").get_text(strip = True)
    author = info.find('a').get('href').replace("/author/","").strip()
    tag = info.find('a').get('href')
    
    quotes.append({
        "Quotes": quote,
        "Author": author,
        "Tag" : tag
    })

This was my trail but could not get the approriate results. Below outputs quotes is the trail output
    
  """  
    

In [72]:
quotes

[{'Quotes': <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
  'Author': '/',
  'Tag': '/'},
 {'Quotes': None,
  'Author': 'https://www.goodreads.com/quotes',
  'Tag': 'https://www.goodreads.com/quotes'}]

Handle Missing Data: Implement error handling to manage cases where certain elements might be missing or where requests might fail.

In [113]:
# Saved as a dataframe
df = pd.DataFrame(information)
df

Unnamed: 0,Quotes,Author,Tag
0,“The world as we have created it is a process ...,Albert Einstein,"[change, deep-thoughts, thinking, world]"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"[abilities, choices]"
2,“There are only two ways to live your life. On...,Albert Einstein,"[inspirational, life, live, miracle, miracles]"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"[aliteracy, books, classic, humor]"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"[be-yourself, inspirational]"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"[adulthood, success, value]"
6,“It is better to be hated for what you are tha...,André Gide,"[life, love]"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"[edison, failure, inspirational, paraphrased]"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,[misattributed-eleanor-roosevelt]
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"[humor, obvious, simile]"


In [116]:
df.dtypes

Quotes    object
Author    object
Tag       object
dtype: object

#### Task 4: Store the Scraped Data

Save Data to a JSON File: Store the extracted data in a JSON file.

In [120]:
df.to_json('Quotes.json' ) #Stores as a json file

print("Data is stored as .json file")

Data is stored as .json file


Save Data to a CSV File: Store the extracted data in a CSV file.

In [121]:
df.to_csv("Quotes.csv", index = False) #Stored as a .csv file 

print("File has been saved as .csv file")

File has been saved as .csv file
