# Scraping Quotes And Related Details From **Quotes To Scrape** Website


![](https://fiverr-res.cloudinary.com/videos/so_0.629779,t_main1,q_auto,f_auto/cracplxl1o4fnxqi7np1/parse-website-and-collect-and-analyze-data-from-it.png)

Web scraping is a method for extracting and collecting huge volumes of data from websites and saving it to a file or database efficiently. In most cases, the data scraped will be stored in a tabulated form.
In most cases a web browser is required in order to access and read the data provided on website. Moreover, we cannot store a copy of this data to a database locally. If you need the data, the only way to get it is to manually copy and paste it - which is a time-consuming process that might take a few hours to a few days to accomplish.
Web scraping is a method of automating this process so that instead of manually downloading data from webpages, it may be done automatically using python and a few other tools

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
#Execute this to save new versions of the notebook
jovian.commit(project="web-scraping-with-python-2-0")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "vigilantstars6/web-scraping-with-python-2-0" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/vigilantstars6/web-scraping-with-python-2-0[0m


'https://jovian.ai/vigilantstars6/web-scraping-with-python-2-0'

In [4]:
#jovian.submit(assignment="zerotoanalyst-project1")

### Project Outline:

- `Quotes To Scrape` is a website that hosts hundreds of intriguing quotes. It is complied by  ['Scraping Hub'](https://www.zyte.com/)  from the website ['Good Reads'](https://www.goodreads.com/quotes?page=1).
- In this Web Scraping Project we will scrape the website, extract details like `Quote` `Author` `Genre Related Topic Tags` `About Author` and store it in a dataframe of `CSV FORMAT`. 
- We will first use the requests library and download the web pages into a file.
- We will then use Beautiful Soup library to parse and extract information from the downloaded or locally saved webpage data.
- We will then define function for accessing various parts of the data and save it into python lists.
- We then define another function that combines the utility of all the previously defined functions into one. 
- Using this function we get all the required data and save it into a CSV file. 


### Tools That Will Be Used:
- **Python 3.9** 
- **Jupyter Notebooks** to save and display our code and output
- **Beautiful Soup Library** for parsing through a website
- **Requests Library** to send HTTP requests and receive HTML files
- **Pandas library** to construct a dataframe and saving the result to .CSV file

## Using requests library to download web pages

![image.png](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/03/ws1.png)

Requests is a HTTP library for the Python programming language. The goal of the project is to make HTTP requests simpler and more human-friendly. The current version is 2.26.0.
You can read more about this [here](https://en.wikipedia.org/wiki/Requests_(software)).

In [5]:
! pip install requests --quiet --upgrade 

In [6]:
import requests

In [7]:
topics_url = "http://quotes.toscrape.com"

In [8]:
response= requests.get(topics_url)
response.status_code

200

In [9]:
page_contents = response.text

In [10]:
len(page_contents)

11010

In [11]:
page_contents[:1000]

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\n        <sp

In [12]:
with open("webpage.html", "w") as f:
    f.write(page_contents)

## Using Beautiful Soup to parse and extract information

![image.png](https://sixfeetup.com/blog/an-introduction-to-beautifulsoup/@@images/27e8bf2a-5469-407e-b84d-5cf53b1b0bb6.png)

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML which is useful for web scraping. You can read more about this [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [13]:
!pip install beautifulsoup4 --upgrade --quiet

In [14]:
from bs4 import BeautifulSoup

In [15]:
doc = BeautifulSoup(page_contents, "html.parser")

In [16]:
type(doc)

bs4.BeautifulSoup

In [17]:
doc.find_all('head')

[<head>
 <meta charset="utf-8"/>
 <title>Quotes to Scrape</title>
 <link href="/static/bootstrap.min.css" rel="stylesheet"/>
 <link href="/static/main.css" rel="stylesheet"/>
 </head>]

## Defining function to get Beautiful Soup object from any URL

We will now define a function that will take our website URL as the input and return a document as the output. This document file consists of a copy of the website that we choose to scrape.  

In [18]:
def get_doc(url):
    response= requests.get(url)
    doc = BeautifulSoup(response.text, "html.parser")
    return doc

In [19]:
doc= get_doc(topics_url)
doc.find_all('head')

[<head>
 <meta charset="utf-8"/>
 <title>Quotes to Scrape</title>
 <link href="/static/bootstrap.min.css" rel="stylesheet"/>
 <link href="/static/main.css" rel="stylesheet"/>
 </head>]

On executing the above code block we've get the required output which is stored in the variable `doc`.


## Defining function to get all div tags from the webpage

![image-3.png](https://creive.me/wp-content/uploads/2016/03/div-1.jpg)

After saving a copy of the entire website, we will now define a function `get_quote_divtags` that will accesses the individual <div> tags from the entire website HTML data.  

In [20]:
selection_class = "quote"
quotes_divtag = doc.find_all("div",{"class": selection_class})
len(quotes_divtag)

10

This is the zeroth `div-tag` element. When converted into tabular data, it will consists an entire row worth of data. 

In [21]:
quotes_divtag[0]

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

In [22]:
def get_quote_divtags(doc):
    selection_class = "quote"
    quote_tags = doc.find_all("div",{"class": selection_class})
    return quote_tags

In [23]:
quote_divtags = get_quote_divtags(doc)
quote_divtags


[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
 <span>by <small class="author" itempr

On executing the above code block we get the required output which is stored in the variable `quote_divtags`.


### Defining a function to get all quotes from the webpage

In [24]:
def get_quotes(quote_divtags):
    all_quotes = []
    for quote_tag in quote_divtags :
        quote= quote_tag.find("span",class_="text").text
        all_quotes.append(quote)
    return all_quotes

In [25]:
all_quotes= get_quotes(quote_divtags)
all_quotes

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

On executing the above code block we've get the required output which is stored in the variable `all_quotes`.


### Define a function to get all the authors names from a website 


In [26]:
def get_all_authors(quotes_divtag):
    all_authors =[]
    for author_tag in quote_divtags:
        author= author_tag.find("small",class_="author").text
        all_authors.append(author)
    return all_authors

In [27]:
all_authors = get_all_authors(get_quote_divtags)
all_authors

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

On executing the above code block we've get the required output which is stored in the variable `all_authors`.


### Defining a function to get all the topic tags related to the quote

Here we first define a for loop that pull out the topic tags related to a quote. We then embed this for loop under another nested for loop that runs the entire block of code based on the user input. 

In [28]:
topic_tag=[i.text for i in quotes_divtag[0].find_all("a",class_="tag")]
topic_tag

['change', 'deep-thoughts', 'thinking', 'world']

In [29]:
all_topic_tags=[]
a=len(quotes_divtag)
for i in range(a):
    b= [i.text for i in quotes_divtag[i].find_all("a",class_="tag")]
    all_topic_tags.append(" ".join(b))
all_topic_tags    

['change deep-thoughts thinking world',
 'abilities choices',
 'inspirational life live miracle miracles',
 'aliteracy books classic humor',
 'be-yourself inspirational',
 'adulthood success value',
 'life love',
 'edison failure inspirational paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor obvious simile']

In [30]:
def get_all_topic_tags(quotes_divtag):
    all_topic_tags=[]
    for topic_tag in quotes_divtag:
        topic= [i.text for i in topic_tag.find_all("a",class_="tag")]
        all_topic_tags.append(" ".join(topic)) 
    return all_topic_tags
    

In [31]:
all_topic_tags= get_all_topic_tags(quotes_divtag)
all_topic_tags

['change deep-thoughts thinking world',
 'abilities choices',
 'inspirational life live miracle miracles',
 'aliteracy books classic humor',
 'be-yourself inspirational',
 'adulthood success value',
 'life love',
 'edison failure inspirational paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor obvious simile']

On executing the above code block we've get the required output which is stored in the variable `all_topic_tags`.


### Defining a function to get all the links for the authors bio

In [32]:
about_link= quotes_divtag[0].find("a")["href"]
about_link

'/author/Albert-Einstein'

In [33]:
complete_link= "http://quotes.toscrape.com" + quotes_divtag[0].find("a")["href"]
complete_link


'http://quotes.toscrape.com/author/Albert-Einstein'

In [34]:
def get_all_authors_links(divtags):
    all_authors_links=[]
    complete_about_url= ("http://quotes.toscrape.com/")
    for about_link in divtags:
            b= about_link.find("a")["href"]
            k=complete_about_url + b
            all_authors_links.append(k)
    return all_authors_links


In [35]:
all_authors_links= get_all_authors_links(quotes_divtag)
all_authors_links

['http://quotes.toscrape.com//author/Albert-Einstein',
 'http://quotes.toscrape.com//author/J-K-Rowling',
 'http://quotes.toscrape.com//author/Albert-Einstein',
 'http://quotes.toscrape.com//author/Jane-Austen',
 'http://quotes.toscrape.com//author/Marilyn-Monroe',
 'http://quotes.toscrape.com//author/Albert-Einstein',
 'http://quotes.toscrape.com//author/Andre-Gide',
 'http://quotes.toscrape.com//author/Thomas-A-Edison',
 'http://quotes.toscrape.com//author/Eleanor-Roosevelt',
 'http://quotes.toscrape.com//author/Steve-Martin']

On executing the above code block we've get the required output which is stored in the variable **all_authors_links**.


## Defining a function that will scrape a single page

This function **scrape_single_page** combines all the above defined functions into a single code block. Its output can be saved into a pandas dataframe of CSV Format  

In [36]:
def scrape_single_page(url):
    '''This function takes a single URL as the input and returns multiple lists that store all the data that we require '''
    quotes_tags = get_quote_divtags(get_doc(url))
    quotes_list= get_quotes(quotes_tags)

    authors_list= get_all_authors(quotes_tags)
    topic_tags_list= get_all_topic_tags(quotes_tags)
    links_list= get_all_authors_links(quotes_tags)
    return [quotes_list, authors_list, topic_tags_list, links_list]

It returns a code block in  the form of a `primary-list` that contains four `sub-lists` as defined by us. 

## Defining a function that will scrape multiple pages

This function makes use of the **scrape_single_page** that we defined earlier. Apart from that here we make use of a recursive for loop that scrapes the web pages based on the user input. 
For example, if the input value is 6, we get scraped data from six consecutive web pages.  

In [37]:
# topic url was defined before
def scrape_multiple_pages(total_pages):
    base_url = "http://quotes.toscrape.com/"
    all_quotes, all_authors, all_tags, all_urls = [], [], [], []
    for i in range(total_pages):
        url = base_url+"page/" + str(i+1)
        print(url)
        quotes, authors, quotes_tags, urls = scrape_single_page(url)
        all_quotes.extend(quotes)
        all_authors.extend(authors)
        all_tags.extend(quotes_tags)
        all_urls.extend(urls)
    return [all_quotes, all_authors, all_tags, all_urls]

In [38]:
quotes, authors, quotes_tags, urls = scrape_single_page("http://quotes.toscrape.com/page/2/")


In [39]:
all_quotes

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

In [40]:
no_of_pages= int(input("Enter the number of pages you want to scrap(atmost 10 pages)"))

Enter the number of pages you want to scrap(atmost 10 pages)10


In [41]:

[all_quotes,all_authors,all_tags,all_urls]=scrape_multiple_pages(no_of_pages)

http://quotes.toscrape.com/page/1
http://quotes.toscrape.com/page/2
http://quotes.toscrape.com/page/3
http://quotes.toscrape.com/page/4
http://quotes.toscrape.com/page/5
http://quotes.toscrape.com/page/6
http://quotes.toscrape.com/page/7
http://quotes.toscrape.com/page/8
http://quotes.toscrape.com/page/9
http://quotes.toscrape.com/page/10


Lets check the first 20 authors from the list that contains all authors

In [42]:
all_authors[:20]

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin',
 'Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

## Create CSV file(s) with the extracted information

We now have all the data that was required by us. Let's now build a DataFrame with Pandas to store the data we have, nicely into a tabular format.

![image.png](https://media.vlpt.us/images/odobenus/post/90f13824-5ca3-46f3-b0bc-a0473fe44e26/Screen%20Shot%202021-09-05%20at%202.49.47%20PM.png)

In [43]:
!pip install pandas --upgrade --quiet

In [44]:
import pandas as pd

In [45]:
quotes = pd.DataFrame({
'Quote': all_quotes,
'Author': all_authors,
'Tags': all_tags,
'About Author': all_urls,
})

In [46]:
quotes

Unnamed: 0,Quote,Author,Tags,About Author
0,“The world as we have created it is a process ...,Albert Einstein,change deep-thoughts thinking world,http://quotes.toscrape.com//author/Albert-Eins...
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,abilities choices,http://quotes.toscrape.com//author/J-K-Rowling
2,“There are only two ways to live your life. On...,Albert Einstein,inspirational life live miracle miracles,http://quotes.toscrape.com//author/Albert-Eins...
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,aliteracy books classic humor,http://quotes.toscrape.com//author/Jane-Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,be-yourself inspirational,http://quotes.toscrape.com//author/Marilyn-Monroe
...,...,...,...,...
95,“You never really understand a person until yo...,Albert Einstein,better-life-empathy,http://quotes.toscrape.com//author/Harper-Lee
96,“You have to write the book that wants to be w...,André Gide,books children difficult grown-ups write write...,http://quotes.toscrape.com//author/Madeleine-L...
97,“Never tell the truth to people who are not wo...,Thomas A. Edison,truth,http://quotes.toscrape.com//author/Mark-Twain
98,"“A person's a person, no matter how small.”",Eleanor Roosevelt,inspirational,http://quotes.toscrape.com//author/Dr-Seuss


In [47]:
quotes.to_csv('quotes.csv', header=False, index=False)

In [None]:
jovian.commit(outputs="quotes.csv")

<IPython.core.display.Javascript object>

## Summary
So here’s an entire gist of what this Web Scraping Project comprises of:
1. Download and save a copy of the website https://quotes.toscrape.com/ using the Requests Library. 
2. We then used the Beautiful soup library to skim through and analyse the HTML source code.
3. We then define the following functions `get_doc` `get_quote_divtags` `get_quotes` `get_all_authors` `get_all_topic_tags` `get_all_authors_links`  `scrape_single_page` `scrape_multiple_pages`
4. We then save all data gathered from these functions into a data frame which is later on converted to a .csv file by making use of the pandas library.
5. We then analyse the data before making any final edits like adding or removing coloumns, rows, null or repeating values. 
6. Hope you enjoyed going through this project!
   Thank you for your valuable time.

## Future Works
- Will work on a Exploratory Data Analysis Project.
- Will work on making changes to this project for better based on users feedback. 

## References
1. https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/project-1-web-scraping-with-python
2. https://www.learndatasci.com/tutorials/applied-introduction-to-numpy-python-tutorial/
3. https://www.analyticsvidhya.com/blog/2021/08/beginners-web-scraping-project-web-scraping-subreddit-step-by-step/
4. https://quotes.toscrape.com/
5. https://www.crummy.com/software/BeautifulSoup/bs4/doc/