# Scraping Your Favorite Quotes from BrainyQuote using Python



![banner_image](https://i.imgur.com/9rLptlu.png)


### Web Scraping

Web scraping is the extraction of information from web pages, typically in an automated fashion. There are several approaches to accomplish this. In this project, I will demonstrate the use of Python to scrape information from a website that harbors thousands of quotes. The method I outline relies primarily on the Python libraries `requests` and `BeautifulSoup`. The output of the project is one master function (and several underlying helper functions) in which simply the topic of interest is entered as an argument, resulting in a CSV file as output that harbors all the quotes belonging to the topic, together with the respective authors and links that lead directly to each quote.


### BrainyQuote
The website *BrainyQuote* claims to be the world's largest quotation site, and indeed forms an extensive reservoir of quotes. As  put on its website:

*Originally published in 2001, BrainyQuote is one of the oldest and most established quotation sites on the web. Our site was built from scratch into the behemoth it is today. In the beginning, we used library books to enter famous quotations by hand. Armed with eyedrops and comfy wrist-rests at our computers, we typed, and typed, and typed! Today, you can enjoy the fruits of our labors; we are a shining example of the little engine that could.*

Despite the large amounts of data that can be harnessed to provide novel insights, **quotes** remain a powerful way of capturing the essence of a phenomenon in a concise and appealing way. For that reason, book authors often use one or several quotes to start a chapter. A quote, in essence, consists of two parts: the exact quote, and the author of that quote. Although in theory a good quote stands on its own, in practice it is the combination of *what* is said and *who* said it that makes a quote powerful. Therefore, in this project we will extract both the exact quote as well as the person to whom the quote can be attributed to (the author).





On [BrainyQuote.com](https://www.brainyquote.com/), quotes are categorized by author, by topic, and there are also the options to view the quote of the day or to use the search bar, as shown below:

![site_outline](https://i.imgur.com/2l0ujWp.png)


BrainyQuote is a great resource for browsing through quotes. However, it can be valuable to collect quotes for documentation, inspiration, or for further analysis. On the page https://www.brainyquote.com/topics an overview of all the available topics on the site can be found. In this project we will use *web scraping* to extract a subset of quotes of interest from this site using the Python libraries [Requests](https://docs.python-requests.org/en/latest/) and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

The goal of this project is: to use web scraping to download all the quotes that belong to a certain topic. As an example, we will focus on the topic 'motivational' in order to find out the steps to be taken. Afterwards, we will derive a set of functions that can subsequently be used to scrape any topic of interest.

The outline of the steps is given below:

1. Identify the webpages
2. Download a webpage using Requests
3. Use Beautiful Soup to parse the HTML source code
4. Extract author, quote text and url for each quote on the page
5. Collect the downloaded data into Python lists
6. Extract and combine data from multiple pages
7. Create CSV file with the extracted information


Ultimately, the results will be exported to a CSV file in the following format:

```
author, quote, url
St. Jerome, Good, better, best. Never let it rest. 'Til your good is better and your better is best., https://www.brainyquote.com/quotes/st_jerome_389605?src=t_motivational
Charles R. Swindoll, Life is 10% what happens to you and 90% how you react to it., https://www.brainyquote.com/quotes/charles_r_swindoll_388332?src=t_motivational
```




### How to Run the Code
In order to execute the code, please use the "Run" button at the top of this page and select "Run on Binder". You can edit the notebook and save a personal version to [Jovian](https://wwww.jovian.ai) by executing the cells below:

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="snoek-quotes-scraping")

<IPython.core.display.Javascript object>

## 1. Identify the webpages

On the page https://www.brainyquote.com/topics all topics are listed. For this project we will select quotes from the topic 'motivational':

![motivational](https://i.imgur.com/5vbdIDP.png)

By inspecting the url of the resulting page, we find that the url is structured in the following way:

`https://www.brainyquote.com/topics/motivational-quotes`

Therefore, we will save the topic in a variable, which can then be used to construct the main url of the target page:

In [None]:
topic = 'motivational'

In [None]:
# generation of main url of the to be scraped page
main_url = 'https://www.brainyquote.com/topics/' + topic + '-quotes'

In [None]:
# view the url
main_url

By visiting the above url, we notice that the site contains of in total 5 subpages. 


![subpages](https://i.imgur.com/kSWnMGt.png)

In addition to the main page, these are:

https://www.brainyquote.com/topics/motivational-quotes_2

https://www.brainyquote.com/topics/motivational-quotes_3

https://www.brainyquote.com/topics/motivational-quotes_4

https://www.brainyquote.com/topics/motivational-quotes_5


We will save the number of subpages in a variable:

In [None]:
# enter the number of subpages
nr_of_subpages = 5

We will use this knowledge to create a list of URLs to be scraped.

In [None]:
# initialize a list 
urls = [main_url]
base_url = main_url

for i in range(2, nr_of_subpages + 1):
    url = base_url + '_' + str(i)
    urls.append(url)

urls

Let's put this in a function. In the function we will ask the user to check and then enter the number of sub pages for the topic of interest.

In [None]:
def get_topic_pages(topic):
    # the main_url (i.e. the first page with quotes on the topic of interest) is generated
    main_url = 'https://www.brainyquote.com/topics/' + topic + '-quotes'
    
    # the user is asked for input
    nr_of_subpages = input("Enter the number of subpages of {}".format(main_url))
    
    # we initialize a list of urls starting off with the main_url. The main_url is also used as a base_url
    urls = [main_url]
    base_url = main_url
    
    # we iterate over the number of subpages to generate the urls which we will scrape
    for i in range(2, int(nr_of_subpages) + 1):
        url = base_url + '_' + str(i)
        urls.append(url)
    return urls

Let's check whether the function works properly

In [None]:
get_topic_pages('motivational')

Indeed, the function returns the correct urls which we will scrape.

## 2. Download a web page using `requests`

Before we can access the information on the website, the website needs to be downloaded. We will use the [`requests`](https://docs.python-requests.org/en/master/) library to download the web page. 

The library can be installed using `pip`, which stands for "Python Installer Package", and subsequently imported:

In [None]:
# Install the library
!pip install requests --upgrade --quiet

In [None]:
# Import the library
import requests

The library is now installed and imported.

We will first focus on collecting the quotes from the first subpage and store in a variable:

In [None]:
topic_url = urls[0]

The function `requests.get`returns a response object containing the data from the web page and some other information. We will save this in a variable called 'response':

In [None]:
response = requests.get(topic_url)

To check whether the response was succesful, we access the `.status_code` property of the response object. A succesful response will yield an [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) between 200 and 299.

In [None]:
response.status_code

The request was succesful.

Access the contents of the web page using the `.text` property of the `response` object:

In [None]:
page_content = response.text

We will check out the page length (i.e. number of characters on the page)

In [None]:
len(page_content)

Since the page contains over 60,000 characters, we will limit the view to the first 500 characters:

In [None]:
page_content[:500]

The above shows us the [HTML source code](https://nl.wikipedia.org/wiki/HyperText_Markup_Language) of the web page.

We can write the page content to a file, which then allows us to view the page locally within Jupyter using "File > Open":

In [None]:
with open('webpage.html', 'w') as f:
    f.write(page_content)

When opening the downloaded page, one would typically see the original page with none of the links working. In this case we see that the downloaded page does not load as the original:

![html page](https://i.imgur.com/In5gXD6.png)

This is likely caused by the presence of advertisement on the page that are dynamically loaded.

In this section we used the `requests` library to download a web page as HTML.

## 3. Use `BeautifulSoup` to parse the HTML source code

Now that we have downloaded the web page, the next step is to locate the information we require within the HTML code of the site.

We will use the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to extract information from the HTML source code. First we will install the library and then import the `BeautifulSoup` class from the `bs4` module.

In [None]:
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
from bs4 import BeautifulSoup

We will create a `BeautifulSoup` object that will contain the parsed content of the page

In [None]:
doc = BeautifulSoup(page_content, 'html.parser')

In [None]:
type(doc)

The power of Beautiful Soup is that the resulting `doc` object possesses several properties and method to extract data from the HTML document. For example, we can extract the title of the page using `doc.title`:

In [None]:
title_tag = doc.title

In [None]:
title_tag

We can obtain the title by extracting the text of the tag using `.text`

In [None]:
title_tag.text

In addition, when a certain tag occurs more than once in the document (for example an `img` tag) we can use `doc.img` to find the first occurence of this tag:

In [None]:
first_image = doc.img

In [None]:
first_image

A very powerful usage of Beautiful Soup is to find all the tags of the same type within `doc` using the `find_all` method. This will be demonstrated in the next section of this project.

Let's put this in a function:

In [None]:
def download_and_parse(topic_url):
    '''Download a web page and return a Beautiful Soup doc'''
    
    # download the page
    response = requests.get(topic_url)
    
    # check if download was succesful; raise an exception if not
    if response.status_code != 200:
        raise Exception("loading problem with {}".format(topic_url))
    
    # get the page HTML
    page_content = response.text
    
    # create a bs4 doc
    doc = BeautifulSoup(page_content, 'html.parser')
    return doc

Let us check whether the function returns the same output as we found above:

In [None]:
doc2 = download_and_parse('https://www.brainyquote.com/topics/motivational-quotes')

In [None]:
doc.title == doc2.title

We can now use the function `download_and_parse` to download any web page and parse it using Beautiful Soup.

## 4. Extract author, quote text and URL for each quote on the page

As mentioned before, the information we require can be found by inspecting the HTML code of the page. The basic structure of an HTML document consists of tags, such as `html`, `head`, `body` and `title` tags. In essence, tags mark the beginning and end of an *element*. For example, a `title` element consists of the opening tag `<title>`, followed by the content, and closes with `</title` as we can also see by inspecting the first of few lines of the downloaded page (webpage.html) with Notepad Plus: ![html_in_Notepad](https://i.imgur.com/G8bFQwi.png)

Here we see the following:
- `<!DOCTYPE html>`: this is the document type declaration, and tells the browser the type of HTML that is being used (in  this case: HTML5)
- `<html lang="en">`: this `html` opening tag indicates that the page is written in HTML. The corresponding closing tag (`</html>`) can be found at the very end of the page:

![end_of_page](https://i.imgur.com/WDrFf8u.png)

- `<head>`: the `head` tag indicates the beginning of the section that contains information *about* the page that will not appear in the browser
- `<title>Motivational Quotes - BrainyQuote</title>`: within the `head` section the `title` element is found, which specifies the title of the page (e.g. the title that is displayed in title bar of the browser window).

Although most opening tags are followed by closing tags, there are also tags that do not require a closing tag, such as `img` and `br` tags. As a rule of thumb, tags with content between them should be closed. Taken together, the basic structure of most elements is the following: opening tag - content - closing tag. We can use this knowledge to find the information we are looking for.

Tags can have *attributes*, the function of which is to modify the behavior or the display of the element. The attributes are located inside the opening tag and the values are specified within quotation marks. Examples of attributes are: `href` (used within `a` tags), `id` (can be added to almost all tags), `src` (to be used with `img` tags), `style` and `class`.

### Author Names

Let us first focus on the Author Names. We will navigate to the first page containing quotes in Chrome (https://www.brainyquote.com/topics/motivational-quotes), locate a quote, hover over the author name, right-mouse click and then select "inspect" option:

![inspect_element](https://i.imgur.com/AQBhDQn.png)

Now we can inspect the html code of the part of the page that displays the author name in detail. This teaches us that the title is within an `a` tag:

![html_author](https://i.imgur.com/cglxzRf.png)

`a` tags are so-called 'anchor' tags, and are used to define links. As we see here, the tag contains an `href` attribute that specifies the destination of the link (i.e.: "/authors/st-jerome-quotes"). In this case, the author name on the page is indeed a clickable link, that re-directs us to the page where all the quotes of this particular author (St. Jerome) are grouped:
![St_Jerome_page](https://i.imgur.com/rkKFSOD.png)

As a first attempt to obtain the tags containing the author names, let us collect all `a` tags from the page:

In [None]:
a_tags = doc.find_all('a')

We can check how many tags we have collected:

In [None]:
len(a_tags)

We have collected 195 a tags. This is more than there are quotes on the page.  Let us inspect the html code further by inspecting the underlying code (using the "element" button) for three authors:

`<a href="/authors/st-jerome-quotes" class="bq-aut qa_389605 oncl_a" title="view author">St. Jerome</a>`

`<a href="/authors/charles-r-swindoll-quotes" class="bq-aut qa_388332 oncl_a" title="view author">Charles R. Swindoll</a>`

`<a href="/authors/walt-disney-quotes" class="bq-aut qa_130027 oncl_a" title="view author">Walt Disney</a>`

It seems that author names are embedded in `a` tags having the class `bq-aut qa_`. We will try to select a tags belonging to this class and then check out the number of collected tags:

In [None]:
author_tags = doc.find_all('a', class_ = "bq-aut")

In [None]:
len(author_tags)

Browsing through the first subpage (out of 5), there seem to be indeed 60 quotes per page. Let us check out the first five author_tags:

In [None]:
author_tags[:5]

We can extract the author name out of the tag using ".text":

In [None]:
author_tags[0].text

Now we can write a function that collects all the author names from a page:

In [None]:
def get_author_info(doc):
    author_tags = doc.find_all('a', class_ = "bq-aut")
    authors = [tag.text for tag in author_tags]
    return authors

Let us verify the function by displaying the first five results:

In [None]:
authors = get_author_info(doc)
authors[:5]

We can also check the number of authors

In [None]:
len(authors)

The function appears to return 60 author names

### Quotes

Analogous to the method followed for the author names, we use the "Inspect" function in the browser to investigate the html code underlying quotes::

![inspect_quote](https://i.imgur.com/M8hGO7B.png)


It appears that the quote is embedded in a so-called `div` tag, which is itself embedded within an `a` tag:

`<div style="display: flex;justify-content: space-between">`

Note that `div` tags, unlike most other tags, do not apply a particular meaning. `div` (division) elements are basicually used to group larger pieces of code together, and in practice will result in a line-break before and after it.

We will try to collect these `div` tags directly, by specifing their `style` attribute as well:

In [None]:
quote_tags = doc.find_all('div', {'style': 'display: flex;justify-content: space-between'})

In [None]:
len(quote_tags)

Also here we get about 60 hits. Let's verify these tags contain indeed quotes:

In [None]:
quote_tags[1].text

To get rid of the newline characters (\n) we have to apply the strip method to the tags:

In [None]:
quote_tags[1].text.strip()

We create a function to collect all the quotes in a list:

In [None]:
def get_quote_info(doc):
    quote_tags = doc.find_all('div', {'style': 'display: flex;justify-content: space-between'})
    quotes = [tag.text.strip() for tag in quote_tags]
    return quotes

And we check the functionality of the function by displaying the first three quotes

In [None]:
quotes = get_quote_info(doc)
quotes[:3]

In addition we check the number of quotes collected

In [None]:
len(get_quote_info(doc))

The function indeed appears to return 60 quotes.

### URLs

In order to collect the URL leading to the quote, we can take advantage of the inspection of the HTML code we carried for the retrieval of the Quotes (see above). We already noticed that the link to a page displaying the quote is embedded with an `a` tag:

`<a href="/quotes/st_jerome_389605?src=t_motivational" class="b-qt qt_389605 oncl_q" title="view quote">`


The `class` attribute is used for layout and styling. Note that since `class` is a reserved keyword in Python, we have to use `class_` here in order to extract the `a` tags from the class 'b-qt':

In [None]:
link_tags = doc.find_all('a', class_ = "b-qt")

In [None]:
len(link_tags)

The number seems to be correct. Let us inspect the first tag:

In [None]:
link_tags[0]

We can access the part of tag leading to the page with the quote by accessing the `href` attribute:

In [None]:
link_tags[0]['href']

And then we can re-create the full URL:

In [None]:
# specify the base_url
base_url = 'https://www.brainyquote.com'

# then generate the quote of the first url
topic0_url = base_url + link_tags[0]['href']

In [None]:
topic0_url

Clicking the link leads us to a page displaying the quote, a nice background as well as additional information:

![quote_example](https://i.imgur.com/b5Wszxp.png)

We can put the above in a function that scrapes the URL of each quote and collects them in a list:

In [None]:
def get_link_info(doc):
    link_tags = doc.find_all('a', class_ = "b-qt")
    base_url = 'https://www.brainyquote.com'
    quote_urls = [base_url + tag['href'] for tag in link_tags]
    return quote_urls

We can verify that the function works:

In [None]:
quote_urls = get_link_info(doc)
quote_urls[:3]

In [None]:
len(quote_urls)

The function returns 60 URLs, each leading to a quote page.

## 5. Collect the downloaded data into Python lists

We can now generate a function that for a single page generates a doc file, and then collects the required data (i.e. authors, quotes, and URLs). This function will call the functions we have defined above.

In [None]:
def get_quotes_per_page(url):
    doc = download_and_parse(url)
    authors = get_author_info(doc)
    quotes = get_quote_info(doc)
    quote_urls = get_link_info(doc)
    return authors, quotes, quote_urls

Let us verify the function works:

In [None]:
test_quotes_per_page = get_quotes_per_page('https://www.brainyquote.com/topics/motivational-quotes')

In [None]:
test_quotes_per_page[0][:5]

In [None]:
len(test_quotes_per_page)

We have collected 3 lists, let us verify the length and first three items of each:

In [None]:
print('The first list contains {} items, the first 3 are:'.format(len(test_quotes_per_page[0])), test_quotes_per_page[0][:3])

In [None]:
print('The second list contains {} items, the first 3 are:'.format(len(test_quotes_per_page[1])), test_quotes_per_page[1][:3])

In [None]:
print('The third list contains {} items, the first 3 are:'.format(len(test_quotes_per_page[2])), test_quotes_per_page[2][:3])

We have verified that the function `get_quotes_per_page ` for a single page generates a doc file, and then collects the required data (i.e. authors, quotes, and URLs) as three separate lists.

## 6. Extract and combine data from multiple pages

In order to collect all the quotes belonging to a certain topic, we have to ensure all subpages are scraped subsequently. To this end we define a function that first calls the function `get_topic_pages` to generate the set of to-be-scraped urls, and then for each url calls the function `get_quotes_per_page`, which we defined in the previous section. Note that this function also calls the function `get_output_file`, which we will define in the next section.

In [None]:
def scraping(topic):
    # empty lists are initialized in which all authors, quotes, and quote urls will be collected
    all_authors, all_quotes, all_quote_urls  = [], [], []
    
    # call 'get_topic_pages' to obtain a set of to-be-scraped URLs
    urls = get_topic_pages(topic)
    
    # loop over the URLs
    for url in urls:
        # for each URL collect the lists of authors, quotes and URLs
        authors, quotes, quote_urls = get_quotes_per_page(url)
        
        # add the collected lists of data to the master lists (i.e. all_auhors, all_quotes, all_quote_urls)
        all_authors += authors
        all_quotes += quotes
        all_quote_urls += quote_urls
        
    # write all the collected data to a |csv file by calling the function 'get_outout_file'
    get_output_file(all_authors, all_quotes, all_quote_urls, topic) 

## 7. Create CSV file with the extracted information

Now that we have collected all the relevant data, we will use the Pandas library in order to create a dataframe from our collected data. First we install and import pandas:

In [None]:
!pip install pandas --quiet

In [None]:
import pandas as pd

The library has now been installed and imported. Now we create a dictionary, with 'author', 'quotes' and 'urls' as keys, and the collected data (in lists) as values.

In [None]:
quotes_dict = {
    'author': authors,
    'quote': quotes,
    'url': quote_urls
}

The dictionary we have created will now be converted into a DataFrame:

In [None]:
quotes_df = pd.DataFrame(quotes_dict)

Let us inspect the first five rows of the dataframe:

In [None]:
quotes_df.head(5)

Finally, we ensure the topic of interest is included in the filename, and then we write the dataframe to a CSV file:

In [None]:
filename = topic + '-quotes.csv'

# "index=None" in order to not include the row numbers in the file:
quotes_df.to_csv(filename, index = None)

Let us inspect the first five entries of the CSV file:

In [None]:
!head motivational-quotes.csv

We have now reached the final goal of the project; i.e. we exported the results will to a CSV file in the desired format.

Let's put this in a function:

In [None]:
def get_output_file(authors, quotes, quote_urls):
    filename = topic + '-quotes.csv'
    
    quotes_dict = {
        'author': authors,
        'quote': quotes,
        'url': quote_urls}
    quotes_df = pd.DataFrame(quotes_dict)
    quotes_df.to_csv(filename, index = None)

## Summary

Here's what we have covered:

1. Identify the webpages
2. Download a webpage using `requests`
3. Use `Beautiful Soup` to parse the HTML source code
4. Extract author, quote text and url for each quote on the page
5. Collect the downloaded data into Python lists
6. Extract and combine data from multiple pages
7. Create CSV file with the extracted information

The CSV file we created has this format

```
author, quote, url
St. Jerome, Good, better, best. Never let it rest. 'Til your good is better and your better is best., https://www.brainyquote.com/quotes/st_jerome_389605?src=t_motivational
Charles R. Swindoll, Life is 10% what happens to you and 90% how you react to it., https://www.brainyquote.com/quotes/charles_r_swindoll_388332?src=t_motivational
```

Here is the complete code for this project:

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
  

# This function collects all info from all urls belonging to a certain topic
def scraping(topic):
    all_authors, all_quotes, all_quote_urls  = [], [], []
    
    urls = get_topic_pages(topic)
    
    for url in urls:
        authors, quotes, quote_urls = get_quotes_per_page(url)
        all_authors += authors
        all_quotes += quotes
        all_quote_urls += quote_urls
        
    get_output_file(all_authors, all_quotes, all_quote_urls, topic)    



# This function generates the set of to-be-scraped urls
def get_topic_pages(topic):
    main_url = 'https://www.brainyquote.com/topics/' + topic + '-quotes'
    nr_of_subpages = input("Enter the number of subpages of {}".format(main_url))
    urls = [main_url]
    base_url = main_url
    for i in range(2, int(nr_of_subpages) + 1):
        url = base_url + '_' + str(i)
        urls.append(url)
    return urls

# This function should get the data from one page
def get_quotes_per_page(url):
    doc = download_and_parse(url)
    authors = get_author_info(doc)
    quotes = get_quote_info(doc)
    quote_urls = get_link_info(doc)
    return authors, quotes, quote_urls


# This function parses the sites and generates doc
def download_and_parse(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception("loading problem with {}".format(topic_url))
    page_content = response.text
    doc = BeautifulSoup(page_content, 'html.parser')
    return doc 
    
# Gets all the author info from the doc as a list    
def get_author_info(doc):
    author_tags = doc.find_all('a', class_ = "bq-aut")
    authors = [tag.text for tag in author_tags]
    return authors
  
# Gets all the quotes from the doc as a list    
def get_quote_info(doc):
    quote_tags = doc.find_all('div', {'style': 'display: flex;justify-content: space-between'})
    quotes = [tag.text.strip() for tag in quote_tags]
    return quotes

# Gets all the quote urls from the doc as a list 
def get_link_info(doc):
    link_tags = doc.find_all('a', class_ = "b-qt")
    base_url = 'https://www.brainyquote.com'
    quote_urls = [base_url + tag['href'] for tag in link_tags]
    return quote_urls

# Creates a dictionary, converts it to a df, and then writes output to csv file
def get_output_file(authors, quotes, quote_urls, topic):
    quotes_dict = {
        'author': authors,
        'quote': quotes,
        'url': quote_urls}
    quotes_df = pd.DataFrame(quotes_dict)
    quotes_df.to_csv(topic + '-quotes.csv', index = None)

Let us verify this for the topic 'knowledge'. We only have to call the master function `scraping` and provide 'knowledge' as the argument:

In [None]:
scraping('knowledge')

We notice the page `https://www.brainyquote.com/topics/knowledge-quotes`has 17 subpages, which we enter as input:
![knowledge](https://i.imgur.com/zRlhLRt.png)

Let us verify that a CSV file has been generated named 'knowledge-quotes.csv' and inspect the first five lines:

In [None]:
!head knowledge-quotes.csv

Finally, we open the CSV file directly to verify all 17 pages have been scraped. Since there are 60 quotes per page, the file should contain 960 - 1020 quotes:

![knowledge_csv](https://i.imgur.com/zdwueVt.png)

Indeed, the CSV file contains 1001 lines, indicating 17 pages have been succesfully scraped for the topic 'knowledge'.

We will re-call the function for our original topic 'motivational':

In [None]:
scraping('motivational')

Also now we open the CSV file directly to verify all 5 pages have been scraped. Since there are 60 quotes per page, the file should contain 240 - 300 quotes:

![motivational](https://i.imgur.com/xVzebKZ.png)

Indeed, the CSV file contains 283 lines, indicating that five pages have been succesfully scraped for the topic 'motivational'.

Now we can save this notebook together with the generated CSV files for 'motivational' and 'knowledge'.

In [None]:
# Execute this to save new versions of the notebook including the csv files)
jovian.commit(files = ['motivational-quotes.csv', 'knowledge-quotes.csv'])

And of course we conclude this notebook with a [quote](https://www.brainyquote.com/quotes/nelson_mandela_378967?img=4&src=t_motivational):

![Mandela](https://i.imgur.com/wWA2yuO.png)

## Future Work

* We can now fetch individual topic pages and get all the quotes. Further refinement of this notebook could include an option to scrape quotes from a particular author, the quotes of the day, or even the quotes resulting from specific search queries.
* The current notebook requires user input (i.e. entering the number of subpages). The notebook could be further improved by automating this step as well.
* With the collected data, further analysis can be carried out. For example the most frequently used words per topic could be determined, and then it could be analyzed how this differs among topics.

## References

- [Jovian Web Scraping Tutorial](https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis)
- [Requests documentation](https://docs.python-requests.org/en/latest/)
- [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html)
- [HTML Tutorials by HTML Dog](https://htmldog.com/guides/html/)
