# Webscraping
Main object: Scrape the information from the web, convert them into Pandas Dataframe, and then CSV/Excel file.

In this webscraping exercise, I will be using the link below. 

url: http://quotes.toscrape.com/

In [2]:
# Import libraries

import pandas as pd
import urllib.request
from bs4 import BeautifulSoup as bs
from urllib.error import HTTPError, URLError

## 1. Scrap data from the web

In [16]:
allQuoteData = []

baseUrl = "http://quotes.toscrape.com/"
curPage = baseUrl

while True:
    try:
        print("Retrieving page: " + curPage)
        curPageContent = urllib.request.urlopen(curPage).read()        
    except: 
        break
        
    soup = bs(curPageContent)

    quoteBlocks = soup.find_all(class_="quote")
    
    for quoteBlock in quoteBlocks:        
        quoteDict = {}

        # Add quote
        quote = quoteBlock.find(class_="text").text
        quoteDict['Quote'] = quote

        # URL for the each author
        authorUrl = baseUrl + 'author/' + quoteBlock.find("a").get("href").split('/')[2]
        
        # Read author page
        try:
            authorInfo = urllib.request.urlopen(authorUrl).read()
        except HTTPError as e:
            print(curPage, " ", authorUrl, " ", e)
        except URLError as e:
            print(curPage, " ", authorUrl, " ", e)

        authorSoup = bs(authorInfo)

        # Author name
        author = authorSoup.find(class_ = 'author-title').text
        quoteDict['Author name'] = author

        # Author born date
        authorDob = authorSoup.find(class_ = 'author-born-date').text
        quoteDict['Author DOB'] = authorDob

        # Author place
        authorPlace = authorSoup.find(class_ = 'author-born-location').text
        quoteDict['Author born location'] = authorPlace.replace('in ','')

        # Author bio
        authorDesc = authorSoup.find(class_ = 'author-description').text
        quoteDict['Author description'] = authorDesc.strip()

        allQuoteData.append(quoteDict)

    if soup.find(class_="next") is None:
        break
    
    pageLink = soup.find(class_='next').find('a').get('href')[1:]
    nextPage = pageLink
    curPage = baseUrl + nextPage



Retrieving page: http://quotes.toscrape.com/
Retrieving page: http://quotes.toscrape.com/page/2/
Retrieving page: http://quotes.toscrape.com/page/3/
Retrieving page: http://quotes.toscrape.com/page/4/
Retrieving page: http://quotes.toscrape.com/page/5/
Retrieving page: http://quotes.toscrape.com/page/6/
Retrieving page: http://quotes.toscrape.com/page/7/
Retrieving page: http://quotes.toscrape.com/page/8/
Retrieving page: http://quotes.toscrape.com/page/9/
Retrieving page: http://quotes.toscrape.com/page/10/


## 2. Move the data into the dataframe

In [17]:
headers = allQuoteData[0].keys()
quoteDf = pd.DataFrame(allQuoteData, columns=headers)

# Check the size of the dataframe
quoteDf.shape

# Check if we have the right information in the dataframe
quoteDf.head(10)

Unnamed: 0,Quote,Author name,Author DOB,Author born location,Author description
0,“The world as we have created it is a process ...,Albert Einstein\n,"March 14, 1879","Ulm, Germany","In 1879, Albert Einstein was born in Ulm, Germ..."
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling\n,"July 31, 1965","Yate, South Gloucestershire, England, The Unit...",See also: Robert GalbraithAlthough she writes ...
2,“There are only two ways to live your life. On...,Albert Einstein\n,"March 14, 1879","Ulm, Germany","In 1879, Albert Einstein was born in Ulm, Germ..."
3,"“The person, be it gentleman or lady, who has ...",Jane Austen\n,"December 16, 1775","Steventon Rectory, Hampshire, The United Kingdom",Jane Austen was an English novelist whose work...
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe\n,"June 01, 1926",The United States,Marilyn Monroe (born Norma Jeane Mortenson; Ju...
5,“Try not to become a man of success. Rather be...,Albert Einstein\n,"March 14, 1879","Ulm, Germany","In 1879, Albert Einstein was born in Ulm, Germ..."
6,“It is better to be hated for what you are tha...,André Gide\n,"November 22, 1869","Paris, France",André Paul Guillaume Gide was a French author ...
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison\n,"February 11, 1847","Milan, Ohio, The United States","Thomas Alva Edison was an American inventor, s..."
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt\n,"October 11, 1884",The United States,Anna Eleanor Roosevelt was an American politic...
9,"“A day without sunshine is like, you know, nig...",Steve Martin\n,"August 14, 1945","Waco, Texas, The United States","Stephen Glenn ""Steve"" Martin is an American ac..."


## 3. Convert dataframe into CSV, and Excel file

In [18]:
quoteDf.to_csv("../data/raw/Quotes.csv", encoding='utf-8-sig')
quoteDf.to_excel("../data/raw/Quotes.xlsx")

print('Successfully converted data to CSV, xlsx')

Successfully converted data to CSV, xlsx


# Final cell

In [20]:
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup as bs
from urllib.error import HTTPError, URLError

allQuoteData = []

baseUrl = "http://quotes.toscrape.com/"
curPage = baseUrl

while True:
    try:
        print("Retrieving page: " + curPage)
        curPageContent = urllib.request.urlopen(curPage).read()        
    except: 
        break
        
    soup = bs(curPageContent)

    quoteBlocks = soup.find_all(class_="quote")
    
    for quoteBlock in quoteBlocks:        
        quoteDict = {}

        # Add quote
        quote = quoteBlock.find(class_="text").text
        quoteDict['Quote'] = quote

        # URL for the each author
        authorUrl = baseUrl + 'author/' + quoteBlock.find("a").get("href").split('/')[2]
        
        # Read author page
        try:
            authorInfo = urllib.request.urlopen(authorUrl).read()
        except HTTPError as e:
            print(curPage, " ", authorUrl, " ", e)
        except URLError as e:
            print(curPage, " ", authorUrl, " ", e)

        authorSoup = bs(authorInfo)

        # Author name
        author = authorSoup.find(class_ = 'author-title').text
        quoteDict['Author name'] = author

        # Author born date
        authorDob = authorSoup.find(class_ = 'author-born-date').text
        quoteDict['Author DOB'] = authorDob

        # Author place
        authorPlace = authorSoup.find(class_ = 'author-born-location').text
        quoteDict['Author born location'] = authorPlace.replace('in ','')

        # Author bio
        authorDesc = authorSoup.find(class_ = 'author-description').text
        quoteDict['Author description'] = authorDesc.strip()

        allQuoteData.append(quoteDict)

    if soup.find(class_="next") is None:
        break
    
    pageLink = soup.find(class_='next').find('a').get('href')[1:]
    nextPage = pageLink
    curPage = baseUrl + nextPage
    
headers = allQuoteData[0].keys()
quoteDf = pd.DataFrame(allQuoteData, columns=headers)

quoteDf.to_csv("../data/raw/Quotes.csv", encoding='utf-8-sig')
quoteDf.to_excel("../data/raw/Quotes.xlsx")

print('Successfully converted data to CSV, xlsx')


Retrieving page: http://quotes.toscrape.com/
Retrieving page: http://quotes.toscrape.com/page/2/
Retrieving page: http://quotes.toscrape.com/page/3/
Retrieving page: http://quotes.toscrape.com/page/4/
Retrieving page: http://quotes.toscrape.com/page/5/
Retrieving page: http://quotes.toscrape.com/page/6/
Retrieving page: http://quotes.toscrape.com/page/7/
Retrieving page: http://quotes.toscrape.com/page/8/
Retrieving page: http://quotes.toscrape.com/page/9/
Retrieving page: http://quotes.toscrape.com/page/10/
Successfully converted data to CSV, xlsx
